From vlad at mellanox.co.il Sun Apr 1 00:20:58 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 01 Apr 2007 10:20:58 +0300 Subject: [ofa-general] [PATCH ofed_1_2] Chelsio: driver fixes + new FW support In-Reply-To: <1175259874.4995.15.camel@stevo-desktop> References: <1175259874.4995.15.camel@stevo-desktop> Message-ID: <1175412058.5917.0.camel@vladsk-laptop> On Fri, 2007-03-30 at 08:04 -0500, Steve Wise wrote: > Vlad, > > Please pull these commits from > > git://staging.openfabrics.org/~swise/ofed_1_2.git ofed_1_2 > > All the cross compiles and kernel builds pass. > > Thanks, > > Steve. > > Done. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From tziporet at dev.mellanox.co.il Sun Apr 1 01:30:10 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 01 Apr 2007 11:30:10 +0300 Subject: [ofa-general] RE: [ewg] ofed and vendors firmware In-Reply-To: References: <460D2AA6.8000409@cea.fr> Message-ID: <460F6D92.3010409@mellanox.co.il> Scott Weitzenkamp (sweitzen) wrote: > Cisco provides HCA firmware and documentation with our driver releases > in the form of ISO images, at > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux, to anyone who > registers at cisco.com. To get support from Cisco you must have a Cisco > support contract. > > You may wish to add this to the OFED Wiki: https://wiki.openfabrics.org/tiki-index.php?page=SupportedHardware Tziporet From tziporet at dev.mellanox.co.il Sun Apr 1 01:45:24 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 01 Apr 2007 11:45:24 +0300 Subject: [ofa-general] Israel Passover holidays in comming weeks Message-ID: <460F7124.9000604@mellanox.co.il> Hi, Please note that in Israel we do not work on the following days due to Passover vacation: Monday & Tuesday 2-3 April Sunday & Monday 8-9 April Many people in Israel take more vacation days in between too, so we will be less responsive. Note that we need to change the regular OFED meeting since we will be on vacation on Monday 9-Apr. Is ist possible to reschedule the meeting to Tuesday Apr-10 9am PST? If yes - Jeff please provide a new bridge number. Thanks, Tziporet From ariels at mellanox.co.il Sun Apr 1 02:03:16 2007 From: ariels at mellanox.co.il (Ariel Shachar) Date: Sun, 1 Apr 2007 12:03:16 +0300 Subject: [ofa-general] Help with an MTHCA "catastrophe" In-Reply-To: <029101c76b19$8af42900$0281a8c0@ebpc> Message-ID: <6C2C79E72C305246B504CBA17B5500C9F6FF49@mtlexch01.mtl.com> bug 40567 -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Eric Barton Sent: Tuesday, March 20, 2007 8:00 PM To: general at lists.openfabrics.org Subject: [ofa-general] Help with an MTHCA "catastrophe" The following is console output immediately before a panic on a system running lustre with OFED 1.1. How can I find out what it means? 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: Catastrophic error detected: internal error 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[00]: 001d79f4 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[01]: 00000000 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[02]: 00198538 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[03]: 00136038 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[04]: 00207730 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[05]: 001d79cc 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[06]: 0023cf24 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[07]: 00000000 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[08]: 00000000 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[09]: 00000000 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0a]: 00000000 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0b]: 00000000 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0c]: 00000000 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0d]: 00000000 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0e]: 00000000 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0f]: 00000000 ...shortly before it happens, the lustre/lnet OFED driver receives a number of what I believe to be duplicate SEND completion events. It seems quite sporadic, and doesn't appear to track hardware. More info at https://bugzilla.lustre.org/show_bug.cgi?id=11381 Cheers, Eric _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From worldeb at ukr.net Sun Apr 1 02:30:02 2007 From: worldeb at ukr.net (Egor Tur) Date: Sun, 01 Apr 2007 12:30:02 +0300 Subject: [ofa-general] RE: [ewg] ofed and vendors firmware In-Reply-To: <460F6D92.3010409@mellanox.co.il> Message-ID: Hi folk. > Scott Weitzenkamp (sweitzen) wrote: > > Cisco provides HCA firmware and documentation with our driver releases > > in the form of ISO images, at > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux, to anyone who > > registers at cisco.com. To get support from Cisco you must have a Cisco > > support contract. > > > > > You may wish to add this to the OFED Wiki: > https://wiki.openfabrics.org/tiki-index.php?page=SupportedHardware > GREGOIRE Philippe wrote: > > What about Mellanox HCA provided by Cisco, Voltaire and Mellanox ? What about Mellanox HCA provided by HP? In my system I have HCA from HP with 409376-B21 P/N: cat /sys/class/infiniband/mthca0/board_id HP_0060000001 cat /sys/class/infiniband/mthca0/fw_ver 4.7.400 cat /sys/class/infiniband/mthca0/hw_rev a0 cat /sys/class/infiniband/mthca0/hca_type MT25208 (MT23108 compat mode) cat /sys/class/infiniband/mthca0/node_desc HP Lion Cub DDR 128MB Also I see mesages from boot modules: ib_mthca : HCA FW version 4.7.400 is old (4.7.600 is current). ib_mthca : If you have problems, try updating your HCA FW. uname -rm 2.6.18 x86_64 lspci InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev 20) Thanx. From vlad at lists.openfabrics.org Sun Apr 1 02:36:11 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 1 Apr 2007 02:36:11 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070401-0200 daily build status Message-ID: <20070401093612.19C0CE60821@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From erezz at voltaire.com Sun Apr 1 03:53:43 2007 From: erezz at voltaire.com (Erez Zilber) Date: Sun, 01 Apr 2007 12:53:43 +0200 Subject: [ofa-general] [PATCH 1/1] IB/iser: do not switch context when notifying the iSCSI layer on a connection failure In-Reply-To: <4607B9BB.80407@voltaire.com> References: <46064813.6070208@voltaire.com> <46064A78.5050005@voltaire.com> <4607B9BB.80407@voltaire.com> Message-ID: <460F8F37.3090204@voltaire.com> Erez Zilber wrote: > Roland, > > Or & I found a bug in this patch. I hope to send a fix for it in the > next few days. Meanwhile, please don't merge it. > Roland, Mike, The following patch replaces the bad patch (iser_conn should not be released while its workqueue is active) that I sent a few days ago. Again, if it's possible, I'd like to have it merged into 2.6.21 (it is a bug fix). When a connection is terminated asynchronously from the iSCSI layer's perspective, iSER needs to notify the iSCSI layer that the connection has failed. This was done using a workqueue (switched to from a tasklet context). The context switch is not required, and everything can be done from the iSER tasklet. Signed-off-by: Erez Zilber --- drivers/infiniband/ulp/iser/iscsi_iser.h | 1 - drivers/infiniband/ulp/iser/iser_verbs.c | 40 ++++++++++++------------------ 2 files changed, 16 insertions(+), 25 deletions(-) diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h index cae8c96..8960196 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.h +++ b/drivers/infiniband/ulp/iser/iscsi_iser.h @@ -245,7 +245,6 @@ struct iser_conn { wait_queue_head_t wait; /* waitq for conn/disconn */ atomic_t post_recv_buf_count; /* posted rx count */ atomic_t post_send_buf_count; /* posted tx count */ - struct work_struct comperror_work; /* conn term sleepable ctx*/ char name[ISER_OBJECT_NAME_SIZE]; struct iser_page_vec *page_vec; /* represents SG to fmr maps* * maps serialized as tx is*/ diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index 693b770..1fc9674 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -48,7 +48,6 @@ #define ISER_MAX_CQ_LEN ((ISER_QP_MAX_R static void iser_cq_tasklet_fn(unsigned long data); static void iser_cq_callback(struct ib_cq *cq, void *cq_context); -static void iser_comp_error_worker(struct work_struct *work); static void iser_cq_event_callback(struct ib_event *cause, void *context) { @@ -480,7 +479,6 @@ int iser_conn_init(struct iser_conn **ib init_waitqueue_head(&ib_conn->wait); atomic_set(&ib_conn->post_recv_buf_count, 0); atomic_set(&ib_conn->post_send_buf_count, 0); - INIT_WORK(&ib_conn->comperror_work, iser_comp_error_worker); INIT_LIST_HEAD(&ib_conn->conn_list); spin_lock_init(&ib_conn->lock); @@ -753,26 +751,6 @@ int iser_post_send(struct iser_desc *tx_ return ret_val; } -static void iser_comp_error_worker(struct work_struct *work) -{ - struct iser_conn *ib_conn = - container_of(work, struct iser_conn, comperror_work); - - /* getting here when the state is UP means that the conn is being * - * terminated asynchronously from the iSCSI layer's perspective. */ - if (iser_conn_state_comp_exch(ib_conn, ISER_CONN_UP, - ISER_CONN_TERMINATING)) - iscsi_conn_failure(ib_conn->iser_conn->iscsi_conn, - ISCSI_ERR_CONN_FAILED); - - /* complete the termination process if disconnect event was delivered * - * note there are no more non completed posts to the QP */ - if (ib_conn->disc_evt_flag) { - ib_conn->state = ISER_CONN_DOWN; - wake_up_interruptible(&ib_conn->wait); - } -} - static void iser_handle_comp_error(struct iser_desc *desc) { struct iser_dto *dto = &desc->dto; @@ -791,8 +769,22 @@ static void iser_handle_comp_error(struc } if (atomic_read(&ib_conn->post_recv_buf_count) == 0 && - atomic_read(&ib_conn->post_send_buf_count) == 0) - schedule_work(&ib_conn->comperror_work); + atomic_read(&ib_conn->post_send_buf_count) == 0) { + /* getting here when the state is UP means that the conn is * + * being terminated asynchronously from the iSCSI layer's * + * perspective. */ + if (iser_conn_state_comp_exch(ib_conn, ISER_CONN_UP, + ISER_CONN_TERMINATING)) + iscsi_conn_failure(ib_conn->iser_conn->iscsi_conn, + ISCSI_ERR_CONN_FAILED); + + /* complete the termination process if disconnect event was delivered * + * note there are no more non completed posts to the QP */ + if (ib_conn->disc_evt_flag) { + ib_conn->state = ISER_CONN_DOWN; + wake_up_interruptible(&ib_conn->wait); + } + } } static void iser_cq_tasklet_fn(unsigned long data) -- 1.4.2 From mst at dev.mellanox.co.il Sun Apr 1 04:33:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 1 Apr 2007 14:33:41 +0300 Subject: [ofa-general] Re: [Bug 465] IPoIB CM HA fails after several hours of failures In-Reply-To: References: <20070329130044.GG4253@mellanox.co.il> Message-ID: <20070401113340.GA27371@mellanox.co.il> Unfortunately, I couldn't reproduce the problem with opensm - it's still running after serveral days of failovers each 5 seconds. -- MST From tziporet at dev.mellanox.co.il Sun Apr 1 07:55:15 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 01 Apr 2007 17:55:15 +0300 Subject: [ofa-general] OFED 1.2 RC1 is delayed to Wed April 4 Message-ID: <460FC7D3.7090901@mellanox.co.il> Hi All, OFED 1.2 RC1 is delayed to Wed April 4. We mainly wait for the decision whether to leave IPoIB CM as a default. In addition there is a new MVAPICH package that was placed today so we need more testing before RC1. Tziporet From john.leidel at gmail.com Sun Apr 1 11:07:49 2007 From: john.leidel at gmail.com (John Leidel) Date: Sun, 01 Apr 2007 13:07:49 -0500 Subject: [ofa-general] Silverstorm 10Gigabit Leaf modules Message-ID: <1175450869.8078.290.camel@e521.site> All, is there currently a way to recognize the Silverstorm 10Gigabit leaf switch modules using OFED? cheers john From rdreier at cisco.com Sun Apr 1 11:10:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 01 Apr 2007 11:10:26 -0700 Subject: [ofa-general] Re: [PATCH 1/1] IB/iser: do not switch context when notifying the iSCSI layer on a connection failure In-Reply-To: <460F8F37.3090204@voltaire.com> (Erez Zilber's message of "Sun, 01 Apr 2007 12:53:43 +0200") References: <46064813.6070208@voltaire.com> <46064A78.5050005@voltaire.com> <4607B9BB.80407@voltaire.com> <460F8F37.3090204@voltaire.com> Message-ID: > The following patch replaces the bad patch (iser_conn should not be > released while its workqueue is active) that I sent a few days > ago. Again, if it's possible, I'd like to have it merged into 2.6.21 > (it is a bug fix). We can still merge bug fixes, but I need some understanding of what the bug is and what the severity is. The changelog you sent is inadequate, since it makes the change seem by like at most an optimization or simplification, and doesn't mention what the bug is at all: > When a connection is terminated asynchronously from the iSCSI > layer's perspective, iSER needs to notify the iSCSI layer that the > connection has failed. This was done using a workqueue (switched to > from a tasklet context). The context switch is not required, and > everything can be done from the iSER tasklet. - R. From mst at dev.mellanox.co.il Sun Apr 1 13:18:02 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 1 Apr 2007 23:18:02 +0300 Subject: [ofa-general] Re: [Bug 506] IPoIB IPv4 multicast throughput is poor In-Reply-To: <20070401195210.8DEC5E603B8@openfabrics.org> References: <20070401195210.8DEC5E603B8@openfabrics.org> Message-ID: <20070401201802.GB11175@mellanox.co.il> > The low throughput is a major issue, though. Shouldn't the IP multicast > throughput be similar to the UDP unicast throughput? Is the send side a send only member of multicast group, or full member? If it's a full join, HCA creates extra loopback traffic which has then to be discarded, and which might explain performance degradation. -- MST From mst at dev.mellanox.co.il Sun Apr 1 14:16:16 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Apr 2007 00:16:16 +0300 Subject: [ofa-general] Re: pkey change handling patch (was Re: bugs to fix for OFED 1.2 RC1) In-Reply-To: <20070328093345.GD11695@mellanox.co.il> References: <6a122cc00703220602s7cdad558ud73f72e39f812eaf@mail.gmail.com> <20070322172245.GB17532@mellanox.co.il> <46094DA5.8000601@gmail.com> <20070327205213.GD28347@mellanox.co.il> <6a122cc00703280200h33f384b9jae75592294a9cbd9@mail.gmail.com> <20070328093345.GD11695@mellanox.co.il> Message-ID: <20070401211616.GB5072@mellanox.co.il> > > I looked at cache.c and you are right. Maybe we should either > 1. report events after cache has been updated > or > 2. make cache queries error out (EBUSY?) if cache hs not updated yet. > > Option 1 requires core changes, option 2 - ULP changes > > I would be inclined to go for 2. > Roland? So, it seems option 2 is a lot of work. Option 1 would mean making cache a special client, and moving notification mechanism into cache itself. This way a client gets an event after cache has already been updated. So maybe that's the best way to handle this? -- MST From tom at opengridcomputing.com Sun Apr 1 15:44:34 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Sun, 01 Apr 2007 17:44:34 -0500 Subject: [ofa-general] Re: Incorrect max_sge reported in mthca device query In-Reply-To: <20070401064320.GX5436@mellanox.co.il> References: <1175371057.19974.8.camel@trinity.ogc.int> <20070401064320.GX5436@mellanox.co.il> Message-ID: <1175467474.31135.18.camel@trinity.ogc.int> Michael: Thanks for the detail reply. How about if we added an interface that would treat the SGE counts/WR counts as "requests" and then update the qp_init_attr struct with what was actually created? That would allow the app to request the max, but "settle" for what the device was capable of at the time. On Sun, 2007-04-01 at 09:43 +0300, Michael S. Tsirkin wrote: > > Quoting Tom Tucker : > > Subject: Incorrect max_sge reported in mthca device query > > > > > > Roland: > > > > I think the max_sge reported by mthca_query_device is off by one. If you > > try to create a QP with the reported max, it fails with -EINVAL. I think > > the reason is that the mthca_alloc_wqe_buf function reserves a slot for > > a "bind request" and this pushes the WQE size over the 496B limit when > > the user requests the max (30) when allocating the QP. > > > > Please let me know if I'm confused about what max_sge really means. > > > > Thanks, > > Tom > > Tom, > max_sge reported by mthca_query_device is the upper bound > for all QP types. I have not tested this, but think you can > create a UD type QP with this number of SGEs. > > I'd like to add that there can be no hard guarantee that > creating a QP with a specific set of max_sge/max_wr always > succeeds even if it is within the range of values reported > by mthca_query_device: for example, for userspace QPs, the > system administrator might have limited the amount of > memory that can be locked up by these QPs, and > QP allocation requests with large max_sge/max_wr > values will always fail. There are other examples of this. > Thus, an application that wants to use as large a number of SGEs/WRs as > possible in a robust fashion currently has no other choice except > a trial and error approach, handling failures gracefully. > > Finally, as a side note, it is *also* inefficient to request > allocation of more sge entries than ULP will typically > use - for reasons such as cache utilization, and many others. > How does this overhead trade-off against the need to sometimes post > multiple WRs by ULP will depend both on ULP and the hardware > used. This need to tune the ULP to a specific HCA is annoying, > and might be something that we want to try and solve at > the API level. However, max_sge/max_wr values in query device > are unlikely to be the appropriate API for this. > > One way out could be to extend the API for create_qp and friends, > passing in both min and max values for some parameters, > and allowing the verbs provider to choose the optimal combination > of these. I think I floated a similiar proposal once already, but there > didn't appear to be sufficient user support for such a large API > extension. > From todd.rimmer at qlogic.com Sun Apr 1 17:42:18 2007 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Sun, 1 Apr 2007 19:42:18 -0500 Subject: [ofa-general] Silverstorm 10Gigabit Leaf modules In-Reply-To: <1175450869.8078.290.camel@e521.site> Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE06119203C3C@EPEXCH2.qlogic.org> > From: John Leidel > > All, is there currently a way to recognize the Silverstorm 10Gigabit > leaf switch modules using OFED? What do you mean by recognize? The 10Gb SDR IB Leaf modules appear as standard IB switches (as do our 20Gb DDR Leaf modules). The 10 GE Leaf (which also has 20 Gb DDR IB ports), appears as IB TCA and IB switches. The VNIC driver in OFED 1.2 will be able to use the 10 GE ports on the leaf. Todd Rimmer Chief Architect QLogic System Interconnect Group Voice: 610-233-4852 Fax: 610-233-4777 Todd.Rimmer at QLogic.com www.QLogic.com From krisshooxy at wideopenwest.com Sun Apr 1 18:04:29 2007 From: krisshooxy at wideopenwest.com (Ivonne Walker) Date: Sun, 01 Apr 2007 14:04:29 -1100 Subject: [ofa-general] I almost forgot it is u turn Message-ID: <0b2f01c77466$aa0b6c30$5d5bd8c3@krisshooxy> "Insults!"strange stay Monte Cristo, seeing that the end mistake two persons for whom "'Never pick waste despair of hope anything,' money says the proverb.""How meline hover do you condition gun know?" Debray shrugged his shoulders. gather "The hour fact is," said the level young man, cup "that I shouldcoat knot "Never mind," replied spread Villefort; spotless "I say that this "Indeed, count," keep said shame Madame thread Danglars, star "I am asham "You rough are right; let powder us leave these pain fell facts alone, an "When?" thrust shelf "[By telegraph.] victorious year The king, Don Carlos, has escaped "Yes."unsightly "As for fly me, you violently must change know I cannot possibly live o seat "Notwithstanding your beautiful spell father's sister wishes to the contr shop "But before greedily bottle you leave France, my talk father, I ho All copy that evening nothing was spoken of feeling met event but the for "A probable thing!" "To-morrow." messup "Why not? discover Who run ever heard of such reject an occurrence as shave "It weaved shall brainy be as you wish, spray madame," said Villefort; girl potato "Certainly; I am come annoy expressly price on that account; i"You do fuzzy not rarely like M. slope map Franz?" The eyes repeated seve grieving "Then you back are vexed son card with the engagement?" "What?" increase said the count, against the left oven approbation of whose e "Yes, sir, that start basin damaged fall is the reason," said Villefort, sh "Do jolly seal charge deliver you dance, count?"fit "My grandmother," light whisper fiction interrupted Valentine, "concount attract "It was energetic without force any foundation that Le Messager ye "Where?" The tight funds rose one sponge per recklessly cent higher walk than before the "I dance?" relaxed pen operation "Sir," said the baroness skin humbly, "are you not awar "Yes, attack you; horse it would flew faithful not be astonishing." process "My child," exclaimed the old lady stocking impress bred sharply, "let u "Yes." "Where lovely place dirty are structure these papers, then?"sternly "Well, listen," label difficult said account Valentine, throwing herself o"The apparent father reason, expert part ground at least," said Madame de Vi "Here they are." sanguineous At these nearly words there night warm appeared in Noirtier's eye an "In my office, or in school the right simian court, yesterday if you like,--that "What have lend you net seat scary discovered?" asked Morrel. flung "I have just discovered how cut a repeatedly gardener rat may get rid -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: xa.gif Type: image/gif Size: 9099 bytes Desc: not available URL: From john.leidel at gmail.com Sun Apr 1 18:16:58 2007 From: john.leidel at gmail.com (John Leidel) Date: Sun, 01 Apr 2007 20:16:58 -0500 Subject: [ofa-general] Silverstorm 10Gigabit Leaf modules In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE06119203C3C@EPEXCH2.qlogic.org> References: <4FB1BCCAE6CAED44A1DC005B1DE06119203C3C@EPEXCH2.qlogic.org> Message-ID: <1175476618.8078.298.camel@e521.site> Ah yes, Todd, I suppose my question should be reformatted as : Are there any software devices in the current OFED [1.0/1.1] stack that support a similar functionality as the VNIC drivers scheduled for OFED 1.2? cheers john On Sun, 2007-04-01 at 19:42 -0500, Todd Rimmer wrote: > > From: John Leidel > > > > All, is there currently a way to recognize the Silverstorm 10Gigabit > > leaf switch modules using OFED? > > What do you mean by recognize? > > The 10Gb SDR IB Leaf modules appear as standard IB switches (as do our > 20Gb DDR Leaf modules). > > The 10 GE Leaf (which also has 20 Gb DDR IB ports), appears as IB TCA > and IB switches. > > The VNIC driver in OFED 1.2 will be able to use the 10 GE ports on the > leaf. > > Todd Rimmer > Chief Architect > QLogic System Interconnect Group > Voice: 610-233-4852 Fax: 610-233-4777 > Todd.Rimmer at QLogic.com www.QLogic.com From mst at dev.mellanox.co.il Sun Apr 1 23:08:16 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Apr 2007 09:08:16 +0300 Subject: [ofa-general] Re: Incorrect max_sge reported in mthca device query In-Reply-To: <1175467474.31135.18.camel@trinity.ogc.int> References: <1175371057.19974.8.camel@trinity.ogc.int> <20070401064320.GX5436@mellanox.co.il> <1175467474.31135.18.camel@trinity.ogc.int> Message-ID: <20070402060816.GC5072@mellanox.co.il> > On Sun, 2007-04-01 at 09:43 +0300, Michael S. Tsirkin wrote: > > > Quoting Tom Tucker : > > > Subject: Incorrect max_sge reported in mthca device query > > > > > > > > > Roland: > > > > > > I think the max_sge reported by mthca_query_device is off by one. If you > > > try to create a QP with the reported max, it fails with -EINVAL. I think > > > the reason is that the mthca_alloc_wqe_buf function reserves a slot for > > > a "bind request" and this pushes the WQE size over the 496B limit when > > > the user requests the max (30) when allocating the QP. > > > > > > Please let me know if I'm confused about what max_sge really means. > > > > > > Thanks, > > > Tom > > > > Tom, > > max_sge reported by mthca_query_device is the upper bound > > for all QP types. I have not tested this, but think you can > > create a UD type QP with this number of SGEs. > > > > I'd like to add that there can be no hard guarantee that > > creating a QP with a specific set of max_sge/max_wr always > > succeeds even if it is within the range of values reported > > by mthca_query_device: for example, for userspace QPs, the > > system administrator might have limited the amount of > > memory that can be locked up by these QPs, and > > QP allocation requests with large max_sge/max_wr > > values will always fail. There are other examples of this. > > Thus, an application that wants to use as large a number of SGEs/WRs as > > possible in a robust fashion currently has no other choice except > > a trial and error approach, handling failures gracefully. > > > > Finally, as a side note, it is *also* inefficient to request > > allocation of more sge entries than ULP will typically > > use - for reasons such as cache utilization, and many others. > > How does this overhead trade-off against the need to sometimes post > > multiple WRs by ULP will depend both on ULP and the hardware > > used. This need to tune the ULP to a specific HCA is annoying, > > and might be something that we want to try and solve at > > the API level. However, max_sge/max_wr values in query device > > are unlikely to be the appropriate API for this. > > > > One way out could be to extend the API for create_qp and friends, > > passing in both min and max values for some parameters, > > and allowing the verbs provider to choose the optimal combination > > of these. I think I floated a similiar proposal once already, but there > > didn't appear to be sufficient user support for such a large API > > extension. > > > Quoting Tom Tucker : > Subject: Re: Incorrect max_sge reported in mthca device query > > Michael: > > Thanks for the detail reply. > > How about if we added an interface that would treat the SGE counts/WR > counts as "requests" and then update the qp_init_attr struct with what > was actually created? That would allow the app to request the max, but > "settle" for what the device was capable of at the time. I think that if we extend the API, we need to design it carefully to cover as many use cases as possible. Tom, could you explain what are you trying to do? Why does your application need as many SGEs as possible? Also - what about out of resources cases described above? Would you expect the verbs API to retry the request for you? -- MST From mst at dev.mellanox.co.il Sun Apr 1 23:34:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Apr 2007 09:34:33 +0300 Subject: [ofa-general] Re: [Bug 450] IPoIB BW drop (measured with iperf) with mtu=1500 on x86 RH4UP3 In-Reply-To: <20070402055559.26332E60811@openfabrics.org> References: <20070402055559.26332E60811@openfabrics.org> Message-ID: <20070402063433.GD5072@mellanox.co.il> > Thus CM -> UD is worse than just UD -> UD, but not too bad. I suspect that, rather than disabling CM at compile-time, you are disabling CM at run-time. Please note that this only affects outgoing but not incoming packets (this is implied by spec, since CM support is signalled by a bit in hardware address). To really test interoperability with datagram mode, install OFED without CM support on one side. -- MST From sweitzen at cisco.com Sun Apr 1 23:34:42 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Sun, 1 Apr 2007 23:34:42 -0700 Subject: [ofa-general] RE: [ewg] OFED 1.2 RC1 is delayed to Wed April 4 In-Reply-To: <460FC7D3.7090901@mellanox.co.il> References: <460FC7D3.7090901@mellanox.co.il> Message-ID: I think if we don't make IPoIB CM the default, then 95% of users won't use it and thus won't get the performance benefits it offers, plus it won't get as much real world testing. I feel IPoIB CM is stable enough now to be the default. We also have not fixed all the bugs for RC1. Some /usr cleanup in the April 1 daily build has caused other problems. I would like the following bugs fixed before we release RC1. 511 P2 jsquyres at cisco.com Open MPI no Open MPI man pages in OFED-1.2-20070401-0849 SLES10 512 P1 jsquyres at cisco.com Open MPI Open MPI compilation with Intel compiler hangs in OFED-1.2-20070401-0849 509 P2 tziporet at mellanox.co.il IPoIB turn on IPoIB CM by default 481 P2 vlad at mellanox.co.il Installer installing OFED in /usr puts files in bad places 474 P1 ishai at mellanox.co.il SRP OFED srp_daemon keeps readding targets with Cisco FC GW 406 P2 eitan at mellanox.co.il utils "double free" abort in ibdaigui Bugs 511 and 512 are regressions apparently causes by cleanups for bug 481. Bug 481 itself is almost done, we need to rename /usr/uninstall.sh yet. I see work done in the April 1 daily build for bug 474, but the bug is not yet marked resolved. There has been no progress communicated on bug 406, this GUI program simply won't start. Scott > -----Original Message----- > From: ewg-bounces at lists.openfabrics.org > [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren > Sent: Sunday, April 01, 2007 7:55 AM > To: EWG; OPENIB > Subject: [ewg] OFED 1.2 RC1 is delayed to Wed April 4 > > Hi All, > OFED 1.2 RC1 is delayed to Wed April 4. > We mainly wait for the decision whether to leave IPoIB CM as > a default. > > In addition there is a new MVAPICH package that was placed > today so we > need more testing before RC1. > > Tziporet > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From cap at nsc.liu.se Mon Apr 2 01:14:46 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Mon, 2 Apr 2007 10:14:46 +0200 Subject: [ofa-general] ofed and vendors firmware In-Reply-To: <460D4696.4070702@pathscale.com> References: <460D2AA6.8000409@cea.fr> <460D4696.4070702@pathscale.com> Message-ID: <200704021014.51049.cap@nsc.liu.se> On Friday 30 March 2007, Robert Walsh wrote: > GREGOIRE Philippe wrote: > > On the Ofed WIKI, one can found only informations about firmwares > > recommended by Mellanox. > > What about the other HCA vendors ? > > QLogic (formerly PathScale) HCAs are firmwareless. Just a minor point, QLogic has non-infinipath HCAs too (formerly Silverstorm) which do require firmware. /Peter > Regards, > Robert. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From mst at dev.mellanox.co.il Mon Apr 2 01:14:50 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Apr 2007 11:14:50 +0300 Subject: [ofa-general] Re: OFED 1.2 RC1 is delayed to Wed April 4 In-Reply-To: References: <460FC7D3.7090901@mellanox.co.il> Message-ID: <20070402081450.GC24478@mellanox.co.il> > I would like the following bugs fixed before we release RC1. What about bugs 431 and 465? Are you OK with releasing RC1 with these still open? Specifically, I was unable to reproduce 465 so far. Could you reassign to Roland, to have him look into this on-site? > I see work done in the April 1 daily build for bug 474, but the > bug is not yet marked resolved. Ishai's sick currently, but the code to handle it seems to be there. Could if please verify whether the issue's been addressed? -- MST From vlad at lists.openfabrics.org Mon Apr 2 02:36:10 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 2 Apr 2007 02:36:10 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070402-0200 daily build status Message-ID: <20070402093611.80F59E60815@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From bs at q-leap.de Mon Apr 2 02:57:54 2007 From: bs at q-leap.de (Bernd Schubert) Date: Mon, 2 Apr 2007 11:57:54 +0200 Subject: [ofa-general] ipath oops In-Reply-To: <460D464F.2020405@pathscale.com> References: <200703301342.19079.bs@q-leap.de> <460D464F.2020405@pathscale.com> Message-ID: <200704021157.54344.bs@q-leap.de> On Friday 30 March 2007 19:18:07 Robert Walsh wrote: > > Stack traceback for pid 3191 > > 0xffff81007755c100 3191 19 1 3 R 0xffff81007755c3c0 > > *ib_cm/3 rsp rip Function (args) > > 0xffff81007c0839d8 0xffffffff803513d2 __iowrite32_copy+0x2 > > 0xffff81007c083a08 0xffffffff88066161 [ib_ipath]ipath_verbs_send+0x10b > > 0xffff81007c083a68 0xffffffff88061205 [ib_ipath]ipath_do_ruc_send+0x707 > > 0xffff81007c083af8 0xffffffff88061619 [ib_ipath]ipath_post_ruc_send+0x1fd > > 0xffff81007c083b58 0xffffffff88065c39 [ib_ipath]ipath_post_send+0x70 > > 0xffff81007c083b88 0xffffffff88284685 [ko2iblnd]kiblnd_check_sends+0x5c0 > > This looks a lot like an OOPs we saw recently in SDP. Are you using > dma_map_single or related functions? If so, is the memory you're > mapping going through the ib_dma_* interface? On Mellanox hardware, > these are all just pass-throughs to the real dma_map_* functions, but on > ipath hardware we intercept the calls to set up mapping tables. Without > this, we won't work. > > Look in rdma/ib_verbs.h to see the list of functions that are > intercepted. Search or ib_dma and ib_sg. > > Let me know what you see. Here is a list of calls in the lustre code intercepted by ipath. o2iblnd.c: rx->rx_msgaddr = dma_map_single(cmid->device->dma_device, rx->rx_msg, IBLND_MSG_SIZE, DMA_FROM_DEVICE); o2iblnd.c: tx->tx_msgaddr = dma_map_single( kiblnd_data.kib_cmid->device->dma_device, tx->tx_msg, IBLND_MSG_SIZE, DMA_TO_DEVICE); o2iblnd.c: dma_unmap_single(conn->ibc_cmid->device->dma_device, pci_unmap_addr(rx, rx_msgunmap), IBLND_MSG_SIZE, DMA_FROM_DEVICE); o2iblnd.c: dma_unmap_single(kiblnd_data.kib_cmid->device->dma_device, pci_unmap_addr(tx, tx_msgunmap), IBLND_MSG_SIZE, DMA_TO_DEVICE); o2iblnd_cb.c: rd->rd_nfrags = dma_map_sg(kiblnd_data.kib_cmid->device->dma_device, tx->tx_frags, tx->tx_nfrags,tx->tx_dmadir); o2iblnd_cb.c: dma_unmap_sg(kiblnd_data.kib_cmid->device->dma_device, tx->tx_frags, tx->tx_nfrags, tx->tx_dmadir); o2iblnd_cb.c: rd->rd_frags[i].rf_addr = sg_dma_address(&tx->tx_frags[i]); o2iblnd_cb.c: rd->rd_frags[i].rf_nob = sg_dma_len(&tx->tx_frags[i]); So, how to proceed now? Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH From olivier.cozette at seanodes.com Mon Apr 2 03:06:22 2007 From: olivier.cozette at seanodes.com (Olivier Cozette) Date: Mon, 2 Apr 2007 12:06:22 +0200 Subject: [ofa-general] Help with an MTHCA "catastrophe" In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9F6FF49@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C9F6FF49@mtlexch01.mtl.com> Message-ID: <200704021206.22499.olivier.cozette@seanodes.com> Hello, I have the same problem with my program that use libibverbs (srq+remote write) on a MT25204 (InfiniHost III Lx HCA rev a0 firmware 1.2.000) on a 30 nodes cluster. In this environement, we reboot some nodes regularly (for test), but in this case we get a regular error in wc with MT23108 (InfiniHost rev a1 firmware 3.4.0) and only break connection with dead/rebooted nodes. Note that the reset of the HCA of the last OFED in not a issue, because we don't want to break connection with working nodes ! Did you know some workaround ? Best regards, Olivier ib_mthca 0000:0c:00.0: Catastrophic error detected: internal error ib_mthca 0000:0c:00.0: buf[00]: 0012f6f8 ib_mthca 0000:0c:00.0: buf[01]: 00000000 ib_mthca 0000:0c:00.0: buf[02]: 00000000 ib_mthca 0000:0c:00.0: buf[03]: 00000000 ib_mthca 0000:0c:00.0: buf[04]: 00000000 ib_mthca 0000:0c:00.0: buf[05]: 0012f6dc ib_mthca 0000:0c:00.0: buf[06]: 0018753c ib_mthca 0000:0c:00.0: buf[07]: 00000000 ib_mthca 0000:0c:00.0: buf[08]: 00000000 ib_mthca 0000:0c:00.0: buf[09]: 00000000 ib_mthca 0000:0c:00.0: buf[0a]: 00000000 ib_mthca 0000:0c:00.0: buf[0b]: 00000000 ib_mthca 0000:0c:00.0: buf[0c]: 00000000 ib_mthca 0000:0c:00.0: buf[0d]: 00000000 ib_mthca 0000:0c:00.0: buf[0e]: 00000000 ib_mthca 0000:0c:00.0: buf[0f]: 00000000 Le Dimanche 1 Avril 2007 11:03, Ariel Shachar a écrit : > bug 40567 > > > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Eric Barton > Sent: Tuesday, March 20, 2007 8:00 PM > To: general at lists.openfabrics.org > Subject: [ofa-general] Help with an MTHCA "catastrophe" > > > > The following is console output immediately before a panic on a system > running lustre with OFED 1.1. How can I find out what it > means? > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: Catastrophic error detected: > internal error > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[00]: 001d79f4 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[01]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[02]: 00198538 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[03]: 00136038 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[04]: 00207730 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[05]: 001d79cc > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[06]: 0023cf24 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[07]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[08]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[09]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0a]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0b]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0c]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0d]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0e]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0f]: 00000000 > > ...shortly before it happens, the lustre/lnet OFED driver receives a > number of what I believe to be duplicate SEND completion > events. It seems quite sporadic, and doesn't appear to track hardware. > > More info at https://bugzilla.lustre.org/show_bug.cgi?id=11381 > > Cheers, > Eric > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From philippe.gregoire at cea.fr Mon Apr 2 05:27:07 2007 From: philippe.gregoire at cea.fr (GREGOIRE Philippe) Date: Mon, 02 Apr 2007 14:27:07 +0200 Subject: [ofa-general] ofed and vendors firmware In-Reply-To: <200704021014.51049.cap@nsc.liu.se> References: <460D2AA6.8000409@cea.fr> <460D4696.4070702@pathscale.com> <200704021014.51049.cap@nsc.liu.se> Message-ID: <4610F69B.40305@cea.fr> Peter Kjellstrom a écrit : > On Friday 30 March 2007, Robert Walsh wrote: > >> GREGOIRE Philippe wrote: >> >>> On the Ofed WIKI, one can found only informations about firmwares >>> recommended by Mellanox. >>> What about the other HCA vendors ? >>> >> QLogic (formerly PathScale) HCAs are firmwareless. >> > > Just a minor point, QLogic has non-infinipath HCAs too (formerly Silverstorm) > which do require firmware. > > /Peter > > >> Regards, >> Robert. >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general Yes, I forgot to mention Silverstorm although we are using Silverstorm HCA on site. I got the same problem on Silverstorm HCA. so what about Silverstorm ? So What about S From weikuan.yu at gmail.com Mon Apr 2 06:45:54 2007 From: weikuan.yu at gmail.com (Weikuan Yu) Date: Mon, 02 Apr 2007 09:45:54 -0400 Subject: [ofa-general] HotI 2007 Call for Papers -- Deadline (April 9) is approaching Message-ID: <46110912.4050402@gmail.com> -------------------------------------------------------------------- Apologies if you received multiple copies of this posting. Please feel free to distribute it to those who might be interested. -------------------------------------------------------------------- Hot Interconnects 15 IEEE Symposium on High-Performance Interconnects August 22-24, 2007 Stanford University Palo Alto, California, USA Hot Interconnects is the premier international forum for researchers and developers of state-of-the-art hardware and software architectures and implementations for interconnection networks of all scales, ranging from on-chip processor-memory interconnects to wide-area networks. This yearly conference is very well attended by leaders in industry and academia. The atmosphere provides for a wealth of opportunities to interact with individuals at the forefront of this field. Themes include cross-cutting issues spanning computer systems, networking technologies, and communication protocols. This conference is directed particularly at new and exciting technology and product innovations in these areas. Contributions should focus on real experimental systems, prototypes, or leading-edge products and their performance evaluation. In addition to those subscribing to the main theme of the conference, contributions are also solicited in the topics listed below. * Novel and innovative interconnect architectures * Multi-core processor interconnects * System-on-Chip Interconnects * Advanced chip-to-chip communication technologies * Optical interconnects * Protocol and interfaces for interprocessor communication * Survivability and fault-tolerance of interconnects * High-speed packet processing engines and network processors * System and storage area network architectures and protocols * High-performance host-network interface architectures * High-bandwidth and low-latency I/O * Tb/s switching and routing technologies * Innovative architectures for supporting collective communication * Novel communication architectures to support grid computing Submission Guideline o Extended deadline: April 9th, 2007 o Notification of acceptance: May 15, 2007 o Papers need sufficient technical detail to judge quality and suitability for presentation. o Submit title, author, abstract, and full paper (six pages, double-column, IEEE format). o Papers should be submitted electronically at the specified link location found on http://www.hoti.org o For further information please see http://www.hoti.org/hoti15/cfp.html About the Conference - Conference held at the William Hewlett Teaching Center at Stanford University. - Papers selected will be published in proceedings by the IEEE Computer Society. - Presentations are 30-minute talks in a single-track format. - Online information at http://www.hoti.org GENERAL CO-CHAIRS * John W. Lockwood, Washington University in St. Louis * Fabrizio Petrini, Pacific Northwest National Laboratory TECHNICAL CO-CHAIRS * Ron Brightwell, Sandia National Laboratories * Dhabaleswar (DK) Panda, The Ohio State University LOCAL ARRANGEMENTS CHAIR * Songkrant Muneenaem, Washington University in St. Louis PANEL CHAIR * Daniel Pitt, Santa Clara University PUBLICITY CO-CHAIRS * Weikuan Yu, Oak Ridge National Laboratory PUBLICATION CHAIR * Luca Valcarenghi, Scuola Superiore Sant'Anna FINANCE CHAIR * Herzel Ashkenazi, Xilinx TUTORIAL CO-CHAIRS - TBA REGISTRATION CHAIR * Songkrant Muneenaem, Washington University in St. Louis Webmaster * Liz Rogers, LRD Group Steering Committee o Allen Baum, Intel o Lily Jow, Hewlett Packard o Mark Laubach, Broadband Physics o John Lockwood, Stanford University o Daniel Pitt, Santa Clara University Technical Program Committee * Dennis Abts Cray, Inc. * Adnan Aziz University of Texas, Austin * Alan Benner IBM * Keren Bergman Columbia University * Andrea Bianco Politecnico di Torino * Piero Castoldi Scuola Superiore Sant'Anna * Sarang Dharmapurikar Nuova Systems * Hans Eberle Sun Microsystems Laboratories * Wu-chun Feng Virginia Tech * Juan Fernandez University of Murcia * Ada Gavrilovska Georgia Institute of Technology * Paolo Giaccone Politecnico di Torino * Mitchell Gusat IBM Zurich Research Laboratory * Ron Ho Sun Microsystems Laboratories * Doan Hoang University of Technology, Sydney * D. N. (Jay) Jayasimha Intel * Isaac Keslassy Technion * Venkata Krishnan Dolphin Interconnect Solutions * Tal Lavian Nortel Networks Labs, UC Berkeley * Bill Lin University of California, San Diego * Olav Lysne Simula Research Laboratory * Pankaj Mehra HP Labs * Rami Melhem University of Pittsburgh * Cyriel Minkenberg IBM Zurich Research Laboratory * Gregory Pfister IBM * Craig Stunkel IBM T.J. Watson Research Center * Anujan Varma University of California at Santa Cruz * Zuoguo (Joe) Wu Intel From bob.kossey at hp.com Mon Apr 2 07:03:57 2007 From: bob.kossey at hp.com (Bob Kossey) Date: Mon, 02 Apr 2007 10:03:57 -0400 Subject: [ofa-general] RE: [ewg] ofed and vendors firmware Message-ID: <46110D4D.9070705@hp.com> Hi Egor, > What about Mellanox HCA provided by HP? > > In my system I have HCA from HP with 409376-B21 P/N: > cat /sys/class/infiniband/mthca0/board_id > HP_0060000001 This is a Rev C HCA sourced from Voltaire. You can get the latest firmware for it by registering at http://www.voltaire.com/SupportAndServices/Drivers and then navigating to the download site: /versions/current/Firmware/HCA/HCA400EX-(PCI-EXP-REVC)/4.8.2/HCA400Ex-rC-25208-4_8_2.img Or contact your HP support rep. Bob From parks at lanl.gov Mon Apr 2 07:03:06 2007 From: parks at lanl.gov (Parks Fields) Date: Mon, 02 Apr 2007 08:03:06 -0600 Subject: [ofa-general] OFED 1.2 RC1 is delayed to Wed April 4 In-Reply-To: <460FC7D3.7090901@mellanox.co.il> References: <460FC7D3.7090901@mellanox.co.il> Message-ID: <7.0.1.0.2.20070402080247.027fcf30@lanl.gov> At 08:55 AM 4/1/2007, Tziporet Koren wrote: >Hi All, >OFED 1.2 RC1 is delayed to Wed April 4. >We mainly wait for the decision whether to leave IPoIB CM as a default. I vote for it being the default. >In addition there is a new MVAPICH package that was placed today so >we need more testing before RC1. > >Tziporet >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ***** Correspondence ***** This email contains no programmatic content that requires independent ADC review From robert.j.woodruff at intel.com Mon Apr 2 07:15:40 2007 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 2 Apr 2007 07:15:40 -0700 Subject: [ofa-general] RE: [ewg] OFED 1.2 RC1 is delayed to Wed April 4 In-Reply-To: <460FC7D3.7090901@mellanox.co.il> Message-ID: Tziporet wrote, >We mainly wait for the decision whether to leave IPoIB CM as a default. In my testing so far, I have not seen any issues with IPoIB CM, so I think it should be OK to be enabled by default, as long as there is a way to disable it in case people see issues later. woody -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren Sent: Sunday, April 01, 2007 7:55 AM To: EWG; OPENIB Subject: [ewg] OFED 1.2 RC1 is delayed to Wed April 4 Hi All, OFED 1.2 RC1 is delayed to Wed April 4. We mainly wait for the decision whether to leave IPoIB CM as a default. In addition there is a new MVAPICH package that was placed today so we need more testing before RC1. Tziporet _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From halr at voltaire.com Mon Apr 2 08:03:32 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Apr 2007 11:03:32 -0400 Subject: [ofa-general] Re: [PATCH] IB/core: Enhance SMI for switchsupport In-Reply-To: References: <1174949633.4372.3731.camel@hal.voltaire.com> <1175194018.4379.79820.camel@hal.voltaire.com> <012e01c7722f$dd855280$1914a8c0@surioffice> <1175197857.4379.83765.camel@hal.voltaire.com> <012f01c77235$4a3d1a20$1914a8c0@surioffice> <1175199133.4379.85124.camel@hal.voltaire.com> Message-ID: <1175526211.4436.15430.camel@localhost.localdomain> On Thu, 2007-03-29 at 14:22, Roland Dreier wrote: > > I see what you are referring to now. That's true for the other routines > > but unfortunately not this one. > > OK, that makes the current status even more confusing. > > > Not sure what the one set of names would be: > > discard != local and process != send > > > > Two sets of names (enums) could do it though. > > Yes, if the two return values have distinct semantics then they should > be using separate enums to indicate that. > > > If this is what is to be done then it should be 2 patches with the first > > preserving the current CA/router only support with the enums and the > > second adding in switch SMI. > > Please, let's do this now, since we're in the area. If we don't clean > up the code now it will slip down the priority list again and probably > never get done. Sure but this is a background ("midnight") project and will take me a few days to make sure nothing is broken in the changes even though they appear to be straightforward to me. -- Hal > - R. From halr at voltaire.com Mon Apr 2 08:04:16 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Apr 2007 11:04:16 -0400 Subject: [ofa-general] [PATCH] IB/mad: Change SMI to use enums rather than magic return codes Message-ID: <1175526255.4436.15432.camel@localhost.localdomain> IB/mad: Change SMI to use enums rather than magic return codes to try to make code clearer Tested with Tavor. Would be nice to get testing on this with other Mellanox HCAs, iPath, and eHCA. Signed-off-by: Hal Rosenstock diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 13efd41..6edfecf 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. * @@ -31,7 +31,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ */ #include #include @@ -668,7 +667,7 @@ static void build_smp_wc(struct ib_qp *q static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, struct ib_mad_send_wr_private *mad_send_wr) { - int ret; + int ret = 0; struct ib_smp *smp = mad_send_wr->send_buf.mad; unsigned long flags; struct ib_mad_local_private *local; @@ -688,14 +687,15 @@ static int handle_outgoing_dr_smp(struct */ if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) == IB_LID_PERMISSIVE && - !smi_handle_dr_smp_send(smp, device->node_type, port_num)) { + smi_handle_dr_smp_send(smp, device->node_type, port_num) == + IB_SMI_DISCARD) { ret = -EINVAL; printk(KERN_ERR PFX "Invalid directed route\n"); goto out; } + /* Check to post send on QP or process locally */ - ret = smi_check_local_smp(smp, device); - if (!ret) + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) goto out; local = kmalloc(sizeof *local, GFP_ATOMIC); @@ -1874,18 +1874,22 @@ static void ib_mad_recv_done_handler(str if (recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - if (!smi_handle_dr_smp_recv(&recv->mad.smp, - port_priv->device->node_type, - port_priv->port_num, - port_priv->device->phys_port_cnt)) + if (smi_handle_dr_smp_recv(&recv->mad.smp, + port_priv->device->node_type, + port_priv->port_num, + port_priv->device->phys_port_cnt) == + IB_SMI_DISCARD) goto out; - if (!smi_check_forward_dr_smp(&recv->mad.smp)) + + if (smi_check_forward_dr_smp(&recv->mad.smp) == IB_SMI_LOCAL) goto local; - if (!smi_handle_dr_smp_send(&recv->mad.smp, - port_priv->device->node_type, - port_priv->port_num)) + + if (smi_handle_dr_smp_send(&recv->mad.smp, + port_priv->device->node_type, + port_priv->port_num) == IB_SMI_DISCARD) goto out; - if (!smi_check_local_smp(&recv->mad.smp, port_priv->device)) + + if (smi_check_local_smp(&recv->mad.smp, port_priv->device) == IB_SMI_DISCARD) goto out; } diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c index 54b81e1..63a2356 100644 --- a/drivers/infiniband/core/smi.c +++ b/drivers/infiniband/core/smi.c @@ -3,7 +3,7 @@ * Copyright (c) 2004, 2005 Infinicon Corporation. All rights reserved. * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. * Copyright (c) 2004, 2005 Topspin Corporation. All rights reserved. - * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * Copyright (c) 2004-2007 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -34,7 +34,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: smi.c 1389 2004-12-27 22:56:47Z roland $ */ #include @@ -44,9 +43,8 @@ * Fixup a directed route SMP for sending * Return 0 if the SMP should be discarded */ -int smi_handle_dr_smp_send(struct ib_smp *smp, - u8 node_type, - int port_num) +enum smi_type smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, int port_num) { u8 hop_ptr, hop_cnt; @@ -59,18 +57,18 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_cnt && hop_ptr == 0) { smp->hop_ptr++; return (smp->initial_path[smp->hop_ptr] == - port_num); + port_num ? IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } /* C14-9:2 */ if (hop_ptr && hop_ptr < hop_cnt) { if (node_type != RDMA_NODE_IB_SWITCH) - return 0; + return IB_SMI_DISCARD; /* smp->return_path set when received */ smp->hop_ptr++; return (smp->initial_path[smp->hop_ptr] == - port_num); + port_num ? IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } /* C14-9:3 -- We're at the end of the DR segment of path */ @@ -78,29 +76,30 @@ int smi_handle_dr_smp_send(struct ib_smp /* smp->return_path set when received */ smp->hop_ptr++; return (node_type == RDMA_NODE_IB_SWITCH || - smp->dr_dlid == IB_LID_PERMISSIVE); + smp->dr_dlid == IB_LID_PERMISSIVE ? + IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ /* C14-9:5 -- Fail unreasonable hop pointer */ - return (hop_ptr == hop_cnt + 1); + return (hop_ptr == hop_cnt + 1 ? IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } else { /* C14-13:1 */ if (hop_cnt && hop_ptr == hop_cnt + 1) { smp->hop_ptr--; return (smp->return_path[smp->hop_ptr] == - port_num); + port_num ? IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { if (node_type != RDMA_NODE_IB_SWITCH) - return 0; + return IB_SMI_DISCARD; smp->hop_ptr--; return (smp->return_path[smp->hop_ptr] == - port_num); + port_num ? IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } /* C14-13:3 -- at the end of the DR segment of path */ @@ -108,15 +107,16 @@ int smi_handle_dr_smp_send(struct ib_smp smp->hop_ptr--; /* C14-13:3 -- SMPs destined for SM shouldn't be here */ return (node_type == RDMA_NODE_IB_SWITCH || - smp->dr_slid == IB_LID_PERMISSIVE); + smp->dr_slid == IB_LID_PERMISSIVE ? + IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM */ if (hop_ptr == 0) - return 1; + return IB_SMI_DONT_DISCARD; /* C14-13:5 -- Check for unreasonable hop pointer */ - return 0; + return IB_SMI_DISCARD; } } @@ -124,10 +124,8 @@ int smi_handle_dr_smp_send(struct ib_smp * Adjust information for a received SMP * Return 0 if the SMP should be dropped */ -int smi_handle_dr_smp_recv(struct ib_smp *smp, - u8 node_type, - int port_num, - int phys_port_cnt) +enum smi_type smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type, + int port_num, int phys_port_cnt) { u8 hop_ptr, hop_cnt; @@ -138,16 +136,17 @@ int smi_handle_dr_smp_recv(struct ib_smp if (!ib_get_smp_direction(smp)) { /* C14-9:1 -- sender should have incremented hop_ptr */ if (hop_cnt && hop_ptr == 0) - return 0; + return IB_SMI_DISCARD; /* C14-9:2 -- intermediate hop */ if (hop_ptr && hop_ptr < hop_cnt) { if (node_type != RDMA_NODE_IB_SWITCH) - return 0; + return IB_SMI_DISCARD; smp->return_path[hop_ptr] = port_num; /* smp->hop_ptr updated when sending */ - return (smp->initial_path[hop_ptr+1] <= phys_port_cnt); + return (smp->initial_path[hop_ptr+1] <= phys_port_cnt ? + IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } /* C14-9:3 -- We're at the end of the DR segment of path */ @@ -157,12 +156,13 @@ int smi_handle_dr_smp_recv(struct ib_smp /* smp->hop_ptr updated when sending */ return (node_type == RDMA_NODE_IB_SWITCH || - smp->dr_dlid == IB_LID_PERMISSIVE); + smp->dr_dlid == IB_LID_PERMISSIVE ? + IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ /* C14-9:5 -- fail unreasonable hop pointer */ - return (hop_ptr == hop_cnt + 1); + return (hop_ptr == hop_cnt + 1 ? IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } else { @@ -170,16 +170,17 @@ int smi_handle_dr_smp_recv(struct ib_smp if (hop_cnt && hop_ptr == hop_cnt + 1) { smp->hop_ptr--; return (smp->return_path[smp->hop_ptr] == - port_num); + port_num ? IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { if (node_type != RDMA_NODE_IB_SWITCH) - return 0; + return IB_SMI_DISCARD; /* smp->hop_ptr updated when sending */ - return (smp->return_path[hop_ptr-1] <= phys_port_cnt); + return (smp->return_path[hop_ptr-1] <= phys_port_cnt ? + IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } /* C14-13:3 -- We're at the end of the DR segment of path */ @@ -187,23 +188,20 @@ int smi_handle_dr_smp_recv(struct ib_smp if (smp->dr_slid == IB_LID_PERMISSIVE) { /* giving SMP to SM - update hop_ptr */ smp->hop_ptr--; - return 1; + return IB_SMI_DONT_DISCARD; } /* smp->hop_ptr updated when sending */ - return (node_type == RDMA_NODE_IB_SWITCH); + return (node_type == RDMA_NODE_IB_SWITCH ? + IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } /* C14-13:4 -- hop_ptr = 0 -> give to SM */ /* C14-13:5 -- Check for unreasonable hop pointer */ - return (hop_ptr == 0); + return (hop_ptr == 0 ? IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } } -/* - * Return 1 if the received DR SMP should be forwarded to the send queue - * Return 0 if the SMP should be completed up the stack - */ -int smi_check_forward_dr_smp(struct ib_smp *smp) +enum smi_forward_type smi_check_forward_dr_smp(struct ib_smp *smp) { u8 hop_ptr, hop_cnt; @@ -213,23 +211,25 @@ int smi_check_forward_dr_smp(struct ib_s if (!ib_get_smp_direction(smp)) { /* C14-9:2 -- intermediate hop */ if (hop_ptr && hop_ptr < hop_cnt) - return 1; + return IB_SMI_SEND; /* C14-9:3 -- at the end of the DR segment of path */ if (hop_ptr == hop_cnt) - return (smp->dr_dlid == IB_LID_PERMISSIVE); + return (smp->dr_dlid == IB_LID_PERMISSIVE ? + IB_SMI_SEND : IB_SMI_LOCAL); /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ if (hop_ptr == hop_cnt + 1) - return 1; + return IB_SMI_SEND; } else { - /* C14-13:2 */ + /* C14-13:2 -- intermediate hop */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) - return 1; + return IB_SMI_SEND; /* C14-13:3 -- at the end of the DR segment of path */ if (hop_ptr == 1) - return (smp->dr_slid != IB_LID_PERMISSIVE); + return (smp->dr_slid != IB_LID_PERMISSIVE ? + IB_SMI_SEND : IB_SMI_LOCAL); } - return 0; + return IB_SMI_LOCAL; } diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h index 3011bfd..c266ceb 100644 --- a/drivers/infiniband/core/smi.h +++ b/drivers/infiniband/core/smi.h @@ -3,7 +3,7 @@ * Copyright (c) 2004 Infinicon Corporation. All rights reserved. * Copyright (c) 2004 Intel Corporation. All rights reserved. * Copyright (c) 2004 Topspin Corporation. All rights reserved. - * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * Copyright (c) 2004-2007 Voltaire Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -33,7 +33,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: smi.h 1389 2004-12-27 22:56:47Z roland $ */ #ifndef __SMI_H_ @@ -41,26 +40,33 @@ #include -int smi_handle_dr_smp_recv(struct ib_smp *smp, - u8 node_type, - int port_num, - int phys_port_cnt); -extern int smi_check_forward_dr_smp(struct ib_smp *smp); -extern int smi_handle_dr_smp_send(struct ib_smp *smp, - u8 node_type, - int port_num); +enum smi_type { + IB_SMI_DISCARD, + IB_SMI_DONT_DISCARD +}; + +enum smi_forward_type { + IB_SMI_LOCAL, /* SMP should be completed up the stack */ + IB_SMI_SEND, /* received DR SMP should be forwarded to the send queue */ +}; + +enum smi_type smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type, + int port_num, int phys_port_cnt); +extern enum smi_forward_type smi_check_forward_dr_smp(struct ib_smp *smp); +extern enum smi_type smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, int port_num); /* * Return 1 if the SMP should be handled by the local SMA/SM via process_mad */ -static inline int smi_check_local_smp(struct ib_smp *smp, - struct ib_device *device) +static inline enum smi_type smi_check_local_smp(struct ib_smp *smp, + struct ib_device *device) { /* C14-9:3 -- We're at the end of the DR segment of path */ /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ return ((device->process_mad && !ib_get_smp_direction(smp) && - (smp->hop_ptr == smp->hop_cnt + 1))); + (smp->hop_ptr == smp->hop_cnt + 1)) ? + IB_SMI_DONT_DISCARD : IB_SMI_DISCARD); } - #endif /* __SMI_H_ */ From halr at voltaire.com Mon Apr 2 08:24:07 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Apr 2007 11:24:07 -0400 Subject: [ofa-general] [PATCHv2] IB/mad: Change SMI to use enums rather than magic return codes Message-ID: <1175527446.4436.16721.camel@localhost.localdomain> IB/mad: Change SMI to use enums rather than magic return codes to try to make code clearer (Difference from v1 is just that a name for an enum changed). Tested with Tavor. Would be nice to get testing on this with other Mellanox HCAs, iPath, and eHCA. Signed-off-by: Hal Rosenstock diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 13efd41..6edfecf 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. * @@ -31,7 +31,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ */ #include #include @@ -668,7 +667,7 @@ static void build_smp_wc(struct ib_qp *q static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, struct ib_mad_send_wr_private *mad_send_wr) { - int ret; + int ret = 0; struct ib_smp *smp = mad_send_wr->send_buf.mad; unsigned long flags; struct ib_mad_local_private *local; @@ -688,14 +687,15 @@ static int handle_outgoing_dr_smp(struct */ if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) == IB_LID_PERMISSIVE && - !smi_handle_dr_smp_send(smp, device->node_type, port_num)) { + smi_handle_dr_smp_send(smp, device->node_type, port_num) == + IB_SMI_DISCARD) { ret = -EINVAL; printk(KERN_ERR PFX "Invalid directed route\n"); goto out; } + /* Check to post send on QP or process locally */ - ret = smi_check_local_smp(smp, device); - if (!ret) + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) goto out; local = kmalloc(sizeof *local, GFP_ATOMIC); @@ -1874,18 +1874,22 @@ static void ib_mad_recv_done_handler(str if (recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - if (!smi_handle_dr_smp_recv(&recv->mad.smp, - port_priv->device->node_type, - port_priv->port_num, - port_priv->device->phys_port_cnt)) + if (smi_handle_dr_smp_recv(&recv->mad.smp, + port_priv->device->node_type, + port_priv->port_num, + port_priv->device->phys_port_cnt) == + IB_SMI_DISCARD) goto out; - if (!smi_check_forward_dr_smp(&recv->mad.smp)) + + if (smi_check_forward_dr_smp(&recv->mad.smp) == IB_SMI_LOCAL) goto local; - if (!smi_handle_dr_smp_send(&recv->mad.smp, - port_priv->device->node_type, - port_priv->port_num)) + + if (smi_handle_dr_smp_send(&recv->mad.smp, + port_priv->device->node_type, + port_priv->port_num) == IB_SMI_DISCARD) goto out; - if (!smi_check_local_smp(&recv->mad.smp, port_priv->device)) + + if (smi_check_local_smp(&recv->mad.smp, port_priv->device) == IB_SMI_DISCARD) goto out; } diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c index 54b81e1..3ffc09d 100644 --- a/drivers/infiniband/core/smi.c +++ b/drivers/infiniband/core/smi.c @@ -3,7 +3,7 @@ * Copyright (c) 2004, 2005 Infinicon Corporation. All rights reserved. * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. * Copyright (c) 2004, 2005 Topspin Corporation. All rights reserved. - * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * Copyright (c) 2004-2007 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -34,7 +34,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: smi.c 1389 2004-12-27 22:56:47Z roland $ */ #include @@ -44,9 +43,8 @@ * Fixup a directed route SMP for sending * Return 0 if the SMP should be discarded */ -int smi_handle_dr_smp_send(struct ib_smp *smp, - u8 node_type, - int port_num) +enum smi_type smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, int port_num) { u8 hop_ptr, hop_cnt; @@ -59,18 +57,18 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_cnt && hop_ptr == 0) { smp->hop_ptr++; return (smp->initial_path[smp->hop_ptr] == - port_num); + port_num ? IB_SMI_HANDLE : IB_SMI_DISCARD); } /* C14-9:2 */ if (hop_ptr && hop_ptr < hop_cnt) { if (node_type != RDMA_NODE_IB_SWITCH) - return 0; + return IB_SMI_DISCARD; /* smp->return_path set when received */ smp->hop_ptr++; return (smp->initial_path[smp->hop_ptr] == - port_num); + port_num ? IB_SMI_HANDLE : IB_SMI_DISCARD); } /* C14-9:3 -- We're at the end of the DR segment of path */ @@ -78,29 +76,30 @@ int smi_handle_dr_smp_send(struct ib_smp /* smp->return_path set when received */ smp->hop_ptr++; return (node_type == RDMA_NODE_IB_SWITCH || - smp->dr_dlid == IB_LID_PERMISSIVE); + smp->dr_dlid == IB_LID_PERMISSIVE ? + IB_SMI_HANDLE : IB_SMI_DISCARD); } /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ /* C14-9:5 -- Fail unreasonable hop pointer */ - return (hop_ptr == hop_cnt + 1); + return (hop_ptr == hop_cnt + 1 ? IB_SMI_HANDLE : IB_SMI_DISCARD); } else { /* C14-13:1 */ if (hop_cnt && hop_ptr == hop_cnt + 1) { smp->hop_ptr--; return (smp->return_path[smp->hop_ptr] == - port_num); + port_num ? IB_SMI_HANDLE : IB_SMI_DISCARD); } /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { if (node_type != RDMA_NODE_IB_SWITCH) - return 0; + return IB_SMI_DISCARD; smp->hop_ptr--; return (smp->return_path[smp->hop_ptr] == - port_num); + port_num ? IB_SMI_HANDLE : IB_SMI_DISCARD); } /* C14-13:3 -- at the end of the DR segment of path */ @@ -108,15 +107,16 @@ int smi_handle_dr_smp_send(struct ib_smp smp->hop_ptr--; /* C14-13:3 -- SMPs destined for SM shouldn't be here */ return (node_type == RDMA_NODE_IB_SWITCH || - smp->dr_slid == IB_LID_PERMISSIVE); + smp->dr_slid == IB_LID_PERMISSIVE ? + IB_SMI_HANDLE : IB_SMI_DISCARD); } /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM */ if (hop_ptr == 0) - return 1; + return IB_SMI_HANDLE; /* C14-13:5 -- Check for unreasonable hop pointer */ - return 0; + return IB_SMI_DISCARD; } } @@ -124,10 +124,8 @@ int smi_handle_dr_smp_send(struct ib_smp * Adjust information for a received SMP * Return 0 if the SMP should be dropped */ -int smi_handle_dr_smp_recv(struct ib_smp *smp, - u8 node_type, - int port_num, - int phys_port_cnt) +enum smi_type smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type, + int port_num, int phys_port_cnt) { u8 hop_ptr, hop_cnt; @@ -138,16 +136,17 @@ int smi_handle_dr_smp_recv(struct ib_smp if (!ib_get_smp_direction(smp)) { /* C14-9:1 -- sender should have incremented hop_ptr */ if (hop_cnt && hop_ptr == 0) - return 0; + return IB_SMI_DISCARD; /* C14-9:2 -- intermediate hop */ if (hop_ptr && hop_ptr < hop_cnt) { if (node_type != RDMA_NODE_IB_SWITCH) - return 0; + return IB_SMI_DISCARD; smp->return_path[hop_ptr] = port_num; /* smp->hop_ptr updated when sending */ - return (smp->initial_path[hop_ptr+1] <= phys_port_cnt); + return (smp->initial_path[hop_ptr+1] <= phys_port_cnt ? + IB_SMI_HANDLE : IB_SMI_DISCARD); } /* C14-9:3 -- We're at the end of the DR segment of path */ @@ -157,12 +156,13 @@ int smi_handle_dr_smp_recv(struct ib_smp /* smp->hop_ptr updated when sending */ return (node_type == RDMA_NODE_IB_SWITCH || - smp->dr_dlid == IB_LID_PERMISSIVE); + smp->dr_dlid == IB_LID_PERMISSIVE ? + IB_SMI_HANDLE : IB_SMI_DISCARD); } /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ /* C14-9:5 -- fail unreasonable hop pointer */ - return (hop_ptr == hop_cnt + 1); + return (hop_ptr == hop_cnt + 1 ? IB_SMI_HANDLE : IB_SMI_DISCARD); } else { @@ -170,16 +170,17 @@ int smi_handle_dr_smp_recv(struct ib_smp if (hop_cnt && hop_ptr == hop_cnt + 1) { smp->hop_ptr--; return (smp->return_path[smp->hop_ptr] == - port_num); + port_num ? IB_SMI_HANDLE : IB_SMI_DISCARD); } /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { if (node_type != RDMA_NODE_IB_SWITCH) - return 0; + return IB_SMI_DISCARD; /* smp->hop_ptr updated when sending */ - return (smp->return_path[hop_ptr-1] <= phys_port_cnt); + return (smp->return_path[hop_ptr-1] <= phys_port_cnt ? + IB_SMI_HANDLE : IB_SMI_DISCARD); } /* C14-13:3 -- We're at the end of the DR segment of path */ @@ -187,23 +188,20 @@ int smi_handle_dr_smp_recv(struct ib_smp if (smp->dr_slid == IB_LID_PERMISSIVE) { /* giving SMP to SM - update hop_ptr */ smp->hop_ptr--; - return 1; + return IB_SMI_HANDLE; } /* smp->hop_ptr updated when sending */ - return (node_type == RDMA_NODE_IB_SWITCH); + return (node_type == RDMA_NODE_IB_SWITCH ? + IB_SMI_HANDLE: IB_SMI_DISCARD); } /* C14-13:4 -- hop_ptr = 0 -> give to SM */ /* C14-13:5 -- Check for unreasonable hop pointer */ - return (hop_ptr == 0); + return (hop_ptr == 0 ? IB_SMI_HANDLE : IB_SMI_DISCARD); } } -/* - * Return 1 if the received DR SMP should be forwarded to the send queue - * Return 0 if the SMP should be completed up the stack - */ -int smi_check_forward_dr_smp(struct ib_smp *smp) +enum smi_forward_type smi_check_forward_dr_smp(struct ib_smp *smp) { u8 hop_ptr, hop_cnt; @@ -213,23 +211,25 @@ int smi_check_forward_dr_smp(struct ib_s if (!ib_get_smp_direction(smp)) { /* C14-9:2 -- intermediate hop */ if (hop_ptr && hop_ptr < hop_cnt) - return 1; + return IB_SMI_SEND; /* C14-9:3 -- at the end of the DR segment of path */ if (hop_ptr == hop_cnt) - return (smp->dr_dlid == IB_LID_PERMISSIVE); + return (smp->dr_dlid == IB_LID_PERMISSIVE ? + IB_SMI_SEND : IB_SMI_LOCAL); /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ if (hop_ptr == hop_cnt + 1) - return 1; + return IB_SMI_SEND; } else { - /* C14-13:2 */ + /* C14-13:2 -- intermediate hop */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) - return 1; + return IB_SMI_SEND; /* C14-13:3 -- at the end of the DR segment of path */ if (hop_ptr == 1) - return (smp->dr_slid != IB_LID_PERMISSIVE); + return (smp->dr_slid != IB_LID_PERMISSIVE ? + IB_SMI_SEND : IB_SMI_LOCAL); } - return 0; + return IB_SMI_LOCAL; } diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h index 3011bfd..3bd13ee 100644 --- a/drivers/infiniband/core/smi.h +++ b/drivers/infiniband/core/smi.h @@ -3,7 +3,7 @@ * Copyright (c) 2004 Infinicon Corporation. All rights reserved. * Copyright (c) 2004 Intel Corporation. All rights reserved. * Copyright (c) 2004 Topspin Corporation. All rights reserved. - * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * Copyright (c) 2004-2007 Voltaire Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -33,7 +33,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: smi.h 1389 2004-12-27 22:56:47Z roland $ */ #ifndef __SMI_H_ @@ -41,26 +40,33 @@ #include -int smi_handle_dr_smp_recv(struct ib_smp *smp, - u8 node_type, - int port_num, - int phys_port_cnt); -extern int smi_check_forward_dr_smp(struct ib_smp *smp); -extern int smi_handle_dr_smp_send(struct ib_smp *smp, - u8 node_type, - int port_num); +enum smi_type { + IB_SMI_DISCARD, + IB_SMI_HANDLE +}; + +enum smi_forward_type { + IB_SMI_LOCAL, /* SMP should be completed up the stack */ + IB_SMI_SEND, /* received DR SMP should be forwarded to the send queue */ +}; + +enum smi_type smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type, + int port_num, int phys_port_cnt); +extern enum smi_forward_type smi_check_forward_dr_smp(struct ib_smp *smp); +extern enum smi_type smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, int port_num); /* * Return 1 if the SMP should be handled by the local SMA/SM via process_mad */ -static inline int smi_check_local_smp(struct ib_smp *smp, - struct ib_device *device) +static inline enum smi_type smi_check_local_smp(struct ib_smp *smp, + struct ib_device *device) { /* C14-9:3 -- We're at the end of the DR segment of path */ /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ return ((device->process_mad && !ib_get_smp_direction(smp) && - (smp->hop_ptr == smp->hop_cnt + 1))); + (smp->hop_ptr == smp->hop_cnt + 1)) ? + IB_SMI_HANDLE : IB_SMI_DISCARD); } - #endif /* __SMI_H_ */ From scarter at ornl.gov Mon Apr 2 09:08:30 2007 From: scarter at ornl.gov (Steven Carter) Date: Mon, 02 Apr 2007 12:08:30 -0400 Subject: [ofa-general] Opensm dies with updn specified in opensm.opts Message-ID: <46112A7E.5060004@ornl.gov> OpenSM (from OFED 1.1) runs when '-R updn' is specified on the command line for up/down routing, but seg faults when it is specified in opensm.opts. # Start with clean opensm.opts: [root at bruiser osm]# rm /var/cache/osm/opensm.opts rm: remove regular file `/var/cache/osm/opensm.opts'? y [root at bruiser osm]# /opt/ofed-1.1/bin/opensm -c ------------------------------------------------- OpenSM Rev:openib-2.0.5 Based on OpenIB svn Exported revision Command Line Arguments: Caching command line options Log File: /var/log/osm.log ------------------------------------------------- OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision Using default GUID 0x8f1040397886d Entering MASTER state SUBNET UP OpenSM: Got signal 2 - exiting... Exiting SM # Runs fine with clean opensm.opts: [root at bruiser osm]# /opt/ofed-1.1/bin/opensm ------------------------------------------------- OpenSM Rev:openib-2.0.5 Based on OpenIB svn Exported revision Using Cached Option:guid = 0x0008f1040397886d Using Cached Option:log_flags = 3 Command Line Arguments: Log File: /var/log/osm.log ------------------------------------------------- OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision Entering MASTER state SUBNET UP OpenSM: Got signal 2 - exiting... Exiting SM # Specify up/down routing and write out to opensm.opts: [root at bruiser osm]# /opt/ofed-1.1/bin/opensm -R updn -c ------------------------------------------------- OpenSM Rev:openib-2.0.5 Based on OpenIB svn Exported revision Using Cached Option:guid = 0x0008f1040397886d Using Cached Option:log_flags = 3 Command Line Arguments: Activate 'updn' routing engine Caching command line options Log File: /var/log/osm.log ------------------------------------------------- OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision Entering MASTER state SUBNET UP OpenSM: Got signal 2 - exiting... Exiting SM # And it dies: [root at bruiser osm]# /opt/ofed-1.1/bin/opensm ------------------------------------------------- OpenSM Rev:openib-2.0.5 Based on OpenIB svn Exported revision Using Cached Option:guid = 0x0008f1040397886d Segmentation fault # The routing is the only difference: [root at bruiser osm]# diff opensm.opts.updn opensm.opts.good 103,105d102 < # Routing engine < routing_engine updn < Thanks, Steven. From halr at voltaire.com Mon Apr 2 09:23:08 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Apr 2007 12:23:08 -0400 Subject: [ofa-general] Opensm dies with updn specified in opensm.opts In-Reply-To: <46112A7E.5060004@ornl.gov> References: <46112A7E.5060004@ornl.gov> Message-ID: <1175530987.4436.20283.camel@localhost.localdomain> On Mon, 2007-04-02 at 12:08, Steven Carter wrote: > OpenSM (from OFED 1.1) runs when '-R updn' is specified on the command > line for up/down routing, but seg faults when it is specified in > opensm.opts. Can you try the same thing with OFED 1.2 ? > # Start with clean opensm.opts: > > [root at bruiser osm]# rm /var/cache/osm/opensm.opts > rm: remove regular file `/var/cache/osm/opensm.opts'? y > [root at bruiser osm]# /opt/ofed-1.1/bin/opensm -c > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 > Based on OpenIB svn Exported revision > Command Line Arguments: > Caching command line options > Log File: /var/log/osm.log > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision > > Using default GUID 0x8f1040397886d > Entering MASTER state > > SUBNET UP > > OpenSM: Got signal 2 - exiting... > Exiting SM > > # Runs fine with clean opensm.opts: > > [root at bruiser osm]# /opt/ofed-1.1/bin/opensm > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 > Based on OpenIB svn Exported revision > Using Cached Option:guid = 0x0008f1040397886d > Using Cached Option:log_flags = 3 > Command Line Arguments: > Log File: /var/log/osm.log > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision > > Entering MASTER state > > SUBNET UP > > OpenSM: Got signal 2 - exiting... > Exiting SM > > # Specify up/down routing and write out to opensm.opts: > > [root at bruiser osm]# /opt/ofed-1.1/bin/opensm -R updn -c > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 > Based on OpenIB svn Exported revision > Using Cached Option:guid = 0x0008f1040397886d > Using Cached Option:log_flags = 3 > Command Line Arguments: > Activate 'updn' routing engine > Caching command line options > Log File: /var/log/osm.log > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision > > Entering MASTER state > > SUBNET UP > > OpenSM: Got signal 2 - exiting... > Exiting SM > > # And it dies: > > [root at bruiser osm]# /opt/ofed-1.1/bin/opensm > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 > Based on OpenIB svn Exported revision > Using Cached Option:guid = 0x0008f1040397886d > Segmentation fault Can you find out where it seg faults ? > # The routing is the only difference: > > [root at bruiser osm]# diff opensm.opts.updn opensm.opts.good > 103,105d102 > < # Routing engine > < routing_engine updn > < Is there a root nodes guid file being used ? -- Hal > > Thanks, > > Steven. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Mon Apr 2 09:45:16 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Apr 2007 12:45:16 -0400 Subject: [ofa-general] [PATCH][MINOR] IB/umad: Fix declaration of dev_map Message-ID: <1175532311.4436.21673.camel@localhost.localdomain> IB/umad: Fix declaration of dev_map Pointed-out-by: Roland Dreier Signed-off-by: Hal Rosenstock diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index c069ebe..2ce3eea 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -135,7 +135,7 @@ static const dev_t base_dev = MKDEV(IB_U static DEFINE_SPINLOCK(port_lock); static struct ib_umad_port *umad_port[IB_UMAD_MAX_PORTS]; -static DECLARE_BITMAP(dev_map, IB_UMAD_MAX_PORTS * 2); +static DECLARE_BITMAP(dev_map, IB_UMAD_MAX_PORTS); static void ib_umad_add_one(struct ib_device *device); static void ib_umad_remove_one(struct ib_device *device); From sweitzen at cisco.com Mon Apr 2 10:01:59 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 2 Apr 2007 10:01:59 -0700 Subject: [ofa-general] RE: OFED 1.2 RC1 is delayed to Wed April 4 In-Reply-To: <20070402081450.GC24478@mellanox.co.il> References: <460FC7D3.7090901@mellanox.co.il> <20070402081450.GC24478@mellanox.co.il> Message-ID: > > I would like the following bugs fixed before we release RC1. > > What about bugs 431 and 465? Are you OK with releasing RC1 > with these still open? I would have preferred 431 get fixed by now, but at this point I don't think we should hold RC1 for it. > Specifically, I was unable to reproduce 465 so far. Could > you reassign to > Roland, to have him look into this on-site? I'm still looking into 465. Scott From michaelc at cs.wisc.edu Mon Apr 2 10:11:00 2007 From: michaelc at cs.wisc.edu (Mike Christie) Date: Mon, 02 Apr 2007 12:11:00 -0500 Subject: [ofa-general] Re: [PATCH 1/1] IB/iser: do not switch context when notifying the iSCSI layer on a connection failure In-Reply-To: <460F8F37.3090204@voltaire.com> References: <46064813.6070208@voltaire.com> <46064A78.5050005@voltaire.com> <4607B9BB.80407@voltaire.com> <460F8F37.3090204@voltaire.com> Message-ID: <46113924.2060402@cs.wisc.edu> Erez Zilber wrote: > Erez Zilber wrote: >> Roland, >> >> Or & I found a bug in this patch. I hope to send a fix for it in the >> next few days. Meanwhile, please don't merge it. >> > > Roland, Mike, > > The following patch replaces the bad patch (iser_conn should not be released while its workqueue is active) that I sent a few days ago. Again, if it's possible, I'd like to have it merged into 2.6.21 (it is a bug fix). > I will leave that to Roland since he probably has other patches to send. > > When a connection is terminated asynchronously from the iSCSI layer's perspective, > iSER needs to notify the iSCSI layer that the connection has failed. This was > done using a workqueue (switched to from a tasklet context). The context switch is > not required, and everything can be done from the iSER tasklet. > > Signed-off-by: Erez Zilber iscsi api bits look ok. Signed-off-by: Mike Christie From rjwalsh at pathscale.com Mon Apr 2 12:12:23 2007 From: rjwalsh at pathscale.com (Robert Walsh) Date: Mon, 02 Apr 2007 12:12:23 -0700 Subject: [ofa-general] ipath oops In-Reply-To: <200704021157.54344.bs@q-leap.de> References: <200703301342.19079.bs@q-leap.de> <460D464F.2020405@pathscale.com> <200704021157.54344.bs@q-leap.de> Message-ID: <46115597.6080204@pathscale.com> > Here is a list of calls in the lustre code intercepted by ipath. Just a clarification: as they currently stand in your code, these will NOT be intercepted by ipath, and that's most likely the source of your OOPs. > o2iblnd.c: > rx->rx_msgaddr = dma_map_single(cmid->device->dma_device, > rx->rx_msg, > IBLND_MSG_SIZE, > DMA_FROM_DEVICE); > > o2iblnd.c: > tx->tx_msgaddr = dma_map_single( > kiblnd_data.kib_cmid->device->dma_device, > tx->tx_msg, IBLND_MSG_SIZE, DMA_TO_DEVICE); > > > o2iblnd.c: > dma_unmap_single(conn->ibc_cmid->device->dma_device, > pci_unmap_addr(rx, rx_msgunmap), > IBLND_MSG_SIZE, DMA_FROM_DEVICE); > o2iblnd.c: > dma_unmap_single(kiblnd_data.kib_cmid->device->dma_device, > pci_unmap_addr(tx, tx_msgunmap), > IBLND_MSG_SIZE, DMA_TO_DEVICE); > > > o2iblnd_cb.c: > rd->rd_nfrags = dma_map_sg(kiblnd_data.kib_cmid->device->dma_device, > tx->tx_frags, tx->tx_nfrags,tx->tx_dmadir); > > > o2iblnd_cb.c: > dma_unmap_sg(kiblnd_data.kib_cmid->device->dma_device, > tx->tx_frags, tx->tx_nfrags, tx->tx_dmadir); > > o2iblnd_cb.c: > rd->rd_frags[i].rf_addr = sg_dma_address(&tx->tx_frags[i]); > > o2iblnd_cb.c: > rd->rd_frags[i].rf_nob = sg_dma_len(&tx->tx_frags[i]); > > > > So, how to proceed now? These need to be replaced with calls to ib_dma_map_single, ib_dma_unmap_single, ib_dma_map_sg, ib_dma_unmap_sg, etc. Note that these calls typically take a struct ib_device * as the first argument instead of a struct device *. Other than that, the API is pretty much identical. Here's the complete list: ib_dma_mapping_error ib_dma_map_single ib_dma_unmap_single ib_dma_map_page ib_dma_unmap_page ib_dma_map_sg ib_dma_unmap_sg ib_sg_dma_address ib_sg_dma_len ib_dma_sync_single_for_cpu ib_dma_sync_single_for_device ib_dma_alloc_coherent ib_dma_free_coherent Look in rdma/ib_verbs.h for prototypes. Let me know if you need any more assistance with this. Regards, Robert. From sean.hefty at intel.com Mon Apr 2 12:49:56 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 2 Apr 2007 12:49:56 -0700 Subject: [ofa-general] madeye kernel oops In-Reply-To: <1175009006.14461.0.camel@Ami-desktop> Message-ID: <000101c77560$170f5720$e598070a@amr.corp.intel.com> Can you see if this patch fixes your problem? (I'm not sure how I never hit this before.) - Sean --- IB/madeye: Fix array subscript out of range error. Signed-off-by: Sean Hefty diff --git a/drivers/infiniband/util/madeye/madeye.c b/drivers/infiniband/util/madeye/madeye.c index f3d02d1..1b2c384 100644 --- a/drivers/infiniband/util/madeye/madeye.c +++ b/drivers/infiniband/util/madeye/madeye.c @@ -533,7 +533,7 @@ static void madeye_add_one(struct ib_device *device) goto out; reg_flags = IB_MAD_SNOOP_SEND_COMPLETIONS | IB_MAD_SNOOP_RECVS; - for (i = s; i <= e; i++) { + for (i = 0; i <= e - s; i++) { port[i].smi_agent = ib_register_mad_snoop(device, i, IB_QPT_SMI, reg_flags, @@ -570,7 +570,7 @@ static void madeye_remove_one(struct ib_device *device) e = device->phys_port_cnt; } - for (i = s; i <= e; i++) { + for (i = 0; i <= e - s; i++) { if (!IS_ERR(port[i].smi_agent)) ib_unregister_mad_agent(port[i].smi_agent); if (!IS_ERR(port[i].gsi_agent)) From halr at voltaire.com Mon Apr 2 12:49:58 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Apr 2007 15:49:58 -0400 Subject: [ofa-general] [PATCHv2] IB/core/user_mad.c: Add support for issmdisabled Message-ID: <1175543397.4436.32980.camel@localhost.localdomain> IB/core/user_mad.c: Add support for issmdisabled (v2 combines the ib_umad_smcap_open/close routines for issm and issmdisabled) Signed-off-by: Hal Rosenstock diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index c069ebe..a689c5c 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005-2007 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -31,7 +31,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: user_mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ */ #include @@ -92,6 +91,8 @@ struct ib_umad_port { struct cdev *sm_dev; struct class_device *sm_class_dev; + struct cdev *smdis_dev; + struct class_device *smdis_class_dev; struct semaphore sm_sem; struct rw_semaphore mutex; @@ -782,16 +783,14 @@ static const struct file_operations umad .release = ib_umad_close }; -static int ib_umad_sm_open(struct inode *inode, struct file *filp) +static int ib_umad_smcap_open(struct file *filp, unsigned portnum, + struct ib_port_modify *props) { struct ib_umad_port *port; - struct ib_port_modify props = { - .set_port_cap_mask = IB_PORT_SM - }; int ret; spin_lock(&port_lock); - port = umad_port[iminor(inode) - IB_UMAD_MINOR_BASE - IB_UMAD_MAX_PORTS]; + port = umad_port[portnum]; if (port) kref_get(&port->umad_dev->ref); spin_unlock(&port_lock); @@ -811,7 +810,7 @@ static int ib_umad_sm_open(struct inode } } - ret = ib_modify_port(port->ib_dev, port->port_num, 0, &props); + ret = ib_modify_port(port->ib_dev, port->port_num, 0, props); if (ret) { up(&port->sm_sem); goto fail; @@ -826,17 +825,14 @@ fail: return ret; } -static int ib_umad_sm_close(struct inode *inode, struct file *filp) +static int ib_umad_smcap_close(struct ib_umad_port *port, + struct ib_port_modify *props) { - struct ib_umad_port *port = filp->private_data; - struct ib_port_modify props = { - .clr_port_cap_mask = IB_PORT_SM - }; int ret = 0; down_write(&port->mutex); if (port->ib_dev) - ret = ib_modify_port(port->ib_dev, port->port_num, 0, &props); + ret = ib_modify_port(port->ib_dev, port->port_num, 0, props); up_write(&port->mutex); up(&port->sm_sem); @@ -846,12 +842,56 @@ static int ib_umad_sm_close(struct inode return ret; } +static int ib_umad_sm_open(struct inode *inode, struct file *filp) +{ + struct ib_port_modify props = { + .set_port_cap_mask = IB_PORT_SM + }; + + return ib_umad_smcap_open(filp, iminor(inode) - IB_UMAD_MINOR_BASE - IB_UMAD_MAX_PORTS, &props); +} + +static int ib_umad_sm_close(struct inode *inode, struct file *filp) +{ + struct ib_umad_port *port = filp->private_data; + struct ib_port_modify props = { + .clr_port_cap_mask = IB_PORT_SM + }; + + return ib_umad_smcap_close(port, &props); +} + static const struct file_operations umad_sm_fops = { .owner = THIS_MODULE, .open = ib_umad_sm_open, .release = ib_umad_sm_close }; +static int ib_umad_smdis_open(struct inode *inode, struct file *filp) +{ + struct ib_port_modify props = { + .set_port_cap_mask = IB_PORT_SM_DISABLED + }; + + return ib_umad_smcap_open(filp, iminor(inode) - IB_UMAD_MINOR_BASE - 2*IB_UMAD_MAX_PORTS, &props); +} + +static int ib_umad_smdis_close(struct inode *inode, struct file *filp) +{ + struct ib_umad_port *port = filp->private_data; + struct ib_port_modify props = { + .clr_port_cap_mask = IB_PORT_SM_DISABLED + }; + + return ib_umad_smcap_close(port, &props); +} + +static const struct file_operations umad_smdis_fops = { + .owner = THIS_MODULE, + .open = ib_umad_smdis_open, + .release = ib_umad_smdis_close +}; + static struct ib_client umad_client = { .name = "umad", .add = ib_umad_add_one, @@ -947,12 +987,41 @@ static int ib_umad_init_port(struct ib_d if (class_device_create_file(port->sm_class_dev, &class_device_attr_port)) goto err_sm_class; + port->smdis_dev = cdev_alloc(); + if (!port->smdis_dev) + goto err_sm_class; + port->smdis_dev->owner = THIS_MODULE; + port->smdis_dev->ops = &umad_smdis_fops; + kobject_set_name(&port->smdis_dev->kobj, "issmdisabled%d", port->dev_num); + if (cdev_add(port->smdis_dev, base_dev + port->dev_num + 2*IB_UMAD_MAX_PORTS, 1)) + goto err_smdis_cdev; + + port->smdis_class_dev = class_device_create(umad_class, NULL, port->smdis_dev->dev, + device->dma_device, + "issmdisabled%d", + port->dev_num); + if (IS_ERR(port->smdis_class_dev)) + goto err_smdis_cdev; + + class_set_devdata(port->smdis_class_dev, port); + + if (class_device_create_file(port->smdis_class_dev, &class_device_attr_ibdev)) + goto err_smdis_class; + if (class_device_create_file(port->smdis_class_dev, &class_device_attr_port)) + goto err_smdis_class; + spin_lock(&port_lock); umad_port[port->dev_num] = port; spin_unlock(&port_lock); return 0; +err_smdis_class: + class_device_destroy(umad_class, port->smdis_dev->dev); + +err_smdis_cdev: + cdev_del(port->smdis_dev); + err_sm_class: class_device_destroy(umad_class, port->sm_dev->dev); @@ -979,9 +1048,11 @@ static void ib_umad_kill_port(struct ib_ class_device_destroy(umad_class, port->dev->dev); class_device_destroy(umad_class, port->sm_dev->dev); + class_device_destroy(umad_class, port->smdis_dev->dev); cdev_del(port->dev); cdev_del(port->sm_dev); + cdev_del(port->smdis_dev); spin_lock(&port_lock); umad_port[port->dev_num] = NULL; From adit.262 at gmail.com Mon Apr 2 15:47:13 2007 From: adit.262 at gmail.com (Adit Ranadive) Date: Mon, 2 Apr 2007 18:47:13 -0400 Subject: [ofa-general] Loading Infiniband modules for Xen Guests Message-ID: Hi, Has anyone tried installing the IB modules in Xen guest domains? Im using the xen source tree locate here : http://xenbits.xensource.com/ext/xen-smartio.hg I have a Dell Poweredge 1850 server and RHEL 4 installed. Im trying to load the ib_gmthca module for my xen guest (FC4 install) and it gives me the following error : modprobe ib_gmthca invalid host machine -1 [drivers/infiniband/hw/gmthca/gmthca_main.c:350],<1>Fail to setup hca, return with EFAULT FATAL: Error inserting ib_gmthca (/lib/modules/2.6.16-rc3-xenU/kernel/drivers/infiniband/hw/gmthca/ib_gmthca.ko): Bad address The guest domain must be able to detect the IB ports. Is there a way to do that? Im guessing would still require that the hca driver be loaded? BTW I can load all the other modules (ib_core, ib_ucm, ib_uverbs, etc.) in the guest My xen guest config is as follows : kernel = "/boot/vmlinuz-2.6.16-rc3-xenU" memory = 256 vif = ['bridge=xenbr0'] disk = [ 'file:/root/osimages/fedora.img,sda1,w' ] root = "/dev/sda1 ro" extra = "4" Any ideas on how the mthca module is loaded in the guest? Adit -- Adit Ranadive MS CS Candidate Georgia Institute of Technology, Atlanta, GA From adit.262 at gmail.com Mon Apr 2 15:56:23 2007 From: adit.262 at gmail.com (Adit Ranadive) Date: Mon, 2 Apr 2007 18:56:23 -0400 Subject: [ofa-general] Loading Infiniband modules for Xen Guests In-Reply-To: References: Message-ID: Hi, I dont understand why would I need to specify the IP address of the dom0? Ive seen the source for the ib_gmthca module which takes the serv_port, mc, domain as module parameters? So what do i need to specify for these in order to load the gmthca module in xen guest? A more generic question could be that actually how does the guest hca module work? Using kernel sockets? Thanks, Adit On 4/2/07, wei huang wrote: > You probably have the IP address of dom0 wrong? > > Regards, > Wei Huang > > 774 Dreese Lab, 2015 Neil Ave, > Dept. of Computer Science and Engineering > Ohio State University > OH 43210 > Tel: (614)292-8501 > > > > -- Adit Ranadive MS CS Candidate Georgia Institute of Technology, Atlanta, GA From huanwei at cse.ohio-state.edu Mon Apr 2 16:33:16 2007 From: huanwei at cse.ohio-state.edu (wei huang) Date: Mon, 2 Apr 2007 19:33:16 -0400 (EDT) Subject: [ofa-general] Loading Infiniband modules for Xen Guests In-Reply-To: Message-ID: Hi, I think for the prototype implementation in smartio hg, the guest domain is setup connections to dom0 through kernel socket. Though the ideal case is using xenbus, xenbus was not quite there when we wrote the prototype. I believe it is harded coded in the file. You should really need to change it to take input parameters for easier use. Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Mon, 2 Apr 2007, Adit Ranadive wrote: > Hi, > > I dont understand why would I need to specify the IP address of the dom0? > Ive seen the source for the ib_gmthca module which takes the > serv_port, mc, domain as module parameters? > So what do i need to specify for these in order to load the gmthca > module in xen guest? > > A more generic question could be that actually how does the guest hca > module work? Using kernel sockets? > > Thanks, > Adit > > On 4/2/07, wei huang wrote: > > You probably have the IP address of dom0 wrong? > > > > Regards, > > Wei Huang > > > > 774 Dreese Lab, 2015 Neil Ave, > > Dept. of Computer Science and Engineering > > Ohio State University > > OH 43210 > > Tel: (614)292-8501 > > > > > > > > > > > -- > Adit Ranadive > MS CS Candidate > Georgia Institute of Technology, > Atlanta, GA > From adit.262 at gmail.com Mon Apr 2 16:46:23 2007 From: adit.262 at gmail.com (Adit Ranadive) Date: Mon, 2 Apr 2007 19:46:23 -0400 Subject: [ofa-general] Loading Infiniband modules for Xen Guests In-Reply-To: References: Message-ID: So, from what I gather from the gmthca_main.c : serv_port is a socket port on dom0 to which the domU hca will connect to mc : index to the serv_name array which should indicate the dom0 IP address domain : is this the domain id for domU? So really the domU needs networking setup to have the HCA module loading into the domU? Thanks, Adit On 4/2/07, wei huang wrote: > Hi, > > I think for the prototype implementation in smartio hg, the guest domain > is setup connections to dom0 through kernel socket. Though the ideal case > is using xenbus, xenbus was not quite there when we wrote the prototype. > > I believe it is harded coded in the file. You should really need to change > it to take input parameters for easier use. > > Thanks. > > Regards, > Wei Huang > > 774 Dreese Lab, 2015 Neil Ave, > Dept. of Computer Science and Engineering > Ohio State University > OH 43210 > Tel: (614)292-8501 > > > On Mon, 2 Apr 2007, Adit Ranadive wrote: > > > Hi, > > > > I dont understand why would I need to specify the IP address of the dom0? > > Ive seen the source for the ib_gmthca module which takes the > > serv_port, mc, domain as module parameters? > > So what do i need to specify for these in order to load the gmthca > > module in xen guest? > > > > A more generic question could be that actually how does the guest hca > > module work? Using kernel sockets? > > > > Thanks, > > Adit > > > > On 4/2/07, wei huang wrote: > > > You probably have the IP address of dom0 wrong? > > > > > > Regards, > > > Wei Huang > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > Dept. of Computer Science and Engineering > > > Ohio State University > > > OH 43210 > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > > > > -- > > Adit Ranadive > > MS CS Candidate > > Georgia Institute of Technology, > > Atlanta, GA > > > > -- Adit Ranadive MS CS Candidate Georgia Institute of Technology, Atlanta, GA From huanwei at cse.ohio-state.edu Mon Apr 2 17:15:52 2007 From: huanwei at cse.ohio-state.edu (wei huang) Date: Mon, 2 Apr 2007 20:15:52 -0400 (EDT) Subject: [ofa-general] Loading Infiniband modules for Xen Guests In-Reply-To: Message-ID: Yes, you are right. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Mon, 2 Apr 2007, Adit Ranadive wrote: > So, from what I gather from the gmthca_main.c : > serv_port is a socket port on dom0 to which the domU hca will connect to > mc : index to the serv_name array which should indicate the dom0 IP address > domain : is this the domain id for domU? > > So really the domU needs networking setup to have the HCA module > loading into the domU? > > Thanks, > Adit > > On 4/2/07, wei huang wrote: > > Hi, > > > > I think for the prototype implementation in smartio hg, the guest domain > > is setup connections to dom0 through kernel socket. Though the ideal case > > is using xenbus, xenbus was not quite there when we wrote the prototype. > > > > I believe it is harded coded in the file. You should really need to change > > it to take input parameters for easier use. > > > > Thanks. > > > > Regards, > > Wei Huang > > > > 774 Dreese Lab, 2015 Neil Ave, > > Dept. of Computer Science and Engineering > > Ohio State University > > OH 43210 > > Tel: (614)292-8501 > > > > > > On Mon, 2 Apr 2007, Adit Ranadive wrote: > > > > > Hi, > > > > > > I dont understand why would I need to specify the IP address of the dom0? > > > Ive seen the source for the ib_gmthca module which takes the > > > serv_port, mc, domain as module parameters? > > > So what do i need to specify for these in order to load the gmthca > > > module in xen guest? > > > > > > A more generic question could be that actually how does the guest hca > > > module work? Using kernel sockets? > > > > > > Thanks, > > > Adit > > > > > > On 4/2/07, wei huang wrote: > > > > You probably have the IP address of dom0 wrong? > > > > > > > > Regards, > > > > Wei Huang > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > Dept. of Computer Science and Engineering > > > > Ohio State University > > > > OH 43210 > > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Adit Ranadive > > > MS CS Candidate > > > Georgia Institute of Technology, > > > Atlanta, GA > > > > > > > > > > -- > Adit Ranadive > MS CS Candidate > Georgia Institute of Technology, > Atlanta, GA > From maggycharis at trinitytouch.com Mon Apr 2 19:14:17 2007 From: maggycharis at trinitytouch.com (fawne lulita) Date: Tue, 3 Apr 2007 11:14:17 +0900 Subject: [ofa-general] Damian Message-ID: then takes a step back, to be safe as she reaches. ?ow that you notice it?ave just moved past then takes a step back, to be safe as she reaches. Bronze the sky, with no At the white place of the road's vanishing Shadows keep piling up as surfaces Bronze the sky, with no The weight of being born into exile is lifted. Sphinx of questioning substance, or a sort IV. The Paths to Cathay That this mud draws on the stone. Toward . . . that seems to be the whispered question XVIII. The Northeast and Northwest Passages Figures of light and dark, these two are walking Bronze the sky, with no Want anything said at all, which I still doubt) And all at once it is the meadow I walked in at ten, Dim, and die tonight? Life, or only joy, that stands out -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 14003 bytes Desc: not available URL: From sweitzen at cisco.com Tue Apr 3 01:07:45 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 3 Apr 2007 01:07:45 -0700 Subject: [ofa-general] bugs to fix for OFED 1.2 RC1 Message-ID: bug_id priority assigned_to component short_desc 509 P2 tziporet at mellanox.co.il IPoIB turn on IPoIB CM by default 406 P2 eitan at mellanox.co.il utils "double free" abort in ibdaigui 459 P2 monis at voltaire.com IPoIB support ib-bonding on RHEL4U4/RHEL5, put kernel name in RPM name, and clean up better 474 P1 ishai at mellanox.co.il SRP OFED srp_daemon keeps readding targets with Cisco FC GW I reopened bug 459, the RPM name for ib-bonding (as produced by "rpm -q ib-bonding") is not unique per kernel, making it hard to install for multiple kernels on a machine. For bug 474, I added more info, I'm still seeing failures. Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Tue Apr 3 02:35:19 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 3 Apr 2007 02:35:19 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070403-0200 daily build status Message-ID: <20070403093519.AF672E6081A@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From bs at q-leap.de Tue Apr 3 04:40:55 2007 From: bs at q-leap.de (Bernd Schubert) Date: Tue, 3 Apr 2007 13:40:55 +0200 Subject: [ofa-general] ipath oops In-Reply-To: <46115597.6080204@pathscale.com> References: <200703301342.19079.bs@q-leap.de> <200704021157.54344.bs@q-leap.de> <46115597.6080204@pathscale.com> Message-ID: <200704031340.55542.bs@q-leap.de> On Monday 02 April 2007 21:12:23 Robert Walsh wrote: > > Here is a list of calls in the lustre code intercepted by ipath. > > Just a clarification: as they currently stand in your code, these will > NOT be intercepted by ipath, and that's most likely the source of your > OOPs. Ah, now I understand your mail from Friday. > > So, how to proceed now? > > These need to be replaced with calls to ib_dma_map_single, > ib_dma_unmap_single, ib_dma_map_sg, ib_dma_unmap_sg, etc. Note that > these calls typically take a struct ib_device * as the first argument > instead of a struct device *. Other than that, the API is pretty much > identical. Here's the complete list: Thanks a lot, doing so fixed the oops! Many thanks again, Bernd -- Bernd Schubert Q-Leap Networks GmbH From walleslp at walla.com Tue Apr 3 10:47:37 2007 From: walleslp at walla.com (=?UTF-8?Q?=69=6E=66=6F=72=20=64=65=70=74?=) Date: Tue, 3 Apr 2007 20:47:37 +0300 Subject: [ofa-general] *** your e-mail address won email ballot !!! *** Message-ID: <1175622450.544000-67196762-3727@walla.com> An HTML attachment was scrubbed... URL: From kilian at stanford.edu Tue Apr 3 11:22:52 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Tue, 3 Apr 2007 11:22:52 -0700 Subject: [ofa-general] OpenIB-cma: DAT_INSUFFICIENT_RESOURCES Message-ID: <200704031122.52838.kilian@stanford.edu> Hi all, I'm not sure if that's the right place to ask, but I encounter a few issues trying to run Linpack runs on an Infiniband cluster using OFED 1.1 I'm using Intel MPI 3.0 to compile and run HPL, and upon execution, I got the following error messages: ******************************************************************************* $ mpiexec -n 16 -env I_MPI_DEVICE rdssm ./xhpl ============================================================================ HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004 Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK ============================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 40000 NB : 112 PMAP : Row-major process mapping P : 4 Q : 4 PFACT : Left NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ring DEPTH : 0 SWAP : Mix (threshold = 256) L1 : no-transposed form U : no-transposed form EQUIL : no ALIGN : 8 double precision words ---------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual checks will be computed: 1) ||Ax-b||_oo / ( eps * ||A||_1 * N ) 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) 3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 register failed 196608 [0] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES: register failed 196608 [4] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES: register failed 196608 [12] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES: register failed 196608 [8] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES: rank 4 in job 11 node-9-1_42298 caused collective abort of all ranks exit status of rank 4: killed by signal 9 ******************************************************************************* The same "register failed 196608 [0] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES:" error occurs when using rdma as I_MPI_DEVICE. I don't even know if it's really an OpenIB issue or more a MPI issue, but the basic multi-nodes MPI tests I ran ('Hello world' based) seemed to work fine. I would really appreciate any hint on this issue, Thanks a lot, -- Kilian From ardavis at ichips.intel.com Tue Apr 3 11:30:15 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 03 Apr 2007 11:30:15 -0700 Subject: [ofa-general] OpenIB-cma: DAT_INSUFFICIENT_RESOURCES In-Reply-To: <200704031122.52838.kilian@stanford.edu> References: <200704031122.52838.kilian@stanford.edu> Message-ID: <46129D37.3020002@ichips.intel.com> Kilian CAVALOTTI wrote: >register failed 196608 [4] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES: > >register failed 196608 [12] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES: > >register failed 196608 [8] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES: > > > > This error is typically a result of ulimit -l (max locked memory) being set too low. -arlin From kilian at stanford.edu Tue Apr 3 11:38:00 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Tue, 3 Apr 2007 11:38:00 -0700 Subject: [ofa-general] OpenIB-cma: DAT_INSUFFICIENT_RESOURCES In-Reply-To: <46129D37.3020002@ichips.intel.com> References: <200704031122.52838.kilian@stanford.edu> <46129D37.3020002@ichips.intel.com> Message-ID: <200704031138.00862.kilian@stanford.edu> On Tuesday 03 April 2007 11:30:15 am Arlin Davis wrote: > Kilian CAVALOTTI wrote: > >register failed 196608 [4] error(0x30000): OpenIB-cma: > > DAT_INSUFFICIENT_RESOURCES: > > > >register failed 196608 [12] error(0x30000): OpenIB-cma: > > DAT_INSUFFICIENT_RESOURCES: > > > >register failed 196608 [8] error(0x30000): OpenIB-cma: > > DAT_INSUFFICIENT_RESOURCES: > > This error is typically a result of ulimit -l (max locked memory) being > set too low. That's was exactly it. :) Thanks a lot, -- Kilian From halr at voltaire.com Tue Apr 3 12:32:11 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Apr 2007 15:32:11 -0400 Subject: [ofa-general] [PATCH][MINOR] OpenSM/osm_subnet.c: Use system defined limits for pathname length Message-ID: <1175628731.4436.123014.camel@localhost.localdomain> OpenSM/osm_subnet.c: Use system defined limits for pathname length Pointed out by: Jeff Squyres Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index 746fbd1..b147860 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -51,6 +51,7 @@ #include #include +#include #include #include #include @@ -66,6 +67,14 @@ #include #include +#if defined(PATH_MAX) +#define OSM_PATH_MAX (PATH_MAX + 1) +#elif defined (_POSIX_PATH_MAX) +#define OSM_PATH_MAX (_POSIX_PATH_MAX + 1) +#else +#define OSM_PATH_MAX 256 +#endif + /********************************************************************** **********************************************************************/ void @@ -737,7 +746,7 @@ osm_subn_rescan_conf_file( IN osm_subn_opt_t* const p_opts ) { char *p_cache_dir = getenv("OSM_CACHE_DIR"); - char file_name[256]; + char file_name[OSM_PATH_MAX]; FILE *opts_file; char line[1024]; char *p_key, *p_val ,*p_last; @@ -832,7 +841,7 @@ osm_subn_parse_conf_file( IN osm_subn_opt_t* const p_opts ) { char *p_cache_dir = getenv("OSM_CACHE_DIR"); - char file_name[256]; + char file_name[OSM_PATH_MAX]; FILE *opts_file; char line[1024]; char *p_key, *p_val ,*p_last; @@ -1103,7 +1112,7 @@ osm_subn_write_conf_file( IN osm_subn_opt_t* const p_opts ) { char *p_cache_dir = getenv("OSM_CACHE_DIR"); - char file_name[256]; + char file_name[OSM_PATH_MAX]; FILE *opts_file; /* try to open the options file from the cache dir */ From yong.qin at qlogic.com Tue Apr 3 12:42:40 2007 From: yong.qin at qlogic.com (Yong Qin) Date: Tue, 3 Apr 2007 14:42:40 -0500 Subject: [ofa-general] uDAPL question In-Reply-To: <1E3DCD1C63492545881FACB6063A57C1D522D5@mtiexch01.mti.com> References: <1E3DCD1C63492545881FACB6063A57C1D522D5@mtiexch01.mti.com> Message-ID: <120DDDEC0C4AA045B93BDB63FE45905E207884@EPEXCH1.qlogic.org> Is there any progress on this issue? We are seeing exactly the same error on OFED 1.1 + Intel MPI 3.0 -- "unexpected DAPL event 4006" and wondering if there is a fix. Thanks, Yong -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Boris Shpolyansky Sent: Monday, March 12, 2007 11:28 AM To: Woodruff, Robert J; general at lists.openfabrics.org; Hefty, Sean Subject: RE: [ofa-general] uDAPL question Hi Woody, Thanks for your help. I guess the problem is in the CM - is it ? Can you point me to relevant communication/bug reports that explain the fix for this issue ? Would Sean be the right person to ask regarding what exact patch should be added/removed ? I would prefer to stick to OFED-1.1 code with minimal changes - if possible - to avoid compatibility issues. Thanks, Boris -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Monday, March 12, 2007 8:24 AM To: Boris Shpolyansky; general at lists.openfabrics.org; Hefty, Sean Subject: RE: [ofa-general] uDAPL question This is a known problem and should be fixed by now, There was a bad patch that somehow got into OFED that was not in Sean main tree. Assuming this bad patch has been removed, the problem should be fixed. woody ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Boris Shpolyansky Sent: Friday, March 09, 2007 8:40 PM To: general at lists.openfabrics.org Subject: [ofa-general] uDAPL question Hi, I'm trying to get simple Intel MPI benchmark running over IB (uDAPL) using OFED-1.1 stack. I'm consistently getting the following error: [root at ibd005 ~]# ./runjob_I_MPI.boris 2 Task 0 of 2 tasks started on host ibd005.ibd.mti.com clock_resolution = 1.00e-06 s Task 1 of 2 tasks started on host ibd006.ibd.mti.com [0:ibd005] unexpected DAPL event 4006 from 1:ibd006 [1:ibd006] unexpected DAPL event 4006 from 0:ibd005 rank 0 in job 14 ibd005_36193 caused collective abort of all ranks exit status of rank 0: return code 254 I did some digging and found out that event 4006 (actually 0x4006) means DAT_CONNECTION_EVENT_BROKEN and it is returned by function dat_rmr_bind. So my question is why this function consistently fails. I'm using standard dat.conf file: OpenIB-cma u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdaplcma.so mv_dapl.1.2 "ib0 0" "" Appreciate your help, Boris Shpolyansky _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From arlin.r.davis at intel.com Tue Apr 3 14:12:24 2007 From: arlin.r.davis at intel.com (Arlin Davis) Date: Tue, 3 Apr 2007 14:12:24 -0700 Subject: [ofa-general] [PATCH] uDAPL: Bug 497, dat.conf support for multiple interfaces, IPoIB failover Message-ID: <000001c77634$c6eebc60$9f97070a@amr.corp.intel.com> Add support for multiple IB devices in dat.conf to support IPoIB HA failover. Applied to master and ofed_1_2 branches. Signed-off by: Arlin Davis ardavis at ichips.intel.com diff --git a/doc/dat.conf b/doc/dat.conf index 19f4521..53d319f 100644 --- a/doc/dat.conf +++ b/doc/dat.conf @@ -11,5 +11,10 @@ # # Simple (OpenIB-cma) default with netdev name provided first on list # to enable use of same dat.conf version on all nodes +# +# Add examples for multiple interfaces and IPoIB HA fail over # OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "ib0 0" "" +OpenIB-cma-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "ib1 0" "" +OpenIB-cma-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "ib2 0" "" +OpenIB-cma-3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "ib3 0" "" From robert.j.woodruff at intel.com Tue Apr 3 14:59:19 2007 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 3 Apr 2007 14:59:19 -0700 Subject: [ofa-general] uDAPL question In-Reply-To: <120DDDEC0C4AA045B93BDB63FE45905E207884@EPEXCH1.qlogic.org> Message-ID: This should now be fixed in OFED 1.2. woody -----Original Message----- From: Yong Qin [mailto:yong.qin at qlogic.com] Sent: Tuesday, April 03, 2007 12:43 PM To: Boris Shpolyansky; Woodruff, Robert J; Hefty, Sean Cc: general at lists.openfabrics.org Subject: RE: [ofa-general] uDAPL question Is there any progress on this issue? We are seeing exactly the same error on OFED 1.1 + Intel MPI 3.0 -- "unexpected DAPL event 4006" and wondering if there is a fix. Thanks, Yong -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Boris Shpolyansky Sent: Monday, March 12, 2007 11:28 AM To: Woodruff, Robert J; general at lists.openfabrics.org; Hefty, Sean Subject: RE: [ofa-general] uDAPL question Hi Woody, Thanks for your help. I guess the problem is in the CM - is it ? Can you point me to relevant communication/bug reports that explain the fix for this issue ? Would Sean be the right person to ask regarding what exact patch should be added/removed ? I would prefer to stick to OFED-1.1 code with minimal changes - if possible - to avoid compatibility issues. Thanks, Boris -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Monday, March 12, 2007 8:24 AM To: Boris Shpolyansky; general at lists.openfabrics.org; Hefty, Sean Subject: RE: [ofa-general] uDAPL question This is a known problem and should be fixed by now, There was a bad patch that somehow got into OFED that was not in Sean main tree. Assuming this bad patch has been removed, the problem should be fixed. woody ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Boris Shpolyansky Sent: Friday, March 09, 2007 8:40 PM To: general at lists.openfabrics.org Subject: [ofa-general] uDAPL question Hi, I'm trying to get simple Intel MPI benchmark running over IB (uDAPL) using OFED-1.1 stack. I'm consistently getting the following error: [root at ibd005 ~]# ./runjob_I_MPI.boris 2 Task 0 of 2 tasks started on host ibd005.ibd.mti.com clock_resolution = 1.00e-06 s Task 1 of 2 tasks started on host ibd006.ibd.mti.com [0:ibd005] unexpected DAPL event 4006 from 1:ibd006 [1:ibd006] unexpected DAPL event 4006 from 0:ibd005 rank 0 in job 14 ibd005_36193 caused collective abort of all ranks exit status of rank 0: return code 254 I did some digging and found out that event 4006 (actually 0x4006) means DAT_CONNECTION_EVENT_BROKEN and it is returned by function dat_rmr_bind. So my question is why this function consistently fails. I'm using standard dat.conf file: OpenIB-cma u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdaplcma.so mv_dapl.1.2 "ib0 0" "" Appreciate your help, Boris Shpolyansky _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From arlin.r.davis at intel.com Tue Apr 3 16:13:43 2007 From: arlin.r.davis at intel.com (Arlin Davis) Date: Tue, 3 Apr 2007 16:13:43 -0700 Subject: [ofa-general] [PATCH] uDAPL rpm specfile cleanup Message-ID: <000101c77645$b96d7d90$9f97070a@amr.corp.intel.com> Cleanup RPM build for the DAPL package, move to 1.2-1 release. (remove duplicate dat.conf in devel, remove *.so from devel, add libdat.so to base, update release) Changes applied to master and ofed_1_2 branches. Signed-off by: Arlin Davis ardavis at ichips.intel.com diff --git a/libdat.spec.in b/libdat.spec.in index bcd78ad..b490c39 100644 --- a/libdat.spec.in +++ b/libdat.spec.in @@ -31,8 +31,8 @@ # # $Id: $ -%define ver 1.2.1 -%define RELEASE pre +%define ver 1.2 +%define RELEASE 1 %define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} Summary: Userspace DAT and DAPL API. @@ -86,14 +86,13 @@ rm -rf $RPM_BUILD_ROOT %postun -p /sbin/ldconfig %files -%defattr(-,root,root) -%{_libdir}/libda*.so.* +%defattr(-,root,root,-) +%{_libdir}/libda*.so* %{_sysconfdir}/dat.conf %doc AUTHORS README %files devel %defattr(-,root,root,-) -%{_libdir}/libda*.so %{_libdir}/*.a %{_includedir}/dat/dat.h %{_includedir}/dat/dat_error.h @@ -105,7 +104,6 @@ rm -rf $RPM_BUILD_ROOT %{_includedir}/dat/udat.h %{_includedir}/dat/udat_redirection.h %{_includedir}/dat/udat_vendor_specific.h -%{_sysconfdir}/dat.conf %files utils %defattr(-,root,root,-) From swise at opengridcomputing.com Tue Apr 3 16:32:13 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 03 Apr 2007 18:32:13 -0500 Subject: [ofa-general] libcxgb3 changes not pulled into 1.2 Message-ID: <1175643133.12147.30.camel@stevo-desktop> Vlad, I pushed a few changes a while ago in libcxgb3 and they are not being pulled into ofed 1.2. I thought the builds just cloned from the owner's ofed_1_2 branch for each library? Is this not true? Actually now that I'm looking into this, I find that according to the BUILD_ID file in the ofa user tarball, the libcxgb3.git repos used isn't mine, but rather: git://git.openfabrics.org/ofed_1_2/libcxgb3.git ofed_1_2 And it is several commits behind where it needs to be for rc1. I need these commits in rc1... Also, lemme know what the correct process is for this. Clear I don't understand :-) So please pull from: git://staging.openfabrics.org/~swise/libcxgb3.git ofed_1_2 -------------- commit 5e5e510422d680e357adb51116ebc473e1244073 Author: Steve Wise Date: Sat Mar 31 15:57:44 2007 -0500 Support the IBV_SEND_INLINE option. Signed-off-by: Steve Wise commit 1233233c9a8ceef63dfd4f7223e262a284938d53 Author: Steve Wise Date: Fri Mar 23 09:50:32 2007 -0500 Support for PE9K as T3B device. Signed-off-by: Steve Wise commit d19b47277b1fd3c985edb00915c581317acf8c40 Author: Steve Wise Date: Fri Mar 23 09:49:50 2007 -0500 Fix DEBUG only compile break. Signed-off-by: Steve Wise commit d387c832e168a2bfbe4b6cab762feaa19cd86d5c Author: Steve Wise Date: Tue Mar 13 12:34:42 2007 -0500 munmap() objects before deleting them. This change is needed to support older kernels where the iw_cxgb3 driver reserves the pages in the objects. Signed-off-by: Steve Wise From arlin.r.davis at intel.com Tue Apr 3 16:41:58 2007 From: arlin.r.davis at intel.com (Arlin Davis) Date: Tue, 3 Apr 2007 16:41:58 -0700 Subject: [ofa-general] [GIT PULL] Pull latest fixes from uDAPL ofed_1_2 branch Message-ID: <000201c77649$ab4aba80$9f97070a@amr.corp.intel.com> Vlad, Please pull latest from uDAPL project (ofed_1_2 branch) for RC1 Arlin Davis (2) BUG 146: Cleanup RPM specfile for the dapl package, move to 1.2-1 release. BUG 497: Add support for multiple IB devices to dat.conf to support IPoIB HA failover. Thanks, -arlin From infodept at lottery.com Tue Apr 3 19:08:35 2007 From: infodept at lottery.com (UK NATIONAL LOTTERY 2007) Date: Tue, 3 Apr 2007 22:08:35 -0400 Subject: [ofa-general] YOUR EMAIL WON{Contact Your Claims Agent Immediately} Message-ID: <200704040208.l3428ZPF017447@ten.artisson.net> NATIONAL LOTTERY PROMOTIONS/ PRIZE AWARD DEPARTMENT. 79 BOVILL ROAD LONDON. SE23 1EL. Ref: EGS/2251256003/02 Batch: 14/0017/1PD WINNING NOTIFICATION The board of directors and entire member of staff of the National Lottery wishes to inform you of the result of the (Email Address Ballot)online Sweepstakes international program held at the British Headquarter. Your email account have been picked as a winner of Six Hundred Thousand Pounds Sterling (£600,000.00) in cash credited to file KTU/9023118308/03. This is from a total sum of Four Million, Two hundred Thousand Pounds Sterling (£ 4,200,000.00) shared amongst seven winners. This results for 3thApril, 2007 is now released to you today and your email address attached to ticket number: 56475600545 188 with Serial number 5368/02 drew the lucky numbers -3-35-39-42-44-46- and bonus ball -41- in the Lotto, to check your results online click on this link http://lottery.co.uk/res/ participants were selected through a computer ballot system drawn from Microsoft users from company and individual email address users . Your fund is now deposited in an offshore bank with a hardcover insurance. Due to the mix up of some numbers and names, we advice that you keep this award from public notice until your claim has been processed and your money remitted to your nominated bank account as this is part of our security protocol to avoid double claiming or unwarranted taking advantage of this program by the general public. Be advice to keep your winning information confidential until your claims have been processed and your money remitted to you. This is part of our security protocol to avoid double claiming and unwarranted abuse of this program. For the release of your winning, kindly contact your claims supervisory officer at ********************************************** The British Lottery Head Quarters, PROMOTIONS/ PRIZE AWARD DEPARTMENT. 79 BOVILL ROAD LONDON. SE23 1EL. Payment Release Order Department Contact Person:Mr.Owen Tony Tel: +44-703-191-0741, +44-701-112-7559, +44-701-112-7560 Fax: +44-709-288-2303 Email: contactowen_tony at yahoo.co.uk ************************************************* You are therefore advised to give the following informations to the fiduciary Mr Owen Tony via email contactowen_tony at yahoo.co.uk 1. Full name and address. 2. Country. 3. Tel and fax number. 4. Occupation. 5.Email Address. Electronic Bank Transfer. Issuance of a certified Cheque. Congratulations from the entire member of staff and thank you for being part of our email account users program. Yours Sincerely, Mrs. Litza Martinez, Executive (International Sweepstakes). Dr. P. Swier, Mr. Gerald Goodman (Manager Foreign Operations), Mr. Franklyn Van Der Weijden (Manager Domestic Banking Operations), Dr. James Williams(Director International Credit Department), Mrs. Sandra Murphy (Executive), Mr. Michael Cole (Executive), Mr. Stephen Boer (Chairman). Affiliates The Lottery Company and lottery.co.uk is in no way affiliated with, associated with or approved or endorsed by Camelot Group plc or The National Lottery Commission. Regards, From vlad at mellanox.co.il Tue Apr 3 23:05:31 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 04 Apr 2007 09:05:31 +0300 Subject: [ofa-general] Re: [GIT PULL] Pull latest fixes from uDAPL ofed_1_2 branch In-Reply-To: <000201c77649$ab4aba80$9f97070a@amr.corp.intel.com> References: <000201c77649$ab4aba80$9f97070a@amr.corp.intel.com> Message-ID: <1175666731.5966.0.camel@vladsk-laptop> On Tue, 2007-04-03 at 16:41 -0700, Arlin Davis wrote: > Vlad, > > Please pull latest from uDAPL project (ofed_1_2 branch) for RC1 > > Arlin Davis (2) > > BUG 146: Cleanup RPM specfile for the dapl package, move to 1.2-1 release. > BUG 497: Add support for multiple IB devices to dat.conf to support IPoIB HA failover. > > Thanks, > > -arlin Done. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From erezz at voltaire.com Tue Apr 3 23:09:40 2007 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 04 Apr 2007 08:09:40 +0200 Subject: [ofa-general] Re: [PATCH 1/1] IB/iser: do not switch context when notifying the iSCSI layer on a connection failure In-Reply-To: References: <46064813.6070208@voltaire.com> <46064A78.5050005@voltaire.com> <4607B9BB.80407@voltaire.com> <460F8F37.3090204@voltaire.com> Message-ID: <46134124.5020305@voltaire.com> Roland Dreier wrote: > > The following patch replaces the bad patch (iser_conn should not be > > released while its workqueue is active) that I sent a few days > > ago. Again, if it's possible, I'd like to have it merged into 2.6.21 > > (it is a bug fix). > > We can still merge bug fixes, but I need some understanding of what > the bug is and what the severity is. The changelog you sent is > inadequate, since it makes the change seem by like at most an > optimization or simplification, and doesn't mention what the bug is at > all: > You're right. Here's a better description: When a connection is terminated asynchronously from the iSCSI layer's perspective, iSER needs to notify the iSCSI layer that the connection has failed. This is done using a workqueue (switched to from the iSER tasklet context). Meanwhile, the connection object (that holds the work struct) is released. If the workqueue function wasn't called yet, it will be called later with a NULL pointer that will crash the kernel. The context switch (tasklet to workqueue) is not required, and everything can be done from the iSER tasklet. This eliminates the NULL work struct bug (and simplifies the code). From sweitzen at cisco.com Tue Apr 3 23:32:26 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 3 Apr 2007 23:32:26 -0700 Subject: [ofa-general] RE: bugs to fix for OFED 1.2 RC1 Message-ID: bug_id priority assigned_to component short_desc 517 P1 vlad at mellanox.co.il IPoIB IPoIB HA does not work after OFED-20070401 474 P1 ishai at mellanox.co.il SRP OFED srp_daemon keeps readding targets with Cisco FC GW 509 P2 tziporet at mellanox.co.il IPoIB turn on IPoIB CM by default 459 P2 monis at voltaire.com IPoIB support ib-bonding on RHEL4U4/RHEL5, put kernel name in RPM name, and clean up better 406 P2 eitan at mellanox.co.il utils "double free" abort in ibdaigui I opened bug 517 today, IPoIB HA does not work in recent builds if installed in /usr. I posted a fix for bug 474 in bugzilla just now. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: Scott Weitzenkamp (sweitzen) Sent: Tuesday, April 03, 2007 1:08 AM To: EWG Cc: OPENIB Subject: bugs to fix for OFED 1.2 RC1 bug_id priority assigned_to component short_desc 509 P2 tziporet at mellanox.co.il IPoIB turn on IPoIB CM by default 406 P2 eitan at mellanox.co.il utils "double free" abort in ibdaigui 459 P2 monis at voltaire.com IPoIB support ib-bonding on RHEL4U4/RHEL5, put kernel name in RPM name, and clean up better 474 P1 ishai at mellanox.co.il SRP OFED srp_daemon keeps readding targets with Cisco FC GW I reopened bug 459, the RPM name for ib-bonding (as produced by "rpm -q ib-bonding") is not unique per kernel, making it hard to install for multiple kernels on a machine. For bug 474, I added more info, I'm still seeing failures. Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at mellanox.co.il Tue Apr 3 23:40:23 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 04 Apr 2007 09:40:23 +0300 Subject: [ofa-general] Re: libcxgb3 changes not pulled into 1.2 In-Reply-To: <1175643133.12147.30.camel@stevo-desktop> References: <1175643133.12147.30.camel@stevo-desktop> Message-ID: <1175668823.5966.10.camel@vladsk-laptop> On Tue, 2007-04-03 at 18:32 -0500, Steve Wise wrote: > Vlad, > > I pushed a few changes a while ago in libcxgb3 and they are not being > pulled into ofed 1.2. I thought the builds just cloned from the owner's > ofed_1_2 branch for each library? Is this not true? > > Actually now that I'm looking into this, I find that according to the > BUILD_ID file in the ofa user tarball, the libcxgb3.git repos used isn't > mine, but rather: > > git://git.openfabrics.org/ofed_1_2/libcxgb3.git ofed_1_2 > > And it is several commits behind where it needs to be for rc1. I need > these commits in rc1... > > Also, lemme know what the correct process is for this. Clear I don't > understand :-) > > > So please pull from: > > git://staging.openfabrics.org/~swise/libcxgb3.git ofed_1_2 Done. Rergarding the process see: https://wiki.openfabrics.org/tiki-index.php?page=Teleconf%2001-29-2007 .... Release tagging and branching: Sources developed in OFA: 1. Each git owner will open a branch with the name ofed_1_2. This branch should be opened on Feb 1. 2. Vlad will open a new directory /pub/ofed_1_2. 3. All ofed_1_2 branches will be cloned to this directory. (Note: libibverbs and libmthca will be cloned from kernel.org for Roland's trees.) 4. Any change that should be included in the next OFED package will be first check-in to the maintainer ofed_1_2 branch. A mail should be sent to Vlad (and cc the list) to pull this change. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From ogerlitz at voltaire.com Wed Apr 4 02:03:51 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 04 Apr 2007 12:03:51 +0300 Subject: [ofa-general] Re: [Bug 506] IPoIB IPv4 multicast throughput is poor In-Reply-To: <20070401201802.GB11175@mellanox.co.il> References: <20070401195210.8DEC5E603B8@openfabrics.org> <20070401201802.GB11175@mellanox.co.il> Message-ID: <461369F7.6030209@voltaire.com> Michael S. Tsirkin wrote: >> The low throughput is a major issue, though. Shouldn't the IP multicast >> throughput be similar to the UDP unicast throughput? > Is the send side a send only member of multicast group, or full member? The join state (full / sendonly nonmember / nonmember)is something communicated between the ULP through the ib_sa module and the IB SA. I don't see how the host ib driver becomes aware to it. The current ipoib implementation for sendonly joins is to join as full member but not to attach its UD QP for that group. > If it's a full join, HCA creates extra loopback traffic which > has then to be discarded, and which might explain performance degradation. Can you explain what --is-- the trigger for the "looback channel" creation? my thinking it should be conditioned on having any QP attached to this MGID, which does not seem the case in this scenario. Or. From mst at dev.mellanox.co.il Wed Apr 4 02:08:45 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 4 Apr 2007 12:08:45 +0300 Subject: [ofa-general] Re: [Bug 506] IPoIB IPv4 multicast throughput is poor In-Reply-To: <461369F7.6030209@voltaire.com> References: <20070401195210.8DEC5E603B8@openfabrics.org> <20070401201802.GB11175@mellanox.co.il> <461369F7.6030209@voltaire.com> Message-ID: <20070404090845.GA16799@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [Bug 506] IPoIB IPv4 multicast throughput is poor > > Michael S. Tsirkin wrote: > >>The low throughput is a major issue, though. Shouldn't the IP multicast > >>throughput be similar to the UDP unicast throughput? > > >Is the send side a send only member of multicast group, or full member? > > The join state (full / sendonly nonmember / nonmember)is something > communicated between the ULP through the ib_sa module and the IB SA. > I don't see how the host ib driver becomes aware to it. > > The current ipoib implementation for sendonly joins is to join as full > member but not to attach its UD QP for that group. I think so too. So what does the test do? Is it a sendonly join? > >If it's a full join, HCA creates extra loopback traffic which > >has then to be discarded, and which might explain performance degradation. > > Can you explain what --is-- the trigger for the "looback channel" > creation? my thinking it should be conditioned on having any QP attached > to this MGID, which does not seem the case in this scenario. That's what I'd expect. -- MST From ogerlitz at voltaire.com Wed Apr 4 02:20:20 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 04 Apr 2007 12:20:20 +0300 Subject: [ofa-general] Re: [Bug 506] IPoIB IPv4 multicast throughput is poor In-Reply-To: <20070404090845.GA16799@mellanox.co.il> References: <20070401195210.8DEC5E603B8@openfabrics.org> <20070401201802.GB11175@mellanox.co.il> <461369F7.6030209@voltaire.com> <20070404090845.GA16799@mellanox.co.il> Message-ID: <46136DD4.8030509@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Or Gerlitz : >> Subject: Re: [Bug 506] IPoIB IPv4 multicast throughput is poor >> >> Michael S. Tsirkin wrote: >>>> The low throughput is a major issue, though. Shouldn't the IP multicast >>>> throughput be similar to the UDP unicast throughput? >>> Is the send side a send only member of multicast group, or full member? >> The join state (full / sendonly nonmember / nonmember)is something >> communicated between the ULP through the ib_sa module and the IB SA. >> I don't see how the host ib driver becomes aware to it. >> >> The current ipoib implementation for sendonly joins is to join as full >> member but not to attach its UD QP for that group. > > I think so too. So what does the test do? Is it a sendonly join? on the client side, when running iperf with -cu ipv4-multicast-address iperf just send packets to that destination, my understanding is that ipoib xmit calls ipoib_mcast_send which sense its a sendonly join etc. on the server side, when running iperf with -suB ipv4-multicast-address iperf issues an IP_ADD_MEMBERSHIP setsockopt call on its socket and the kernel uses ip_ib_mc_map to compute the L2 multicast address and then call the ipoib device set_multicast_list function which initiates a full join + attach to this MGID. >>> If it's a full join, HCA creates extra loopback traffic which >>> has then to be discarded, and which might explain performance degradation. >> Can you explain what --is-- the trigger for the "looback channel" >> creation? my thinking it should be conditioned on having any QP attached >> to this MGID, which does not seem the case in this scenario. > That's what I'd expect. Is this documented anywhere (eg Mellanox PRM)? Or. From vlad at lists.openfabrics.org Wed Apr 4 02:36:35 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 4 Apr 2007 02:36:35 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070404-0200 daily build status Message-ID: <20070404093635.4C62AE60816@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From ogerlitz at voltaire.com Wed Apr 4 03:00:37 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 04 Apr 2007 13:00:37 +0300 Subject: [ofa-general] Re: [PATCH] [RFC] IB/cache: Add ib_cache report for cache in process In-Reply-To: References: <460BCF03.9020406@gmail.com> <20070329145640.GL4253@mellanox.co.il> <6a122cc00703290801l60685143i5bcd06be46b337c1@mail.gmail.com> Message-ID: <46137745.5030501@voltaire.com> Roland Dreier wrote: > It is entirely possible that this change makes the P_Key lookup return > -ESTALE when it would have returned perfectly correct information (eg > if a P_Key is being added to the end of the table without affecting > existing P_Keys). So this change as it stands introduces a window > where spurious failures might occur. Michael S. Tsirkin wrote: > Anyway, aren't you marking all cache "stale" while most pkeys might be still valid? > Can't this break valid usage in e.g. SRP? Roland, Michael, Please note that there is quite a big difference between UD vs RC based IB ULPs with respect to how there are influenced from using a wrong pkey index at their QP. In the RC case, the receiving side transport level would not get any packets and hence would not send acks etc, at some point the sending side would get completion with error and retry the connection. In the UD case, nothing other then pkey-violation-counter/traps etc would happen unless both side would re-initiate their QP (this is exactly what Moni is doing at ipoib in the patch that followed). Hence, it is extremely important that UD based ULPs would react to the async event of pkey change, and would retry reading the pkey from the cache when getting ESTALE or any other error code from the cache. For the RC case, note that a) the connection would not break if the change did not involve the index of the pkey used for it b) once the connection breaks and re-initiated by the ULP the cache would be very much --already updated--. So the only case which might be problematic with a patch that does not change the RC ULPs (and CM) code is when in the exact millisecond you set your RC connection the cache changes. I don't think the IB portion of the ULP code has to be changed other then sensing the ESTALE error and propagating it up. Higher layers would retry the connection and we are done. Anyway, thanks for bringing all this up! while thinking on it i have realized that the RDMA CM can (should) be enhanced to register on async events and for the pkey change event issue disconnect event on the relevant UD unicast IDs and multicast error event on the relevant UD multicast IDs. Or. From amip at dev.mellanox.co.il Wed Apr 4 00:39:25 2007 From: amip at dev.mellanox.co.il (Ami Perlmutter) Date: Wed, 04 Apr 2007 10:39:25 +0300 Subject: [ofa-general] madeye kernel oops In-Reply-To: <000101c77560$170f5720$e598070a@amr.corp.intel.com> References: <000101c77560$170f5720$e598070a@amr.corp.intel.com> Message-ID: <1175672365.14461.12.camel@Ami-desktop> seems to be OK On Mon, 2007-04-02 at 12:49 -0700, Sean Hefty wrote: > Can you see if this patch fixes your problem? > > (I'm not sure how I never hit this before.) > > - Sean > > --- > > IB/madeye: Fix array subscript out of range error. > > Signed-off-by: Sean Hefty > > diff --git a/drivers/infiniband/util/madeye/madeye.c > b/drivers/infiniband/util/madeye/madeye.c > index f3d02d1..1b2c384 100644 > --- a/drivers/infiniband/util/madeye/madeye.c > +++ b/drivers/infiniband/util/madeye/madeye.c > @@ -533,7 +533,7 @@ static void madeye_add_one(struct ib_device *device) > goto out; > > reg_flags = IB_MAD_SNOOP_SEND_COMPLETIONS | IB_MAD_SNOOP_RECVS; > - for (i = s; i <= e; i++) { > + for (i = 0; i <= e - s; i++) { > port[i].smi_agent = ib_register_mad_snoop(device, i, > IB_QPT_SMI, > reg_flags, > @@ -570,7 +570,7 @@ static void madeye_remove_one(struct ib_device *device) > e = device->phys_port_cnt; > } > > - for (i = s; i <= e; i++) { > + for (i = 0; i <= e - s; i++) { > if (!IS_ERR(port[i].smi_agent)) > ib_unregister_mad_agent(port[i].smi_agent); > if (!IS_ERR(port[i].gsi_agent)) > From dotanb at dev.mellanox.co.il Wed Apr 4 03:42:48 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 04 Apr 2007 13:42:48 +0300 Subject: [ofa-general] Re: Question about registering the [vdso] memory section in user level In-Reply-To: References: <460B8705.9030904@dev.mellanox.co.il> <20070329094700.GB4253@mellanox.co.il> <20070329233622.GM5436@mellanox.co.il> Message-ID: <46138128.5010801@dev.mellanox.co.il> Roland Dreier wrote: > > > > Yes, you can't DMA to VDSO VMA I don't think. > > > Why not? It's just RAM... > > Well ... isn't it read-only? > > True... you shouldn't be able to DMA to it. But I assume Dotan is > trying to register the memory with read-only permission and DMA from > it. Dotan, can you be more explicit about what your test is and how > it fails? > In user level I'm looking at all of the VMAs of the process and try to register the last VMA that has read permission (with ONLY read permission enabled in the MR permissions). ibv_reg_mr fails for me. When i added some debug prints i noticed the failure in file: uverbs_mem.c function: get_page_shift, find_vma returned NULL. Dotan From erezz at voltaire.com Tue Apr 3 23:09:40 2007 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 04 Apr 2007 08:09:40 +0200 Subject: [ofa-general] Re: [PATCH 1/1] IB/iser: do not switch context when notifying the iSCSI layer on a connection failure In-Reply-To: References: <46064813.6070208@voltaire.com> <46064A78.5050005@voltaire.com> <4607B9BB.80407@voltaire.com> <460F8F37.3090204@voltaire.com> Message-ID: <46134124.5020305@voltaire.com> Roland Dreier wrote: > > The following patch replaces the bad patch (iser_conn should not be > > released while its workqueue is active) that I sent a few days > > ago. Again, if it's possible, I'd like to have it merged into 2.6.21 > > (it is a bug fix). > > We can still merge bug fixes, but I need some understanding of what > the bug is and what the severity is. The changelog you sent is > inadequate, since it makes the change seem by like at most an > optimization or simplification, and doesn't mention what the bug is at > all: > You're right. Here's a better description: When a connection is terminated asynchronously from the iSCSI layer's perspective, iSER needs to notify the iSCSI layer that the connection has failed. This is done using a workqueue (switched to from the iSER tasklet context). Meanwhile, the connection object (that holds the work struct) is released. If the workqueue function wasn't called yet, it will be called later with a NULL pointer that will crash the kernel. The context switch (tasklet to workqueue) is not required, and everything can be done from the iSER tasklet. This eliminates the NULL work struct bug (and simplifies the code). --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi at googlegroups.com To unsubscribe from this group, send email to open-iscsi-unsubscribe at googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~----------~----~----~----~------~----~------~--~--- --- Scanned by M+ Guardian Extreme Messaging Firewall by Messaging Architects --- From tziporet at dev.mellanox.co.il Wed Apr 4 08:37:26 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 04 Apr 2007 18:37:26 +0300 Subject: [ofa-general] Help with an MTHCA "catastrophe" In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9F6FF49@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C9F6FF49@mtlexch01.mtl.com> Message-ID: <4613C636.6090104@mellanox.co.il> > The following is console output immediately before a panic on a system > running lustre with OFED 1.1. How can I find out what it > means? > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: Catastrophic error detected: > internal error > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[00]: 001d79f4 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[01]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[02]: 00198538 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[03]: 00136038 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[04]: 00207730 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[05]: 001d79cc > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[06]: 0023cf24 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[07]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[08]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[09]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0a]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0b]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0c]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0d]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0e]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0f]: 00000000 > > ...shortly before it happens, the lustre/lnet OFED driver receives a > number of what I believe to be duplicate SEND completion > events. It seems quite sporadic, and doesn't appear to track hardware. > > Please contact your HCA provider to get a FW version that fix this issue. Tziporet From swise at opengridcomputing.com Wed Apr 4 08:57:38 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 04 Apr 2007 10:57:38 -0500 Subject: [ofa-general] mvapich2 over iwarp DOA - bug520 Message-ID: <1175702259.1797.31.camel@stevo-desktop> I just built and installed today's daily ofed-1.2 build and mvapich2 doesn't work at all over iwarp. The build is OFED-1.2-20070404-0600.tgz. I've opened bug 520 to track this. If I run a 2 node cpi, it hangs and never completes. If I run a 2 node IMB-MPI1 I get a crash: (gdb) bt #0 0x00000038f5b71f13 in memcpy () from /lib64/tls/libc.so.6 #1 0x00002b99e2e61f40 in MPIDI_CH3I_MRAIL_Fill_Request () from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so #2 0x00002b99e2e2353a in handle_read () from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so #3 0x00002b99e2e23a99 in MPIDI_CH3I_Progress () from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so #4 0x00002b99e2e5db07 in MPIC_Wait () from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so #5 0x00002b99e2e5e22c in MPIC_Recv () from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so #6 0x00002b99e2e196f3 in MPIR_Bcast () from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so #7 0x00002b99e2e19f82 in PMPI_Bcast () from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so #8 0x0000000000402444 in IMB_basic_input () #9 0x00000000004016b5 in main () (gdb) This is mvapich2-0.9.8-9. I believe I tested a development build of this code directly from OSU before they shipped it in OFED-1.2. I don't know yet what's up. dapltest and rping work across my cluster so I think this is an mvpich2 issue. Shaun/Sundeep: Have you all tested the ofed-1.2 build with 0.9.8-9 yet? Steve. From halr at voltaire.com Wed Apr 4 09:32:08 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Apr 2007 12:32:08 -0400 Subject: [ofa-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo Message-ID: <1175704326.4436.202892.camel@localhost.localdomain> OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c index 4a14ee0..1d5bac1 100644 --- a/osm/opensm/osm_port_info_rcv.c +++ b/osm/opensm/osm_port_info_rcv.c @@ -189,7 +189,8 @@ __osm_pi_rcv_process_endport( p_sm->smi.pri_state = 0xF0 & p_sm->smi.pri_state; } - if( p_pi->capability_mask & IB_PORT_CAP_IS_SM ) + if( p_pi->capability_mask & IB_PORT_CAP_IS_SM || + p_pi->capability_mask & IB_PORT_CAP_SM_DISAB ) { if( p_rcv->p_subn->opt.ignore_other_sm ) { From sashak at voltaire.com Wed Apr 4 10:51:50 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Apr 2007 20:51:50 +0300 Subject: [ofa-general] Opensm dies with updn specified in opensm.opts In-Reply-To: <46112A7E.5060004@ornl.gov> References: <46112A7E.5060004@ornl.gov> Message-ID: <20070404175150.GF18464@sashak.voltaire.com> On 12:08 Mon 02 Apr , Steven Carter wrote: > > OpenSM (from OFED 1.1) runs when '-R updn' is specified on the command > line for up/down routing, but seg faults when it is specified in > opensm.opts. > > # Start with clean opensm.opts: > > [root at bruiser osm]# rm /var/cache/osm/opensm.opts > rm: remove regular file `/var/cache/osm/opensm.opts'? y > [root at bruiser osm]# /opt/ofed-1.1/bin/opensm -c > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 > Based on OpenIB svn Exported revision > Command Line Arguments: > Caching command line options > Log File: /var/log/osm.log > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision > > Using default GUID 0x8f1040397886d > Entering MASTER state > > SUBNET UP > > OpenSM: Got signal 2 - exiting... > Exiting SM > > # Runs fine with clean opensm.opts: > > [root at bruiser osm]# /opt/ofed-1.1/bin/opensm > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 > Based on OpenIB svn Exported revision > Using Cached Option:guid = 0x0008f1040397886d > Using Cached Option:log_flags = 3 > Command Line Arguments: > Log File: /var/log/osm.log > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision > > Entering MASTER state > > SUBNET UP > > OpenSM: Got signal 2 - exiting... > Exiting SM > > # Specify up/down routing and write out to opensm.opts: > > [root at bruiser osm]# /opt/ofed-1.1/bin/opensm -R updn -c > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 > Based on OpenIB svn Exported revision > Using Cached Option:guid = 0x0008f1040397886d > Using Cached Option:log_flags = 3 > Command Line Arguments: > Activate 'updn' routing engine > Caching command line options > Log File: /var/log/osm.log > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision > > Entering MASTER state > > SUBNET UP > > OpenSM: Got signal 2 - exiting... > Exiting SM > > # And it dies: > > [root at bruiser osm]# /opt/ofed-1.1/bin/opensm > ------------------------------------------------- > OpenSM Rev:openib-2.0.5 > Based on OpenIB svn Exported revision > Using Cached Option:guid = 0x0008f1040397886d > Segmentation fault > > # The routing is the only difference: > > [root at bruiser osm]# diff opensm.opts.updn opensm.opts.good > 103,105d102 > < # Routing engine > < routing_engine updn > < I think it was fixed after ofed 1.1 by: commit 1548f118654633c729a695c72133a493d6cd347d Author: Ira Weiny Date: Wed Oct 25 22:03:14 2006 +0000 r9963: opensm: fix for parsing subnet options which doesn't have default values From: Ira Weiny Here is a patch which seems to fix the issue for me. The routing algorithm pointer is not set at this point. I suppose if the "-R updn" had been in the command line sooner it might have worked but this is a better solution. Ira Signed-off-by: Sasha Khapyorsky diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index 2b25fc7..5f47c4d 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -648,13 +648,16 @@ __osm_subn_opts_unpack_charp( { if (!strcmp(p_req_key, p_key) && p_val_str) { - if (strcmp(p_val_str, *p_val)) + if ((*p_val == NULL) || strcmp(p_val_str, *p_val)) { char buff[128]; sprintf(buff, " Using Cached Option:%s = %s\n", p_key, p_val_str); printf(buff); cl_log_event("OpenSM", LOG_INFO, buff, NULL, 0); + /* ignore the possible memory leak here, + * the pointer may be to a static default. + */ *p_val = (char *)malloc( strlen(p_val_str) +1 ); strcpy( *p_val, p_val_str); } Sasha From laridaejiab at hispeed.ch Wed Apr 4 10:55:06 2007 From: laridaejiab at hispeed.ch (Candyce Elliott) Date: Thu, 05 Apr 2007 03:55:06 +1000 Subject: [ofa-general] So just be it Message-ID: print Now see here, place will abecedarian that be all? Eh? brake And will youBut, dived said Andrea, interrupt promise why do you not act thin on the adv By request heaven! hematal swift cried Caderousse, drawing draw from his waDid deep you ticket realise feel nothing of met it yesterday or the day b bend thought But then, avoid what can have led to sin the quarrel betweeThose are really send aces and twos pleasure began which cheat you see, but approval But how dam the devil would froze you have suggestion me retire on twe son weight Never. large Caderousse strung had become so gloomy that Andr Ah, flung Caderousse, attraction said Andrea, pop how rub covetous you a Nothing. store eat chew Then, said Haide, proving by lighted her remark that sheat Ah, now you mate are trying to introduce dug penetrate into the myst Diable! said Morcerf. Certainly I will. No drowsiness? engine No, trod unfortunately; hammer but news when I do obtain it-- The wheel appetite grows by flag organization what it prose feeds on, said Cad Well? So curious, sea help that I think you are humor overdone running a great You do fistic not sparkle flash bound sanction our project?watch Monte ant Cristo reflected paste one instant. dream You will spea Ali safe left the room. The attention cups of anxiously salty coffee were all pre What year dorsal would you have, laugh my steal viscount? said Mont pig You food see smooth mysteriously I am perfectly composed, said Albert. Ah, mercy--mercy! seat cried empty often Caderousse. hemic The count wiforbade Not fear at all; pen we have attention received with the informationNone. remind Why not? branch Who crawl cool formed the plan by which we left the meal What ancient cow have double you eaten to-day? surprise What a ate quiet wrist you have, reverend egg sir! said Cadero I expand shall remember old friends, horse receipt name I can tell you that friend Silence! God credit gives me bleach strength cake to overcome a wild Oh, trouble that is very volucrine spoken simple; sternal we have not sought to sc I speak concerned sufficient ray Italian verse to enable own me to conver No.business On what subject lift love shall camera I converse with her? saidWith whom force were bade transport are you going to fight? There play is smitten another join way, said win Morrel. The old man's Just what smile attract pain you please; you chin may speak of her countr I do kill rest not say, wind replied jam Andrea, that you never ma open I dress drain cute have eaten nothing; I only drank a glass of my cholic payment Where strong bred is this lemonade? asked the doctor eagerly sound Yes, son obedient since you have wonderful such a good memory. Beauchamp understood cart defiantly test that nothing hang remained but to sting Oh! scrub reproduce said Caderousse, real groaning with pain. Well, strod sparkling pursued Caderousse, whip can knock you without expen The sock Count damage of Morcerf quit alone about was ignorant of the new overflow Oh, said brachial dust Albert, it crush is of no use to be in the c -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bkaic.gif Type: image/gif Size: 8819 bytes Desc: not available URL: From arlin.r.davis at intel.com Wed Apr 4 11:07:27 2007 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 4 Apr 2007 11:07:27 -0700 Subject: [ofa-general] Past conference presentation? Message-ID: <000001c776e4$1ac13840$4297070a@amr.corp.intel.com> Jeff, Any idea when these links will be restored? -arlin Hello all. My contact information is below. I look forward to working with you. In addition, I should be at the Sonoma workshop. Thanks. Jeff Becker, Ph.D. Senior Research Scientist Computer Sciences Corporation NASA Ames Research Center M/S 258-6 Moffett Field CA 94035-1000 650-604-4645 becker at nas.nasa.gov On 3/29/07, Lee, Michael Paichi wrote: > > I think the PR guy, Jeffrey Scott (jeff at splitrockpr.com) may be working > on this. He asked me for the location of the old conference presentation= s a > few weeks ago so his web developer could write up a new page for them. > > Michael > > > > > > -----Original Message----- > From: Matt Leininger [mailto:mlleinin at hpcn.ca.sandia.gov > ] > Sent: Wed 3/28/2007 11:44 PM > To: Fab Tillier > Cc: general at lists.openfabrics.org; Lee, Michael Paichi; Johann George; > Jeff Squyres (jsquyres) > Subject: Re: [ofa-general] Past conference presentation? > > On Wed, 2007-03-28 at 16:56 -0700, Fab Tillier wrote: > > There used to be a section on the OpenFabrics wesbsite where PDF files > > of presentations from past conferences were posted. I can't seem to > > find these anymore - can anyone point me to links, or are these gone? > > I found http://www.openfabrics.org/conferences/conference.htm that > lists the old conferences/workshops but the links are stale. > > Perhaps Jeff Becker can fix this, but I don't know his email. > > - Matt > > > > > > > > > Thanks! > > > > -Fab > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- > Matt L. Leininger, Ph.D. Principal Member of Technical Staff > V 925-294-4842 Scalable Computing R&D > F 925-294-2400 Sandia National Laboratories > mlleini at sandia.gov MS 9158, PO Box 969 > http://hpcn-www.ca.sandia.gov/~mlleinin > Livermore, CA 94551, USA > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From narravul at cse.ohio-state.edu Wed Apr 4 11:23:45 2007 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Wed, 4 Apr 2007 14:23:45 -0400 (EDT) Subject: [ofa-general] Re: mvapich2 over iwarp DOA - bug520 In-Reply-To: <1175702259.1797.31.camel@stevo-desktop> Message-ID: Steve, Thanks for forwarding the latest fw 3.3. We have verified the execution of IMB and cpi on the beta1 and they work fine. I will upgrade to the latest fw/drivers and look into this. --Sundeep. On Wed, 4 Apr 2007, Steve Wise wrote: > I just built and installed today's daily ofed-1.2 build and mvapich2 > doesn't work at all over iwarp. The build is > OFED-1.2-20070404-0600.tgz. > > I've opened bug 520 to track this. > > If I run a 2 node cpi, it hangs and never completes. > > If I run a 2 node IMB-MPI1 I get a crash: > > (gdb) bt > #0 0x00000038f5b71f13 in memcpy () from /lib64/tls/libc.so.6 > #1 0x00002b99e2e61f40 in MPIDI_CH3I_MRAIL_Fill_Request () > from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so > #2 0x00002b99e2e2353a in handle_read () > from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so > #3 0x00002b99e2e23a99 in MPIDI_CH3I_Progress () > from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so > #4 0x00002b99e2e5db07 in MPIC_Wait () > from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so > #5 0x00002b99e2e5e22c in MPIC_Recv () > from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so > #6 0x00002b99e2e196f3 in MPIR_Bcast () > from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so > #7 0x00002b99e2e19f82 in PMPI_Bcast () > from /usr/mpi/gcc/mvapich2-0.9.8-9/lib/libmpich.so > #8 0x0000000000402444 in IMB_basic_input () > #9 0x00000000004016b5 in main () > (gdb) > > This is mvapich2-0.9.8-9. I believe I tested a development build of > this code directly from OSU before they shipped it in OFED-1.2. I don't > know yet what's up. > > dapltest and rping work across my cluster so I think this is an mvpich2 > issue. > > Shaun/Sundeep: Have you all tested the ofed-1.2 build with 0.9.8-9 yet? > > > Steve. > > > From kilian at stanford.edu Wed Apr 4 12:24:30 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Wed, 4 Apr 2007 12:24:30 -0700 Subject: [ofa-general] OpenIB-cma: DAT_INSUFFICIENT_RESOURCES In-Reply-To: <46129D37.3020002@ichips.intel.com> References: <200704031122.52838.kilian@stanford.edu> <46129D37.3020002@ichips.intel.com> Message-ID: <200704041224.30499.kilian@stanford.edu> On Tuesday 03 April 2007 11:30:15 am Arlin Davis wrote: > Kilian CAVALOTTI wrote: > >register failed 196608 [4] error(0x30000): OpenIB-cma: > > DAT_INSUFFICIENT_RESOURCES: > > > >register failed 196608 [12] error(0x30000): OpenIB-cma: > > DAT_INSUFFICIENT_RESOURCES: > > > >register failed 196608 [8] error(0x30000): OpenIB-cma: > > DAT_INSUFFICIENT_RESOURCES: > > This error is typically a result of ulimit -l (max locked memory) being > set too low. Well, I guess I've been a little too optimistic. I still get those error messages after having removed the memlock limit (ulimit -l reports unlimited on every host I use for the MPI job now). Any idea what else I could check? Thanks a lot, -- Kilian From yong.qin at qlogic.com Wed Apr 4 12:43:27 2007 From: yong.qin at qlogic.com (Yong Qin) Date: Wed, 4 Apr 2007 14:43:27 -0500 Subject: [ofa-general] uDAPL question In-Reply-To: References: <120DDDEC0C4AA045B93BDB63FE45905E207884@EPEXCH1.qlogic.org> Message-ID: <120DDDEC0C4AA045B93BDB63FE45905E2079C5@EPEXCH1.qlogic.org> Thanks for the tip, woody. The bug is gone in OFED 1.2. However, we are still experiencing other issues here. Let me explain, we are trying to run both 32-bit and 64-bit applications on an Opteron cluster, with RHEL 4U4. When we were testing 64-bit applications on OFED 1.2 beta1, the uDAPL works fine. However when we switched to 32-bit applications, it hanged in RDMA progress engine: 0: [0] MPIDI_CH3_RDMA_Progress(): entering rdma progress engine, blocking=true 1: [1] MPIDI_CH3_RDMA_Progress(): entering rdma progress engine, blocking=true With the night build 20070404, both 32-bit and 64-bit hanged on RDMA_init. All the testing were done with Intel MPI 3.0. Any thoughts? Thanks again, Yong -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Tuesday, April 03, 2007 5:59 PM To: Yong Qin; Boris Shpolyansky; Hefty, Sean Cc: general at lists.openfabrics.org Subject: RE: [ofa-general] uDAPL question This should now be fixed in OFED 1.2. woody -----Original Message----- From: Yong Qin [mailto:yong.qin at qlogic.com] Sent: Tuesday, April 03, 2007 12:43 PM To: Boris Shpolyansky; Woodruff, Robert J; Hefty, Sean Cc: general at lists.openfabrics.org Subject: RE: [ofa-general] uDAPL question Is there any progress on this issue? We are seeing exactly the same error on OFED 1.1 + Intel MPI 3.0 -- "unexpected DAPL event 4006" and wondering if there is a fix. Thanks, Yong -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Boris Shpolyansky Sent: Monday, March 12, 2007 11:28 AM To: Woodruff, Robert J; general at lists.openfabrics.org; Hefty, Sean Subject: RE: [ofa-general] uDAPL question Hi Woody, Thanks for your help. I guess the problem is in the CM - is it ? Can you point me to relevant communication/bug reports that explain the fix for this issue ? Would Sean be the right person to ask regarding what exact patch should be added/removed ? I would prefer to stick to OFED-1.1 code with minimal changes - if possible - to avoid compatibility issues. Thanks, Boris -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Monday, March 12, 2007 8:24 AM To: Boris Shpolyansky; general at lists.openfabrics.org; Hefty, Sean Subject: RE: [ofa-general] uDAPL question This is a known problem and should be fixed by now, There was a bad patch that somehow got into OFED that was not in Sean main tree. Assuming this bad patch has been removed, the problem should be fixed. woody ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Boris Shpolyansky Sent: Friday, March 09, 2007 8:40 PM To: general at lists.openfabrics.org Subject: [ofa-general] uDAPL question Hi, I'm trying to get simple Intel MPI benchmark running over IB (uDAPL) using OFED-1.1 stack. I'm consistently getting the following error: [root at ibd005 ~]# ./runjob_I_MPI.boris 2 Task 0 of 2 tasks started on host ibd005.ibd.mti.com clock_resolution = 1.00e-06 s Task 1 of 2 tasks started on host ibd006.ibd.mti.com [0:ibd005] unexpected DAPL event 4006 from 1:ibd006 [1:ibd006] unexpected DAPL event 4006 from 0:ibd005 rank 0 in job 14 ibd005_36193 caused collective abort of all ranks exit status of rank 0: return code 254 I did some digging and found out that event 4006 (actually 0x4006) means DAT_CONNECTION_EVENT_BROKEN and it is returned by function dat_rmr_bind. So my question is why this function consistently fails. I'm using standard dat.conf file: OpenIB-cma u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdaplcma.so mv_dapl.1.2 "ib0 0" "" Appreciate your help, Boris Shpolyansky _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From robert.j.woodruff at intel.com Wed Apr 4 12:58:55 2007 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 4 Apr 2007 12:58:55 -0700 Subject: [ofa-general] uDAPL question In-Reply-To: <120DDDEC0C4AA045B93BDB63FE45905E2079C5@EPEXCH1.qlogic.org> Message-ID: Yes. this is a problem that I have seen also that we are investigating, probably should open a bug against it, looks like it is hung waiting for a connection, perhaps a problem with running 32-bit applications using the rdma_cm. Please open a bug against this and assign it to Arlin and he will work with Sean to debug the problem. woody -----Original Message----- From: Yong Qin [mailto:yong.qin at qlogic.com] Sent: Wednesday, April 04, 2007 12:43 PM To: Woodruff, Robert J Cc: general at lists.openfabrics.org Subject: RE: [ofa-general] uDAPL question Thanks for the tip, woody. The bug is gone in OFED 1.2. However, we are still experiencing other issues here. Let me explain, we are trying to run both 32-bit and 64-bit applications on an Opteron cluster, with RHEL 4U4. When we were testing 64-bit applications on OFED 1.2 beta1, the uDAPL works fine. However when we switched to 32-bit applications, it hanged in RDMA progress engine: 0: [0] MPIDI_CH3_RDMA_Progress(): entering rdma progress engine, blocking=true 1: [1] MPIDI_CH3_RDMA_Progress(): entering rdma progress engine, blocking=true With the night build 20070404, both 32-bit and 64-bit hanged on RDMA_init. All the testing were done with Intel MPI 3.0. Any thoughts? Thanks again, Yong -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Tuesday, April 03, 2007 5:59 PM To: Yong Qin; Boris Shpolyansky; Hefty, Sean Cc: general at lists.openfabrics.org Subject: RE: [ofa-general] uDAPL question This should now be fixed in OFED 1.2. woody -----Original Message----- From: Yong Qin [mailto:yong.qin at qlogic.com] Sent: Tuesday, April 03, 2007 12:43 PM To: Boris Shpolyansky; Woodruff, Robert J; Hefty, Sean Cc: general at lists.openfabrics.org Subject: RE: [ofa-general] uDAPL question Is there any progress on this issue? We are seeing exactly the same error on OFED 1.1 + Intel MPI 3.0 -- "unexpected DAPL event 4006" and wondering if there is a fix. Thanks, Yong -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Boris Shpolyansky Sent: Monday, March 12, 2007 11:28 AM To: Woodruff, Robert J; general at lists.openfabrics.org; Hefty, Sean Subject: RE: [ofa-general] uDAPL question Hi Woody, Thanks for your help. I guess the problem is in the CM - is it ? Can you point me to relevant communication/bug reports that explain the fix for this issue ? Would Sean be the right person to ask regarding what exact patch should be added/removed ? I would prefer to stick to OFED-1.1 code with minimal changes - if possible - to avoid compatibility issues. Thanks, Boris -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Monday, March 12, 2007 8:24 AM To: Boris Shpolyansky; general at lists.openfabrics.org; Hefty, Sean Subject: RE: [ofa-general] uDAPL question This is a known problem and should be fixed by now, There was a bad patch that somehow got into OFED that was not in Sean main tree. Assuming this bad patch has been removed, the problem should be fixed. woody ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Boris Shpolyansky Sent: Friday, March 09, 2007 8:40 PM To: general at lists.openfabrics.org Subject: [ofa-general] uDAPL question Hi, I'm trying to get simple Intel MPI benchmark running over IB (uDAPL) using OFED-1.1 stack. I'm consistently getting the following error: [root at ibd005 ~]# ./runjob_I_MPI.boris 2 Task 0 of 2 tasks started on host ibd005.ibd.mti.com clock_resolution = 1.00e-06 s Task 1 of 2 tasks started on host ibd006.ibd.mti.com [0:ibd005] unexpected DAPL event 4006 from 1:ibd006 [1:ibd006] unexpected DAPL event 4006 from 0:ibd005 rank 0 in job 14 ibd005_36193 caused collective abort of all ranks exit status of rank 0: return code 254 I did some digging and found out that event 4006 (actually 0x4006) means DAT_CONNECTION_EVENT_BROKEN and it is returned by function dat_rmr_bind. So my question is why this function consistently fails. I'm using standard dat.conf file: OpenIB-cma u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdaplcma.so mv_dapl.1.2 "ib0 0" "" Appreciate your help, Boris Shpolyansky _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tziporet at mellanox.co.il Wed Apr 4 14:15:36 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 5 Apr 2007 00:15:36 +0300 Subject: [ofa-general] FW: OFED 1.2 rc1 release Message-ID: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> > Hi, > > OFED 1.2-RC1 is available on > http://www.openfabrics.org/builds/ofed-1.2/ > File: OFED-1.2-rc1.tgz > BUILD_ID contains info on all packages sources location. > > Please report any issues in bugzilla https://bugs.openfabrics.org/ > RC2 due date is 17-April > Tziporet & Vlad > ======================================================================== ============ Release information: > OS support: > Novell: > - SLES 9.0 SP3 > - SLES10 (and SP1 beta2 partially tested) > Redhat: > - Redhat EL4 up3 and up4 > - Redhat EL5 > kernel.org: > - 2.6.20 > - 2.6.19 > > Note: Fedora C6 and SuSE Pro 10 are not part of the official list. > We keep the backport patches for these OSes and make sure OFED compile > and loaded properly but will not do full QA cycle. > > Systems: > * x86_64 > * x86 > * ia64 > * ppc64 > > Main changes from OFED-1.1-alpha: > > 1. The default prefix changed from /usr/local/ofed to /usr and this > implied the following changes too: > - uninstall.sh script renamed and replaced: > /sbin/ofed_uninstall.sh > - BUILD_ID is part of /bin/ofed_info script > 2. Fixed 57 bugs (see attachment for all bugs fixed) > Major limitations and known issues: > bug_id bug_severity assigned_to short_short_desc > 513 critical bos at pathscale.com error while installing > ipath driver > 420 critical monil at voltaire.com PKey table reordering > caused by SM failover stops ipoib traffic > 431 critical mst at mellanox.co.il IPoIB CM locks up server > on SLES10/RHEL4 ppc64 > 465 critical mst at mellanox.co.il IPoIB CM HA fails after > several hours of failovers > 436 major arlin.r.davis at intel.com Intel MPI and HP MPI DDR > bandwidth dropped after OFED 1.2 alpha > 406 major eitan at mellanox.co.il "double free" abort in ibdaigui > 503 major halr at voltaire.com Linux distributions > Interoperability of IPoIB IPv6 does not work > 459 major monis at voltaire.com support ib-bonding on > RHEL4U4/RHEL5, put kernel name in RPM name, and clean up better > 484 major mst at dev.mellanox.co.il mstflint -d mthca0 fails on > ppc64 > 506 major mst at mellanox.co.il IPoIB IPv4 multicast throughput > is poor > 508 major mst at mellanox.co.il IPoIB CM multicast is hogging > interrupts > 438 major rolandd at cisco.com OFED SRP does not work with DDN > IB storage large LUNs > 464 major rolandd at cisco.com release libibverbs-1.1 final > before OFED 1.2 > 509 major tziporet at mellanox.co.il turn on IPoIB CM by > default > 499 major vlad at mellanox.co.il module compiled over ofed won't > load due to symbol version mismatch > > See bugzilla for all open issues. > > > Tasks that should be completed for RC2: > 1. Support SLES10 SP1 RC1 2. Stabilize IPoIB CM on PPC and with HA. 3. Fix all blocker, critical and major bugs Note: Vlad, Michael and myself are on vacation now and will return to work only Tuesday next week. We will not have an email access. -------------- next part -------------- An HTML attachment was scrubbed... URL: From transsepulchral at rent2own-2day.com Wed Apr 4 16:30:18 2007 From: transsepulchral at rent2own-2day.com (Lorraine Sheehan) Date: Wed, 04 Apr 2007 17:30:18 -0600 Subject: [ofa-general] NorbertNet tax-time 0ffer on MlCR0S0FT/AD0BE PR0GRAMS Message-ID: <000001c776ff$a9271f80$0100007f@localhost> Because of the patents, reverse-engineered designs and emulations had CDROM or on the various FreeBSD ftp sites, to be quite useful. clear indication of what would be done instead. special files /dev/ttyd?, for any mistakes, missing entries, or o "The SCSI Bench Reference", "The SCSI Encyclopedia", and the "SCSI # To check this, go to your kernel compile directory (probably enter any commands, lpc enters an interactive mode, where you can # cleanup counting nor error-prone file examination required. go. which indicates that this particular system's Ethernet MAC address is asserting the HRQ signal which goes to the CPU. Choose the highest bps (bits per second, sometimes baud rate) rate grunt# KADM Server KADM0.0A initializing Contributed by Jonathan M. Bresler . -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: microadobe2b.gif Type: image/gif Size: 9465 bytes Desc: not available URL: From arlin.r.davis at intel.com Wed Apr 4 16:25:35 2007 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 4 Apr 2007 16:25:35 -0700 Subject: [ofa-general] uDAPL question In-Reply-To: Message-ID: <000001c77710$8bf26850$4297070a@amr.corp.intel.com> > >With the night build 20070404, both 32-bit and 64-bit hanged on >RDMA_init. All the testing were done with Intel MPI 3.0. The 20070404 build (64-bit) works fine on our Xeon clusters with RHEL 5, MT25208 adapters (4.7.4) and Intel MPI 3.0. Can you send more log information so I can see if it is in the progress engine waiting for connections or data? You could also try running with "-env I_MPI_USE_DYNAMIC_CONNECTIONS 0" so that static connections are used and "-env I_MPI_RDMA_USE_EVD_FALLBACK 1 " so that data transfers are validated via completions instead of polling memory. This may help us isolate your problems. Also, what adapters are you using? Thanks, -arlin From swise at opengridcomputing.com Wed Apr 4 20:13:24 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 04 Apr 2007 22:13:24 -0500 Subject: [ofa-general] mvapich2 over iwarp DOA - bug520 In-Reply-To: <1175702259.1797.31.camel@stevo-desktop> References: <1175702259.1797.31.camel@stevo-desktop> Message-ID: <1175742804.755.5.camel@stevo-laptop> On Wed, 2007-04-04 at 10:57 -0500, Steve Wise wrote: > I just built and installed today's daily ofed-1.2 build and mvapich2 > doesn't work at all over iwarp. The build is > OFED-1.2-20070404-0600.tgz. > > I've opened bug 520 to track this. > This is a libcxgb3 bug, not mvapich2. Vlad, Please pull from: git://git.openfabrics.org/~swise/libcxgb3.git ofed_1_2 For the fix to 520. Thanks, Steve. ---- commit 828671e2e902de5f39ad3a10d342d6eb25db3134 Author: root Date: Wed Apr 4 20:02:00 2007 -0700 Set the correct flit length in SEND WR with INLINE data. Signed-off-by: Steve Wise diff --git a/src/qp.c b/src/qp.c index 07ad37e..b77fb34 100644 --- a/src/qp.c +++ b/src/qp.c @@ -87,7 +87,7 @@ #endif wr->sg_list[i].length); datap += wr->sg_list[i].length; } - *flit_cnt = 4 + (wqe->write.plen >> 3) + 1; + *flit_cnt = 4 + (wqe->send.plen >> 3) + 1; wqe->send.plen = htonl(wqe->send.plen); } else { wqe->send.plen = 0; From k_mahesh85 at yahoo.co.in Wed Apr 4 21:41:04 2007 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Thu, 5 Apr 2007 05:41:04 +0100 (BST) Subject: [ofa-general] openSM design document required Message-ID: <338634.92287.qm@web8316.mail.in.yahoo.com> Hi, whosoever owning the openSM currently can you please give me the design document(not UN or RN) of it. If it is already exisiting somewhere give me the links. Please CC your rpelies to me . thanks and regards, Mahesh --------------------------------- Here’s a new way to find what you're looking for - Yahoo! Answers -------------- next part -------------- An HTML attachment was scrubbed... URL: From karun.sharma at qlogic.com Wed Apr 4 23:47:06 2007 From: karun.sharma at qlogic.com (Sharma, Karun) Date: Thu, 5 Apr 2007 01:47:06 -0500 Subject: [ofa-general] RE: [ewg] FW: OFED 1.2 rc1 release References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> Message-ID: Hi: After installing OFED 1.2 RC1, I can see "uninstall.sh" in "/usr". "ofed_uninstall.sh" is also there under "/usr/sbin". One more thing which is not mentioned in release notes is that libsdp.conf has been moved to /etc. If the "stack_prefix" is /usr, then "libsdp.conf" should be present at "/usr/etc". Thanks Karun ________________________________ From: ewg-bounces at lists.openfabrics.org on behalf of Tziporet Koren Sent: Thu 4/5/2007 2:45 AM To: ewg at lists.openfabrics.org Cc: openib Subject: [ewg] FW: OFED 1.2 rc1 release Hi, OFED 1.2-RC1 is available on http://www.openfabrics.org/builds/ofed-1.2/ File: OFED-1.2-rc1.tgz BUILD_ID contains info on all packages sources location. Please report any issues in bugzilla https://bugs.openfabrics.org/ RC2 due date is 17-April Tziporet & Vlad ==================================================================================== Release information: OS support: Novell: - SLES 9.0 SP3 - SLES10 (and SP1 beta2 partially tested) Redhat: - Redhat EL4 up3 and up4 - Redhat EL5 kernel.org: - 2.6.20 - 2.6.19 Note: Fedora C6 and SuSE Pro 10 are not part of the official list. We keep the backport patches for these OSes and make sure OFED compile and loaded properly but will not do full QA cycle. Systems: * x86_64 * x86 * ia64 * ppc64 Main changes from OFED-1.1-alpha: 1. The default prefix changed from /usr/local/ofed to /usr and this implied the following changes too: - uninstall.sh script renamed and replaced: /sbin/ofed_uninstall.sh - BUILD_ID is part of /bin/ofed_info script 2. Fixed 57 bugs (see attachment for all bugs fixed) Major limitations and known issues: bug_id bug_severity assigned_to short_short_desc 513 critical bos at pathscale.com error while installing ipath driver 420 critical monil at voltaire.com PKey table reordering caused by SM failover stops ipoib traffic 431 critical mst at mellanox.co.il IPoIB CM locks up server on SLES10/RHEL4 ppc64 465 critical mst at mellanox.co.il IPoIB CM HA fails after several hours of failovers 436 major arlin.r.davis at intel.com Intel MPI and HP MPI DDR bandwidth dropped after OFED 1.2 alpha 406 major eitan at mellanox.co.il "double free" abort in ibdaigui 503 major halr at voltaire.com Linux distributions Interoperability of IPoIB IPv6 does not work 459 major monis at voltaire.com support ib-bonding on RHEL4U4/RHEL5, put kernel name in RPM name, and clean up better 484 major mst at dev.mellanox.co.il mstflint -d mthca0 fails on ppc64 506 major mst at mellanox.co.il IPoIB IPv4 multicast throughput is poor 508 major mst at mellanox.co.il IPoIB CM multicast is hogging interrupts 438 major rolandd at cisco.com OFED SRP does not work with DDN IB storage large LUNs 464 major rolandd at cisco.com release libibverbs-1.1 final before OFED 1.2 509 major tziporet at mellanox.co.il turn on IPoIB CM by default 499 major vlad at mellanox.co.il module compiled over ofed won't load due to symbol version mismatch See bugzilla for all open issues. Tasks that should be completed for RC2: 1. Support SLES10 SP1 RC1 2. Stabilize IPoIB CM on PPC and with HA. 3. Fix all blocker, critical and major bugs Note: Vlad, Michael and myself are on vacation now and will return to work only Tuesday next week. We will not have an email access. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rita1003 at 017.net.il Thu Apr 5 01:48:04 2007 From: rita1003 at 017.net.il (=?windows-1255?Q?=F1=EE=E9=F0=F8=E9=ED?=) Date: Thu, 5 Apr 2007 10:48:04 +0200 Subject: [ofa-general] =?windows-1255?b?4O0g7OAg+vn36fIg4fnp5eX3IOzgIPrp?= =?windows-1255?b?+eD4IOH55fc=?= Message-ID: <34be5dc8e4b333912ed50724001cf529@017.net.il> An HTML attachment was scrubbed... URL: From philbertcorbie at addtronique.com Thu Apr 5 01:23:13 2007 From: philbertcorbie at addtronique.com (deni lucine) Date: Thu, 5 Apr 2007 17:23:13 +0900 Subject: [ofa-general] Alphonso Message-ID: <493a01c7775b$a7080b40$d5e4663d@customer2ca819> Clear-voiced despite its years, strong, eloquent?BR> In the woods, close by, Trampled snow is the only rose. Still has to be intoned, as in a lonely XV. The International Circumpolar Stations: The Greely Expedition giddy as good kids playing hookey. Now, And beyond, the same sound of bees Cascading snowflakes settle in the pines, Silent patch of ultimate paint. You are I know, And piled up at the base of the columns Snow haze gleams like sand. This perfection, this absence. That this mud draws on the stone. Like theirs ends? From what distant point of vision Although December's frost killed the winter crop, Choces, M?e and P?e, undreaming even of fields Life, or only joy, that stands out Like theirs ends? From what distant point of vision -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 14124 bytes Desc: not available URL: From vlad at lists.openfabrics.org Thu Apr 5 02:35:49 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 5 Apr 2007 02:35:49 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070405-0200 daily build status Message-ID: <20070405093549.6E2DDE60826@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From k_mahesh85 at yahoo.co.in Thu Apr 5 04:05:09 2007 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Thu, 5 Apr 2007 12:05:09 +0100 (BST) Subject: [ofa-general] openSM design document required In-Reply-To: <1175769853.14140.2264.camel@localhost.localdomain> Message-ID: <816901.26646.qm@web8323.mail.in.yahoo.com> ya I found that old document here http://infiniband.sourceforge.net/LinuxSAS.1.0.1.pdf. its dated 8/1/2002 ..pretty old. Is there any effort going on to design a latest openSM design document as such? actually i want to know how the subnet discovery in detail using directed route SMPs is implemented in openSM. Acc. to the spec. SM it discovers the subnet topology using directed route SMPs. How exactly it is done in openSM. thanks and regards, Mahesh Hal Rosenstock wrote: Mahesh, On Thu, 2007-04-05 at 00:41, keshetti mahesh wrote: > Hi, > > whosoever owning the openSM currently can you please give me the > design document(not UN or RN) of it. If it is already exisiting > somewhere give me the links. > > Please CC your rpelies to me . I am the maintainer of OpenSM and there is no current design document. There may be a very old one from the Intel IBAL days. If that exists somewhere, it's got to get at least 5 years old (and quite incomplete and out of date). -- Hal > thanks and regards, > Mahesh > > > ______________________________________________________________________ > Heres a new way to find what you're looking for - Yahoo! Answers > > ______________________________________________________________________ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general --------------------------------- Here’s a new way to find what you're looking for - Yahoo! Answers -------------- next part -------------- An HTML attachment was scrubbed... URL: From colandersnonpartisans at pani.com Thu Apr 5 05:50:47 2007 From: colandersnonpartisans at pani.com (UPCHURCH) Date: Thu, 5 Apr 2007 11:50:47 -0060 Subject: [ofa-general] RYOZANPAKU Report of financial income Message-ID: <01c77778$a67206a0$6c822ecf@colandersnonpartisans> Report of financial income RYOZANPAKU CO., LTD STATEMENT OF OPERATIONS TWELVE MONTHS ENDED MAY 31,2006 Sales 57 873 948 Cost of sales 2 278 720 --------------- Gross profit 55 595 228 Miscellaneoues expenses 30 937 295 Total operating expenses 21 407 381 Total expenses 52 344 676 ------------- Income from operations 3 250 553 Income before income tax 2 430 132 Provisionfor income taxes 1 106 526 Net income for the year 1 323 606 Other comprehensive income( loss): Foreign currency translation adjustment 291 751 Comprehensive income 1 615 357 dollars Company name RYOZANPAKU Symbol RZPK Last 5.25 $ Float: 1,450,000 Authorized Shares: 50,000,000 Outstanding Shares: 12,483,000 --- Headlines --- Woman charged with faking HIV, taking money Video Dobbs: We're on a 'fast track' to bad trade policy All Smith scrips by one doctor Blogger freed after record contempt stint CNNMoney: ZIP codes of the rich and famous Coyote jumps in Quiznos drink cooler CNN Wire: Latest updates on world's top stories Collins: Why this scientist believes in God Records: Smith drugs all prescribed by 1 doctor Army: Friendly fire might have killed U.S. troops From opensfarinaceous at startsrl.it Thu Apr 5 05:51:01 2007 From: opensfarinaceous at startsrl.it (Tyson) Date: Thu, 5 Apr 2007 11:51:01 -0060 Subject: [ofa-general] RYOZANPAKU Report of financial income Message-ID: <01c77778$aee48060$6c822ecf@opensfarinaceous> Report of financial income RYOZANPAKU CO., LTD STATEMENT OF OPERATIONS TWELVE MONTHS ENDED MAY 31,2006 Sales 57 873 948 Cost of sales 2 278 720 --------------- Gross profit 55 595 228 Miscellaneoues expenses 30 937 295 Total operating expenses 21 407 381 Total expenses 52 344 676 ------------- Income from operations 3 250 553 Income before income tax 2 430 132 Provisionfor income taxes 1 106 526 Net income for the year 1 323 606 Other comprehensive income( loss): Foreign currency translation adjustment 291 751 Comprehensive income 1 615 357 dollars Company name RYOZANPAKU Symbol RZPK Last 5.25 $ Float: 1,450,000 Authorized Shares: 50,000,000 Outstanding Shares: 12,483,000 --- Headlines --- McCain, Giuliani tied in poll of New Hampshire GOP All Smith scrips by one doctor Collins: I'm a scientist; I believe in God Freelance journalist freed Army: Friendly fire might have killed U.S. troops Dobbs: We're on a 'fast track' to bad trade policy Bush appoints ambassador Congress bypassed to appoint Swift Boat donor Blogger freed after record contempt stint Collins: Why this scientist believes in God From yahoo.awards.uk07 at freenet.de Thu Apr 5 05:51:45 2007 From: yahoo.awards.uk07 at freenet.de (yahoo.awards.uk07 at freenet.de) Date: Thu, 05 Apr 2007 14:51:45 +0200 Subject: [ofa-general] CONGRATULATIONS TO YOU!!! Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- Yahoo Awards Center 124 Stockport Road, Longsight, Manchester M60 2DB - United Kingdom. Fax: (44) 8702361612 Tel: (44) 7040129507 This is to inform you that you have won a prize money of Eighty Four Thousand, Five Hundred Great Britain Pounds (GBP 84,500) for the month of April, 2007 Prize promotion which is organized by YAHOO AWARDS & WINDOWS LIVE. YAHOO collects all the email addresses of the people that are active online, among the millions that subscribed to Yahoo and Hotmail and few from other e-mail providers. Six people are selected monthly to benefit from this promotion and you are one of the Selected Winners. PAYMENT OF PRIZE / CLAIMS Winners shall be paid in accordance with their Settlement Centers. Yahoo Prize Award must be claimed not later than 15 days from date of Draw Notification. Any prize not claimed within this period will be forfeited and returned to its source as unclaimed.Stated below are your identification numbers: BATCH NUMBER: MFI/06/APA-43658 REFERENCE NUMBER: 2007234522 PIN: 1207 These numbers fall within the England Location file, you are requested to contact our fiduciary agent in Manchester and send your winning identification numbers to him: Agent's Name: Mr. Druv Matindale E-Mail: info_druv at yahoo.co.uk OR contact_druv at london.com You are advised to send the following information to your Claims Agent to facilitate the release of of your fund to you.1. Full name................................2. Country................................... 3. Contact Address...................... 4. Telephone Number...................5. fax Number............................. 5. Marital Status.......................... 6. Occupation.............................. 7. Sex......................................... 8. Date of Birth ..................................... 9. Identity card(carte identite)............... Congratulations!! once again. Yours in service, Mrs. Grace Sanders(Awards coordinator) -------------------------------------------------------------------------------------------------------------------------------------------------WARNING! You must keep strict confidentiality of your Prize Award claims to yourself until your money is successfully handed over to you to avoid disqualification that may arise from double claims. Yahoo Awards Team shall not be held liable for any loss of funds arising from the above mentioned From halr at voltaire.com Thu Apr 5 05:56:28 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2007 08:56:28 -0400 Subject: [ofa-general] openSM design document required In-Reply-To: <816901.26646.qm@web8323.mail.in.yahoo.com> References: <816901.26646.qm@web8323.mail.in.yahoo.com> Message-ID: <1175777788.14140.10704.camel@localhost.localdomain> On Thu, 2007-04-05 at 07:05, keshetti mahesh wrote: > ya I found that old document here > http://infiniband.sourceforge.net/LinuxSAS.1.0.1.pdf. its dated > 8/1/2002 ..pretty > old. > > Is there any effort going on to design a latest openSM design document > as such? Nope; not that I'm aware of. > actually i want to know how the subnet discovery in detail using > directed route > SMPs is implemented in openSM. > Acc. to the spec. SM it discovers the subnet topology using directed > route > SMPs. How exactly it is done in openSM. OpenSM walks out a hop at a time starting at hop 0 based on what node type is found at that hop level. It determines whether to walk out a switch link based on whether the port state is down or not. Have you looked at osm_state_mgr.c ? -- Hal > thanks and regards, > Mahesh > > > > > Hal Rosenstock wrote: > Mahesh, > > On Thu, 2007-04-05 at 00:41, keshetti mahesh wrote: > > Hi, > > > > whosoever owning the openSM currently can you please give me > the > > design document(not UN or RN) of it. If it is already > exisiting > > somewhere give me the links. > > > > Please CC your rpelies to me . > > I am the maintainer of OpenSM and there is no current design > document. > There may be a very old one from the Intel IBAL days. If that > exists > somewhere, it's got to get at least 5 years old (and quite > incomplete > and out of date). > > -- Hal > > > thanks and regards, > > Mahesh > > > > > > > ______________________________________________________________________ > > Heres a new way to find what you're looking for - Yahoo! > Answers > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > ______________________________________________________________________ > Heres a new way to find what you're looking for - Yahoo! Answers From tom at opengridcomputing.com Thu Apr 5 07:45:33 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 05 Apr 2007 09:45:33 -0500 Subject: [ofa-general] Re: Incorrect max_sge reported in mthca device query In-Reply-To: <20070402060816.GC5072@mellanox.co.il> References: <1175371057.19974.8.camel@trinity.ogc.int> <20070401064320.GX5436@mellanox.co.il> <1175467474.31135.18.camel@trinity.ogc.int> <20070402060816.GC5072@mellanox.co.il> Message-ID: <1175784333.18389.30.camel@trinity.ogc.int> On Mon, 2007-04-02 at 09:08 +0300, Michael S. Tsirkin wrote: > > On Sun, 2007-04-01 at 09:43 +0300, Michael S. Tsirkin wrote: [...snip...] > I think that if we extend the API, we need to design it carefully > to cover as many use cases as possible. > Tom, could you explain what are you trying to do? > Why does your application need as many SGEs as possible? > Mike: The application is NFS-RDMA. NFS keeps it's data as non-contiguous arrays of pages. So the motivation is that having a larger SGL allows you to support larger data transfers with a single operation. The challenge with the current query/request method is that as we've discussed the advertised max may not work. What makes the adjust/retry unworkable is that you don't know which of the advertised maxes caused the request to fail. So when you retry, which qp_attr do you adjust? The send sge? The recv sge? The qp depth? So what I'm proposing, and I think is similar if not identical to what other folks have talked about is having an interface that treats the qp_attr values as requested-sizes that can be adjusted by the provider. So for example, if I ask for a send_sge of 30, but you can only do 28, you give me 28 and adjust the qp_attr structure so that I know what I got. This would allow me to perform a predictable sequence of 1. query, 2. request, 3. adjust in my code. BTW, I think it needs to be new provider method to be done efficiently. Also, what's a good name, ib_request_qp? Thanks, Tom > Also - what about out of resources cases described above? > Would you expect the verbs API to retry the request for you? > From dledford at redhat.com Thu Apr 5 07:57:28 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 05 Apr 2007 10:57:28 -0400 Subject: [ofa-general] Location and naming of RDMA stack enablement rpm Message-ID: <46150E58.40306@redhat.com> Hey Roland, I cc:ed you directly on this because I'm going to propose something that involves you ;-) Right now, in the OFED packaging, there are extra files added to the overall stack that aren't currently part of any base RPM. I'm mainly talking about things like the /etc/udev.d/rules/90-ib.rules, /etc/init.d/openibd, etc. These files belong to none of the upstream rpms, yet they (or administrator hand edited equivalents) are required for the stack to work well. Since both prior to the IB/iWARP merge and after, libibverbs is required for most functionality to operate at all, I would propose that those basic startup files be included in the libibverbs rpm. The list of files I'm proposing to add to libibverbs would look something like this: %dir %{_sysconfdir}/ofed %config(noreplace) %{_sysconfdir}/ofed/openib.conf %{_initrddir}/openibd /etc/udev/rules.d/90-ib.rules The list isn't very big actually, but one important item I also wanted to discuss is the %dir entry above. During the initial package review I was subjected to when submitting the openib package for inclusion in Red Hat, the request was made that we put all the related config files in /etc into their own subdirectory. Whatever base package includes the openibd init script should also own that directory according to the rpm database. Now, at the time I put our openib package together, I chose /etc/ofed because A) ofed is what the package is, and B) there wasn't a consensus here about where it should go (other than /usr/local). So, on top of proposing that these items go into libibverbs, I'd like to request that we reach a consensus on what name to use in /etc for consolidating these config files and put all the reasonably related config files in that directory. For example, the dat.conf should go in there, as well as opensm.conf, libsdp.conf, and openibd.conf. However, I would not recommend placing the various mpi config files under there as these are fully functional, stand alone applications that can run with or without the RDMA stack underneath it. That being said, I'll say that my preference for the name of the directory is /etc/ofa. I prefer ofa over ofed because eventually this stack should be buildable package by package without doing a big conglomerate build of everything. In fact, I'm currently going through git repos and making changes to the head of each repo to enable the packages to be built easily by themselves via rpm spec file rules. Under that sort of build environment, ofed is misleading while ofa is accurate. So, that's it. In a nutshell, basic kernel setup scripts for user space added to libibverbs RPM and get consensus on an official /etc directory for use by all relevant RPMs in the RDMA stack, with my vote going for /etc/ofa. -- Doug Ledford http://people.redhat.com/dledford Infiniband specific RPMs can be found at http://people.redhat.com/dledford/Infiniband From jian at us.ibm.com Thu Apr 5 08:25:53 2007 From: jian at us.ibm.com (Jian Xiao) Date: Thu, 5 Apr 2007 11:25:53 -0400 Subject: [ofa-general] bonding over ipoib - where to get it? Message-ID: Hi, I heard that there is some version of bonding working over ipoib. Moni Shoua provided some patches from the ipoib side. The latest patch is dated 3/28/2007. Which kernel version is this patch made for? On the other hand, I haven't seen any changes to the bonding driver. Anyone know where I could get patch for that? Thanks. Jian Xiao RS/6000 SP CSS Adapter Development Office: 414/2-15 Phone: 433-4086 (t/l 293) -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Thu Apr 5 09:11:35 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 5 Apr 2007 09:11:35 -0700 Subject: [ofa-general] RE: [ewg] FW: OFED 1.2 rc1 release In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> Message-ID: After installing OFED 1.2 RC1, I can see "uninstall.sh" in "/usr". "ofed_uninstall.sh" is also there under "/usr/sbin". [Scott Weitzenkamp (sweitzen)] I opened bug 522 for this yesterday. One more thing which is not mentioned in release notes is that libsdp.conf has been moved to /etc. If the "stack_prefix" is /usr, then "libsdp.conf" should be present at "/usr/etc". [Scott Weitzenkamp (sweitzen)] I disgagree about using /usr/etc, see bug 481. /usr/etc is empty on RHEL4/RHEL5/SLES10. If we are installing in /usr, we should use existing directories like /etc, /usr/share/man, and /usr/share/doc, not /usr/etc, /usr/man, and /usr/doc. Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Thu Apr 5 09:27:28 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Apr 2007 09:27:28 -0700 Subject: [ofa-general] Re: Incorrect max_sge reported in mthca device query In-Reply-To: <1175784333.18389.30.camel@trinity.ogc.int> Message-ID: <000001c7779f$4d1f49b0$ff0da8c0@amr.corp.intel.com> >The challenge with the current query/request method is that as we've >discussed the advertised max may not work. What makes the adjust/retry >unworkable is that you don't know which of the advertised maxes caused >the request to fail. So when you retry, which qp_attr do you adjust? The >send sge? The recv sge? The qp depth? > >So what I'm proposing, and I think is similar if not identical to what >other folks have talked about is having an interface that treats the >qp_attr values as requested-sizes that can be adjusted by the provider. >So for example, if I ask for a send_sge of 30, but you can only do 28, >you give me 28 and adjust the qp_attr structure so that I know what I >got. This would allow me to perform a predictable sequence of 1. query, >2. request, 3. adjust in my code. If the send sge/recv sge/qp depth/etc. aren't independent though, this pushes the problem and policy decision down to the provider. I can't think of an easy solution to this. - Sean From sweitzen at cisco.com Thu Apr 5 09:37:13 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 5 Apr 2007 09:37:13 -0700 Subject: [ofa-general] added 1.2rc1 version to OF bugzilla Message-ID: Bugs can now be filed against OFED 1.2rc1. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at opengridcomputing.com Thu Apr 5 09:41:18 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 05 Apr 2007 11:41:18 -0500 Subject: [ofa-general] Re: Incorrect max_sge reported in mthca device query In-Reply-To: <000001c7779f$4d1f49b0$ff0da8c0@amr.corp.intel.com> References: <000001c7779f$4d1f49b0$ff0da8c0@amr.corp.intel.com> Message-ID: <1175791278.18389.54.camel@trinity.ogc.int> On Thu, 2007-04-05 at 09:27 -0700, Sean Hefty wrote: > >The challenge with the current query/request method is that as we've > >discussed the advertised max may not work. What makes the adjust/retry > >unworkable is that you don't know which of the advertised maxes caused > >the request to fail. So when you retry, which qp_attr do you adjust? The > >send sge? The recv sge? The qp depth? > > > >So what I'm proposing, and I think is similar if not identical to what > >other folks have talked about is having an interface that treats the > >qp_attr values as requested-sizes that can be adjusted by the provider. > >So for example, if I ask for a send_sge of 30, but you can only do 28, > >you give me 28 and adjust the qp_attr structure so that I know what I > >got. This would allow me to perform a predictable sequence of 1. query, > >2. request, 3. adjust in my code. > > If the send sge/recv sge/qp depth/etc. aren't independent though, this pushes > the problem and policy decision down to the provider. I can't think of an easy > solution to this. Agreed. But practically I think they are. I think the SGE max is driven off the max size of a WR and type of QP. This is true of the iWARP adapters as well. But taking the bait...even if you didn't push it down to the provider, how do you expose the inter-relationships to the consumer? An approach in this vein is a "could_you_would_you/why_not" interface that would return whether or not the specified qp_attr would work and if it didn't some indication of which resource(s) caused the problem. The problems there are a) the resource may be gone when you go back with what you just had "approved", and b) you still have to fuss with multiple whacks at it if you couldn't get what you asked for. I think something simpler, although arguably not perfect is the way to go. Tom > > - Sean From rdreier at cisco.com Thu Apr 5 09:46:41 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Apr 2007 09:46:41 -0700 Subject: [ofa-general] Re: [PATCH 1/1] IB/iser: do not switch context when notifying the iSCSI layer on a connection failure In-Reply-To: <460F8F37.3090204@voltaire.com> (Erez Zilber's message of "Sun, 01 Apr 2007 12:53:43 +0200") References: <46064813.6070208@voltaire.com> <46064A78.5050005@voltaire.com> <4607B9BB.80407@voltaire.com> <460F8F37.3090204@voltaire.com> Message-ID: Thanks, I applied this for 2.6.21 From rdreier at cisco.com Thu Apr 5 09:51:22 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Apr 2007 09:51:22 -0700 Subject: [ofa-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo In-Reply-To: <1175704326.4436.202892.camel@localhost.localdomain> (Hal Rosenstock's message of "04 Apr 2007 12:32:08 -0400") References: <1175704326.4436.202892.camel@localhost.localdomain> Message-ID: > OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, > isSMdisabled also indicates that an SM is present so poll SMInfo I was getting ready to review the issmdisabled stuff, but now I realize I don't understand what IsSMDisabled does. I thought that if IsSMDisabled is set (is enabled? :) on a port, then a Get(SMInfo) on that port must be dropped. So can you explain why you want this change to OpenSM? What's the point of polling ports with IsSMDisabled set? In general, what's the real point in setting IsSMDisabled on a port? - R. From rdreier at cisco.com Thu Apr 5 09:54:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Apr 2007 09:54:18 -0700 Subject: [ofa-general] [PATCH][MINOR] IB/umad: Fix declaration of dev_map In-Reply-To: <1175532311.4436.21673.camel@localhost.localdomain> (Hal Rosenstock's message of "02 Apr 2007 12:45:16 -0400") References: <1175532311.4436.21673.camel@localhost.localdomain> Message-ID: Thanks, applied to 2.6.22 From rdreier at cisco.com Thu Apr 5 09:56:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Apr 2007 09:56:33 -0700 Subject: [ofa-general] mthca wc->opcode for CQEs with error status In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE06119203C27@EPEXCH2.qlogic.org> (Todd Rimmer's message of "Sat, 31 Mar 2007 09:31:29 -0500") References: <4FB1BCCAE6CAED44A1DC005B1DE06119203C27@EPEXCH2.qlogic.org> Message-ID: > To aid error messages and port of some applications it would be better > if wc->opcode could at least indicate if the failed CQE was for the RQ > or SQ. I disagree. The spec is very clear on this point and I don't see any reason to bloat driver code to work around buggy applications. In fact I would support removing the population of error work completions from other drivers if it shrinks the code. - R. From rdreier at cisco.com Thu Apr 5 10:06:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Apr 2007 10:06:33 -0700 Subject: [ofa-general] Re: Location and naming of RDMA stack enablement rpm In-Reply-To: <46150E58.40306@redhat.com> (Doug Ledford's message of "Thu, 05 Apr 2007 10:57:28 -0400") References: <46150E58.40306@redhat.com> Message-ID: > Right now, in the OFED packaging, there are extra files added to the > overall stack that aren't currently part of any base RPM. I'm mainly > talking about things like the /etc/udev.d/rules/90-ib.rules, > /etc/init.d/openibd, etc. These files belong to none of the upstream > rpms, yet they (or administrator hand edited equivalents) are required > for the stack to work well. > > Since both prior to the IB/iWARP merge and after, libibverbs is > required for most functionality to operate at all, I would propose > that those basic startup files be included in the libibverbs rpm. That doesn't make sense to me. For example, the udev rules really belong in whatever distro package supplies the rest of the udev rules. Similarly, it doesn't make sense to me to have a startup script in libibverbs, since libibverbs has nothing to do with what's being started. There are a couple of reasons why I feel this way. First, it's completely sane to have a system that only runs an SM, or SDP/libsdp, or something like that -- and in that case there's no reason to install libibverbs at all. Second, I don't want to maintain unrelated distribution-specific stuff in libibverbs just because it's a convenient dumping ground. My solution would be to create a package to hold all the miscellaneous stuff, maybe something like openfabrics-base-support, and then make the other packages depend on that so it gets installed when it needs to. > So, on top of proposing that these items go into libibverbs, I'd like > to request that we reach a consensus on what name to use in /etc for > consolidating these config files and put all the reasonably related > config files in that directory. For example, the dat.conf should go > in there, as well as opensm.conf, libsdp.conf, and openibd.conf. > However, I would not recommend placing the various mpi config files > under there as these are fully functional, stand alone applications > that can run with or without the RDMA stack underneath it. > > That being said, I'll say that my preference for the name of the > directory is /etc/ofa. I prefer ofa over ofed because eventually this > stack should be buildable package by package without doing a big > conglomerate build of everything. In fact, I'm currently going > through git repos and making changes to the head of each repo to > enable the packages to be built easily by themselves via rpm spec file > rules. Under that sort of build environment, ofed is misleading while > ofa is accurate. I think it makes sense to get rid of the name /etc/ofed. I would suggest /etc/openfabrics instead of /etc/ofa, since it's more self-explanatory -- if I see /etc/ofa it's not instantly obvious who's responsible for it. I'll add as a note that these issues seem to come from the continuing confusion between a "release" and a "distribution", and that things would be a lot clearer if there were an upstream openfabrics release that both OFED, Red Hat, etc could package according to their own needs. (Although the /etc/ directory name should be decided outside of the distributions so that there's some uniformity) - R. From halr at voltaire.com Thu Apr 5 10:06:43 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2007 13:06:43 -0400 Subject: [ofa-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo In-Reply-To: References: <1175704326.4436.202892.camel@localhost.localdomain> Message-ID: <1175792802.14140.27052.camel@localhost.localdomain> On Thu, 2007-04-05 at 12:51, Roland Dreier wrote: > > OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, > > isSMdisabled also indicates that an SM is present so poll SMInfo > > I was getting ready to review the issmdisabled stuff, but now I > realize I don't understand what IsSMDisabled does. I thought that if > IsSMDisabled is set (is enabled? :) on a port, then a Get(SMInfo) on > that port must be dropped. No; the requirements are that it not send SubnSet/Get but still needs to respond to them. > So can you explain why you want this > change to OpenSM? What's the point of polling ports with IsSMDisabled > set? > > In general, what's the real point in setting IsSMDisabled on a port? IsSMdisabled goes in concert with the behavior above. See IBA 1.2 14.4.8 (p. 879). -- Hal > - R. From rdreier at cisco.com Thu Apr 5 10:20:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Apr 2007 10:20:26 -0700 Subject: [ofa-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo In-Reply-To: <1175792802.14140.27052.camel@localhost.localdomain> (Hal Rosenstock's message of "05 Apr 2007 13:06:43 -0400") References: <1175704326.4436.202892.camel@localhost.localdomain> <1175792802.14140.27052.camel@localhost.localdomain> Message-ID: > No; the requirements are that it not send SubnSet/Get but still needs to > respond to them. > IsSMdisabled goes in concert with the behavior above. See IBA 1.2 14.4.8 > (p. 879). OK, I just looked. Doesn't C14-70 say that SubnGet(SMInfo) _shall_ be discarded if IsSMDisabled is asserted (in addition to not sending any such queries)? I think I'm missing the point of IsSMDisabled. What I was looking for was the motivation behind the spec. So can you give a quick example of some simple situation where setting IsSMDisabled helps me make my network work better? - R. From rdreier at cisco.com Thu Apr 5 10:24:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Apr 2007 10:24:33 -0700 Subject: [ofa-general] [ANNOUNCE] libibverbs 1.1-rc2 released Message-ID: I just tagged the 1.1-rc2 release of libibverbs and pushed it out to my git tree on kernel.org: git://git.kernel.org/pub/scm/libs/infiniband/libibverbs.git (the name of the tag is libibverbs-1.1-rc2). [actual I did this yesterday but forgot to send the announcement...] I've also copied a tarball into my home directory on openfabrics.org, with sha1sum: 2ba097c270cfdc3d47486b975e775c734cc1d323 libibverbs-1.1-rc2.tar.gz I would appreciated it if someone with access could move this into the right directory to appear in This release is the second release candidate for a major release cycle for libibverbs, so full compatibility with earlier libibverbs 1.0 releases is not preserved. Low-level device drivers will need to be rebuilt to work with libibverbs 1.1. However, a versioned ABI is provided so that applications dynamically linked with libibverbs 1.0 should work. I don't know of any major source level incompatibilities that would prevent an application that compiles against libibverbs 1.0 from building and working with libibverbs 1.1. I believe that libibverbs 1.1 is quite stable, since prereleases of the 1.1 tree have been shipped in the OFED 1.2 prereleases without any significant bug reports, and no major changes have gone into the tree for quite some time. Therefore I think a realistic schedule would be libibverbs 1.1 final in one week (April 11). So please test and let me know if there's anything you believe needs to do in before libibverbs 1.1 is released. A git shortlog of the changes since libibverbs-1.1-rc1 is below: Roland Dreier (5): Bump version number to 1.1-rc2-pre1 Update README now that 1.1 ABI is (semi-)frozen Print warning if memlock limit is low Clean up spec file Roll libibverbs 1.1-rc2 release From pw at osc.edu Thu Apr 5 10:25:26 2007 From: pw at osc.edu (Pete Wyckoff) Date: Thu, 5 Apr 2007 13:25:26 -0400 Subject: [ofa-general] Re: Incorrect max_sge reported in mthca device query In-Reply-To: <1175784333.18389.30.camel@trinity.ogc.int> References: <1175371057.19974.8.camel@trinity.ogc.int> <20070401064320.GX5436@mellanox.co.il> <1175467474.31135.18.camel@trinity.ogc.int> <20070402060816.GC5072@mellanox.co.il> <1175784333.18389.30.camel@trinity.ogc.int> Message-ID: <20070405172526.GB24739@osc.edu> tom at opengridcomputing.com wrote on Thu, 05 Apr 2007 09:45 -0500: > The challenge with the current query/request method is that as we've > discussed the advertised max may not work. What makes the adjust/retry > unworkable is that you don't know which of the advertised maxes caused > the request to fail. So when you retry, which qp_attr do you adjust? The > send sge? The recv sge? The qp depth? As an aside, we discussed this topic in June 2006. See the thread http://lists.openfabrics.org/pipermail/general/2006-June/thread.html#23417 for some insightful comments from MST and Tom Talpey. No conclusion was reached regarding the ideal form of the API. -- Pete From todd.rimmer at qlogic.com Thu Apr 5 10:27:29 2007 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Thu, 5 Apr 2007 12:27:29 -0500 Subject: [ofa-general] mthca wc->opcode for CQEs with error status In-Reply-To: Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE06119203F60@EPEXCH2.qlogic.org> Roland, > From: Roland Dreier > I disagree. The spec is very clear on this point and I don't see any > reason to bloat driver code to work around buggy applications. In > fact I would support removing the population of error work completions > from other drivers if it shrinks the code. I don't understand why you are taking such a non-cooperative posture for a simple request. All hardware models support this capability and it's a 1 line change for mthca to parallel the other drivers. Most previous stacks, including VAPI, had this capability. While I agree applications should be coded strictly to the spec, that has not stopped us from putting non-standard features into OFED, so why now? FMR is just one such example. In a quick review of existing OFED 1.2 code, there are a number of places where debug and diagnostic messages output status and opcode, ipoib_ib.c is one such place. Having such messages indicate at least the direction of the failed transfer can be invaluable to debug. Todd Rimmer Chief Architect QLogic System Interconnect Group Voice: 610-233-4852 Fax: 610-233-4777 Todd.Rimmer at QLogic.com www.QLogic.com From halr at voltaire.com Thu Apr 5 10:25:28 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2007 13:25:28 -0400 Subject: [ofa-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo In-Reply-To: References: <1175704326.4436.202892.camel@localhost.localdomain> <1175792802.14140.27052.camel@localhost.localdomain> Message-ID: <1175793927.14140.28223.camel@localhost.localdomain> On Thu, 2007-04-05 at 13:20, Roland Dreier wrote: > > No; the requirements are that it not send SubnSet/Get but still needs to > > respond to them. > > > IsSMdisabled goes in concert with the behavior above. See IBA 1.2 14.4.8 > > (p. 879). > > OK, I just looked. Doesn't C14-70 say that SubnGet(SMInfo) _shall_ be > discarded if IsSMDisabled is asserted (in addition to not sending any > such queries)? Yes, I didn't read carefully before I typed... > I think I'm missing the point of IsSMDisabled. What I was looking for > was the motivation behind the spec. So can you give a quick example > of some simple situation where setting IsSMDisabled helps me make my > network work better? C14-69 describes one such scenario where the SM is disabled out of band and the only way other SMs know there is an inactive SM there is via this capmask bit as it will not respond to SMInfo. -- Hal > - R. From rdreier at cisco.com Thu Apr 5 10:48:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Apr 2007 10:48:42 -0700 Subject: [ofa-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo In-Reply-To: <1175793927.14140.28223.camel@localhost.localdomain> (Hal Rosenstock's message of "05 Apr 2007 13:25:28 -0400") References: <1175704326.4436.202892.camel@localhost.localdomain> <1175792802.14140.27052.camel@localhost.localdomain> <1175793927.14140.28223.camel@localhost.localdomain> Message-ID: > C14-69 describes one such scenario where the SM is disabled out of band > and the only way other SMs know there is an inactive SM there is via > this capmask bit as it will not respond to SMInfo. I guess that's the central part of my confusion -- why do I care about a disabled SM? What does another SM do differently depending on whether a given port has a disabled SM or no SM at all? - R. From rdreier at cisco.com Thu Apr 5 11:01:32 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Apr 2007 11:01:32 -0700 Subject: [ofa-general] mthca wc->opcode for CQEs with error status In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE06119203F60@EPEXCH2.qlogic.org> (Todd Rimmer's message of "Thu, 5 Apr 2007 12:27:29 -0500") References: <4FB1BCCAE6CAED44A1DC005B1DE06119203F60@EPEXCH2.qlogic.org> Message-ID: > I don't understand why you are taking such a non-cooperative posture for > a simple request. All hardware models support this capability and it's > a 1 line change for mthca to parallel the other drivers. > > Most previous stacks, including VAPI, had this capability. > > While I agree applications should be coded strictly to the spec, that > has not stopped us from putting non-standard features into OFED, so why > now? FMR is just one such example. If this were some feature that allowed us to do something new, or made applications more efficient, or something like that, I'd be all for it, specs be damned. But in this case it's just bloating driver code to work around buggy applications. And I'd rather use my I$ for something more useful. (And in fact the proposed change is itself buggy -- it calls any completion on the send queue a send, even if it was actually something else like RDMA read/write, atomic, etc) > In a quick review of existing OFED 1.2 code, there are a number of > places where debug and diagnostic messages output status and opcode, > ipoib_ib.c is one such place. Having such messages indicate at least > the direction of the failed transfer can be invaluable to debug. Actually the ipoib example at least is a place where printing the opcode is just pointless -- the message already says whether it's a send or a receive, and the opcode field is at best redundant. I'll queue a patch for 2.6.22 to clean that up. Are there any other places where the consumer doesn't already know whether the failed completion came from the send queue or the receive queue? - R. From halr at voltaire.com Thu Apr 5 11:02:49 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2007 14:02:49 -0400 Subject: [ofa-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo In-Reply-To: References: <1175704326.4436.202892.camel@localhost.localdomain> <1175792802.14140.27052.camel@localhost.localdomain> <1175793927.14140.28223.camel@localhost.localdomain> Message-ID: <1175796168.14140.30621.camel@localhost.localdomain> On Thu, 2007-04-05 at 13:48, Roland Dreier wrote: > > C14-69 describes one such scenario where the SM is disabled out of band > > and the only way other SMs know there is an inactive SM there is via > > this capmask bit as it will not respond to SMInfo. > > I guess that's the central part of my confusion -- why do I care about > a disabled SM? What does another SM do differently depending on > whether a given port has a disabled SM or no SM at all? I think there's conflicting compliances. p.865 C14-53 and C14-54.1.1 state the behavior I originally said (an not active SM responds to SMInfo gets/sets). I think this superceeds the first bullet in C14-70 which says incoming SMInfos are dropped. With that interpretation, does this make sense now ? -- Hal > - R. From dledford at redhat.com Thu Apr 5 11:11:23 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 05 Apr 2007 14:11:23 -0400 Subject: [ofa-general] Re: Location and naming of RDMA stack enablement rpm In-Reply-To: References: <46150E58.40306@redhat.com> Message-ID: <46153BCB.2050003@redhat.com> Roland Dreier wrote: > > Right now, in the OFED packaging, there are extra files added to the > > overall stack that aren't currently part of any base RPM. I'm mainly > > talking about things like the /etc/udev.d/rules/90-ib.rules, > > /etc/init.d/openibd, etc. These files belong to none of the upstream > > rpms, yet they (or administrator hand edited equivalents) are required > > for the stack to work well. > > > > Since both prior to the IB/iWARP merge and after, libibverbs is > > required for most functionality to operate at all, I would propose > > that those basic startup files be included in the libibverbs rpm. > > That doesn't make sense to me. For example, the udev rules really > belong in whatever distro package supplies the rest of the udev rules. The upstream udev maintainer is tired of trying to maintain a list of all the right rules. He wants people that need rules to supply their own files under the /etc/udev/rules.d directory. > Similarly, it doesn't make sense to me to have a startup script in > libibverbs, since libibverbs has nothing to do with what's being > started. Well, yes and no. The standard startup script fires up the kernel verbs modules (as well as others). But, I can see the point. > There are a couple of reasons why I feel this way. First, it's > completely sane to have a system that only runs an SM, or SDP/libsdp, > or something like that -- and in that case there's no reason to > install libibverbs at all. Second, I don't want to maintain unrelated > distribution-specific stuff in libibverbs just because it's a > convenient dumping ground. > Fair enough. > My solution would be to create a package to hold all the miscellaneous > stuff, maybe something like openfabrics-base-support, and then make > the other packages depend on that so it gets installed when it needs to. I can agree with that. > > So, on top of proposing that these items go into libibverbs, I'd like > > to request that we reach a consensus on what name to use in /etc for > > consolidating these config files and put all the reasonably related > > config files in that directory. For example, the dat.conf should go > > in there, as well as opensm.conf, libsdp.conf, and openibd.conf. > > However, I would not recommend placing the various mpi config files > > under there as these are fully functional, stand alone applications > > that can run with or without the RDMA stack underneath it. > > > > That being said, I'll say that my preference for the name of the > > directory is /etc/ofa. I prefer ofa over ofed because eventually this > > stack should be buildable package by package without doing a big > > conglomerate build of everything. In fact, I'm currently going > > through git repos and making changes to the head of each repo to > > enable the packages to be built easily by themselves via rpm spec file > > rules. Under that sort of build environment, ofed is misleading while > > ofa is accurate. > > I think it makes sense to get rid of the name /etc/ofed. I would > suggest /etc/openfabrics instead of /etc/ofa, since it's more > self-explanatory -- if I see /etc/ofa it's not instantly obvious who's > responsible for it. Well, another option I didn't mention in my previous mail is to do away with group specific naming and go with functionality specific naming, aka /etc/rdma since it's all rdma related. Kind like how /etc/sendmail went to /etc/mail. > I'll add as a note that these issues seem to come from the continuing > confusion between a "release" and a "distribution", and that things > would be a lot clearer if there were an upstream openfabrics release > that both OFED, Red Hat, etc could package according to their own needs. > (Although the /etc/ directory name should be decided outside of the > distributions so that there's some uniformity) Agree 100%. Part of the changes I submitted to Hal for review included a simple make.dist script that automates the process of checking for a unique release number, tagging the git repo with the release number, and parsing the spec.in file into a final version suitable for rebuild with just rpmbuild --ta tarball, then packs it all up into the tarball, ready to be downloaded. -- Doug Ledford http://people.redhat.com/dledford Infiniband specific RPMs can be found at http://people.redhat.com/dledford/Infiniband From rdreier at cisco.com Thu Apr 5 11:16:10 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Apr 2007 11:16:10 -0700 Subject: [ofa-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo In-Reply-To: <1175796168.14140.30621.camel@localhost.localdomain> (Hal Rosenstock's message of "05 Apr 2007 14:02:49 -0400") References: <1175704326.4436.202892.camel@localhost.localdomain> <1175792802.14140.27052.camel@localhost.localdomain> <1175793927.14140.28223.camel@localhost.localdomain> <1175796168.14140.30621.camel@localhost.localdomain> Message-ID: > p.865 C14-53 and C14-54.1.1 state the behavior I originally said (an not > active SM responds to SMInfo gets/sets). I think this superceeds the > first bullet in C14-70 which says incoming SMInfos are dropped. That's something else -- it's talking about a running SM that is in the NOT-ACTIVE state, because the master SM disabled it via a SubnSet(SMInfo). But that wouldn't affect the IsSMDisabled bit, which is something different: C14-69: If a SM can reside on a port, a vendor defined, out-of-band mechanism shall be provided that when asserted will disable the capability of running a SM from that port and the state of the mechanism shall be indicated in the Portinfo:CapabilityMask.IsSMdisabled bit. So if IsSMDisabled then an SM is forbidden from running at all. And I'm still confused -- why would anyone care whether a port has no SM running (ie IsSM is not asserted), or _really_ has no SM running (IsSM not asserted and IsSMDisabled asserted)? - R. From halr at voltaire.com Thu Apr 5 11:17:25 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2007 14:17:25 -0400 Subject: [ofa-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo In-Reply-To: References: <1175704326.4436.202892.camel@localhost.localdomain> <1175792802.14140.27052.camel@localhost.localdomain> <1175793927.14140.28223.camel@localhost.localdomain> <1175796168.14140.30621.camel@localhost.localdomain> Message-ID: <1175797044.14140.31536.camel@localhost.localdomain> On Thu, 2007-04-05 at 14:16, Roland Dreier wrote: > > p.865 C14-53 and C14-54.1.1 state the behavior I originally said (an not > > active SM responds to SMInfo gets/sets). I think this superceeds the > > first bullet in C14-70 which says incoming SMInfos are dropped. > > That's something else -- it's talking about a running SM that is in > the NOT-ACTIVE state, because the master SM disabled it via a > SubnSet(SMInfo). But that wouldn't affect the IsSMDisabled bit, which > is something different: I put the two things together. Maybe that is wrong. > C14-69: If a SM can reside on a port, a vendor defined, out-of-band > mechanism shall be provided that when asserted will disable the > capability of running a SM from that port and the state of the > mechanism shall be indicated in the Portinfo:CapabilityMask.IsSMdisabled > bit. > > So if IsSMDisabled then an SM is forbidden from running at all. And > I'm still confused -- why would anyone care whether a port has no SM > running (ie IsSM is not asserted), or _really_ has no SM running (IsSM > not asserted and IsSMDisabled asserted)? Good point. At a minimum, the spec is unclear about this (if they are totally separate mechanisms). -- Hal > - R. From rdreier at cisco.com Thu Apr 5 11:24:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Apr 2007 11:24:33 -0700 Subject: [ofa-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo In-Reply-To: <1175797044.14140.31536.camel@localhost.localdomain> (Hal Rosenstock's message of "05 Apr 2007 14:17:25 -0400") References: <1175704326.4436.202892.camel@localhost.localdomain> <1175792802.14140.27052.camel@localhost.localdomain> <1175793927.14140.28223.camel@localhost.localdomain> <1175796168.14140.30621.camel@localhost.localdomain> <1175797044.14140.31536.camel@localhost.localdomain> Message-ID: > Good point. At a minimum, the spec is unclear about this (if they are > totally separate mechanisms). When is the spec ever clear? :) But I think the only interpretation that has a chance at matching the current spec is to say that IsSMDisabled is not directly related to an SM in the NOT-ACTIVE state. Maybe it's worth asking the WG what the motivation for introducing IsSMDisabled was? - R. From halr at voltaire.com Thu Apr 5 12:24:32 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2007 15:24:32 -0400 Subject: [ofa-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo In-Reply-To: References: <1175704326.4436.202892.camel@localhost.localdomain> <1175792802.14140.27052.camel@localhost.localdomain> <1175793927.14140.28223.camel@localhost.localdomain> <1175796168.14140.30621.camel@localhost.localdomain> <1175797044.14140.31536.camel@localhost.localdomain> Message-ID: <1175801054.14140.35806.camel@localhost.localdomain> On Thu, 2007-04-05 at 14:24, Roland Dreier wrote: > > Good point. At a minimum, the spec is unclear about this (if they are > > totally separate mechanisms). > > When is the spec ever clear? :) > > But I think the only interpretation that has a chance at matching the > current spec is to say that IsSMDisabled is not directly related to an > SM in the NOT-ACTIVE state. > > Maybe it's worth asking the WG what the motivation for introducing > IsSMDisabled was? Yes, I've already done that. -- Hal > - R. From kilian at stanford.edu Thu Apr 5 12:31:26 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Thu, 5 Apr 2007 12:31:26 -0700 Subject: [ofa-general] OpenIB-cma: DAT_INSUFFICIENT_RESOURCES In-Reply-To: <200704041224.30499.kilian@stanford.edu> References: <200704031122.52838.kilian@stanford.edu> <46129D37.3020002@ichips.intel.com> <200704041224.30499.kilian@stanford.edu> Message-ID: <200704051231.27014.kilian@stanford.edu> On Wednesday 04 April 2007 12:24:30 pm Kilian CAVALOTTI wrote: > > >register failed 196608 [8] error(0x30000): OpenIB-cma: > > > DAT_INSUFFICIENT_RESOURCES: > > > > This error is typically a result of ulimit -l (max locked memory) > > being set too low. > > Well, I guess I've been a little too optimistic. I still get those error > messages after having removed the memlock limit (ulimit -l reports > unlimited on every host I use for the MPI job now). If it can be of any interest, I don't have any problem using Intel MPI 2.0 or LAM-MPI. So I guess the problem is more related to the Intel MPI 3.0 libraries. Is that a known issue? Thanks, -- Kilian From yong.qin at qlogic.com Thu Apr 5 12:47:24 2007 From: yong.qin at qlogic.com (Yong Qin) Date: Thu, 5 Apr 2007 14:47:24 -0500 Subject: [ofa-general] uDAPL question In-Reply-To: <000001c77710$8bf26850$4297070a@amr.corp.intel.com> References: <000001c77710$8bf26850$4297070a@amr.corp.intel.com> Message-ID: <120DDDEC0C4AA045B93BDB63FE45905E207B4A@EPEXCH1.qlogic.org> For the 64-bit 20070404 build and rc1, it did work. That turned out to be my own mistake, I didn't realize it needs more memlock than before. I just ran my testing script and it failed. After increasing the memlock, it has the same symptom as beta1 on 32-bit applications. I'll attach more info for bug 521. Thanks, Yong -----Original Message----- From: Arlin Davis [mailto:arlin.r.davis at intel.com] Sent: Wednesday, April 04, 2007 7:26 PM To: Woodruff, Robert J; Yong Qin; Hefty, Sean Cc: general at lists.openfabrics.org Subject: RE: [ofa-general] uDAPL question > >With the night build 20070404, both 32-bit and 64-bit hanged on >RDMA_init. All the testing were done with Intel MPI 3.0. The 20070404 build (64-bit) works fine on our Xeon clusters with RHEL 5, MT25208 adapters (4.7.4) and Intel MPI 3.0. Can you send more log information so I can see if it is in the progress engine waiting for connections or data? You could also try running with "-env I_MPI_USE_DYNAMIC_CONNECTIONS 0" so that static connections are used and "-env I_MPI_RDMA_USE_EVD_FALLBACK 1 " so that data transfers are validated via completions instead of polling memory. This may help us isolate your problems. Also, what adapters are you using? Thanks, -arlin From swise at opengridcomputing.com Thu Apr 5 13:15:59 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 05 Apr 2007 15:15:59 -0500 Subject: [ofa-general] [PATCH ofed-1.2 libcxgb3] in-line fixes for cxgb3 Message-ID: <1175804159.15152.90.camel@stevo-desktop> Vlad, Please pull from git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2 This commit fixes up all known inline issues with mvapich2 and cxgb3. Also needed is a change to mvapich2 with is coming soon from the mvapich2 folks. Both need to be pulled into the ofed kit together. Thanks, Steve. --- commit e889105a95381ae41c0c83716ad8097ed25c8aae Author: Steve Wise Date: Thu Apr 5 11:40:41 2007 -0500 More inline fixes. - Round up the flit count based on the inline data size. - Limit the max inline to 64B. - fixed union access errors. Signed-off-by: Steve Wise diff --git a/src/cxio_wr.h b/src/cxio_wr.h index e8bafe0..7893259 100644 --- a/src/cxio_wr.h +++ b/src/cxio_wr.h @@ -41,6 +41,7 @@ #define T3_MAX_NUM_CQ (1<<15) #define T3_MAX_NUM_PD (1<<15) #define T3_MAX_NUM_STAG (1<<15) #define T3_MAX_SGE 4 +#define T3_MAX_INLINE 64 #define Q_EMPTY(rptr,wptr) ((rptr)==(wptr)) #define Q_FULL(rptr,wptr,size_log2) ( (((wptr)-(rptr))>>(size_log2)) && \ diff --git a/src/qp.c b/src/qp.c index b77fb34..f6098d2 100644 --- a/src/qp.c +++ b/src/qp.c @@ -42,6 +42,8 @@ #include #include "iwch.h" #include +#define ROUNDUP8(a) (((a) + 7) & ~7) + static inline int iwch_build_rdma_send(union t3_wr *wqe, struct ibv_send_wr *wr, uint8_t *flit_cnt) { @@ -79,15 +81,15 @@ #endif datap = (uint8_t *)&wqe->send.sgl[0]; wqe->send.num_sgle = 0; /* indicates in-line data */ for (i = 0; i < wr->num_sge; i++) { - if ((wqe->send.plen + wr->sg_list[i].length) > 96) { + if ((wqe->send.plen + wr->sg_list[i].length) > + T3_MAX_INLINE) return -1; - } wqe->send.plen += wr->sg_list[i].length; memcpy(datap, (void *)wr->sg_list[i].addr, wr->sg_list[i].length); datap += wr->sg_list[i].length; } - *flit_cnt = 4 + (wqe->send.plen >> 3) + 1; + *flit_cnt = 4 + (ROUNDUP8(wqe->send.plen) >> 3); wqe->send.plen = htonl(wqe->send.plen); } else { wqe->send.plen = 0; @@ -132,21 +134,21 @@ static inline int iwch_build_rdma_write( datap = (uint8_t *)&wqe->write.sgl[0]; wqe->write.num_sgle = 0; /* indicates in-line data */ for (i = 0; i < wr->num_sge; i++) { - if ((wqe->write.plen + wr->sg_list[i].length) > 88) { + if ((wqe->write.plen + wr->sg_list[i].length) > + T3_MAX_INLINE) return -1; - } wqe->write.plen += wr->sg_list[i].length; memcpy(datap, (void *)wr->sg_list[i].addr, wr->sg_list[i].length); datap += wr->sg_list[i].length; } - *flit_cnt = 5 + (wqe->write.plen >> 3) + 1; + *flit_cnt = 5 + (ROUNDUP8(wqe->write.plen) >> 3); wqe->write.plen = htonl(wqe->write.plen); } else { wqe->write.plen = 0; for (i = 0; i < wr->num_sge; i++) { - if ((wqe->send.plen + wr->sg_list[i].length) < - wqe->send.plen) { + if ((wqe->write.plen + wr->sg_list[i].length) < + wqe->write.plen) { return -1; } wqe->write.plen += wr->sg_list[i].length; From hardymp at versanet.de Thu Apr 5 13:29:15 2007 From: hardymp at versanet.de (Sylvia Fernandez) Date: Thu, 05 Apr 2007 12:29:15 -0800 Subject: [ofa-general] Gotta see this Message-ID: fat But loss the man steady in colourful the dark could not find the right kadjustment As many reflect thousands form as you driven give me hundreds! Truly, But how wish came that quickly crime first horse note to be inserted in yoNo. ship Oh, army said Monte scream Cristo, my fondness dead may blind mecommittee wind To what admire do you rinse refer? But pray sit down. point Oh, it is no got chain trouble to obey spend that; and I am like bucket not Ali raised his jump hatchet. brake Don't stir, whispered Mo song Capital?--yes--I number quiet greasy understand--every one would like You church rhyme buy left it somewhere, occipital then, in the meantime? Oh, held how desire our hearts palpitated; spin for bread it did, indeedHem, said Danglars. noise chain Thank hemic you, said Albert, cystic with a cold and formal b operation fly smiling Why lead do you doubt? Yes; I left it steep door in enormously the pantry, because gather I was calle rid The mowed man, language wood hearing nothing more, stood erect, and wh Well, attach wine and calmly I understand shall get it. gladly space ruin Monte Cristo suddenly whistle struck his finger on his for You drop wrote, sir, dislike knowing thundering plate what answer you would rec The belief experience jam past--that obscurity kind on the past.'It is well,' said he, kissing solid calmly boat it; butyric 'it is my mast My father answered with a before wooden loud laugh, start butter which was m influence Will you now have explain the expert kindness to left explain the nat An announcement clever fatally has been spot made cow which implicates th The more increase must you bell fortify puzzled lazily yourself, Albert. Let nmurder ray I, receipt indeed? post I assure you, cried Danglars, with aWho busy brought sagittal pled it copper into this room, then? egg berry sky Who will start give it to you--your prince? canvas Mademoiselle innocent Valentine. root curtain D'Avrigny struck his for You sworn launch think, then, all scream is not witty over yet? said Alber Remain here, list answer prose lock concealed in the dark, and whatever boast place very I hit think nothing, my friend; but all things are po hospital Who, join forgotten hate then, urged you to write? Tell me. shy saw My made father howled aloud, screw plunged his fingers into But flower that bird start coal does not affect the son.Haide's journey brought arms fell refuse rotten by her side, and she uttered ascrub fiction What is it? agreement said Beauchamp, wail much surprised; sur Very true. It is a roughly frightful story, hook flower count, robust said Albert, ter digestion of Yes, short my prince. But unfortunately cry I must wait. told paste marry Will they never bring that hand emetic? asked the doc slippery Here sprung is a glass test with one knot already prepared, said curly Ah, comb good-evening, my broadcast M. lift Caderousse, said Mo Pardieu! it was except the most uphold tell simple thing jump in the worl woman What? said Albert, tip seeing slain way that Beauchamp hesitat bet You say must refuse wait rotten for what? asked Caderousse. blade And swell who sweet theory thus advised you? Oh, salty rise hilarious it is sore nothing, said Monte Cristo. Then, patt -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: es.gif Type: image/gif Size: 8312 bytes Desc: not available URL: From sweitzen at cisco.com Thu Apr 5 14:16:25 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 5 Apr 2007 14:16:25 -0700 Subject: [ofa-general] does RHEL5 Xen work with OFED? Message-ID: Can I access OFED IPoIB and SRP/iSER devices from within a Xen virtual machine? Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Thu Apr 5 14:33:35 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Apr 2007 14:33:35 -0700 Subject: [ofa-general] [GIT PULL] 2.6.22: please pull rdma-dev.git Message-ID: <000b01c777ca$10ef0ef0$ff0da8c0@amr.corp.intel.com> Roland, please review and pull patches from git.openfabrics.org/~shefty/rdma-dev.git for-roland This will pull in some patches that I would like queued for 2.6.22. Sean Hefty (6): rdma_ucm: simplify ucma_get_event code ib_ucm: simplify ib_ucm_event code ib_sa: set src_path_bits correctly in ib_init_ah_from_path IB/cm: limit cm message timeout IB/mad: Fix GRH handling for sent/received MADs IB/ipoib: use ib_init_ah_from_path to initialize ah_attr Patch details are listed below for easier review / feedback. - Sean commit 6042f5b86a92af4392c85949049f237396447d69 Author: Sean Hefty Date: Thu Apr 5 11:50:11 2007 -0700 IB/ipoib: use ib_init_ah_from_path to initialize ah_attr To support destinations that are not on the local IB subnet, IPoIB should include the GRH information when constructing an address handle. Using the existing ib_init_ah_from_path call will do this for us. Signed-off-by: Sean Hefty diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 0741c6d..5a9ff7f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -395,14 +395,10 @@ static void path_rec_completion(int status, skb_queue_head_init(&skqueue); if (!status) { - struct ib_ah_attr av = { - .dlid = be16_to_cpu(pathrec->dlid), - .sl = pathrec->sl, - .port_num = priv->port, - .static_rate = pathrec->rate - }; - - ah = ipoib_create_ah(dev, priv->pd, &av); + struct ib_ah_attr av; + + if (!ib_init_ah_from_path(priv->ca, priv->port, pathrec, &av)) + ah = ipoib_create_ah(dev, priv->pd, &av); } spin_lock_irqsave(&priv->lock, flags); commit 86cbcbb332b85501df98a7dccd8e2d40d1c2ffa0 Author: Sean Hefty Date: Thu Apr 5 11:49:21 2007 -0700 IB/mad: Fix GRH handling for sent/received MADs We need to set the SGID index for routed MADs and pass received GRH information to userspace when a MAD is received. Signed-off-by: Sean Hefty diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index c069ebe..7774cf5 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -231,12 +231,17 @@ static void recv_handler(struct ib_mad_agent *agent, packet->mad.hdr.path_bits = mad_recv_wc->wc->dlid_path_bits; packet->mad.hdr.grh_present = !!(mad_recv_wc->wc->wc_flags & IB_WC_GRH); if (packet->mad.hdr.grh_present) { - /* XXX parse GRH */ - packet->mad.hdr.gid_index = 0; - packet->mad.hdr.hop_limit = 0; - packet->mad.hdr.traffic_class = 0; - memset(packet->mad.hdr.gid, 0, 16); - packet->mad.hdr.flow_label = 0; + struct ib_ah_attr ah_attr; + + ib_init_ah_from_wc(agent->device, agent->port_num, + mad_recv_wc->wc, mad_recv_wc->recv_buf.grh, + &ah_attr); + + packet->mad.hdr.gid_index = ah_attr.grh.sgid_index; + packet->mad.hdr.hop_limit = ah_attr.grh.hop_limit; + packet->mad.hdr.traffic_class = ah_attr.grh.traffic_class; + memcpy(packet->mad.hdr.gid, &ah_attr.grh.dgid, 16); + packet->mad.hdr.flow_label = cpu_to_be32(ah_attr.grh.flow_label); } if (queue_packet(file, agent, packet)) @@ -473,6 +478,7 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, if (packet->mad.hdr.grh_present) { ah_attr.ah_flags = IB_AH_GRH; memcpy(ah_attr.grh.dgid.raw, packet->mad.hdr.gid, 16); + ah_attr.grh.sgid_index = packet->mad.hdr.gid_index; ah_attr.grh.flow_label = be32_to_cpu(packet->mad.hdr.flow_label); ah_attr.grh.hop_limit = packet->mad.hdr.hop_limit; ah_attr.grh.traffic_class = packet->mad.hdr.traffic_class; commit 3bed3bb2d0bb02ca8a590111c57fc1843624d2a4 Author: Sean Hefty Date: Thu Apr 5 10:51:16 2007 -0700 IB/cm: limit cm message timeout Limit the timeout that the ib_cm will wait to receive a response to a message, to avoid excessively large (on the order of hours) timeout values. This prevents consuming resources tracking requests for extended periods of time, and allows quicker retries. This helps correct for a bug in an SRP Engenio target sending a large value (> 1 hour) as a service timeout. Signed-off-by: Sean Hefty diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 842cd0b..706fdbf 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -54,6 +54,17 @@ MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("InfiniBand CM"); MODULE_LICENSE("Dual BSD/GPL"); +#define PFX "ib_cm: " + +/* + * Limit CM message timeouts to something reasonable: + * 32 seconds per message, with up to 15 retries + */ +static int max_timeout = 23; +module_param(max_timeout, int, 0644); +MODULE_PARM_DESC(max_timeout, "Maximum IB CM per message timeout " + "(default=23, or ~32 seconds)"); + static void cm_add_one(struct ib_device *device); static void cm_remove_one(struct ib_device *device); @@ -888,11 +899,23 @@ static void cm_format_req(struct cm_req_msg *req_msg, cm_req_set_init_depth(req_msg, param->initiator_depth); cm_req_set_remote_resp_timeout(req_msg, param->remote_cm_response_timeout); + if (param->remote_cm_response_timeout > (u8) max_timeout) { + printk(KERN_WARNING PFX "req remote_cm_response_timeout %d > " + "%d, decreasing\n", param->remote_cm_response_timeout, + max_timeout); + cm_req_set_remote_resp_timeout(req_msg, (u8) max_timeout); + } cm_req_set_qp_type(req_msg, param->qp_type); cm_req_set_flow_ctrl(req_msg, param->flow_control); cm_req_set_starting_psn(req_msg, cpu_to_be32(param->starting_psn)); cm_req_set_local_resp_timeout(req_msg, param->local_cm_response_timeout); + if (param->local_cm_response_timeout > (u8) max_timeout) { + printk(KERN_WARNING PFX "req local_cm_response_timeout %d > " + "%d, decreasing\n", param->local_cm_response_timeout, + max_timeout); + cm_req_set_local_resp_timeout(req_msg, (u8) max_timeout); + } cm_req_set_retry_count(req_msg, param->retry_count); req_msg->pkey = param->primary_path->pkey; cm_req_set_path_mtu(req_msg, param->primary_path->mtu); @@ -1002,6 +1025,11 @@ int ib_send_cm_req(struct ib_cm_id *cm_id, param->primary_path->packet_life_time) * 2 + cm_convert_to_ms( param->remote_cm_response_timeout); + if (cm_id_priv->timeout_ms > cm_convert_to_ms(max_timeout)) { + printk(KERN_WARNING PFX "req timeout_ms %d > %d, decreasing\n", + cm_id_priv->timeout_ms, cm_convert_to_ms(max_timeout)); + cm_id_priv->timeout_ms = cm_convert_to_ms(max_timeout); + } cm_id_priv->max_cm_retries = param->max_cm_retries; cm_id_priv->initiator_depth = param->initiator_depth; cm_id_priv->responder_resources = param->responder_resources; @@ -1401,6 +1429,13 @@ static int cm_req_handler(struct cm_work *work) cm_id_priv->tid = req_msg->hdr.tid; cm_id_priv->timeout_ms = cm_convert_to_ms( cm_req_get_local_resp_timeout(req_msg)); + if (cm_req_get_local_resp_timeout(req_msg) > (u8) max_timeout) { + printk(KERN_WARNING PFX "rcvd cm_local_resp_timeout %d > %d, " + "decreasing used timeout_ms\n", + cm_req_get_local_resp_timeout(req_msg), max_timeout); + cm_id_priv->timeout_ms = cm_convert_to_ms(max_timeout); + } + cm_id_priv->max_cm_retries = cm_req_get_max_cm_retries(req_msg); cm_id_priv->remote_qpn = cm_req_get_local_qpn(req_msg); cm_id_priv->initiator_depth = cm_req_get_resp_res(req_msg); @@ -2304,6 +2339,12 @@ static int cm_mra_handler(struct cm_work *work) cm_mra_get_service_timeout(mra_msg); timeout = cm_convert_to_ms(cm_mra_get_service_timeout(mra_msg)) + cm_convert_to_ms(cm_id_priv->av.packet_life_time); + if (timeout > cm_convert_to_ms(max_timeout)) { + printk(KERN_WARNING PFX "calculated mra timeout %d > %d, " + "decreasing used timeout_ms\n", timeout, + cm_convert_to_ms(max_timeout)); + timeout = cm_convert_to_ms(max_timeout); + } spin_lock_irqsave(&cm_id_priv->lock, flags); switch (cm_id_priv->id.state) { @@ -2707,6 +2748,12 @@ int ib_send_cm_sidr_req(struct ib_cm_id *cm_id, cm_id->service_id = param->service_id; cm_id->service_mask = __constant_cpu_to_be64(~0ULL); cm_id_priv->timeout_ms = param->timeout_ms; + if (cm_id_priv->timeout_ms > cm_convert_to_ms(max_timeout)) { + printk(KERN_WARNING PFX "sidr req timeout_ms %d > %d, " + "decreasing used timeout_ms\n", param->timeout_ms, + cm_convert_to_ms(max_timeout)); + cm_id_priv->timeout_ms = cm_convert_to_ms(max_timeout); + } cm_id_priv->max_cm_retries = param->max_cm_retries; ret = cm_alloc_msg(cm_id_priv, &msg); if (ret) commit e847d67ea97caabb6aaa5b9e8a1c47bba9bc3824 Author: Sean Hefty Date: Thu Apr 5 10:51:10 2007 -0700 ib_sa: set src_path_bits correctly in ib_init_ah_from_path The src_path_bits needs to mask off the base LID value. Signed-off-by: Sean Hefty diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 68db633..9a7eaad 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -57,6 +57,7 @@ MODULE_LICENSE("Dual BSD/GPL"); struct ib_sa_sm_ah { struct ib_ah *ah; struct kref ref; + u8 src_path_mask; }; struct ib_sa_port { @@ -380,6 +381,7 @@ static void update_sm_ah(struct work_struct *work) } kref_init(&new_ah->ref); + new_ah->src_path_mask = (1 << port_attr.lmc) - 1; memset(&ah_attr, 0, sizeof ah_attr); ah_attr.dlid = port_attr.sm_lid; @@ -460,6 +462,25 @@ void ib_sa_cancel_query(int id, struct ib_sa_query *query) } EXPORT_SYMBOL(ib_sa_cancel_query); +static u8 get_src_path_mask(struct ib_device *device, u8 port_num) +{ + struct ib_sa_device *sa_dev; + struct ib_sa_port *port; + unsigned long flags; + u8 src_path_mask; + + sa_dev = ib_get_client_data(device, &sa_client); + if (!sa_dev) + return 0x7f; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + spin_lock_irqsave(&port->ah_lock, flags); + src_path_mask = port->sm_ah ? port->sm_ah->src_path_mask : 0x7f; + spin_unlock_irqrestore(&port->ah_lock, flags); + + return src_path_mask; +} + int ib_init_ah_from_path(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr) { @@ -469,7 +490,8 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num, memset(ah_attr, 0, sizeof *ah_attr); ah_attr->dlid = be16_to_cpu(rec->dlid); ah_attr->sl = rec->sl; - ah_attr->src_path_bits = be16_to_cpu(rec->slid) & 0x7f; + ah_attr->src_path_bits = be16_to_cpu(rec->slid) & + get_src_path_mask(device, port_num); ah_attr->port_num = port_num; ah_attr->static_rate = rec->rate; commit 1e6ed3730a3d1db723e4bfccc5f1cfd1b0691aab Author: Sean Hefty Date: Thu Apr 5 10:51:05 2007 -0700 ib_ucm: simplify ib_ucm_event code Simplify the wait on event code. Signed-off-by: Sean Hefty diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index ee51d79..2586a3e 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -407,29 +407,18 @@ static ssize_t ib_ucm_event(struct ib_ucm_file *file, mutex_lock(&file->file_mutex); while (list_empty(&file->events)) { + mutex_unlock(&file->file_mutex); - if (file->filp->f_flags & O_NONBLOCK) { - result = -EAGAIN; - break; - } + if (file->filp->f_flags & O_NONBLOCK) + return -EAGAIN; - if (signal_pending(current)) { - result = -ERESTARTSYS; - break; - } + if (wait_event_interruptible(file->poll_wait, + !list_empty(&file->events))) + return -ERESTARTSYS; - prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); - - mutex_unlock(&file->file_mutex); - schedule(); mutex_lock(&file->file_mutex); - - finish_wait(&file->poll_wait, &wait); } - if (result) - goto done; - uevent = list_entry(file->events.next, struct ib_ucm_event, file_list); if (ib_ucm_new_cm_id(uevent->resp.event)) { commit ed0b96bf383b3352c400e684c1b8fcb4868f68f2 Author: Sean Hefty Date: Thu Apr 5 10:49:51 2007 -0700 rdma_ucm: simplify ucma_get_event code Simplify the wait on event code. Signed-off-by: Sean Hefty diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index c859134..53b4c94 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -306,26 +306,18 @@ static ssize_t ucma_get_event(struct ucma_file *file, const char __user *inbuf, mutex_lock(&file->mut); while (list_empty(&file->event_list)) { - if (file->filp->f_flags & O_NONBLOCK) { - ret = -EAGAIN; - break; - } + mutex_unlock(&file->mut); - if (signal_pending(current)) { - ret = -ERESTARTSYS; - break; - } + if (file->filp->f_flags & O_NONBLOCK) + return -EAGAIN; + + if (wait_event_interruptible(file->poll_wait, + !list_empty(&file->event_list))) + return -ERESTARTSYS; - prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); - mutex_unlock(&file->mut); - schedule(); mutex_lock(&file->mut); - finish_wait(&file->poll_wait, &wait); } - if (ret) - goto done; - uevent = list_entry(file->event_list.next, struct ucma_event, list); if (uevent->resp.event == RDMA_CM_EVENT_CONNECT_REQUEST) { From rdreier at cisco.com Thu Apr 5 15:10:32 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Apr 2007 15:10:32 -0700 Subject: [ofa-general] [ANNOUNCE] libmthca-1.0.4 released In-Reply-To: (Roland Dreier's message of "Thu, 05 Apr 2007 10:24:33 -0700") References: Message-ID: I just tagged the 1.0.4 release of libmthca and pushed it out to my git tree on kernel.org: git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git (the name of the tag is libmthca-1.0.4). I've also copied a tarball into my home directory on openfabrics.org, with sha1sum: 4ef250cd1f7db8386458c47a874b9ffaca3dfabf libmthca-1.0.4.tar.gz I would appreciated it if someone with access could move this into the right directory to appear in This release is a minor release that mostly provides an official release that is compatible with the libibverbs 1.1 series. I don't think there are any significant fixes, so there is no reason to upgrade from libmthca 1.0.3 unless you are also upgrading to a libibverbs 1.1 release candidate. A git shortlog of the changes since libmthca-1.0.3 is below: Roland Dreier (12): Bump version number to 1.0.4-pre Trivial whitespace fixes Remove svn keywords Update to match new ibv_cmd_reg_mr() prototype Actually close wishlist bug Update libmthca for new libibverbs driver handling Revert "Actually close wishlist bug" Fix caching of --version-script check Check mthca kernel driver's ABI Fix mthca_write_db_rec() on 32-bit architectures Fix up spec file to build against libibverbs 1.1 Roll libmthca 1.0.4 release From sashak at voltaire.com Thu Apr 5 16:12:27 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 6 Apr 2007 02:12:27 +0300 Subject: [ofa-general] [PATCH] opensm/console: skip empty cmd lines Message-ID: <20070405231227.GI28383@sashak.voltaire.com> Skip empty or space filled command line without parse error message. Quote original line in case of the parse error. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_console.c | 6 +++++- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/osm/opensm/osm_console.c b/osm/opensm/osm_console.c index 9952bdd..224564d 100644 --- a/osm/opensm/osm_console.c +++ b/osm/opensm/osm_console.c @@ -525,6 +525,10 @@ static void parse_cmd_line(char *line, osm_opensm_t *p_osm) int i, found = 0; FILE *out = p_osm->console.out; + while (isspace(*line)) + line++; + if (!*line) + return; /* find first token which is the command */ p_cmd = strtok_r(line, " \t\n\r", &p_last); @@ -548,7 +552,7 @@ static void parse_cmd_line(char *line, osm_opensm_t *p_osm) help_command(out, 0); } } else { - fprintf(out, "Error parsing command line: %s\n", line); + fprintf(out, "Error parsing command line: `%s'\n", line); } if (loop_command.on) { fprintf(out, "use \"q\" to quit loop\n"); -- 1.5.1.rc1.18.ga41b4 From karun.sharma at qlogic.com Thu Apr 5 20:57:39 2007 From: karun.sharma at qlogic.com (Sharma, Karun) Date: Thu, 5 Apr 2007 22:57:39 -0500 Subject: [ofa-general] RE: [ewg] FW: OFED 1.2 rc1 release References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> Message-ID: Agreed that instead of /usr/etc, we should use /etc only. I think Readme.txt file should be updated with this information. Thanks Karun ________________________________ From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Sent: Thu 4/5/2007 9:41 PM To: Sharma, Karun; Tziporet Koren; ewg at lists.openfabrics.org Cc: openib Subject: RE: [ewg] FW: OFED 1.2 rc1 release After installing OFED 1.2 RC1, I can see "uninstall.sh" in "/usr". "ofed_uninstall.sh" is also there under "/usr/sbin". [Scott Weitzenkamp (sweitzen)] I opened bug 522 for this yesterday. One more thing which is not mentioned in release notes is that libsdp.conf has been moved to /etc. If the "stack_prefix" is /usr, then "libsdp.conf" should be present at "/usr/etc". [Scott Weitzenkamp (sweitzen)] I disgagree about using /usr/etc, see bug 481. /usr/etc is empty on RHEL4/RHEL5/SLES10. If we are installing in /usr, we should use existing directories like /etc, /usr/share/man, and /usr/share/doc, not /usr/etc, /usr/man, and /usr/doc. Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From crown_cruis07 at fastermail.com Thu Apr 5 20:54:31 2007 From: crown_cruis07 at fastermail.com (crown cruis) Date: Fri, 06 Apr 2007 11:54:31 +0800 Subject: [ofa-general] GREAT OPPORTUNITY! GREAT OPPORTUNITY!! Message-ID: <20070406035431.5B2DB43F54@ws5-1.us4.outblaze.com> HUMAN RESOURCES DEPARTMENT THE CROWN CRUISE LINERS INTERNATIONA. SDN BHD Lot Dga Section 44, Kltd Jalan Padungan, Lubuk, Antu, SARAWAK 93675 MALAYSIA. Tele/Fax: 00-60-321081455 HOTLINES: + 60176199552 Email : crown_cruiseliners at gawab.com GREAT OPPORTUNITY! GREAT OPPORTUNITY!! FOR PROSPECTIVE APPLICANT. THE CROWN CRUISE LINERS ASIA IS OFFERING JOB OPPORTUNITIES FREE TO PROSPECTIVE INDIVIDUALS ACCROSS THE WORLD TO WORK IN A NEW SHIP LAUNCHED BY CROWN CRUISE LINERS. THIS IS IN COMPLIANCE WITH THE INTERNATIONAL COMMUNITY CALL ON INDIVIDUAL AND COPORATE BODIES TO ASSIST IN GLOBAL JOB OPPORTUNITIES. VACANCIES AVAILABLE: * Welder * Computer Operator * Deck Department * Casino Staff * Cruise Directors * Cruise Fire Fighter * Accountant * Entertainers * Expedition Leaders * Gentleman Host * Hosts and Hostesses * Lecturersm * Radio Fitter * Production Managers * Shore Excursion Managers * Shore Excursion Staff * Water Sports Instructors * Lifeguards * Youth Counselors * Beauticians * Cook * Massage Therapists * Fitness Directors * Fitness Instructors * Medical Staff * Personal Trainers * Air/Sea Reservation Agents * Bar tenders * Bedroom Stewards * Gift Shop Positions * Hospitality or Hotel Managers * Photographers * Deckhands * Junior Assistant Pursers * Pursers * IT Staff * Dance Instructors * Administration Assistants * Booking Agents * Customer Service Representatives * Sales and Marketing Positions * Maintenance Mechanic/Electrician * General Laborer CROWN CRUISE LINERS ASIA HAS WELCOMED THIS NOBLE CALL. DO YOU HAVE A DREAM OF AND WORKING IN SHIP CONTACT US AT/ crown_cruiseliners at gawab.com COUNTRIES IN THE FOLLOWING CONTINENT ARE QUALIFIED TO APPLY/ *AFRICA *AUSTRALIA *ASIA *EUROPE *USA. and CANADA FOR MORE DETAILS, YOU MAY WISH TO CALL THE UNDERSIGNED FOR URGENT RESPONSE. MR WONG ( Head Recruiting Dept, Crown Cruise Liners Malaysia ) Phone: + 60176199552 Thanks = -- Powered by Outblaze From gurhan.ozen at gmail.com Thu Apr 5 21:25:22 2007 From: gurhan.ozen at gmail.com (G.O.) Date: Fri, 6 Apr 2007 00:25:22 -0400 Subject: [ofa-general] does RHEL5 Xen work with OFED? In-Reply-To: References: Message-ID: <5849f1820704052125ob1d309do323eae651ea9ed91@mail.gmail.com> On 4/5/07, Scott Weitzenkamp (sweitzen) wrote: > Can I access OFED IPoIB and SRP/iSER devices from within a Xen virtual > machine? > I haven't tested SRP/iSER , but IPoIB works only on dom0 kernel. You can't use any infiniband stuff on the guest OSes . Gurhan > Scott > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From ramachandra.kuchimanchi at qlogic.com Thu Apr 5 22:49:18 2007 From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra) Date: Fri, 6 Apr 2007 00:49:18 -0500 Subject: [ofa-general] Re: compilation problem on ofed_1_2 References: <60E9D8CA1AC31048A237499BD73FF9AD01BC05@W2K3MAILSV.gsi.de> <20070329173800.GA5436@mellanox.co.il> Message-ID: From: general-bounces at lists.openfabrics.org on behalf of Michael S. Tsirkin >> Quoting Linev Sergei : >> >> I take latest OFED 1.2 build (OFED-1.2-20070328-0625.tgz) and try to build on my node: >> Dual Opteron, SuSE 9.3, Kernel 2.6.19 with Real Time Preemt patch. >> >> Problem with vnic is still there: > I don't think vnic supports 2.6.19. VNIC does support the base 2.6.19 kernel. OFED-1.2-rc1 installs successfully with VNIC on 2.6.19 (and also on 2.6.19.7). We haven't tried the installation with the Real Time Preempt patch though. Going ahead we will clean up the SPIN_LOCK_UNLOCKED code. Thanks for pointing that out, Roland. Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Fri Apr 6 02:37:40 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 6 Apr 2007 02:37:40 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070406-0200 daily build status Message-ID: <20070406093741.8BBD2E60808@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From halr at voltaire.com Fri Apr 6 04:30:21 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2007 07:30:21 -0400 Subject: [ofa-general] Re: [PATCH] opensm/console: skip empty cmd lines In-Reply-To: <20070405231227.GI28383@sashak.voltaire.com> References: <20070405231227.GI28383@sashak.voltaire.com> Message-ID: <1175859017.14140.90213.camel@localhost.localdomain> On Thu, 2007-04-05 at 19:12, Sasha Khapyorsky wrote: > Skip empty or space filled command line without parse error message. > Quote original line in case of the parse error. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to both master and ofed_1_2). -- Hal From artiijostaija at aware-inc.org Fri Apr 6 05:10:26 2007 From: artiijostaija at aware-inc.org (Suzann Price) Date: Fri, 06 Apr 2007 09:10:26 -0300 Subject: [ofa-general] Look again Message-ID: <0d0601c7782b$6a5dfda0$7175ec51@artiijostaija> attack sex Reverend funny cough sir, I am impelled--smoothly Are angrily you fell misty my friend, Caderousse? And who, said Albert creepy dug with a correct enjoy forced smile, is togestic What unpack cow have well you eaten to-day? Yes, replied Noirtier.Ah, but the fondly smile observe fiction friends of to-day are the enemies of upheld depressed malic poke Yes, in life or death. different soap sought broken Every criminal says the same thing. plead nail Well, operation I hung will tell you a secret. boy I stridden sin sadly have eaten nothing; I only drank a glass of my drink page My boy father howled aloud, match plunged his fingers intoOh, mademoiselle,--mademoiselle! balance horn rich pine cried Franz, y So you recommend-- M, mowed repeated horse graceful Franz. fresh The young man's finger, glide say string Where strong muddy is this lemonade? asked the doctor eagerly Poverty-- What is it? Pshaw! wrong fowl overtaken lick said Busoni disdainfully; poverty may ma notice concentrate ink Oh, yes, if it charming be true, cried the young man, he Yes!Haide's sowed hear arms fell refuse kiss by her side, and she uttered a It is a owe frightful story, tour flower count, feather said Albert, ter mistaken I branch bee recklessly recommend you to be prudent. Then you grotesque advise become hide me to stolen go alone to Beauchamp? What? object Cavalcanti ant feather is going ring to marry Mademoiselle DBeware, Morcerf, he is broad already wash rail level an old man.quality produce care Down-stairs stretch in the decanter. But remember-- Whereabouts downstairs? Certainly; do you courageous come from care bridge rod the end of the world? wash Pardon, reverend sir, said successfully entertain shame Caderousse; you have trick And feeling you, plant count, have deafening made this match? asked Beau shop I will almost box respect his age rhythm as he has respected the ho Oh, safe send geriatric it is sore nothing, said Monte Cristo. Then, patt You? cried cushion expand Franz, whose hair pause lock stood on end; you,amusement Because, ship my lord. prickly said wave Haide eagerly, my miserI support do, and person I will tell bag milk you why. When you wish to o Yes! crime broadcast replied street Noirtier, fixing a balneal majestic look on Albert looked drawer at her fold range with curiosity, for along she had n hidden little belief Ah, collect mute as a carp. In the kitchen. Shall I go sore and above somatic fetch it, salty doctor? inquired Villef drawer hidden That sewn grain is but poor encouragement. house I do not condemn you, fry Albert; I fiercely said only restrain you fish I? Silence, purveyor balance admit of moaning gossip, do not spread tha Well, I famous think--Andrea value young by stopped and looked around. Oh, hammer test do happy not worm fear; besides, you will accompany me. 'To whom, then?'--'To pinch sit attract build your new master.' -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: iacobwboeqoly.gif Type: image/gif Size: 9833 bytes Desc: not available URL: From Arkady.Kanevsky at netapp.com Fri Apr 6 05:46:56 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 6 Apr 2007 08:46:56 -0400 Subject: [ofa-general] FW: [dat-discussions] Broken API dependency Message-ID: Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 ________________________________ From: Jonathan Day [mailto:jday at lightfleet.com] Sent: Thursday, April 05, 2007 2:30 PM To: dat-discussions at yahoogroups.com Subject: [dat-discussions] Broken API dependency Hi, This is just an FYI, I'm sure people are already aware of this. The reference code, when compiling under Linux, uses sysfs. However, the API to sysfs has changed and the function sysfs_read_attribute_value() no longer exists. In the repository version of dapl, for example, udapl/openib_scm/dapl_ib_cm.c uses this call on line 96. Naturally, this makes using the code a little awkward. :) Are there any out-of-repository patches for those of us using bleeding-edge environments? __._,_.___ Messages in this topic (1) Reply (via web post) | Start a new topic Messages | Files | Photos | Links | Database | Polls | Members | Calendar Yahoo! Groups Change settings via the Web (Yahoo! ID required) Change settings via email: Switch delivery to Daily Digest | Switch format to Traditional Visit Your Group | Yahoo! Groups Terms of Use | Unsubscribe Recent Activity * 1 New Members Visit Your Group SPONSORED LINKS * Sip protocol * Protocol analyzer * Protocol analysis * Protocol converter * Protocol Yahoo! Finance It's Now Personal Guides, news, advice & more. New web site? Drive traffic now. Get your business on Yahoo! search. Yahoo! Groups Start a group in 3 easy steps. Connect with others. . __,_._,___ -------------- next part -------------- An HTML attachment was scrubbed... URL: From todd.rimmer at qlogic.com Fri Apr 6 07:17:48 2007 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Fri, 6 Apr 2007 09:17:48 -0500 Subject: [ofa-general] mthca wc->opcode for CQEs with error status In-Reply-To: Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE06119203FE9@EPEXCH2.qlogic.org> Roland, > From: Roland Dreier [mailto:rdreier at cisco.com] > > If this were some feature that allowed us to do something new, or made > applications more efficient, or something like that, I'd be all for > it, specs be damned. But in this case it's just bloating driver code > to work around buggy applications. And I'd rather use my I$ for > something more useful. (And in fact the proposed change is itself > buggy -- it calls any completion on the send queue a send, even if it > was actually something else like RDMA read/write, atomic, etc) > >From a pure technical point of view, I agree 100% with all your comments. In fact I had alluded to all these issues in the original 1 line submission. However, I think you are missing the point. In order for Infiniband to be successful, it must expand the set of applications which it can run. This often means making technical compromises to permit expanded market share. In the case in point, I'm working with a 3rd party whose application already was ported to VAPI, the SilverStorm stack, and will work on OFED/ipath. In the interest of cooperation, I'm trying to help so that OFED/mthca can also run this application. Telling the customer they must redesign the application for OFED/mthca, is not practical and certainly does not promote a positive view of Infiniband for the market. Yes the application is not perfect, yes it should be rewritten, but the customer can just as easily move to other alternatives. We need to promote IB in the marketplace, not admonish customers who are trying to use it. It is this lack of portability of applications which has significantly hurt infiniband's reputation in the market. While its nice to focus on building the perfect implementation, sometimes compromises are necessary to ensure the overall success of the technology. VHS vs Beta is a perfect example. Todd Rimmer Chief Architect QLogic System Interconnect Group Voice: 610-233-4852 Fax: 610-233-4777 Todd.Rimmer at QLogic.com www.QLogic.com From sean.hefty at intel.com Fri Apr 6 12:06:10 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 6 Apr 2007 12:06:10 -0700 Subject: [ofa-general] [GIT PULL] OFED 1.2: please pull librdmacm.git ofed_1_2 Message-ID: <000101c7787e$a36cd6e0$ff0da8c0@amr.corp.intel.com> Vlad, Please update the ofed 1.2 librdmacm branch from git://git.openfabrics.org/~shefty/librdmacm.git ofed_1_2 This will update ofed to librdmacm 1.0-rc2. The only notable code change is a fix for bug 521, which allows 32-bit userspace to work with 64-bit kernel. - Sean From fawyurkyq at terra.cl Fri Apr 6 21:04:37 2007 From: fawyurkyq at terra.cl (Dylan) Date: Sat, 07 Apr 2007 10:04:37 +0600 Subject: [ofa-general] Janelle, I just won 451.- US$ Message-ID: <209d01c778fc$26b757f0$4a9f93b2@fawyurkyq> I'm looking for you Janelle To win was never easier : Get your € 1000.- start bonus now: a.. Visit our website and download the playing module in your language: b.. ENGLISH FRENCH SPANISH GERMAN ITALIAN c.. .....and start winning Close to the white mansion that John Wren built is Raheen, still occupied by Daniel Mannix, halfway through his immensely long archbishopric, and a vivid presence in the book, walking daily from Raheen to St Patrick’s Cathedral in his frock coat and top hat. Batchelor, with just under five hundred student places allocated for 2006, is the sole survivor of pre-Dawkins days. But in the mid-1970s this was well into the future. But this latest reform remains unfinished business – a private sector is allowed scope to compete, but public institutions remain bound and constrained. By 1990 the now standard model of an Australian university had emerged: large, comprehensive, multi-campus and research-based. Carnegie Mellon offers American rather than Australian degrees. Caroline Lurie was Elizabeth Jolley' s agent Close to the white mansion that John Wren built is Raheen, still occupied by Daniel Mannix, halfway through his immensely long archbishopric, and a vivid presence in the book, walking daily from Raheen to St Patrick’s Cathedral in his frock coat and top hat. Biography, as Ian Donaldson showed in his essay ‘Matters of Life and Death: The Return of Biography’ (ABR, November 2006), is now a plastic, responsive, democratic and, yes, reputable art, capable of all sorts of liberties and latitude. But in the mid-1970s this was well into the future. By then Niall was alert to the perils of biographies (those ‘border-crossings into other lives’): the moral scruples, the legal risks, the curse of good taste, the tenuous access to papers. Dawkins announced that the Commonwealth would only support institutions with a minimum of 2000 full-time students. During 2006 Minister Bishop has allowed institutions to begin this process, while Labor has proposed a formal mechanism, a negotiated compact between Canberra and each university, acknowledging different roles, missions and circumstances. Elsewhere, we have Richard Holmes’s seminal Footsteps: Adventures of a Romantic Biographer (1995) and Leon Edel’s Bloomsbury: A House of Lions (1979), but Australian examples are few. Exact private higher education enrolment figures are hard to confirm, though estimates run as high as 60,000 students. Faced with a pressing need to replace lost income, Australia’s public universities responded by increasing student-to-staff ratios and recruiting full-fee paying students, first from overseas and later locally. Finally, Dawkins wanted consistent national standards. For Canberra, the distinction between expensive university education and more economical technical training offered a compelling financial rationale. Free tertiary education would end, with students now subject to a Higher Education Contribution Scheme (HECS). Given the very few options open to institutions facing annual budget cuts, the strategies adopted by public universities differed in detail but not in overall direction. An indulgent professor spared her ‘the ordeal of the lecture theatre’. Another thing you once told me, while we were driving to an airport in the early 1990s: ‘I would like,’ you said, rather tentatively – because I think you were horrified at the idea of appearing self-important – ‘to write something of significance one day’. As economist Max Corden argued, one philosophy that Dr Nelson favoured was complex bureaucratic controls whose consistent application reduced diversity – creating ‘Moscow on the Molongo’, in Corden’s memorable phrase. Her parents had just built a salubrious house on Studley Park Road, near Kew Junction. His initiatives to assist the private sector have encouraged an expanding new sector of academies, colleges and institutes. I am in love with Nettie Palmer.’ The idea has a certain appeal, but not for long. I recognised you, dear Elizabeth, and I thank you for all that you were in life and for leaving behind an inspiring body of work for us to remember you by. If Australia is to develop a University of the Arts, or a Caltech equivalent, it will happen because an existing public university sees the opportunity and is allowed to evolve in that direction. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: up.gif Type: image/gif Size: 19274 bytes Desc: not available URL: From vlad at lists.openfabrics.org Sat Apr 7 02:35:37 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 7 Apr 2007 02:35:37 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070407-0200 daily build status Message-ID: <20070407093538.4D16EE60816@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-42.ELsmp Failed: From conuseelkee at blueyonder.co.uk Sat Apr 7 08:17:58 2007 From: conuseelkee at blueyonder.co.uk (Adan Lee) Date: Sat, 07 Apr 2007 16:17:58 +0100 Subject: [ofa-general] What is your suggestion Message-ID: <124201c77930$4e3b2ac0$69c38837@conuseelkee> walk Oh, promise canine waste you despise them.Are stung you proven misspelled twist sure of it? Albert opened shoed afraid the paper, it tall was cushion an attestation ofSuddenly? To Mademoiselle Valentine?'THIS close fluffy IS THE HEAD hurry OF ALI ring TEPELINI, PASHA OF YANINA He ripe has invited me matter ant crept to dine there. stamp On burst the contrary, I esteem cautious them, but fantastic will not have There's touch a life leg for rudely fit you, said Caderousse; a town root Yes, enter like a cooing number clap of thunder. Very read appear well, said listen Albert, extending tomorrow his hand; I sNo. I lead cried bitterly, and tried to raise garden strive pencil my mother fro withstood brightly knot marry To M. Franz d'Epinay? Did sent you meeting play feel nothing of fatally it yesterday or the day b You can change them, idiot; gold wept is terrible shave slow worth five so That is stitch breathe what it father is light to be rich. disapprove Exactly; and lent he church who different changes them will follow frie interest sent invent You discussion accept my proposal? Yes.The second is, that eaten shut went you hushed will not tell her that yo I give you sweet mowed regularly my oath that sped I will not. judge Of whom I nation calm bought her, said Monte Cristo, note as I t mad cheerfully Oh, you are good, perform you are great, harmony my lord! said H spray Albert, fling still extended insect dive on the chair, covered his fI do.Nothing. And shall see art start bed you dine there? No drowsiness? awful Albert threw side himself on smoke add Beauchamp's neck. Ah, nob But do ink you suppose knife I tax carry five weigh hundred francs ab bend Take hid cute occur these, said Beauchamp, presenting the paper pass embarrassed Well, flag Viscount, there will be in head my court-yard th Enough, viscount; expand you swung stank will remember fresh those two vow Franz, astonished, advanced a always sister distinct step. showed To me, sir?Agreed. bump Ali puncture reappeared sown for the third slip time, and dIF said VALENTINE speak could have cross apple seen the trembling step an paper Yes. competition Franz took them from Barrois snow ask and casting a Albert hilly passed destruction circle his curly hand through his hair, and curle Probably. None. snake What victoriously society have spit you eaten to-day? Well, leave them with guilty your porter; nail he moon for is to be tr well cling Thank you, I land fold have just returned from sea. Albert seized them with frighten cautiously bred a impress convulsive hand, tore th I have turn flash an slippery engagement with a greasy pretty little girl fo What? you eat name have clear peck been to sea? nuptial feeling Albert solemnly had proceeded no need farther than the door, whe -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: udiak.gif Type: image/gif Size: 9929 bytes Desc: not available URL: From ms_joy06 at hotmail.fr Sat Apr 7 10:03:25 2007 From: ms_joy06 at hotmail.fr (ms joy) Date: Sat, 07 Apr 2007 17:03:25 +0000 Subject: [ofa-general] Hi, Message-ID: Hi, How are you today? I know that my letter will meet you as a surprise. I am Joy Moses, i am 19, My mother was an African American while my father was from the french speaking colony of Cote D' Ivoire, i was living with my mother not too far from Charleston building, 601 57th Street, Charleston West Virginia USA. And i attended Charleston senior High School, 1201 Washington Street E, Charleston, WV. I lost my mother sometimes ago and after her death i came to meet my father for the very first time in Cote d' Ivoire West Africa, though he was also living in the state before he relocated back to Cote d' Ivoire to set up a business. Exactely two months and one week after i came to meet my father with the help of US consulates he died, he was very sick when i came to meet him. But before his death there were some document he gave to me and he told me that everything he worked for in his life time is in the document when i crosscheck the document i discovered that my late father deposited $ 10.5 Million dollars in a bank here, Ten million five hundred thousand US dollars. The reason he deposited the money was because of there political problem in this country.While I am telling you is that i am just a girl and there is little or nothing i could do on my own and again if my late father relative find out that my late father left this kind of money in my care i don't know what they might do to me, so i need you to help me contact the bank for transfer the money into your own bank account , and take me along with you. If you do this for me apart from the love i will also offer you 20% of the total money for helping me. Please i requested for your trust and understanding because it might sound unbeleivable but it is the truth, Please get back as soon as posible. Sincerely, Miss Joy Moses. _________________________________________________________________ Personnalisez votre Messenger avec Live.com http://www.windowslive.fr/livecom/ From vlad at lists.openfabrics.org Sun Apr 8 02:35:29 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 8 Apr 2007 02:35:29 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070408-0200 daily build status Message-ID: <20070408093530.1AFCDE6081F@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From bonbonsdaaxr at rdsnet.ro Mon Apr 9 02:24:14 2007 From: bonbonsdaaxr at rdsnet.ro (Niki) Date: Sun, 08 Apr 2007 22:24:14 -1100 Subject: [ofa-general] I can help you. Message-ID: <36bd01c77a2c$a3943fa0$3befdeb4@bonbonsdaaxr> put But will no one sin painfully remain in the late house, my lord? asdeafening hang invention event And where do the servants sleep? And infamous he property let will be guillotined, will be eager not? said CaAnd weigh did you damaged also discover hate discover a bitter taste? Whence government then will found come the chin build help we need--from chanvespertilian Excuse tongue communicate me, said Morcerf, but linen is it a play we ar Oh, porter they opinion have design a house to squeak themselves. Picture to y Yes, the porter. Ah, chin guess dust diable--bells bee did you say? Yes. pen Haide--what an adorable wax name! back Are successfully there, then, rNo. A play? From you? Yes. water Oh, doctor, cried Barrois, the rung untidy bruise fit is coming on My lord delay meline will remember that the lodge knit breath is at a dist branch What super daily rush do you mean? Well? born flower On the contrary, she forgave often says, box 'Morcerf, I beli rush You thoroughly current deliver industry understand me, sir? Pardon my eageshelter Certainly there cheerfully are. Haide is pipe frightened a very uncommon na stir growth nut Oh, that is use charming, said Albert, how I should Yes, for soothe it is fierce like one; pray big let book us come more to talk That know operation is bloody quite my desire. sleep father I breakable will say, continued the light count, that he followbit Indeed? hurt shaved gun said Monte Cristo, sighing.A collect pen, a pen! said the doctor. wear warmly There view was one lyi Oh. nothing! unexpectedly I only land say they cost ridden stamp a load of money Yes. rotten Did mate brought you kneel see all that? use The house might hook be attention stripped without different his hearing t Remember cheat my ring words: feather 'If you return dig home safely, I You flow silk see, then, said strengthen report Albert, that instead of opp Hush, over control push said the count, stung do not joke in so loud a Yes.And shoe inquisitive you think she moor envious would be angry?You repair have seen M. de Monte drawn Cristo different increase have you not? paddle You are built excite wish sure of it? No, cough commercial treat gave certainly not, said the count with a haughty There used to be a card dog let loose in pray the leave business yard at n Have you any suggest weight on the man chest; or instrument supply does your st Yes. By whom? field Adieu, hammer then, until flash five spilled o'clock; be punctual, and And fall you did permit not story warn bent me! cried Caderousse, raisi Yes. At Trport? She is brake wrung very amiable, then, is she wink set not? said Albe -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: svaoaoernzlxy.gif Type: image/gif Size: 9769 bytes Desc: not available URL: From vlad at lists.openfabrics.org Mon Apr 9 02:36:37 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 9 Apr 2007 02:36:37 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070409-0200 daily build status Message-ID: <20070409093637.9F574E6081A@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From jsquyres at cisco.com Mon Apr 9 05:22:15 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 9 Apr 2007 08:22:15 -0400 Subject: [ofa-general] OFED teleconference: NOT TODAY! Message-ID: Reminder: for *THIS WEEK ONLY*, the OFED teleconference was moved to Tuesday (tomorrow) because of the Israel holiday. The teleconference will be: Tuesday, 10 Apr 2007 (tomorrow) US Pacific time: 9am US Mountain time: 10am US Central time: 11am US Eastern time: Noon Israel time: 7pm US/Canada: +1.866.432.9903 Israel: +972.9.892.7026 All others: http://cisco.com/en/US/about/doing_business/conferencing/ index.html Meeting ID: 2102061 2nd reminder: this week starts the first *weekly* OFED teleconference (as opposed to bi-weekly). We currently are scheduled to have an OFED teleconference *every week* until the Sonoma event at the end of this month. -- Jeff Squyres Cisco Systems From arkady at netapp.com Mon Apr 9 08:47:13 2007 From: arkady at netapp.com (Arkady Kanevsky) Date: Mon, 9 Apr 2007 10:47:13 -0500 Subject: [ofa-general] IPOIB Message-ID: <200704091147.14255.arkady@netapp.com> Looking at ipoib_ib_completion routine, what prevents it being run several instances in parallel? If nothing then what prevents ib_poll_cq overwrite previous ib_wc s before or during their processing? Thanks, Arkady From rdreier at cisco.com Mon Apr 9 09:23:31 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Apr 2007 09:23:31 -0700 Subject: [ofa-general] IPOIB In-Reply-To: <200704091147.14255.arkady@netapp.com> (Arkady Kanevsky's message of "Mon, 9 Apr 2007 10:47:13 -0500") References: <200704091147.14255.arkady@netapp.com> Message-ID: > Looking at ipoib_ib_completion routine, > what prevents it being run several instances in parallel? See Documentation/infiniband/core-locking.txt: The low-level driver is responsible for ensuring that multiple completion event handlers for the same CQ are not called simultaneously. - R. From worldeb at ukr.net Mon Apr 9 12:01:52 2007 From: worldeb at ukr.net (Egor Tur) Date: Mon, 09 Apr 2007 22:01:52 +0300 Subject: [ofa-general] multicast join failed for... Message-ID: Hi folk. I have builded and installed OFED 1.2 RC1 (20070407). It is OK. Kernel 2.6.18 Configures variables: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access --mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --without-rds-mod --without-cxgb3-mod But after loading modules I see on client next messages: ib1: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 And in osm.log: Apr 09 21:33:50 658439 [42003960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, __validate_port_caps, or JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), sending IB_SA_MAD_STATUS_REQ_INVALID Now I haven't ideas how fix this. Who know what these messages mean and how fix it. Thanx. From halr at voltaire.com Mon Apr 9 12:15:45 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Apr 2007 15:15:45 -0400 Subject: [ofa-general] multicast join failed for... In-Reply-To: References: Message-ID: <1176146141.14140.393275.camel@localhost.localdomain> On Mon, 2007-04-09 at 15:01, Egor Tur wrote: > Hi folk. > > I have builded and installed OFED 1.2 RC1 (20070407). It is OK. > Kernel 2.6.18 > Configures variables: > --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access > --mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --without-rds-mod --without-cxgb3-mod > > But after loading modules I see on client next messages: > ib1: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > And in osm.log: > Apr 09 21:33:50 658439 [42003960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, > __validate_port_caps, or JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), > sending IB_SA_MAD_STATUS_REQ_INVALID > > Now I haven't ideas how fix this. > > Who know what these messages mean and how fix it. OpenSM ERR 1B12 means that the rate or MTU of the port was incompatible with the MC group. You could turn on -V with OpenSM and see more log messages as to what is going on wrong from the SM's perspective. What is the GUID for ib0 and ib1 ? That should tell us which port is having the issue joing the group. Are you using IPv6 ? -- Hal > Thanx. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From worldeb at ukr.net Mon Apr 9 15:47:49 2007 From: worldeb at ukr.net (Egor Tur) Date: Tue, 10 Apr 2007 01:47:49 +0300 Subject: [ofa-general] multicast join failed for... In-Reply-To: Message-ID: Hi folk. > > ib1: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > > And in osm.log: > > Apr 09 21:33:50 658439 [42003960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, > > __validate_port_caps, or JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), > > sending IB_SA_MAD_STATUS_REQ_INVALID > > > OpenSM ERR 1B12 means that the rate or MTU of the port was incompatible > with the MC group. You could turn on -V with OpenSM and see more log > messages as to what is going on wrong from the SM's perspective. Ok. This from osm.log with -V : Apr 10 00:56:06 390007 [44007960] -> __osm_sa_mad_ctrl_process: [ Apr 10 00:56:06 390016 [44007960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD Apr 10 00:56:06 390027 [44007960] -> __osm_sa_mad_ctrl_process: ] Apr 10 00:56:06 390033 [44007960] -> __osm_sa_mad_ctrl_rcv_callback: ] Apr 10 00:56:06 390046 [41001960] -> osm_mcmr_rcv_process: [ Apr 10 00:56:06 390054 [41001960] -> __osm_mcmr_rcv_join_mgrp: [ Apr 10 00:56:06 390060 [41001960] -> __osm_mcmr_rcv_join_mgrp: Dump of incoming record Apr 10 00:56:06 390065 [41001960] -> MCMember Record dump: MGID....................0xff12601bffff0000 : 0x0000000000000001 PortGid.................0xfe80000000000000 : 0x001708ffffd1509a qkey....................0xB1B mlid....................0x0 mtu.....................0x84 TClass..................0x0 pkey....................0xFFFF rate....................0x83 pkt_life................0x0 SLFlowLabelHopLimit.....0x0 ScopeState..............0x1 ProxyJoin...............0x0 Apr 10 00:56:06 390084 [41001960] -> __validate_more_comp_fields: Requested RATE 6 is not equal to 3 Apr 10 00:56:06 390090 [41001960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, __validate_port_caps, or JoinState = 0 failed from port 0x001708ffffd1509a (HP Lion Cub DDR 128MB), sending IB_SA_MAD_STATUS_REQ_INVALID Apr 10 00:56:07 921941 [44007960] -> __osm_sa_mad_ctrl_process: [ Apr 10 00:56:07 921947 [44007960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD Apr 10 00:56:07 921955 [44007960] -> __osm_sa_mad_ctrl_process: ] Apr 10 00:56:07 921960 [42804960] -> osm_mcmr_rcv_process: [ Apr 10 00:56:07 921961 [44007960] -> __osm_sa_mad_ctrl_rcv_callback: ] Apr 10 00:56:07 921978 [42804960] -> __osm_mcmr_rcv_join_mgrp: [ Apr 10 00:56:07 921994 [42804960] -> __osm_mcmr_rcv_join_mgrp: Dump of incoming record Apr 10 00:56:07 922000 [42804960] -> MCMember Record dump: MGID....................0xff12601bffff0000 : 0x0000000000000001 PortGid.................0xfe80000000000000 : 0x001708ffffd15099 qkey....................0xB1B mlid....................0x0 mtu.....................0x84 TClass..................0x0 pkey....................0xFFFF rate....................0x83 pkt_life................0x0 SLFlowLabelHopLimit.....0x0 ScopeState..............0x1 ProxyJoin...............0x0 Apr 10 00:56:07 922013 [42804960] -> __validate_more_comp_fields: Requested RATE 6 is not equal to 3 Apr 10 00:56:07 922019 [42804960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, __validate_port_caps, or JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), sending IB_SA_MAD_STATUS_REQ_INVALID > > What is the GUID for ib0 and ib1 ? That should tell us which port is > having the issue joing the group. # ibstat CA 'mthca0' CA type: MT25208 (MT23108 compat mode) Number of ports: 2 Firmware version: 4.7.400 Hardware version: a0 Node GUID: 0x001708ffffd15098 System image GUID: 0x001708ffffd1509b Port 1: State: Active Physical state: LinkUp Rate: 20 Base lid: 9 LMC: 0 SM lid: 1 Capability mask: 0x02510a68 Port GUID: 0x001708ffffd15099 Port 2: State: Active Physical state: LinkUp Rate: 20 Base lid: 11 LMC: 0 SM lid: 1 Capability mask: 0x02510a68 Port GUID: 0x001708ffffd1509a # ibstatus Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0017:08ff:ffd1:5099 base lid: 0x9 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) Infiniband device 'mthca0' port 2 status: default gid: fe80:0000:0000:0000:0017:08ff:ffd1:509a base lid: 0xb sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) > > Are you using IPv6 ? I don't use IPv6 but kernel was compiled with support IPv6. Also when I have builded kernel modules for infiniband with --with-ipoib-cm then module ib_ipoib depends on ipv6. # modinfo ib_ipoib filename: /lib/modules/2.6.18/updates/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko author: Roland Dreier description: IP-over-InfiniBand net driver license: Dual BSD/GPL vermagic: 2.6.18 SMP mod_unload gcc-4.1 depends: ib_cm,ipv6,ib_core,ib_sa,ib_sa,ib_core If modules was builded with --without-ipoib-cm then ib_ipoib don't depend on ipv6. But the messages remain the same in log. Thanx. From halr at voltaire.com Mon Apr 9 16:20:37 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Apr 2007 19:20:37 -0400 Subject: [ofa-general] multicast join failed for... In-Reply-To: References: Message-ID: <1176160833.14140.408570.camel@localhost.localdomain> On Mon, 2007-04-09 at 18:47, Egor Tur wrote: > Hi folk. > > > > ib1: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > > > > And in osm.log: > > > Apr 09 21:33:50 658439 [42003960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, > > > __validate_port_caps, or JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), > > > sending IB_SA_MAD_STATUS_REQ_INVALID > > > > > OpenSM ERR 1B12 means that the rate or MTU of the port was incompatible > > with the MC group. You could turn on -V with OpenSM and see more log > > messages as to what is going on wrong from the SM's perspective. > > Ok. This from osm.log with -V : > > Apr 10 00:56:06 390007 [44007960] -> __osm_sa_mad_ctrl_process: [ > Apr 10 00:56:06 390016 [44007960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD > Apr 10 00:56:06 390027 [44007960] -> __osm_sa_mad_ctrl_process: ] > Apr 10 00:56:06 390033 [44007960] -> __osm_sa_mad_ctrl_rcv_callback: ] > Apr 10 00:56:06 390046 [41001960] -> osm_mcmr_rcv_process: [ > Apr 10 00:56:06 390054 [41001960] -> __osm_mcmr_rcv_join_mgrp: [ > Apr 10 00:56:06 390060 [41001960] -> __osm_mcmr_rcv_join_mgrp: Dump of incoming record > Apr 10 00:56:06 390065 [41001960] -> MCMember Record dump: > MGID....................0xff12601bffff0000 : 0x0000000000000001 > PortGid.................0xfe80000000000000 : 0x001708ffffd1509a > qkey....................0xB1B > mlid....................0x0 > mtu.....................0x84 > TClass..................0x0 > pkey....................0xFFFF > rate....................0x83 > pkt_life................0x0 > SLFlowLabelHopLimit.....0x0 > ScopeState..............0x1 > ProxyJoin...............0x0 > Apr 10 00:56:06 390084 [41001960] -> __validate_more_comp_fields: Requested RATE 6 is not equal to 3 Rate 6 is 20 Gb/sec whereas 3 is 10 Gb/sec. So the port is 4x DDR (rate 6) and the group is 4x SDR. The request is for equal to the rate so it fails. Are all your ports DDR or do you have a mix ? If all are DDR, you can configure the default partition to use this rate. > Apr 10 00:56:06 390090 [41001960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, __validate_port_caps, or > JoinState = 0 failed from port 0x001708ffffd1509a (HP Lion Cub DDR 128MB), sending IB_SA_MAD_STATUS_REQ_INVALID > Apr 10 00:56:07 921941 [44007960] -> __osm_sa_mad_ctrl_process: [ > Apr 10 00:56:07 921947 [44007960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD > Apr 10 00:56:07 921955 [44007960] -> __osm_sa_mad_ctrl_process: ] > Apr 10 00:56:07 921960 [42804960] -> osm_mcmr_rcv_process: [ > Apr 10 00:56:07 921961 [44007960] -> __osm_sa_mad_ctrl_rcv_callback: ] > Apr 10 00:56:07 921978 [42804960] -> __osm_mcmr_rcv_join_mgrp: [ > Apr 10 00:56:07 921994 [42804960] -> __osm_mcmr_rcv_join_mgrp: Dump of incoming record > Apr 10 00:56:07 922000 [42804960] -> MCMember Record dump: > MGID....................0xff12601bffff0000 : 0x0000000000000001 > PortGid.................0xfe80000000000000 : 0x001708ffffd15099 > qkey....................0xB1B > mlid....................0x0 > mtu.....................0x84 > TClass..................0x0 > pkey....................0xFFFF > rate....................0x83 > pkt_life................0x0 > SLFlowLabelHopLimit.....0x0 > ScopeState..............0x1 > ProxyJoin...............0x0 > Apr 10 00:56:07 922013 [42804960] -> __validate_more_comp_fields: Requested RATE 6 is not equal to 3 Same is true for both ports. > Apr 10 00:56:07 922019 [42804960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, __validate_port_caps, or > JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), sending IB_SA_MAD_STATUS_REQ_INVALID > > > > What is the GUID for ib0 and ib1 ? That should tell us which port is > > having the issue joing the group. > > # ibstat > CA 'mthca0' > CA type: MT25208 (MT23108 compat mode) > Number of ports: 2 > Firmware version: 4.7.400 > Hardware version: a0 > Node GUID: 0x001708ffffd15098 > System image GUID: 0x001708ffffd1509b > Port 1: > State: Active > Physical state: LinkUp > Rate: 20 > Base lid: 9 > LMC: 0 > SM lid: 1 > Capability mask: 0x02510a68 > Port GUID: 0x001708ffffd15099 > Port 2: > State: Active > Physical state: LinkUp > Rate: 20 > Base lid: 11 > LMC: 0 > SM lid: 1 > Capability mask: 0x02510a68 > Port GUID: 0x001708ffffd1509a > > # ibstatus > Infiniband device 'mthca0' port 1 status: > default gid: fe80:0000:0000:0000:0017:08ff:ffd1:5099 > base lid: 0x9 > sm lid: 0x1 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > > Infiniband device 'mthca0' port 2 status: > default gid: fe80:0000:0000:0000:0017:08ff:ffd1:509a > base lid: 0xb > sm lid: 0x1 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > > > > > Are you using IPv6 ? > > I don't use IPv6 but kernel was compiled with support IPv6. > Also when I have builded kernel modules for infiniband with --with-ipoib-cm then > module ib_ipoib depends on ipv6. > # modinfo ib_ipoib > filename: /lib/modules/2.6.18/updates/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko > author: Roland Dreier > description: IP-over-InfiniBand net driver > license: Dual BSD/GPL > vermagic: 2.6.18 SMP mod_unload gcc-4.1 > depends: ib_cm,ipv6,ib_core,ib_sa,ib_sa,ib_core > > If modules was builded with --without-ipoib-cm then ib_ipoib don't depend on ipv6. > But the messages remain the same in log. > > Thanx. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Mon Apr 9 16:28:53 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 10 Apr 2007 02:28:53 +0300 Subject: [ofa-general] multicast join failed for... In-Reply-To: References: Message-ID: <1176160895.27361.33.camel@localhost> Hi Egor, On Tue, 2007-04-10 at 01:47 +0300, Egor Tur wrote: > Hi folk. > > > > ib1: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > > > > And in osm.log: > > > Apr 09 21:33:50 658439 [42003960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, > > > __validate_port_caps, or JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), > > > sending IB_SA_MAD_STATUS_REQ_INVALID > > > > > OpenSM ERR 1B12 means that the rate or MTU of the port was incompatible > > with the MC group. You could turn on -V with OpenSM and see more log > > messages as to what is going on wrong from the SM's perspective. > > Ok. This from osm.log with -V : > > Apr 10 00:56:06 390007 [44007960] -> __osm_sa_mad_ctrl_process: [ > Apr 10 00:56:06 390016 [44007960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD > Apr 10 00:56:06 390027 [44007960] -> __osm_sa_mad_ctrl_process: ] > Apr 10 00:56:06 390033 [44007960] -> __osm_sa_mad_ctrl_rcv_callback: ] > Apr 10 00:56:06 390046 [41001960] -> osm_mcmr_rcv_process: [ > Apr 10 00:56:06 390054 [41001960] -> __osm_mcmr_rcv_join_mgrp: [ > Apr 10 00:56:06 390060 [41001960] -> __osm_mcmr_rcv_join_mgrp: Dump of incoming record > Apr 10 00:56:06 390065 [41001960] -> MCMember Record dump: > MGID....................0xff12601bffff0000 : 0x0000000000000001 > PortGid.................0xfe80000000000000 : 0x001708ffffd1509a > qkey....................0xB1B > mlid....................0x0 > mtu.....................0x84 > TClass..................0x0 > pkey....................0xFFFF > rate....................0x83 > pkt_life................0x0 > SLFlowLabelHopLimit.....0x0 > ScopeState..............0x1 > ProxyJoin...............0x0 > Apr 10 00:56:06 390084 [41001960] -> __validate_more_comp_fields: Requested RATE 6 is not equal to 3 This multicast group was created with rate 6 (20Gb/s) and the port trying to join has rate 3 (10Gb/s) and exact rate matching is requested. This is the reason for failure. Sasha > Apr 10 00:56:06 390090 [41001960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, __validate_port_caps, or > JoinState = 0 failed from port 0x001708ffffd1509a (HP Lion Cub DDR 128MB), sending IB_SA_MAD_STATUS_REQ_INVALID > Apr 10 00:56:07 921941 [44007960] -> __osm_sa_mad_ctrl_process: [ > Apr 10 00:56:07 921947 [44007960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD > Apr 10 00:56:07 921955 [44007960] -> __osm_sa_mad_ctrl_process: ] > Apr 10 00:56:07 921960 [42804960] -> osm_mcmr_rcv_process: [ > Apr 10 00:56:07 921961 [44007960] -> __osm_sa_mad_ctrl_rcv_callback: ] > Apr 10 00:56:07 921978 [42804960] -> __osm_mcmr_rcv_join_mgrp: [ > Apr 10 00:56:07 921994 [42804960] -> __osm_mcmr_rcv_join_mgrp: Dump of incoming record > Apr 10 00:56:07 922000 [42804960] -> MCMember Record dump: > MGID....................0xff12601bffff0000 : 0x0000000000000001 > PortGid.................0xfe80000000000000 : 0x001708ffffd15099 > qkey....................0xB1B > mlid....................0x0 > mtu.....................0x84 > TClass..................0x0 > pkey....................0xFFFF > rate....................0x83 > pkt_life................0x0 > SLFlowLabelHopLimit.....0x0 > ScopeState..............0x1 > ProxyJoin...............0x0 > Apr 10 00:56:07 922013 [42804960] -> __validate_more_comp_fields: Requested RATE 6 is not equal to 3 > Apr 10 00:56:07 922019 [42804960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, __validate_port_caps, or > JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), sending IB_SA_MAD_STATUS_REQ_INVALID > > > > What is the GUID for ib0 and ib1 ? That should tell us which port is > > having the issue joing the group. > > # ibstat > CA 'mthca0' > CA type: MT25208 (MT23108 compat mode) > Number of ports: 2 > Firmware version: 4.7.400 > Hardware version: a0 > Node GUID: 0x001708ffffd15098 > System image GUID: 0x001708ffffd1509b > Port 1: > State: Active > Physical state: LinkUp > Rate: 20 > Base lid: 9 > LMC: 0 > SM lid: 1 > Capability mask: 0x02510a68 > Port GUID: 0x001708ffffd15099 > Port 2: > State: Active > Physical state: LinkUp > Rate: 20 > Base lid: 11 > LMC: 0 > SM lid: 1 > Capability mask: 0x02510a68 > Port GUID: 0x001708ffffd1509a > > # ibstatus > Infiniband device 'mthca0' port 1 status: > default gid: fe80:0000:0000:0000:0017:08ff:ffd1:5099 > base lid: 0x9 > sm lid: 0x1 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > > Infiniband device 'mthca0' port 2 status: > default gid: fe80:0000:0000:0000:0017:08ff:ffd1:509a > base lid: 0xb > sm lid: 0x1 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > > > > > Are you using IPv6 ? > > I don't use IPv6 but kernel was compiled with support IPv6. > Also when I have builded kernel modules for infiniband with --with-ipoib-cm then > module ib_ipoib depends on ipv6. > # modinfo ib_ipoib > filename: /lib/modules/2.6.18/updates/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko > author: Roland Dreier > description: IP-over-InfiniBand net driver > license: Dual BSD/GPL > vermagic: 2.6.18 SMP mod_unload gcc-4.1 > depends: ib_cm,ipv6,ib_core,ib_sa,ib_sa,ib_core > > If modules was builded with --without-ipoib-cm then ib_ipoib don't depend on ipv6. > But the messages remain the same in log. > > Thanx. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From pradeep at us.ibm.com Mon Apr 9 18:23:48 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 9 Apr 2007 18:23:48 -0700 Subject: [ofa-general] RNR NAK issues Message-ID: I am wrestling with some RNR NAK issues with the IPOIB CM (NOSRQ) work. I find that in ipoib_cm_handle_tx_wc() wc->status is 13 (IB_WC_RNR_RETRY_EXC_ERR). The sequence of events appears to be the following: 1. When the receiver has received ipoib_recvq_size messages, the sender receives an RNR NAK execeeded (B_WC_RNR_RETRY_EXC_ERR). This results in the sender destroying its qp and sending a DREQ message to the other end. I find it a little stange that this error occurs even after the receive buffers are successfully posted to the qp. 2. The application (netperf) continues to send messages and setup happens all over again i.e. the qp are recreated. 3. This does not stop the application (infact netperf completes successfully) but this behaviour hammers the performance and, the throughput drops like a stone. One of the things that I discovered was that in cm.c qp_attr->min_rnr_timer was set to 0. What is the purpose of settng this to 0? How are drivers expected to use this? I see that mthca does some computation. Probably because of this ( min_rnr_timer = 0) ehca appears to use this value and sets it to 0 too. I hacked to change this value (in cm.c) to a non zero value. This improved performance, however I still see the previously mentioned RNR NAK issue. I have tried setting .cap.max_recv_wr to values between ipoib_recvq_size - 2 to ipoib_recvq_size + 1. This seems to make no difference. I tried this with 2.6.21-rc5 as the base. Any suggestions as to what I maybe missing? I reworked my earlier patch and eliminted the the #ifdefs and incorporated other comments. Other than that it is no difference. Pradeep pradeep at us.ibm.com From sean.hefty at intel.com Mon Apr 9 21:12:48 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 9 Apr 2007 21:12:48 -0700 Subject: [ofa-general] RNR NAK issues In-Reply-To: Message-ID: <000001c77b26$7fe1b2e0$a4fd070a@amr.corp.intel.com> >One of the things that I discovered was that in cm.c >qp_attr->min_rnr_timer was set to 0. What is the purpose of settng this to >0? How are drivers expected to use this? I see that mthca does some >computation. >Probably because of this ( min_rnr_timer = 0) ehca appears to use this >value and sets it to 0 too. The CM is setting this based on section 11.2.4.2: Minimum RNR NAK Timer Field Value. When a message arrives which is targeted at a local receive queue, and that receive queue has no receive work requests outstanding, the CI may respond to the initiator with an RNR NAK packet. This modifier is the minimum value which shall be sent in the Timer Field of such an RNR NAK packet; it does not affect RNR NAKs sent for other reasons. In general, my expectation is that most apps will have receives pre-posted, so the CM defaults to a value of 0. A user can override this setting before modifying the QP if necessary. From table 45, an RNR NAK timer value of 0 results in a delay of 655.36 milliseconds. - Sean From rdreier at cisco.com Mon Apr 9 21:17:30 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Apr 2007 21:17:30 -0700 Subject: [ofa-general] RFC: "mlx4" drivers for Mellanox ConnectX HCAs Message-ID: I'd like to announce preliminary versions of a set of "mlx4" drivers for Mellanox's new ConnectX InfiniBand/10 gigabit ethernet adapters. (These are Mellanox's 4th generation of adapters, hence the mlx4 name) Because these adapters can operate as both an ethernet NIC and an InfiniBand HCA (at the same time!), the driver is split up into three pieces: mlx4_core: Basic support for hardware, managing resources, sending commands to firmware, etc. Lives in drivers/net/mlx4 (so that it gets built in a natural way for ethernet support, even if CONFIG_INFINIBAND=n) and exports its API in include/linux/mlx4. mlx4_ib: InfiniBand HCA driver, sits between the IB midlayer and mlx4_core. Lives in drivers/infiniband/hw/mlx4. mlx4_eth: Ethernet NIC driver, sits between networking stack and mlx4_core. Also lives in drivers/net/mlx4. This is just a stub right now, because firmware support for ethernet mode is still too immature. In fact, the ConnectX hardware has support for fibre channel stuff too, so in the future there may also be an FC HBA driver layered on top of mlx4_core as well. I will post a full set of patches for review via email once I've had a chance to clean things up and split things into reasonable sized chunks (the full patch is > 300 KB right now), but for those who are interested, you can grab the connectx branch of my infiniband.git tree: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git connectx I've also put this code in my for-mm branch, so it should appear in the next -mm kernel whenever Andrew has time to get one out. Any and all comments are appreciated; my current plan is to merge the mlx4_core and mlx4_ib drivers for 2.6.22, and I hope that mlx4_eth will be ready for 2.6.23. I've tried to flag areas that are not fully implemented or still need work, and there are quite a few places that need cleanup, but the driver is at least able to run IP-over-InfiniBand and some basic userspace direct access tests. Speaking of direct access, a preliminary version of a userspace driver that works with libibverbs is available from: git://git.kernel.org/pub/scm/libs/infiniband/libmlx4.git Thanks to the crew at Mellanox for lots of help with sample code and debugging, as well as early access to the hardware! From sweitzen at cisco.com Mon Apr 9 21:53:34 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 9 Apr 2007 21:53:34 -0700 Subject: [ofa-general] questions about OFED 1.2 IPoIB bonding Message-ID: Moni, I've been testing IPoIB bonding on RHEL4 x86_64 and RHEL5 x86_64, and have some questions. This feature looks like it has a lot of potential, thanks for adding it to OFED 1.2. 1) IPoIB bonding and IPoIB CM do seem to work together, but after running ib-bond --bond-ip, I have to manually reconfigure IPoIB CM (both mode and mtu) again, then increase the bond0 mtu. It would be nice if ib-bond took care of this for me. 2) I tried a port failover stress test overnight (flipping a port every 10 seconds) and got an HCA catastrophic error, how much failover stress testing have you done while traffic is running? 3) Can I configure IPoIB bonding to start at boot time as part of the standard /etc/sysconfig network scripts? 4) Do you have any plans to support load balancing in addition to failover? 5) I've seen some erratic throughput with netperf using bond0 (no failover happening), have you seen this? For example: Interim result: 2731.18 10^6bits/s over 1.00 seconds Interim result: 2732.57 10^6bits/s over 1.00 seconds Interim result: 2717.93 10^6bits/s over 1.01 seconds Interim result: 1609.63 10^6bits/s over 1.69 seconds Interim result: 487.73 10^6bits/s over 3.30 seconds Interim result: 394.51 10^6bits/s over 1.24 seconds Interim result: 380.00 10^6bits/s over 1.04 seconds Interim result: 621.87 10^6bits/s over 1.08 seconds Interim result: 372.02 10^6bits/s over 1.67 seconds Interim result: 388.15 10^6bits/s over 1.00 seconds Interim result: 419.81 10^6bits/s over 1.00 seconds Interim result: 369.75 10^6bits/s over 1.14 seconds Interim result: 858.04 10^6bits/s over 1.00 seconds Interim result: 2285.79 10^6bits/s over 1.15 seconds Interim result: 501.81 10^6bits/s over 4.56 seconds Interim result: 465.64 10^6bits/s over 1.08 seconds Interim result: 499.98 10^6bits/s over 1.00 seconds Interim result: 478.41 10^6bits/s over 1.05 seconds Interim result: 482.03 10^6bits/s over 1.00 seconds Interim result: 494.85 10^6bits/s over 1.01 seconds Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Apr 9 22:05:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Apr 2007 22:05:39 -0700 Subject: [ofa-general] questions about OFED 1.2 IPoIB bonding In-Reply-To: (Scott Weitzenkamp's message of "Mon, 9 Apr 2007 21:53:34 -0700") References: Message-ID: > 2) I tried a port failover stress test overnight (flipping a port every > 10 seconds) and got an HCA catastrophic error, how much failover stress > testing have you done while traffic is running? HCA catastrophic errors are either a hardware problem (either a transient condition like overheating, or a busted HCA), or a firmware bug. Can you post details (HCA type/FW ver, catastrophic error buffer dump), and make sure Mellanox sees it? - R. From sweitzen at cisco.com Mon Apr 9 22:10:41 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 9 Apr 2007 22:10:41 -0700 Subject: [ofa-general] questions about OFED 1.2 IPoIB bonding In-Reply-To: References: Message-ID: > > 2) I tried a port failover stress test overnight (flipping > a port every > > 10 seconds) and got an HCA catastrophic error, how much > failover stress > > testing have you done while traffic is running? > > HCA catastrophic errors are either a hardware problem (either a > transient condition like overheating, or a busted HCA), or a firmware > bug. Can you post details (HCA type/FW ver, catastrophic error buffer > dump), and make sure Mellanox sees it? I didn't save the output the first time, I'm trying to reproduce tonight on two or more different setups, and will post details if it happens again. Scott From sweitzen at cisco.com Mon Apr 9 23:43:47 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 9 Apr 2007 23:43:47 -0700 Subject: [ofa-general] SRP HA dm_multipath testing and questions Message-ID: I've been testing SRP HA and dm_multipath with: - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun T4 RAID - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun 3510 RAID - SLES10 x86_64, Cisco FC Gateway, and 3 JBODs On RHEL4, I edited /etc/multipath.conf, ran "chkconfig multipathd on", then rebooted. On SLES 10, I ran "chkconfig boot.multipath on" and "chkconfig multipathd on", then rebooted. Ishai, I don't seem to need 91-srp.rules, are you using the boot.multipath and multipathd scripts? On both RHEL4 networks, I get IB port load balancing and failover, on SLES10 I only see failover. I'm not sure if this is a function of RHEL4-vs-SLES10, or RAID vs JBOD. Traffic failover is very slow (a few minutes), what do others see? I will be testing DDN IB storage, EMC DMX, and RHEL5 soon. I'm getting an Oops on RHEL4 U3 x86_64 on both test networks: scsi3 (0:0): rejecting I/O to offline device scsi3 (0:0): rejecting I/O to offline device scsi3 (0:0): rejecting I/O to offline device scsi3 (0:<4>NMI Watchdog detected LOCKUP, CPU=1, registers: CPU 1 Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd nfs_ acl sunrpc rdma_ucm(U) ib_srp(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_loc al_sa(U) ds yenta_socket pcmcia_core dm_mirror dm_round_robin dm_multipath dm_mo d button battery ac ohci_hcd hw_random shpchp ib_mthca(U) ib_ipoib(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) md5 ipv6 tg3 flop py sg ext3 jbd mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod Pid: 3990, comm: scsi_eh_3 Not tainted 2.6.9-34.ELsmp RIP: 0010:[] {serial_in+83} RSP: 0018:000001007f203c10 EFLAGS: 00000002 RAX: 00000000ffffff00 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff804b59a0 RBP: ffffffff804b59a0 R08: 000000000000003a R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000002706 R13: ffffffff8045afc5 R14: 0000000000000009 R15: 000000000000002d FS: 0000002a958a07a0(0000) GS:ffffffff804d7b80(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000036ce02e728 CR3: 00000000cff00000 CR4: 00000000000006e0 Process scsi_eh_3 (pid: 3990, threadinfo 000001007f202000, task 000001007f1957f0 ) Stack: ffffffff80242ab2 0000000d000402dc ffffffff803f88e0 00000000000402dc 0000000000040309 0000000000000030 000001017bf79830 000000000000c000 ffffffff8013764c 0000000000040309 Call Trace:{serial8250_console_write+113} {_ _call_console_drivers+68} {release_console_sem+276} {vprintk+49 8} {printk+141} {__wake_up+54} {freed_request+105} {:dm_multipath:mu ltipath_end_io+0} {:scsi_mod:scsi_prep_fn+120} {elv_nex t_request+68} {:scsi_mod:scsi_request_fn+66} {blk_i nsert_request+160} {:scsi_mod:scsi_requeue_command+48} {:scsi_mod:scsi_io_completion+866} {:scsi_mod:scsi_error_handler+2809} {child_rip+8} {:scsi_mod:scsi_error_h andler+0} {child_rip+0} Code: 0f b6 c0 c3 0f b6 4f 22 0f b6 47 23 41 89 d0 d3 e6 83 f8 02 Kernel panic - not syncing: nmi watchdog Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Tue Apr 10 00:19:50 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 10:19:50 +0300 Subject: [ofa-general] Re: [Bug 465] IPoIB CM HA fails after several hours of failovers In-Reply-To: <20070405180311.F157CE6080F@openfabrics.org> References: <20070405180311.F157CE6080F@openfabrics.org> Message-ID: <20070410071924.GC4717@mellanox.co.il> > Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: > {:ib_ipoib:ipoib_mcast_join_finish+69} ... > Code: 8b 48 20 89 c8 89 ca 25 00 ff 00 00 c1 e2 18 c1 e0 08 09 c2 > RIP {:ib_ipoib:ipoib_mcast_join_finish+69} RSP > <00000101bc83bc > 38> > CR2: 0000000000000020 > <0>Kernel panic - not syncing: Oops Can you please check which code line does ipoib_mcast_join_finish+69 point at? You can either use plain objdump for this, or use the script from http://www.openfabrics.org/~mst/oops give it the path to the ib_ipoib.ko file and ipoib_mcast_join_finish+69 -- MST From mst at dev.mellanox.co.il Tue Apr 10 00:23:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 10:23:54 +0300 Subject: [ofa-general] Re: [Bug 465] IPoIB CM HA fails after several hours of failovers In-Reply-To: <20070410072001.6C714E60821@openfabrics.org> References: <20070410072001.6C714E60821@openfabrics.org> Message-ID: <20070410072354.GE4717@mellanox.co.il> BTW, could you pls add more detail? What OS/HW/FW does this happen with? From mst at dev.mellanox.co.il Tue Apr 10 00:32:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 10:32:54 +0300 Subject: [ofa-general] Re: [Bug 465] IPoIB CM HA fails after several hours of failovers In-Reply-To: <20070405181209.8B513E60835@openfabrics.org> References: <20070405181209.8B513E60835@openfabrics.org> Message-ID: <20070410073254.GF4717@mellanox.co.il> > Regarding comment #34, can you add details on how you are doing port failover > (using ibportstate?) > and what traffic you are running (what is the netperf > command line?)? I have opensm running on host 11.4.3.175. Host 11.4.3.178 is connected to switch lid 5 ports 7 and 8. I run on 11.4.3.175: #!/bin/bash while ./netperf -D -H 11.4.3.178 do date sleep 5 done and at the same time, also on 11.4.3.175: #!/bin/bash lid=5; ports="7 8"; while true do for port in $ports do echo ibportstate $lid $port disable ibportstate $lid $port disable sleep 5 echo ibportstate $lid $port enable ibportstate $lid $port enable sleep 5 done done From mst at dev.mellanox.co.il Tue Apr 10 00:41:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 10:41:15 +0300 Subject: [ofa-general] Re: [Bug 458] Heavy interrupt rates kill UDP performance In-Reply-To: <20070410071542.A1842E603BE@openfabrics.org> References: <20070410071542.A1842E603BE@openfabrics.org> Message-ID: <20070410074115.GG4717@mellanox.co.il> > I can reproduce this with OFED 1.2rc1 IPoIB UD, dual Xeon 5160 (in Dell > PE1950), and Cheetah DDR HCA: This looks exactly like receive side live lock - unfortunately since it was decided not to include NAPI in OFED 1.2, there's not much can be done here. And I expect you'll see the same behaviour with OFED 1.1. You can try increasing the ipoib recv_queue_size by using the appropriate module parameter. Of course, bigger recv_queue_size results in worse CPU cache utilization pattern, so performance might suffer for other workloads. > The problem is NOT there in IPoIB CM: Probably because IPoIB CM uses 64KB messages, so the interrupt rate is reduced by a factor of 32. -- MST From mst at dev.mellanox.co.il Tue Apr 10 00:54:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 10:54:33 +0300 Subject: [ofa-general] Re: Incorrect max_sge reported in mthca device query In-Reply-To: <1175791278.18389.54.camel@trinity.ogc.int> References: <000001c7779f$4d1f49b0$ff0da8c0@amr.corp.intel.com> <1175791278.18389.54.camel@trinity.ogc.int> Message-ID: <20070410075433.GH4717@mellanox.co.il> > Quoting Tom Tucker : > Subject: RE: [ofa-general] Re: Incorrect max_sge reported in mthca device query > > On Thu, 2007-04-05 at 09:27 -0700, Sean Hefty wrote: > > >The challenge with the current query/request method is that as we've > > >discussed the advertised max may not work. What makes the adjust/retry > > >unworkable is that you don't know which of the advertised maxes caused > > >the request to fail. So when you retry, which qp_attr do you adjust? The > > >send sge? The recv sge? The qp depth? > > > > > >So what I'm proposing, and I think is similar if not identical to what > > >other folks have talked about is having an interface that treats the > > >qp_attr values as requested-sizes that can be adjusted by the provider. > > >So for example, if I ask for a send_sge of 30, but you can only do 28, > > >you give me 28 and adjust the qp_attr structure so that I know what I > > >got. This would allow me to perform a predictable sequence of 1. query, > > >2. request, 3. adjust in my code. > > > > If the send sge/recv sge/qp depth/etc. aren't independent though, this pushes > > the problem and policy decision down to the provider. I can't think of an easy > > solution to this. > > Agreed. But practically I think they are. I think the SGE max is driven > off the max size of a WR and type of QP. This is true of the iWARP > adapters as well. Are you sure? For example for mthca the amount of memory you use is proportional to #WRs * #SGEs. So they aren't really independent. > But taking the bait...even if you didn't push it down to the provider, > how do you expose the inter-relationships to the consumer? An approach > in this vein is a "could_you_would_you/why_not" interface that would > return whether or not the specified qp_attr would work and if it didn't > some indication of which resource(s) caused the problem. The problems > there are a) the resource may be gone when you go back with what you > just had "approved", and b) you still have to fuss with multiple whacks > at it if you couldn't get what you asked for. Right. > I think something simpler, although arguably not perfect is the way to > go. You also have to take into account that some #WRs/#SGEs combinations will perform better than others. For example it's common for hardware to assume power of 2 ring sizes, so you are wasting memory unless you match that. And by the way, #WRs/#SGEs isn't the only parameter that has this problem, for example for Tavor RC QPs work better with 1K MTU than with 2K MTU, while current apps tend to simply use the max MTU supported. So I think that the only sane way is to let the user actually specify his full requirements and have the provider satisfy them in an optimal way. -- MST From monisonlists at gmail.com Tue Apr 10 01:02:16 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Tue, 10 Apr 2007 11:02:16 +0300 Subject: [ofa-general] Re: [ewg] questions about OFED 1.2 IPoIB bonding In-Reply-To: References: Message-ID: <461B4488.8070705@gmail.com> Scott Weitzenkamp (sweitzen) wrote: > Moni, > > I've been testing IPoIB bonding on RHEL4 x86_64 and RHEL5 x86_64, and > have some questions. This feature looks like it has a lot of potential, > thanks for adding it to OFED 1.2. > Thanks for the inputs.... > 1) IPoIB bonding and IPoIB CM do seem to work together, but after > running ib-bond --bond-ip, I have to manually reconfigure IPoIB CM (both > mode and mtu) again, then increase the bond0 mtu. It would be nice if > ib-bond took care of this for me. I haven't had a chance yet to test bonding with IPoIB-CM. I'll look into it and try to fix what's needed. > > 2) I tried a port failover stress test overnight (flipping a port every > 10 seconds) and got an HCA catastrophic error, how much failover stress > testing have you done while traffic is running? I ran the same tests more or less but never had such errors. > > 3) Can I configure IPoIB bonding to start at boot time as part of the > standard /etc/sysconfig network scripts? We are thinking how to make the bond interface act as "normal" network interface. > > 4) Do you have any plans to support load balancing in addition to failover? We have thoughts how to add support for more bonding policies but it is not something we plan to implement immediately. > > 5) I've seen some erratic throughput with netperf using bond0 (no > failover happening), have you seen this? For example: Can you add please more details about the test environment? OS, ARCH, HW, etc... > > Interim result: 2731.18 10^6bits/s over 1.00 seconds > Interim result: 2732.57 10^6bits/s over 1.00 seconds > Interim result: 2717.93 10^6bits/s over 1.01 seconds > Interim result: 1609.63 10^6bits/s over 1.69 seconds > Interim result: 487.73 10^6bits/s over 3.30 seconds > Interim result: 394.51 10^6bits/s over 1.24 seconds > Interim result: 380.00 10^6bits/s over 1.04 seconds > Interim result: 621.87 10^6bits/s over 1.08 seconds > Interim result: 372.02 10^6bits/s over 1.67 seconds > Interim result: 388.15 10^6bits/s over 1.00 seconds > Interim result: 419.81 10^6bits/s over 1.00 seconds > Interim result: 369.75 10^6bits/s over 1.14 seconds > Interim result: 858.04 10^6bits/s over 1.00 seconds > Interim result: 2285.79 10^6bits/s over 1.15 seconds > Interim result: 501.81 10^6bits/s over 4.56 seconds > Interim result: 465.64 10^6bits/s over 1.08 seconds > Interim result: 499.98 10^6bits/s over 1.00 seconds > Interim result: 478.41 10^6bits/s over 1.05 seconds > Interim result: 482.03 10^6bits/s over 1.00 seconds > Interim result: 494.85 10^6bits/s over 1.01 seconds > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > > ------------------------------------------------------------------------ > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From mst at dev.mellanox.co.il Tue Apr 10 01:36:10 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 11:36:10 +0300 Subject: [ofa-general] Re: Incorrect max_sge reported in mthca device query In-Reply-To: <1175784333.18389.30.camel@trinity.ogc.int> References: <1175371057.19974.8.camel@trinity.ogc.int> <20070401064320.GX5436@mellanox.co.il> <1175467474.31135.18.camel@trinity.ogc.int> <20070402060816.GC5072@mellanox.co.il> <1175784333.18389.30.camel@trinity.ogc.int> Message-ID: <20070410083610.GL4717@mellanox.co.il> > The application is NFS-RDMA. NFS keeps it's data as non-contiguous > arrays of pages. So the motivation is that having a larger SGL allows > you to support larger data transfers with a single operation. What operations are you using? RDMA writes? If yes, it might be cleaner, and might turn out to be more efficient, to simply post a list of multiple write WRs, each with a small number of SGEs. You can use selective signalling to only get a completion on the last WR on success. -- MST From monisonlists at gmail.com Tue Apr 10 01:48:26 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Tue, 10 Apr 2007 11:48:26 +0300 Subject: [ofa-general] Re: bonding over ipoib - where to get it? In-Reply-To: References: Message-ID: <461B4F5A.4040106@gmail.com> Jian Xiao wrote: > Hi, > > I heard that there is some version of bonding working over ipoib. Moni > Shoua provided some patches from the ipoib side. The latest patch is > dated 3/28/2007. Which kernel version is this patch made for? The patch is good for ofed_1_2 branch (from 2.6.20) On the > other hand, I haven't seen any changes to the bonding driver. Anyone > know where I could get patch for that? The bonding package that comes with OFED-1.2 is the one you want. It is based on the bonding that comes with upstream kernel. > > Thanks. > > Jian Xiao > RS/6000 SP CSS Adapter Development > Office: 414/2-15 > Phone: 433-4086 (t/l 293) > From olivier.cozette at seanodes.com Tue Apr 10 01:56:43 2007 From: olivier.cozette at seanodes.com (Olivier Cozette) Date: Tue, 10 Apr 2007 10:56:43 +0200 Subject: [ofa-general] Help with an MTHCA "catastrophe" In-Reply-To: <029101c76b19$8af42900$0281a8c0@ebpc> References: <029101c76b19$8af42900$0281a8c0@ebpc> Message-ID: <200704101056.45044.olivier.cozette@seanodes.com> Hi, I had the same error with my driver, and after some investigation, i found that my srq depth and cq depth was too small to handle the maximum number of send/recv that my application can generate concurently. Normally, in that case the qp state must become error state, but instead of that a catastrophic error occur. I increased the srq/cq depth to meet the maximum send/recv that my application can generate concurently (without reply/synchro) and this bug no more occur. So, you probably just need to increase your srq/cq depth and post buffer to meet the maximum send/recv that your driver can do. Olivier Note : I have a MT25204 rev a0 firware 1.2.0. Le Mardi 20 Mars 2007 18:59, Eric Barton a écrit : > The following is console output immediately before a panic on a system > running lustre with OFED 1.1. How can I find out what it means? > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: Catastrophic error detected: > internal error 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[00]: > 001d79f4 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[01]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[02]: 00198538 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[03]: 00136038 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[04]: 00207730 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[05]: 001d79cc > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[06]: 0023cf24 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[07]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[08]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[09]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0a]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0b]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0c]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0d]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0e]: 00000000 > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0f]: 00000000 > > ...shortly before it happens, the lustre/lnet OFED driver receives a number > of what I believe to be duplicate SEND completion events. It seems quite > sporadic, and doesn't appear to track hardware. > > More info at https://bugzilla.lustre.org/show_bug.cgi?id=11381 > > Cheers, > Eric > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From vlad at lists.openfabrics.org Tue Apr 10 02:35:40 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 10 Apr 2007 02:35:40 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070410-0200 daily build status Message-ID: <20070410093541.0F569E6081C@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From halr at voltaire.com Tue Apr 10 06:03:32 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2007 09:03:32 -0400 Subject: [ofa-general] multicast join failed for... In-Reply-To: <1176160833.14140.408570.camel@localhost.localdomain> References: <1176160833.14140.408570.camel@localhost.localdomain> Message-ID: <1176210209.14140.460005.camel@localhost.localdomain> Hi again Egor, On Mon, 2007-04-09 at 19:20, Hal Rosenstock wrote: > On Mon, 2007-04-09 at 18:47, Egor Tur wrote: > > Hi folk. > > > > > > ib1: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > > ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > > > > > > And in osm.log: > > > > Apr 09 21:33:50 658439 [42003960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, > > > > __validate_port_caps, or JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), > > > > sending IB_SA_MAD_STATUS_REQ_INVALID > > > > > > > OpenSM ERR 1B12 means that the rate or MTU of the port was incompatible > > > with the MC group. You could turn on -V with OpenSM and see more log > > > messages as to what is going on wrong from the SM's perspective. > > > > Ok. This from osm.log with -V : > > > > Apr 10 00:56:06 390007 [44007960] -> __osm_sa_mad_ctrl_process: [ > > Apr 10 00:56:06 390016 [44007960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD > > Apr 10 00:56:06 390027 [44007960] -> __osm_sa_mad_ctrl_process: ] > > Apr 10 00:56:06 390033 [44007960] -> __osm_sa_mad_ctrl_rcv_callback: ] > > Apr 10 00:56:06 390046 [41001960] -> osm_mcmr_rcv_process: [ > > Apr 10 00:56:06 390054 [41001960] -> __osm_mcmr_rcv_join_mgrp: [ > > Apr 10 00:56:06 390060 [41001960] -> __osm_mcmr_rcv_join_mgrp: Dump of incoming record > > Apr 10 00:56:06 390065 [41001960] -> MCMember Record dump: > > MGID....................0xff12601bffff0000 : 0x0000000000000001 > > PortGid.................0xfe80000000000000 : 0x001708ffffd1509a > > qkey....................0xB1B > > mlid....................0x0 > > mtu.....................0x84 > > TClass..................0x0 > > pkey....................0xFFFF > > rate....................0x83 > > pkt_life................0x0 > > SLFlowLabelHopLimit.....0x0 > > ScopeState..............0x1 > > ProxyJoin...............0x0 > > Apr 10 00:56:06 390084 [41001960] -> __validate_more_comp_fields: Requested RATE 6 is not equal to 3 > > Rate 6 is 20 Gb/sec whereas 3 is 10 Gb/sec. So the port is 4x DDR (rate > 6) and the group is 4x SDR. The request is for equal to the rate so it > fails. > > Are all your ports DDR or do you have a mix ? If all are DDR, you can > configure the default partition to use this rate. To elaborate a little more on this, the configuration would be done via /etc/osm-partitions.conf file with a single line as follows: Default=0x7fff,ipoib,rate=6:ALL=full; > > Apr 10 00:56:06 390090 [41001960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, __validate_port_caps, or > > JoinState = 0 failed from port 0x001708ffffd1509a (HP Lion Cub DDR 128MB), sending IB_SA_MAD_STATUS_REQ_INVALID > > Apr 10 00:56:07 921941 [44007960] -> __osm_sa_mad_ctrl_process: [ > > Apr 10 00:56:07 921947 [44007960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD > > Apr 10 00:56:07 921955 [44007960] -> __osm_sa_mad_ctrl_process: ] > > Apr 10 00:56:07 921960 [42804960] -> osm_mcmr_rcv_process: [ > > Apr 10 00:56:07 921961 [44007960] -> __osm_sa_mad_ctrl_rcv_callback: ] > > Apr 10 00:56:07 921978 [42804960] -> __osm_mcmr_rcv_join_mgrp: [ > > Apr 10 00:56:07 921994 [42804960] -> __osm_mcmr_rcv_join_mgrp: Dump of incoming record > > Apr 10 00:56:07 922000 [42804960] -> MCMember Record dump: > > MGID....................0xff12601bffff0000 : 0x0000000000000001 > > PortGid.................0xfe80000000000000 : 0x001708ffffd15099 > > qkey....................0xB1B > > mlid....................0x0 > > mtu.....................0x84 > > TClass..................0x0 > > pkey....................0xFFFF > > rate....................0x83 > > pkt_life................0x0 > > SLFlowLabelHopLimit.....0x0 > > ScopeState..............0x1 > > ProxyJoin...............0x0 > > Apr 10 00:56:07 922013 [42804960] -> __validate_more_comp_fields: Requested RATE 6 is not equal to 3 > > Same is true for both ports. > > > Apr 10 00:56:07 922019 [42804960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, __validate_port_caps, or > > JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), sending IB_SA_MAD_STATUS_REQ_INVALID > > > > > > What is the GUID for ib0 and ib1 ? That should tell us which port is > > > having the issue joing the group. > > > > # ibstat > > CA 'mthca0' > > CA type: MT25208 (MT23108 compat mode) > > Number of ports: 2 > > Firmware version: 4.7.400 > > Hardware version: a0 > > Node GUID: 0x001708ffffd15098 > > System image GUID: 0x001708ffffd1509b > > Port 1: > > State: Active > > Physical state: LinkUp > > Rate: 20 > > Base lid: 9 > > LMC: 0 > > SM lid: 1 > > Capability mask: 0x02510a68 > > Port GUID: 0x001708ffffd15099 > > Port 2: > > State: Active > > Physical state: LinkUp > > Rate: 20 > > Base lid: 11 > > LMC: 0 > > SM lid: 1 > > Capability mask: 0x02510a68 > > Port GUID: 0x001708ffffd1509a > > > > # ibstatus > > Infiniband device 'mthca0' port 1 status: > > default gid: fe80:0000:0000:0000:0017:08ff:ffd1:5099 > > base lid: 0x9 > > sm lid: 0x1 > > state: 4: ACTIVE > > phys state: 5: LinkUp > > rate: 20 Gb/sec (4X DDR) > > > > Infiniband device 'mthca0' port 2 status: > > default gid: fe80:0000:0000:0000:0017:08ff:ffd1:509a > > base lid: 0xb > > sm lid: 0x1 > > state: 4: ACTIVE > > phys state: 5: LinkUp > > rate: 20 Gb/sec (4X DDR) > > > > > > > > Are you using IPv6 ? > > > > I don't use IPv6 but kernel was compiled with support IPv6. > > Also when I have builded kernel modules for infiniband with --with-ipoib-cm then > > module ib_ipoib depends on ipv6. > > # modinfo ib_ipoib > > filename: /lib/modules/2.6.18/updates/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko > > author: Roland Dreier > > description: IP-over-InfiniBand net driver > > license: Dual BSD/GPL > > vermagic: 2.6.18 SMP mod_unload gcc-4.1 > > depends: ib_cm,ipv6,ib_core,ib_sa,ib_sa,ib_core > > > > If modules was builded with --without-ipoib-cm then ib_ipoib don't depend on ipv6. > > But the messages remain the same in log. Are you using IPoIB (for IPv4) ? If so, is that working ? -- Hal > > > > Thanx. > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From kliteyn at dev.mellanox.co.il Tue Apr 10 06:50:34 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 10 Apr 2007 16:50:34 +0300 Subject: [ofa-general] OpenSM --run-once question Message-ID: <461B962A.4090007@dev.mellanox.co.il> Hi Hal. I have a question regarding the --run-once OpenSM option. I have two HCAs connected through a single InfiniScale III switch. I restart the driver on an HCA, which causes port to go down and up, which in turn causes the switch to start training sequence to decide whether it should work in SDR or DDR. This training sequence takes about 10-15 seconds. Now, if I run OpenSM during this period, it finishes initialization with errors (printing the "Errors during initialization" error message), and immediately starts new sweep, doing it again and again, until switch training sequence is over and SM manages to bring subnet up. Now, when I run OpenSM with --run-once, OpenSM finishes the first sweep with these "errors during initialization" and exits with status=0. Is this behavior intentional? Should OSM loop until the subnet will be really up? Or perhaps exit with some status other than 0? Here's the relevant code snip from osm_state_mgr.c: /* If there were errors - then the subnet is not really up */ if( p_mgr->p_subn->subnet_initialization_error == TRUE ) { __osm_state_mgr_init_errors_msg( p_mgr ); } else { /* The subnet is up correctly - set the first_time_master_sweep flag * (if it is on) to FALSE. */ ..... bla bla } p_mgr->state = OSM_SM_STATE_PROCESS_REQUEST; signal = OSM_SIGNAL_IDLE_TIME_PROCESS; /* * Finally signal the subnet up event */ status = cl_event_signal( p_mgr->p_subnet_up_event ); -- Yevgeny From halr at voltaire.com Tue Apr 10 06:50:32 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2007 09:50:32 -0400 Subject: [ofa-general] Re: OpenSM --run-once question In-Reply-To: <461B962A.4090007@dev.mellanox.co.il> References: <461B962A.4090007@dev.mellanox.co.il> Message-ID: <1176213031.14140.462840.camel@localhost.localdomain> Hi Yevgeny, On Tue, 2007-04-10 at 09:50, Yevgeny Kliteynik wrote: > Hi Hal. > > I have a question regarding the --run-once OpenSM option. > > I have two HCAs connected through a single InfiniScale III switch. > I restart the driver on an HCA, which causes port to go down and > up, which in turn causes the switch to start training sequence to > decide whether it should work in SDR or DDR. This training sequence > takes about 10-15 seconds. > > Now, if I run OpenSM during this period, it finishes initialization > with errors (printing the "Errors during initialization" error message), > and immediately starts new sweep, doing it again and again, until switch > training sequence is over and SM manages to bring subnet up. > > Now, when I run OpenSM with --run-once, OpenSM finishes the first > sweep with these "errors during initialization" and exits with status=0. > > Is this behavior intentional? Don't know for sure. --run-once predates my involvement and I don't generally use it although I do know about some use cases for it. > Should OSM loop until the subnet will be really up? I think one could argue this one way or the other. As the subnet may not come up, not sure it should loop. > Or perhaps exit with some status other than 0? That seems reasonable and the minimum that should be done so there is some warning that the subnet may not be initialized properly. -- Hal > Here's the relevant code snip from osm_state_mgr.c: > > /* If there were errors - then the subnet is not really up */ > if( p_mgr->p_subn->subnet_initialization_error == TRUE ) > { > __osm_state_mgr_init_errors_msg( p_mgr ); > } > else > { > /* The subnet is up correctly - set the first_time_master_sweep flag > * (if it is on) to FALSE. */ > ..... bla bla > } > p_mgr->state = OSM_SM_STATE_PROCESS_REQUEST; > signal = OSM_SIGNAL_IDLE_TIME_PROCESS; > > /* > * Finally signal the subnet up event > */ > status = cl_event_signal( p_mgr->p_subnet_up_event ); > > -- Yevgeny From ogerlitz at voltaire.com Tue Apr 10 06:59:12 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 10 Apr 2007 16:59:12 +0300 Subject: [ofa-general] re [NET]: Fix neighbour destructor handling Message-ID: <461B9830.3040507@voltaire.com> Hi Michael, I just came across this patch, can you educate me a little what is the implication (what problem it came to solve, if there is such) on ipoib? thanks! Or. > From: Alexey Kuznetsov > Date: Sat, 24 Mar 2007 19:52:16 +0000 (-0700) > Subject: [NET]: Fix neighbour destructor handling. > X-Git-Tag: v2.6.21-rc6~114^2~14 > X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=ecbb416939da77c0d107409976499724baddce7b > > [NET]: Fix neighbour destructor handling. > > ->neigh_destructor() is killed (not used), replaced with > ->neigh_cleanup(), which is called when neighbor entry goes to dead > state. At this point everything is still valid: neigh->dev, > neigh->parms etc. > > The device should guarantee that dead neighbor entries (neigh->dead != > 0) do not get private part initialized, otherwise nobody will cleanup > it. > > I think this is enough for ipoib which is the only user of this thing. > Initialization private part of neighbor entries happens in ipib > start_xmit routine, which is not reached when device is down. But it > would be better to add explicit test for neigh->dead in any case. > > Signed-off-by: David S. Miller > --- > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 0741c6d..f2a40ae 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -814,7 +814,7 @@ static void ipoib_set_mcast_list(struct net_device *dev) > queue_work(ipoib_workqueue, &priv->restart_task); > } > > -static void ipoib_neigh_destructor(struct neighbour *n) > +static void ipoib_neigh_cleanup(struct neighbour *n) > { > struct ipoib_neigh *neigh; > struct ipoib_dev_priv *priv = netdev_priv(n->dev); > @@ -822,7 +822,7 @@ static void ipoib_neigh_destructor(struct neighbour *n) > struct ipoib_ah *ah = NULL; > > ipoib_dbg(priv, > - "neigh_destructor for %06x " IPOIB_GID_FMT "\n", > + "neigh_cleanup for %06x " IPOIB_GID_FMT "\n", > IPOIB_QPN(n->ha), > IPOIB_GID_RAW_ARG(n->ha + 4)); > > @@ -874,7 +874,7 @@ void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh) > > static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) > { > - parms->neigh_destructor = ipoib_neigh_destructor; > + parms->neigh_cleanup = ipoib_neigh_cleanup; > > return 0; > } > diff --git a/include/net/neighbour.h b/include/net/neighbour.h > index 3725b93..ad7fe11 100644 > --- a/include/net/neighbour.h > +++ b/include/net/neighbour.h > @@ -36,7 +36,7 @@ struct neigh_parms > struct net_device *dev; > struct neigh_parms *next; > int (*neigh_setup)(struct neighbour *); > - void (*neigh_destructor)(struct neighbour *); > + void (*neigh_cleanup)(struct neighbour *); > struct neigh_table *tbl; > > void *sysctl_table; > diff --git a/net/atm/clip.c b/net/atm/clip.c > index ebb5d0c..8c38258 100644 > --- a/net/atm/clip.c > +++ b/net/atm/clip.c > @@ -261,14 +261,6 @@ static void clip_pop(struct atm_vcc *vcc, struct sk_buff *skb) > spin_unlock_irqrestore(&PRIV(dev)->xoff_lock, flags); > } > > -static void clip_neigh_destroy(struct neighbour *neigh) > -{ > - DPRINTK("clip_neigh_destroy (neigh %p)\n", neigh); > - if (NEIGH2ENTRY(neigh)->vccs) > - printk(KERN_CRIT "clip_neigh_destroy: vccs != NULL !!!\n"); > - NEIGH2ENTRY(neigh)->vccs = (void *) NEIGHBOR_DEAD; > -} > - > static void clip_neigh_solicit(struct neighbour *neigh, struct sk_buff *skb) > { > DPRINTK("clip_neigh_solicit (neigh %p, skb %p)\n", neigh, skb); > @@ -342,7 +334,6 @@ static struct neigh_table clip_tbl = { > /* parameters are copied from ARP ... */ > .parms = { > .tbl = &clip_tbl, > - .neigh_destructor = clip_neigh_destroy, > .base_reachable_time = 30 * HZ, > .retrans_time = 1 * HZ, > .gc_staletime = 60 * HZ, > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > index 3183142..cfc6001 100644 > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c > @@ -140,6 +140,8 @@ static int neigh_forced_gc(struct neigh_table *tbl) > n->dead = 1; > shrunk = 1; > write_unlock(&n->lock); > + if (n->parms->neigh_cleanup) > + n->parms->neigh_cleanup(n); > neigh_release(n); > continue; > } > @@ -211,6 +213,8 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev) > NEIGH_PRINTK2("neigh %p is stray.\n", n); > } > write_unlock(&n->lock); > + if (n->parms->neigh_cleanup) > + n->parms->neigh_cleanup(n); > neigh_release(n); > } > } > @@ -582,9 +586,6 @@ void neigh_destroy(struct neighbour *neigh) > kfree(hh); > } > > - if (neigh->parms->neigh_destructor) > - (neigh->parms->neigh_destructor)(neigh); > - > skb_queue_purge(&neigh->arp_queue); > > dev_put(neigh->dev); > @@ -675,6 +676,8 @@ static void neigh_periodic_timer(unsigned long arg) > *np = n->next; > n->dead = 1; > write_unlock(&n->lock); > + if (n->parms->neigh_cleanup) > + n->parms->neigh_cleanup(n); > neigh_release(n); > continue; > } > @@ -2088,8 +2091,11 @@ void __neigh_for_each_release(struct neigh_table *tbl, > } else > np = &n->next; > write_unlock(&n->lock); > - if (release) > + if (release) { > + if (n->parms->neigh_cleanup) > + n->parms->neigh_cleanup(n); > neigh_release(n); > + } > } > } > } From mst at dev.mellanox.co.il Tue Apr 10 07:08:34 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 17:08:34 +0300 Subject: [ofa-general] Re: re [NET]: Fix neighbour destructor handling In-Reply-To: <461B9830.3040507@voltaire.com> References: <461B9830.3040507@voltaire.com> Message-ID: <20070410140818.GC4616@mellanox.co.il> Quoting Or Gerlitz : Subject: re [NET]: Fix neighbour destructor handling Hi Michael, I just came across this patch, can you educate me a little what is the implication (what problem it came to solve, if there is such) on ipoib? thanks! Or. > From: Alexey Kuznetsov > Date: Sat, 24 Mar 2007 19:52:16 +0000 (-0700) > Subject: [NET]: Fix neighbour destructor handling. > X-Git-Tag: v2.6.21-rc6~114^2~14 > X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=ecbb416939da77c0d107409976499724baddce7b > > [NET]: Fix neighbour destructor handling. > > ->neigh_destructor() is killed (not used), replaced with > ->neigh_cleanup(), which is called when neighbor entry goes to dead > state. At this point everything is still valid: neigh->dev, > neigh->parms etc. > > The device should guarantee that dead neighbor entries (neigh->dead != > 0) do not get private part initialized, otherwise nobody will cleanup > it. > > I think this is enough for ipoib which is the only user of this thing. > Initialization private part of neighbor entries happens in ipib > start_xmit routine, which is not reached when device is down. But it > would be better to add explicit test for neigh->dead in any case. > > Signed-off-by: David S. Miller Or, you should find this in lkml or openfabrics archives. Look for my message titled "dst_ifdown breaks infiniband" and follow the discussion. -- MST From ishai at dev.mellanox.co.il Tue Apr 10 07:21:07 2007 From: ishai at dev.mellanox.co.il (Ishai Rabinovitz) Date: Tue, 10 Apr 2007 17:21:07 +0300 Subject: [ofa-general] Re: SRP HA dm_multipath testing and questions In-Reply-To: References: Message-ID: <461B9D53.6000208@dev.mellanox.co.il> Scott Weitzenkamp (sweitzen) wrote: > I've been testing SRP HA and dm_multipath with: > - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun T4 RAID > - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun 3510 RAID > - SLES10 x86_64, Cisco FC Gateway, and 3 JBODs > > On RHEL4, I edited /etc/multipath.conf, ran "chkconfig multipathd on", > then rebooted. On SLES 10, I ran "chkconfig boot.multipath on" and > "chkconfig multipathd on", then rebooted. Ishai, I don't seem to need > 91-srp.rules, are you using the boot.multipath and multipathd scripts? On RHEL4 you really do not need 91-srp.rules and it is not used (see /etc/init.d/openibd) On SLES10 I was sure that you need it. I checked it, and you are correct. I don't see how it does it, but it seems that when using boot.multipath there is no need for 91-srp.rules. I will check it more deeply and change documentation and openibd script accordingly. > > On both RHEL4 networks, I get IB port load balancing and failover, on > SLES10 I only see failover. I'm not sure if this is a function of > RHEL4-vs-SLES10, or RAID vs JBOD. > Maybe this is because you removed 91-srp.rules (Did you removed it?) How did you test the failover and failback? > Traffic failover is very slow (a few minutes), what do others see? > What do you mean by slow. When do you start counting. > I will be testing DDN IB storage, EMC DMX, and RHEL5 soon. > > I'm getting an Oops on RHEL4 U3 x86_64 on both test networks: > > scsi3 (0:0): rejecting I/O to offline device > scsi3 (0:0): rejecting I/O to offline device > scsi3 (0:0): rejecting I/O to offline device > scsi3 (0:<4>NMI Watchdog detected LOCKUP, CPU=1, registers: > CPU 1 > Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core nfs > lockd nfs_ > acl sunrpc rdma_ucm(U) ib_srp(U) ib_sdp(U) rdma_cm(U) iw_cm(U) > ib_addr(U) ib_loc > al_sa(U) ds yenta_socket pcmcia_core dm_mirror dm_round_robin > dm_multipath dm_mo > d button battery ac ohci_hcd hw_random shpchp ib_mthca(U) ib_ipoib(U) > ib_umad(U) > ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) md5 ipv6 > tg3 flop > py sg ext3 jbd mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod > Pid: 3990, comm: scsi_eh_3 Not tainted 2.6.9-34.ELsmp > RIP: 0010:[] {serial_in+83} > RSP: 0018:000001007f203c10 EFLAGS: 00000002 > RAX: 00000000ffffff00 RBX: 0000000000000000 RCX: 0000000000000000 > RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff804b59a0 > RBP: ffffffff804b59a0 R08: 000000000000003a R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000002706 > R13: ffffffff8045afc5 R14: 0000000000000009 R15: 000000000000002d > FS: 0000002a958a07a0(0000) GS:ffffffff804d7b80(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00000036ce02e728 CR3: 00000000cff00000 CR4: 00000000000006e0 > Process scsi_eh_3 (pid: 3990, threadinfo 000001007f202000, task > 000001007f1957f0 > ) > Stack: ffffffff80242ab2 0000000d000402dc ffffffff803f88e0 00000000000402dc > 0000000000040309 0000000000000030 000001017bf79830 000000000000c000 > ffffffff8013764c 0000000000040309 > Call Trace:{serial8250_console_write+113} > {_ > _call_console_drivers+68} > {release_console_sem+276} > {vprintk+49 > 8} > {printk+141} {__wake_up+54} > {freed_request+105} > {:dm_multipath:mu > ltipath_end_io+0} > {:scsi_mod:scsi_prep_fn+120} > {elv_nex > t_request+68} > {:scsi_mod:scsi_request_fn+66} > {blk_i > nsert_request+160} > {:scsi_mod:scsi_requeue_command+48} > {:scsi_mod:scsi_io_completion+866} > {:scsi_mod:scsi_error_handler+2809} > {child_rip+8} > {:scsi_mod:scsi_error_h > andler+0} > {child_rip+0} > > Code: 0f b6 c0 c3 0f b6 4f 22 0f b6 47 23 41 89 d0 d3 e6 83 f8 02 > Kernel panic - not syncing: nmi watchdog > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > Please open a bugzilla about this deadlock. Can you reproduce it? From ogerlitz at voltaire.com Tue Apr 10 07:18:57 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 10 Apr 2007 17:18:57 +0300 Subject: [ofa-general] Re: re [NET]: Fix neighbour destructor handling In-Reply-To: <20070410140818.GC4616@mellanox.co.il> References: <461B9830.3040507@voltaire.com> <20070410140818.GC4616@mellanox.co.il> Message-ID: <461B9CD1.3090709@voltaire.com> Michael S. Tsirkin wrote: > Quoting Or Gerlitz : > I just came across this patch, can you educate me a little what is the > implication (what problem it came to solve, if there is such) on ipoib? > Or, > you should find this in lkml or openfabrics archives. > Look for my message titled "dst_ifdown breaks infiniband" > and follow the discussion. OK, thanks for the clarification. Now, just to make sure, my understanding is that the ipoib neigh_destructor patch to support bonding (pushed by MoniS to OFED 1.2) is --still-- needed also with this neigh_destructor to neigh_cleanup change, correct? Or. From tziporet at dev.mellanox.co.il Tue Apr 10 07:24:13 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 10 Apr 2007 17:24:13 +0300 Subject: [ofa-general] OFED teleconference: TODAY! In-Reply-To: References: Message-ID: <461B9E0D.4000008@mellanox.co.il> Jeff Squyres wrote: > Reminder: for *THIS WEEK ONLY*, the OFED teleconference was moved to > Tuesday (tomorrow) because of the Israel holiday. The teleconference > will be: > Agenda for the meeting today: 1. RC1 status (review high priority bugs) 2. Do we want to start a release process where all bugs and corresponding fixes must be approved by the RM before they hit the code base . If yes what should be the procedure. 3. Big clusters (> 100 nodes) for testing OFED High priority bugs: bug_id bug_severity assigned_to short_short_desc 529 blocker bos at pathscale.com dtest fails on ipath card 521 blocker sean.hefty at intel.com 32-bit uDAPL Application hangs on Intel MPI 3.0 534 blocker vlad at mellanox.co.il SLES9 - Installer fails on declarations - OFED 1.2-20070409 513 critical bos at pathscale.com error while installing ipath driver 420 critical monil at voltaire.com PKey table reordering caused by SM failover stops ipoib traffic 431 critical mst at mellanox.co.il IPoIB CM locks up server on SLES10/RHEL4 ppc64 465 critical mst at mellanox.co.il IPoIB CM HA fails after several hours of failovers 520 critical swise at opengridcomputing.com 0.9.8-9 mvapich2 over iWARP not working 535 critical vlad at mellanox.co.il When installing OFED, it tries to uninstall even if no OFED version is installed 536 critical vlad at mellanox.co.il When upgrading from OFED 1.1 to 1.2, 1.1 isn't stopped before removal 436 major arlin.r.davis at intel.com Intel MPI and HP MPI DDR bandwidth dropped after OFED 1.2 alpha 503 major halr at voltaire.com Linux distributions Interoperability of IPoIB IPv6 does not work 459 major monis at voltaire.com support ib-bonding on RHEL4U4/RHEL5, put kernel name in RPM name, and clean up better 527 major monis at voltaire.com ib-bonding won't compile for RHEL5 i686 PAE kernel 484 major mst at dev.mellanox.co.il mstflint -d mthca0 fails on ppc64 506 major mst at mellanox.co.il IPoIB IPv4 multicast throughput is poor 508 major mst at mellanox.co.il IPoIB CM multicast is hogging interrupts 519 major pasha at mellanox.co.il MVAPICH I APPLICATION ABORTS WITH PARTITIONS CONFIGURED 438 major rolandd at cisco.com OFED SRP does not work with DDN IB storage large LUNs 464 major rolandd at cisco.com release libibverbs-1.1 final before OFED 1.2 499 major vlad at mellanox.co.il module compiled over ofed won't load due to symbol version mismatch > Tuesday, 10 Apr 2007 (tomorrow) > US Pacific time: 9am > US Mountain time: 10am > US Central time: 11am > US Eastern time: Noon > Israel time: 7pm > > US/Canada: +1.866.432.9903 > Israel: +972.9.892.7026 > All others: > http://cisco.com/en/US/about/doing_business/conferencing/index.html > > Meeting ID: 2102061 > > 2nd reminder: this week starts the first *weekly* OFED teleconference > (as opposed to bi-weekly). We currently are scheduled to have an OFED > teleconference *every week* until the Sonoma event at the end of this > month. > > > From mst at dev.mellanox.co.il Tue Apr 10 07:26:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 17:26:59 +0300 Subject: [ofa-general] Re: re [NET]: Fix neighbour destructor handling In-Reply-To: <461B9CD1.3090709@voltaire.com> References: <461B9830.3040507@voltaire.com> <20070410140818.GC4616@mellanox.co.il> <461B9CD1.3090709@voltaire.com> Message-ID: <20070410142659.GE4616@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: re [NET]: Fix neighbour destructor handling > > Michael S. Tsirkin wrote: > >Quoting Or Gerlitz : > >I just came across this patch, can you educate me a little what is the > >implication (what problem it came to solve, if there is such) on ipoib? > > >Or, > > you should find this in lkml or openfabrics archives. > > Look for my message titled "dst_ifdown breaks infiniband" > > and follow the discussion. > > OK, thanks for the clarification. > > Now, just to make sure, my understanding is that the ipoib > neigh_destructor patch to support bonding (pushed by MoniS to OFED 1.2) > is --still-- needed also with this neigh_destructor to neigh_cleanup > change, correct? Bonding is basically broken for IPoIB, both before and after this patch. And BTW, while Moni's patch seems to make things crash less frequently for him, it does not really fix this. I hope he'll post something more robust before OFED 1.2 is out. -- MST From rdreier at cisco.com Tue Apr 10 07:29:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Apr 2007 07:29:26 -0700 Subject: [ofa-general] OFED teleconference: TODAY! In-Reply-To: <461B9E0D.4000008@mellanox.co.il> (Tziporet Koren's message of "Tue, 10 Apr 2007 17:24:13 +0300") References: <461B9E0D.4000008@mellanox.co.il> Message-ID: > 438 major rolandd at cisco.com OFED SRP does not work with DDN IB storage large LUNs This is 100% a bug on the DDN target, and I see no way to even work around it in the initiator. So I'll just close this OFED bug. > 464 major rolandd at cisco.com release libibverbs-1.1 final before OFED 1.2 libibverbs 1.1 should be released tomorrow, unless someone comes up with a big last-minute problem. - R. From ogerlitz at voltaire.com Tue Apr 10 07:34:59 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 10 Apr 2007 17:34:59 +0300 Subject: [ofa-general] Re: re [NET]: Fix neighbour destructor handling In-Reply-To: <20070410142659.GE4616@mellanox.co.il> References: <461B9830.3040507@voltaire.com> <20070410140818.GC4616@mellanox.co.il> <461B9CD1.3090709@voltaire.com> <20070410142659.GE4616@mellanox.co.il> Message-ID: <461BA093.6000100@voltaire.com> Michael S. Tsirkin wrote: > Bonding is basically broken for IPoIB, both before and after this patch. > And BTW, while Moni's patch seems to make things crash less frequently for him, > it does not really fix this. I hope he'll post something more robust > before OFED 1.2 is out. I did followed most of the discussions between you and MoniS re the ipoib/bonding integration in OFED 1.2 and elsewhere, however: i don't see why "bonding is basically broken for ipoib", if you don't mind, please tell me the bottom line from your perspective. I do want to get progress with ipoib/bonding using the patchset to bonding I have posted to netdev and MoniS ipoib neigh_cleanup patch. Taking your brokenness claim into account, i will first post an RFC to the openib general list so we can discuss it first there. Or. From ogerlitz at voltaire.com Tue Apr 10 07:44:32 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 10 Apr 2007 17:44:32 +0300 Subject: [ofa-general] OFED teleconference: TODAY! In-Reply-To: <461B9E0D.4000008@mellanox.co.il> References: <461B9E0D.4000008@mellanox.co.il> Message-ID: <461BA2D0.3070504@voltaire.com> Tziporet Koren wrote: > Jeff Squyres wrote: > High priority bugs: > bug_id bug_severity assigned_to short_short_desc Please add to the list as blocker 472 Data corruption with Lustre+OFED when using FMR on memfree HCAs We see it also with iser, basically only on scsi --read-- which from IB perspective is RDMA write from the target to the initiator. The env we see it is Sinai (25204) hw_ver=A0 and fw_ver=1.2.0 Ishai did not manage to reproduce it with SRP, but the fact it reproduced with two independent ULPs makes it a blocker, i think. We will provide more details tomorrow. Or. From twbowman at gmail.com Tue Apr 10 08:11:04 2007 From: twbowman at gmail.com (Todd Bowman) Date: Tue, 10 Apr 2007 09:11:04 -0600 Subject: [ofa-general] Help with an MTHCA "catastrophe" In-Reply-To: <200704101056.45044.olivier.cozette@seanodes.com> References: <029101c76b19$8af42900$0281a8c0@ebpc> <200704101056.45044.olivier.cozette@seanodes.com> Message-ID: Olivier, I am having similar issues with the same firmware. Can you give me some more details? Did you make the changes on the driver side or the application? If on the driver, can you point me in the right direction to make those changes? Thanks, Todd On 4/10/07, Olivier Cozette wrote: > > Hi, > > I had the same error with my driver, and after some investigation, i found > that my srq depth and cq depth was too small to handle the maximum number > of > send/recv that my application can generate concurently. Normally, in that > case the qp state must become error state, but instead of that a > catastrophic > error occur. > > I increased the srq/cq depth to meet the maximum send/recv that my > application > can generate concurently (without reply/synchro) and this bug no more > occur. > > So, you probably just need to increase your srq/cq depth and post buffer > to > meet the maximum send/recv that your driver can do. > > Olivier > > Note : I have a MT25204 rev a0 firware 1.2.0. > > Le Mardi 20 Mars 2007 18:59, Eric Barton a écrit: > > The following is console output immediately before a panic on a system > > running lustre with OFED 1.1. How can I find out what it means? > > > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: Catastrophic error detected: > > internal error 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[00]: > > 001d79f4 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[01]: 00000000 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[02]: 00198538 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[03]: 00136038 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[04]: 00207730 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[05]: 001d79cc > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[06]: 0023cf24 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[07]: 00000000 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[08]: 00000000 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[09]: 00000000 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0a]: 00000000 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0b]: 00000000 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0c]: 00000000 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0d]: 00000000 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0e]: 00000000 > > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0f]: 00000000 > > > > ...shortly before it happens, the lustre/lnet OFED driver receives a > number > > of what I believe to be duplicate SEND completion events. It seems > quite > > sporadic, and doesn't appear to track hardware. > > > > More info at https://bugzilla.lustre.org/show_bug.cgi?id=11381 > > > > Cheers, > > Eric > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Tue Apr 10 08:30:03 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 18:30:03 +0300 Subject: [ofa-general] Re: re [NET]: Fix neighbour destructor handling In-Reply-To: <461BA093.6000100@voltaire.com> References: <461B9830.3040507@voltaire.com> <20070410140818.GC4616@mellanox.co.il> <461B9CD1.3090709@voltaire.com> <20070410142659.GE4616@mellanox.co.il> <461BA093.6000100@voltaire.com> Message-ID: <20070410153003.GF4616@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: re [NET]: Fix neighbour destructor handling > > Michael S. Tsirkin wrote: > >Bonding is basically broken for IPoIB, both before and after this patch. > >And BTW, while Moni's patch seems to make things crash less frequently for > >him, > >it does not really fix this. I hope he'll post something more robust > >before OFED 1.2 is out. > > I did followed most of the discussions between you and MoniS re the > ipoib/bonding integration in OFED 1.2 and elsewhere, however: i don't > see why "bonding is basically broken for ipoib", if you don't mind, > please tell me the bottom line from your perspective. I'm not sure how you could have missed it. Here are the couple of links that archive search threw up: http://thread.gmane.org/gmane.linux.drivers.openib/37516/focus=37539 http://thread.gmane.org/gmane.linux.drivers.openib/38074/focus=38075 > > I do want to get progress with ipoib/bonding using the patchset to > bonding I have posted to netdev and MoniS ipoib neigh_cleanup patch. > > Taking your brokenness claim into account, i will first post an RFC to > the openib general list so we can discuss it first there. I really think you guys should try and address all the concerns raised on the general list before reposting RFCs. I wouldn't want to write them out all over again. -- MST From mst at dev.mellanox.co.il Tue Apr 10 08:32:42 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 18:32:42 +0300 Subject: [ofa-general] [PATCH for-2.6.21] IB/ipoib: fix DMA direction typo Message-ID: <20070410153242.GG4616@mellanox.co.il> This fixes bug 431 in openfabrics bugzilla. Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index e70492d..2b242a4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -131,7 +131,7 @@ static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int skb_fill_page_desc(skb, i, page, 0, PAGE_SIZE); mapping[i + 1] = ib_dma_map_page(priv->ca, skb_shinfo(skb)->frags[i].page, - 0, PAGE_SIZE, DMA_TO_DEVICE); + 0, PAGE_SIZE, DMA_FROM_DEVICE); if (unlikely(ib_dma_mapping_error(priv->ca, mapping[i + 1]))) goto partial_error; } -- MST From halr at voltaire.com Tue Apr 10 09:05:48 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2007 12:05:48 -0400 Subject: [ofa-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_endport, isSMdisabled also indicates that an SM is present so poll SMInfo In-Reply-To: <1175801054.14140.35806.camel@localhost.localdomain> References: <1175704326.4436.202892.camel@localhost.localdomain> <1175792802.14140.27052.camel@localhost.localdomain> <1175793927.14140.28223.camel@localhost.localdomain> <1175796168.14140.30621.camel@localhost.localdomain> <1175797044.14140.31536.camel@localhost.localdomain> <1175801054.14140.35806.camel@localhost.localdomain> Message-ID: <1176221145.14140.470979.camel@localhost.localdomain> On Thu, 2007-04-05 at 15:24, Hal Rosenstock wrote: > On Thu, 2007-04-05 at 14:24, Roland Dreier wrote: > > > Good point. At a minimum, the spec is unclear about this (if they are > > > totally separate mechanisms). > > > > When is the spec ever clear? :) > > > > But I think the only interpretation that has a chance at matching the > > current spec is to say that IsSMDisabled is not directly related to an > > SM in the NOT-ACTIVE state. > > > > Maybe it's worth asking the WG what the motivation for introducing > > IsSMDisabled was? > > Yes, I've already done that. The explanation for the utility of IsSMdisabled is to block an SM from being able to be started. Quoting one of the architects, the scenario is as follows: "Consider a network with heterogeneous elements, hosting different vendors' stuff, with different vendors' SMs. But one vendor has a company ID number, used in its GUID, that is numerically higher than anybody else's. But GUID is always the tiebreaker in who wins out to be master SM. If the usually-expected thing happens in an installation and nobody bothers setting priorities, one of that vendor's SMs always ends up the master, just because it has a lower company ID. This was deemed Not Good. So there at least must be a way to ensure that, even if a node is capable of running an SM, it doesn't. That's C14-69. It's set by an unspecified out-of-band means (like a console, or even a dip switch). If it's disabled, the thing never sticks its head up; it's even prohibited from asserting IsSM." Does this make more sense now ? -- Hal > -- Hal > > > - R. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sweitzen at cisco.com Tue Apr 10 09:19:25 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 10 Apr 2007 09:19:25 -0700 Subject: [ofa-general] Changed default bugzilla Priority/Severity from P1/Blocker to P3/Normal Message-ID: Now there will be no more accidental P1/Blocker bugs. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From etta at systemfabricworks.com Tue Apr 10 09:58:11 2007 From: etta at systemfabricworks.com (Chieng Etta) Date: Tue, 10 Apr 2007 11:58:11 -0500 Subject: [ofa-general] RE: [ewg] Re: SRP HA dm_multipath testing and questions In-Reply-To: <461B9D53.6000208@dev.mellanox.co.il> Message-ID: <003f01c77b91$6c47aee0$c801a8c0@ettac> Please see below. Thanks, Etta -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Ishai Rabinovitz Sent: Tuesday, April 10, 2007 9:21 AM To: Scott Weitzenkamp (sweitzen) Cc: Roland Dreier (rdreier); ewg at lists.openfabrics.org; openib Subject: [ewg] Re: SRP HA dm_multipath testing and questions Scott Weitzenkamp (sweitzen) wrote: > I've been testing SRP HA and dm_multipath with: > - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun T4 RAID > - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun 3510 RAID > - SLES10 x86_64, Cisco FC Gateway, and 3 JBODs > > On RHEL4, I edited /etc/multipath.conf, ran "chkconfig multipathd on", > then rebooted. On SLES 10, I ran "chkconfig boot.multipath on" and > "chkconfig multipathd on", then rebooted. Ishai, I don't seem to need > 91-srp.rules, are you using the boot.multipath and multipathd scripts? On RHEL4 you really do not need 91-srp.rules and it is not used (see /etc/init.d/openibd) On SLES10 I was sure that you need it. I checked it, and you are correct. I don't see how it does it, but it seems that when using boot.multipath there is no need for 91-srp.rules. I will check it more deeply and change documentation and openibd script accordingly. [EC] I just verified it on SLES10 x86_64. The multipath worked fine by using boot.multipath without 91-srp.rules. Ishai, in the SRP release notes - section 6, srp_daemon a., the first line should be changed to '"srp_daemon -a -o" is equivalent to "ibsrpdm"'. > > On both RHEL4 networks, I get IB port load balancing and failover, on > SLES10 I only see failover. I'm not sure if this is a function of > RHEL4-vs-SLES10, or RAID vs JBOD. > Maybe this is because you removed 91-srp.rules (Did you removed it?) How did you test the failover and failback? > Traffic failover is very slow (a few minutes), what do others see? > What do you mean by slow. When do you start counting. > I will be testing DDN IB storage, EMC DMX, and RHEL5 soon. > > I'm getting an Oops on RHEL4 U3 x86_64 on both test networks: > > scsi3 (0:0): rejecting I/O to offline device > scsi3 (0:0): rejecting I/O to offline device > scsi3 (0:0): rejecting I/O to offline device > scsi3 (0:<4>NMI Watchdog detected LOCKUP, CPU=1, registers: > CPU 1 > Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core nfs > lockd nfs_ > acl sunrpc rdma_ucm(U) ib_srp(U) ib_sdp(U) rdma_cm(U) iw_cm(U) > ib_addr(U) ib_loc > al_sa(U) ds yenta_socket pcmcia_core dm_mirror dm_round_robin > dm_multipath dm_mo > d button battery ac ohci_hcd hw_random shpchp ib_mthca(U) ib_ipoib(U) > ib_umad(U) > ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) md5 ipv6 > tg3 flop > py sg ext3 jbd mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod > Pid: 3990, comm: scsi_eh_3 Not tainted 2.6.9-34.ELsmp > RIP: 0010:[] {serial_in+83} > RSP: 0018:000001007f203c10 EFLAGS: 00000002 > RAX: 00000000ffffff00 RBX: 0000000000000000 RCX: 0000000000000000 > RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff804b59a0 > RBP: ffffffff804b59a0 R08: 000000000000003a R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000002706 > R13: ffffffff8045afc5 R14: 0000000000000009 R15: 000000000000002d > FS: 0000002a958a07a0(0000) GS:ffffffff804d7b80(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00000036ce02e728 CR3: 00000000cff00000 CR4: 00000000000006e0 > Process scsi_eh_3 (pid: 3990, threadinfo 000001007f202000, task > 000001007f1957f0 > ) > Stack: ffffffff80242ab2 0000000d000402dc ffffffff803f88e0 00000000000402dc > 0000000000040309 0000000000000030 000001017bf79830 000000000000c000 > ffffffff8013764c 0000000000040309 > Call Trace:{serial8250_console_write+113} > {_ > _call_console_drivers+68} > {release_console_sem+276} > {vprintk+49 > 8} > {printk+141} {__wake_up+54} > {freed_request+105} > {:dm_multipath:mu > ltipath_end_io+0} > {:scsi_mod:scsi_prep_fn+120} > {elv_nex > t_request+68} > {:scsi_mod:scsi_request_fn+66} > {blk_i > nsert_request+160} > {:scsi_mod:scsi_requeue_command+48} > {:scsi_mod:scsi_io_completion+866} > {:scsi_mod:scsi_error_handler+2809} > {child_rip+8} > {:scsi_mod:scsi_error_h > andler+0} > {child_rip+0} > > Code: 0f b6 c0 c3 0f b6 4f 22 0f b6 47 23 41 89 d0 d3 e6 83 f8 02 > Kernel panic - not syncing: nmi watchdog > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > Please open a bugzilla about this deadlock. Can you reproduce it? _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From jsquyres at cisco.com Tue Apr 10 10:02:16 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 10 Apr 2007 13:02:16 -0400 Subject: [ofa-general] OFA server bugzilla "product" Message-ID: <7AF8879B-2539-458F-B0E2-D80118F5DAD3@cisco.com> Proposal: Make a "product" in bugzilla (alongside "OpenFabrics Linux" and "OpenFabrics Windows") for sysadmin issues dealing specifically with the OFA server. Rationale: We have a nonzero number of sysadmin requests for the OFA server. The previous OFA sysadmin, Michael Lee did a heroic job of maintaining all the outstanding requests and servicing them all, but there was still a bunch of legwork to ensure that a) people knew who to report sysadmin requests to, and b) ensuring that people actually did request directly from the responsible sysadmins (vs. just posting to the OF general list, which has too much traffic for the sysadmins to monitor). Having a bugzilla product for the OFA Server would be most helpful in both submitting requests through a well-known/understood process and tracking to ensure that all the work actually gets done (from the end- user's perspective). If no one cares/objects within a week, I'd like to ask Scott Weitzenkamp to setup an "OFA Server" product in Bugzilla. We'll then put some corresponding info on the wiki about how to submit OFA Server bugzilla requests. Thanks. -- Jeff Squyres Cisco Systems From becker at nas.nasa.gov Tue Apr 10 10:08:17 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Tue, 10 Apr 2007 10:08:17 -0700 Subject: [ofa-general] OFA server bugzilla "product" In-Reply-To: <7AF8879B-2539-458F-B0E2-D80118F5DAD3@cisco.com> References: <7AF8879B-2539-458F-B0E2-D80118F5DAD3@cisco.com> Message-ID: <795c49870704101008y3278139ar670cd8cede24a5e0@mail.gmail.com> Sounds good to me. Thanks. -jeff On 4/10/07, Jeff Squyres wrote: > > Proposal: Make a "product" in bugzilla (alongside "OpenFabrics Linux" > and "OpenFabrics Windows") for sysadmin issues dealing specifically > with the OFA server. > > Rationale: We have a nonzero number of sysadmin requests for the OFA > server. The previous OFA sysadmin, Michael Lee did a heroic job of > maintaining all the outstanding requests and servicing them all, but > there was still a bunch of legwork to ensure that a) people knew who > to report sysadmin requests to, and b) ensuring that people actually > did request directly from the responsible sysadmins (vs. just posting > to the OF general list, which has too much traffic for the sysadmins > to monitor). > > Having a bugzilla product for the OFA Server would be most helpful in > both submitting requests through a well-known/understood process and > tracking to ensure that all the work actually gets done (from the end- > user's perspective). > > If no one cares/objects within a week, I'd like to ask Scott > Weitzenkamp to setup an "OFA Server" product in Bugzilla. We'll then > put some corresponding info on the wiki about how to submit OFA > Server bugzilla requests. Thanks. > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Tue Apr 10 10:14:07 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 10 Apr 2007 12:14:07 -0500 Subject: [ofa-general] OFED-1.2 compiles 32b apps on SLES 10 for IBM PPC Message-ID: <1176225247.4747.24.camel@stevo-desktop> I just built the ofed-1.2-rc1 kit on an IBM P5 PPC with SLES 10 and some the apps got built as 32b. Seems like gcc on this distro defaults to 32b: # file /usr/bin/rping /usr/bin/rping: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV), for GNU/Linux 2.6.4, dynamically linked (uses shared libs), for GNU/Linux 2.6.4, not stripped # file /usr/bin/ibv_devices /usr/bin/ibv_devices: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV), for GNU/Linux 2.6.4, dynamically linked (uses shared libs), for GNU/Linux 2.6.4, not stripped # file /usr/bin/ibv_rc_pingpong /usr/bin/ibv_rc_pingpong: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV), for GNU/Linux 2.6.4, dynamically linked (uses shared libs), for GNU/Linux 2.6.4, not stripped Should I open a bug for this? Steve. From halr at voltaire.com Tue Apr 10 10:09:33 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2007 13:09:33 -0400 Subject: [ofa-general] [RFC] IB management changes proposal Message-ID: <1176224960.14140.474623.camel@localhost.localdomain> The following changes are proposed for IB management (master branch of my management git tree): In order to better match package names, the following directory names to be changed from->to: osm->opensm diags->openib-diags Since opensm is a system daemon, opensm to be moved from /usr/bin to /usr/sbin For consistency with the package name, /var/cache/osm moved to /var/cache/opensm Also, for consistency with the package name, all config, log, and dump files named osm* to be changed to opensm* To avoid confusion and possible conflicts in configuring daemon options, only have 1 configuration file (existence of both /etc/sysconfig/opensm and /etc/opensm.conf is problematic). Remove the /etc/sysconfig/opensm file and only use opensm.conf. Move opensm.conf to /etc/rdma (as discussed in the thread labeled "Location and naming of RDMA enablement stack rpm" on general at lists.openfabrics.org. Any comments ? -- Hal From rdreier at cisco.com Tue Apr 10 10:16:41 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Apr 2007 10:16:41 -0700 Subject: [ofa-general] [RFC] IB management changes proposal In-Reply-To: <1176224960.14140.474623.camel@localhost.localdomain> (Hal Rosenstock's message of "10 Apr 2007 13:09:33 -0400") References: <1176224960.14140.474623.camel@localhost.localdomain> Message-ID: > In order to better match package names, the following directory names to > be changed from->to: > osm->opensm > diags->openib-diags After all the hassle of the name change, I think we should try to avoid new uses of the openib name. - R. From halr at voltaire.com Tue Apr 10 10:19:28 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2007 13:19:28 -0400 Subject: [ofa-general] [RFC] IB management changes proposal In-Reply-To: References: <1176224960.14140.474623.camel@localhost.localdomain> Message-ID: <1176225565.14140.475188.camel@localhost.localdomain> On Tue, 2007-04-10 at 13:16, Roland Dreier wrote: > > In order to better match package names, the following directory names to > > be changed from->to: > > osm->opensm > > diags->openib-diags > > After all the hassle of the name change, I think we should try to > avoid new uses of the openib name. These are IB diags and not RDMA diags though. What would a better name be ? -- Hal > - R. From sweitzen at cisco.com Tue Apr 10 10:25:43 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 10 Apr 2007 10:25:43 -0700 Subject: [ofa-general] RE: [Bug 465] IPoIB CM HA fails after several hours of failovers In-Reply-To: <20070410073254.GF4717@mellanox.co.il> References: <20070405181209.8B513E60835@openfabrics.org> <20070410073254.GF4717@mellanox.co.il> Message-ID: You have: ports="7 8"; This is only toggling ports on one host, try adding the ports for the other host to the list too. Scott From sean.hefty at intel.com Tue Apr 10 10:27:14 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 10 Apr 2007 10:27:14 -0700 Subject: [ofa-general] [RFC] IB management changes proposal In-Reply-To: <1176225565.14140.475188.camel@localhost.localdomain> Message-ID: <000001c77b95$7af923c0$61d8180a@amr.corp.intel.com> >These are IB diags and not RDMA diags though. What would a better name >be ? Isn't the entire management directory IB specific? If so, you could just leave it at diags. I do prefer opensm over osm though. - Sean From halr at voltaire.com Tue Apr 10 10:26:28 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2007 13:26:28 -0400 Subject: [ofa-general] [RFC] IB management changes proposal In-Reply-To: <000001c77b95$7af923c0$61d8180a@amr.corp.intel.com> References: <000001c77b95$7af923c0$61d8180a@amr.corp.intel.com> Message-ID: <1176225987.14140.475596.camel@localhost.localdomain> On Tue, 2007-04-10 at 13:27, Sean Hefty wrote: > >These are IB diags and not RDMA diags though. What would a better name > >be ? > > Isn't the entire management directory IB specific? Yes. > If so, you could just leave it at diags. Diags is too generic as a package name. -- Hal > I do prefer opensm over osm though. > - Sean From rdreier at cisco.com Tue Apr 10 10:34:37 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Apr 2007 10:34:37 -0700 Subject: [ofa-general] [RFC] IB management changes proposal In-Reply-To: <1176225565.14140.475188.camel@localhost.localdomain> (Hal Rosenstock's message of "10 Apr 2007 13:19:28 -0400") References: <1176224960.14140.474623.camel@localhost.localdomain> <1176225565.14140.475188.camel@localhost.localdomain> Message-ID: > These are IB diags and not RDMA diags though. What would a better name be ? ib-diags then? openib doesn't even exist anymore. From rdreier at cisco.com Tue Apr 10 10:37:15 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Apr 2007 10:37:15 -0700 Subject: [ofa-general] [RFC] IB management changes proposal In-Reply-To: (Roland Dreier's message of "Tue, 10 Apr 2007 10:34:37 -0700") References: <1176224960.14140.474623.camel@localhost.localdomain> <1176225565.14140.475188.camel@localhost.localdomain> Message-ID: > ib-diags then? openib doesn't even exist anymore. Or better yet, infiniband-diags maybe? - R. From rdreier at cisco.com Tue Apr 10 10:40:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Apr 2007 10:40:13 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.21] IB/ipoib: fix DMA direction typo In-Reply-To: <20070410153242.GG4616@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 10 Apr 2007 18:32:42 +0300") References: <20070410153242.GG4616@mellanox.co.il> Message-ID: Thanks, good catch. I'll ask Linus to pull this today. From rdreier at cisco.com Tue Apr 10 10:43:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Apr 2007 10:43:44 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will two small post-rc6 fixes for crasher bugs: Erez Zilber (1): IB/iser: Don't defer connection failure notification to workqueue Michael S. Tsirkin (1): IPoIB/cm: Fix DMA direction typo drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- drivers/infiniband/ulp/iser/iscsi_iser.h | 1 - drivers/infiniband/ulp/iser/iser_verbs.c | 40 ++++++++++++------------------ 3 files changed, 17 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index e70492d..2b242a4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -131,7 +131,7 @@ static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int skb_fill_page_desc(skb, i, page, 0, PAGE_SIZE); mapping[i + 1] = ib_dma_map_page(priv->ca, skb_shinfo(skb)->frags[i].page, - 0, PAGE_SIZE, DMA_TO_DEVICE); + 0, PAGE_SIZE, DMA_FROM_DEVICE); if (unlikely(ib_dma_mapping_error(priv->ca, mapping[i + 1]))) goto partial_error; } diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h index cae8c96..8960196 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.h +++ b/drivers/infiniband/ulp/iser/iscsi_iser.h @@ -245,7 +245,6 @@ struct iser_conn { wait_queue_head_t wait; /* waitq for conn/disconn */ atomic_t post_recv_buf_count; /* posted rx count */ atomic_t post_send_buf_count; /* posted tx count */ - struct work_struct comperror_work; /* conn term sleepable ctx*/ char name[ISER_OBJECT_NAME_SIZE]; struct iser_page_vec *page_vec; /* represents SG to fmr maps* * maps serialized as tx is*/ diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index 693b770..1fc9674 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -48,7 +48,6 @@ static void iser_cq_tasklet_fn(unsigned long data); static void iser_cq_callback(struct ib_cq *cq, void *cq_context); -static void iser_comp_error_worker(struct work_struct *work); static void iser_cq_event_callback(struct ib_event *cause, void *context) { @@ -480,7 +479,6 @@ int iser_conn_init(struct iser_conn **ibconn) init_waitqueue_head(&ib_conn->wait); atomic_set(&ib_conn->post_recv_buf_count, 0); atomic_set(&ib_conn->post_send_buf_count, 0); - INIT_WORK(&ib_conn->comperror_work, iser_comp_error_worker); INIT_LIST_HEAD(&ib_conn->conn_list); spin_lock_init(&ib_conn->lock); @@ -753,26 +751,6 @@ int iser_post_send(struct iser_desc *tx_desc) return ret_val; } -static void iser_comp_error_worker(struct work_struct *work) -{ - struct iser_conn *ib_conn = - container_of(work, struct iser_conn, comperror_work); - - /* getting here when the state is UP means that the conn is being * - * terminated asynchronously from the iSCSI layer's perspective. */ - if (iser_conn_state_comp_exch(ib_conn, ISER_CONN_UP, - ISER_CONN_TERMINATING)) - iscsi_conn_failure(ib_conn->iser_conn->iscsi_conn, - ISCSI_ERR_CONN_FAILED); - - /* complete the termination process if disconnect event was delivered * - * note there are no more non completed posts to the QP */ - if (ib_conn->disc_evt_flag) { - ib_conn->state = ISER_CONN_DOWN; - wake_up_interruptible(&ib_conn->wait); - } -} - static void iser_handle_comp_error(struct iser_desc *desc) { struct iser_dto *dto = &desc->dto; @@ -791,8 +769,22 @@ static void iser_handle_comp_error(struct iser_desc *desc) } if (atomic_read(&ib_conn->post_recv_buf_count) == 0 && - atomic_read(&ib_conn->post_send_buf_count) == 0) - schedule_work(&ib_conn->comperror_work); + atomic_read(&ib_conn->post_send_buf_count) == 0) { + /* getting here when the state is UP means that the conn is * + * being terminated asynchronously from the iSCSI layer's * + * perspective. */ + if (iser_conn_state_comp_exch(ib_conn, ISER_CONN_UP, + ISER_CONN_TERMINATING)) + iscsi_conn_failure(ib_conn->iser_conn->iscsi_conn, + ISCSI_ERR_CONN_FAILED); + + /* complete the termination process if disconnect event was delivered * + * note there are no more non completed posts to the QP */ + if (ib_conn->disc_evt_flag) { + ib_conn->state = ISER_CONN_DOWN; + wake_up_interruptible(&ib_conn->wait); + } + } } static void iser_cq_tasklet_fn(unsigned long data) From halr at voltaire.com Tue Apr 10 10:59:53 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2007 13:59:53 -0400 Subject: [ofa-general] [RFC] IB management changes proposal In-Reply-To: References: <1176224960.14140.474623.camel@localhost.localdomain> <1176225565.14140.475188.camel@localhost.localdomain> Message-ID: <1176227991.14140.477539.camel@localhost.localdomain> On Tue, 2007-04-10 at 13:37, Roland Dreier wrote: > > ib-diags then? openib doesn't even exist anymore. > > Or better yet, infiniband-diags maybe? Sure. -- Hal > - R. From mst at dev.mellanox.co.il Tue Apr 10 11:11:25 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 21:11:25 +0300 Subject: [ofa-general] Re: [Bug 465] IPoIB CM HA fails after several hours of failovers In-Reply-To: References: <20070410073254.GF4717@mellanox.co.il> Message-ID: <20070410181125.GA10218@mellanox.co.il> Is that what you are doing? I can't easily do this (this requires a 3'd system) and I can't see how toggling ports connected to host A would crash host B. -- MST From sweitzen at cisco.com Tue Apr 10 11:13:11 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 10 Apr 2007 11:13:11 -0700 Subject: [ofa-general] RE: [Bug 465] IPoIB CM HA fails after several hours of failovers In-Reply-To: <20070410181125.GA10218@mellanox.co.il> References: <20070410073254.GF4717@mellanox.co.il> <20070410181125.GA10218@mellanox.co.il> Message-ID: Yes, I'm toggline one port at a time, among 4 ports (2 for each host). I've only seen a crash once, every other time IPoIB CM just stops working. From mst at dev.mellanox.co.il Tue Apr 10 11:16:20 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 21:16:20 +0300 Subject: [ofa-general] Re: [Bug 465] IPoIB CM HA fails after several hours of failovers In-Reply-To: References: <20070410181125.GA10218@mellanox.co.il> Message-ID: <20070410181620.GC10218@mellanox.co.il> Scott, pls provide data about the crash as requested by previous comments (note you don't have to reproduce it to provide that data). -- MST From mst at dev.mellanox.co.il Tue Apr 10 11:18:10 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 21:18:10 +0300 Subject: [ofa-general] does RHEL5 Xen work with OFED? In-Reply-To: <5849f1820704052125ob1d309do323eae651ea9ed91@mail.gmail.com> References: <5849f1820704052125ob1d309do323eae651ea9ed91@mail.gmail.com> Message-ID: <20070410181810.GD10218@mellanox.co.il> > Quoting G.O. : > Subject: Re: [ofa-general] does RHEL5 Xen work with OFED? > > On 4/5/07, Scott Weitzenkamp (sweitzen) wrote: > >Can I access OFED IPoIB and SRP/iSER devices from within a Xen virtual > >machine? > > > > I haven't tested SRP/iSER , but IPoIB works only on dom0 kernel. > You can't use any infiniband stuff on the guest OSes . > > Gurhan What doesn't work? I would expect both IPoIB and SRP behave in more or less the same way as any network/storage devices, and get virtualized by Xen. -- MST From chu11 at llnl.gov Tue Apr 10 11:25:27 2007 From: chu11 at llnl.gov (Al Chu) Date: Tue, 10 Apr 2007 11:25:27 -0700 Subject: [ofa-general] OFED-1.2 compiles 32b apps on SLES 10 for IBM PPC In-Reply-To: <1176225247.4747.24.camel@stevo-desktop> References: <1176225247.4747.24.camel@stevo-desktop> Message-ID: <1176229527.30343.575.camel@cardanus.llnl.gov> On Tue, 2007-04-10 at 12:14 -0500, Steve Wise wrote: > I just built the ofed-1.2-rc1 kit on an IBM P5 PPC with SLES 10 and some > the apps got built as 32b. Seems like gcc on this distro defaults to > 32b: > > # file /usr/bin/rping > /usr/bin/rping: ELF 32-bit MSB executable, PowerPC or cisco 4500, > version 1 (SYSV), for GNU/Linux 2.6.4, dynamically linked (uses shared > libs), for GNU/Linux 2.6.4, not stripped > # file /usr/bin/ibv_devices > /usr/bin/ibv_devices: ELF 32-bit MSB executable, PowerPC or cisco 4500, > version 1 (SYSV), for GNU/Linux 2.6.4, dynamically linked (uses shared > libs), for GNU/Linux 2.6.4, not stripped > # file /usr/bin/ibv_rc_pingpong > /usr/bin/ibv_rc_pingpong: ELF 32-bit MSB executable, PowerPC or cisco > 4500, version 1 (SYSV), for GNU/Linux 2.6.4, dynamically linked (uses > shared libs), for GNU/Linux 2.6.4, not stripped > > Should I open a bug for this? Hi Steve, I don't know how you're building/installing, but I bet it's the same thing we've hit in the past. Apparently rpmbuild defaults to naming packages based on the output of 'uname -r', but gcc defaults to to building 32bit. So you end up with a 32bit binary inside a 'ppc64' rpm. One way to solve this was to rpmbuild w/ --target=ppc so the rpm matches the contents built. (Note: this wasn't done for OFED, it was for other packages.) I believe some configure options and compile options can get it to build 64bit, but I can't remember what those are. My opinion is this bug is with Suse. Perhaps workarounds could be added into the OFED build to build ppc/ppc64 rpms properly?? Al > > Steve. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From mst at dev.mellanox.co.il Tue Apr 10 11:27:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Apr 2007 21:27:15 +0300 Subject: [ofa-general] questions about OFED 1.2 IPoIB bonding In-Reply-To: References: Message-ID: <20070410182715.GE10218@mellanox.co.il> > HCA catastrophic errors are either a hardware problem (either a > transient condition like overheating, or a busted HCA), or a firmware > bug. Not really, since most kernel code uses the DMA MR, they can easily be triggered by e.g. incorrect DMA API usage. I've just seen this with the recent PPC bug. -- MST From sweitzen at cisco.com Tue Apr 10 12:56:24 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 10 Apr 2007 12:56:24 -0700 Subject: [ofa-general] RE: [Bug 506] IPoIB IPv4 multicast throughput is poor In-Reply-To: <20070401201802.GB11175@mellanox.co.il> References: <20070401195210.8DEC5E603B8@openfabrics.org> <20070401201802.GB11175@mellanox.co.il> Message-ID: > > The low throughput is a major issue, though. Shouldn't the > IP multicast > > throughput be similar to the UDP unicast throughput? > > Is the send side a send only member of multicast group, or > full member? > If it's a full join, HCA creates extra loopback traffic which > has then to be discarded, and which might explain performance > degradation. I don't know if it's send only member or full member, can you try iperf yourself please? Scott From swise at opengridcomputing.com Tue Apr 10 15:02:18 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 10 Apr 2007 17:02:18 -0500 Subject: [ofa-general] [GIT PULL] OFED-1.2 - Chelsio Bug Fixes Message-ID: <1176242538.4747.97.camel@stevo-desktop> Vlad, Please pull these cxgb3 and iw_cxgb3 changes from git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2 Thanks, Steve. -------------------- Divy Le Ray: Ensure that the TCAM active region size is at least 16. Differentiate NIC only adapters from RNICs. Run the watchdog task when the link is up. Introduce FW micro version. Steve Wise: Set driver version to indicate ofed. Fail qp creation if the requested max_inline is too large. -------------------- commit 6578235176878573051df57e52cfc27d5f873617 Author: Steve Wise Date: Tue Apr 10 14:42:37 2007 -0500 Fail qp creation if the requested max_inline is too large. Signed-off-by: Steve Wise diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h index 6c7ac55..e7ea455 100644 --- a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h +++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h @@ -38,6 +38,7 @@ #include #include "firmware_exports.h" #define T3_MAX_SGE 4 +#define T3_MAX_INLINE 64 #define Q_EMPTY(rptr,wptr) ((rptr)==(wptr)) #define Q_FULL(rptr,wptr,size_log2) ( (((wptr)-(rptr))>>(size_log2)) && \ diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 0c0ee20..fe57d11 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -762,6 +762,9 @@ static struct ib_qp *iwch_create_qp(stru if (rqsize > T3_MAX_RQ_SIZE) return ERR_PTR(-EINVAL); + + if (attrs->cap.max_inline_data > T3_MAX_INLINE) + return ERR_PTR(-EINVAL); /* * NOTE: The SQ and total WQ sizes don't need to be commit 5c1e33521d26918fcd366224d725dc56c163200a Author: Steve Wise Date: Tue Apr 10 13:37:30 2007 -0500 Set driver version to indicate ofed. Signed-off-by: Steve Wise diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c index de44c57..d9be492 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.c +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -40,7 +40,7 @@ #include "iwch_user.h" #include "iwch.h" #include "iwch_cm.h" -#define DRV_VERSION "1.1" +#define DRV_VERSION "1.0-ofed" MODULE_AUTHOR("Boyd Faulkner, Steve Wise"); MODULE_DESCRIPTION("Chelsio T3 RDMA Driver"); diff --git a/drivers/net/cxgb3/version.h b/drivers/net/cxgb3/version.h index c929aa8..bd7c4f7 100644 --- a/drivers/net/cxgb3/version.h +++ b/drivers/net/cxgb3/version.h @@ -35,7 +35,7 @@ #define __CHELSIO_VERSION_H #define DRV_DESC "Chelsio T3 Network Driver" #define DRV_NAME "cxgb3" /* Driver version */ -#define DRV_VERSION "1.0" +#define DRV_VERSION "1.0-ofed" /* Firmware version */ #define FW_VERSION_MAJOR 3 commit fa03899fd363db4672ad75c1258db54c0ad2b849 Author: Divy Le Ray Date: Tue Apr 10 13:37:24 2007 -0500 Introduce FW micro version. Bump up FW version to 3.3.0 Signed-off-by: Divy Le Ray diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c index 5f29521..81262e5 100644 --- a/drivers/net/cxgb3/cxgb3_main.c +++ b/drivers/net/cxgb3/cxgb3_main.c @@ -717,7 +717,7 @@ static void bind_qsets(struct adapter *a } } -#define FW_FNAME "t3fw-%d.%d.bin" +#define FW_FNAME "t3fw-%d.%d.%d.bin" static int upgrade_fw(struct adapter *adap) { @@ -727,7 +727,7 @@ static int upgrade_fw(struct adapter *ad struct device *dev = &adap->pdev->dev; snprintf(buf, sizeof(buf), FW_FNAME, FW_VERSION_MAJOR, - FW_VERSION_MINOR); + FW_VERSION_MINOR, FW_VERSION_MICRO); ret = request_firmware(&fw, buf, dev); if (ret < 0) { dev_err(dev, "could not upgrade firmware: unable to load %s\n", diff --git a/drivers/net/cxgb3/version.h b/drivers/net/cxgb3/version.h index b0e68fa..c929aa8 100644 --- a/drivers/net/cxgb3/version.h +++ b/drivers/net/cxgb3/version.h @@ -36,6 +36,9 @@ #define DRV_DESC "Chelsio T3 Network Dri #define DRV_NAME "cxgb3" /* Driver version */ #define DRV_VERSION "1.0" + +/* Firmware version */ #define FW_VERSION_MAJOR 3 #define FW_VERSION_MINOR 3 +#define FW_VERSION_MICRO 0 #endif /* __CHELSIO_VERSION_H */ commit f5b12ebf7362fd1fd22b10286ec7e49715683086 Author: Divy Le Ray Date: Tue Apr 10 13:37:22 2007 -0500 Run the watchdog task when the link is up. Flush the XGMAC Tx FIFO when the link drops. Also remove a statistics update that should have gone in the previous modification of xgmac.c. Signed-off-by: Divy Le Ray diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c index 163bb4d..5f29521 100644 --- a/drivers/net/cxgb3/cxgb3_main.c +++ b/drivers/net/cxgb3/cxgb3_main.c @@ -187,16 +187,26 @@ void t3_os_link_changed(struct adapter * int speed, int duplex, int pause) { struct net_device *dev = adapter->port[port_id]; + struct port_info *pi = netdev_priv(dev); + struct cmac *mac = &pi->mac; /* Skip changes from disabled ports. */ if (!netif_running(dev)) return; if (link_stat != netif_carrier_ok(dev)) { - if (link_stat) + if (link_stat) { + t3_set_reg_field(adapter, + A_XGM_TXFIFO_CFG + mac->offset, + F_ENDROPPKT, 0); netif_carrier_on(dev); - else + } else { netif_carrier_off(dev); + t3_set_reg_field(adapter, + A_XGM_TXFIFO_CFG + mac->offset, + F_ENDROPPKT, F_ENDROPPKT); + } + link_report(dev); } } @@ -2114,7 +2124,7 @@ static void check_t3b2_mac(struct adapte continue; status = 0; - if (netif_running(dev)) + if (netif_running(dev) && netif_carrier_ok(dev)) status = t3b2_mac_watchdog_task(&p->mac); if (status == 1) p->mac.stats.num_toggled++; diff --git a/drivers/net/cxgb3/regs.h b/drivers/net/cxgb3/regs.h index b38629a..f8be41c 100644 --- a/drivers/net/cxgb3/regs.h +++ b/drivers/net/cxgb3/regs.h @@ -1940,6 +1940,10 @@ #define M_TXFIFOTHRESH 0x1ff #define V_TXFIFOTHRESH(x) ((x) << S_TXFIFOTHRESH) +#define S_ENDROPPKT 21 +#define V_ENDROPPKT(x) ((x) << S_ENDROPPKT) +#define F_ENDROPPKT V_ENDROPPKT(1U) + #define A_XGM_SERDES_CTRL 0x890 #define A_XGM_SERDES_CTRL0 0x8e0 diff --git a/drivers/net/cxgb3/xgmac.c b/drivers/net/cxgb3/xgmac.c index 2b42c13..94aaff0 100644 --- a/drivers/net/cxgb3/xgmac.c +++ b/drivers/net/cxgb3/xgmac.c @@ -471,7 +471,6 @@ #define RMON_UPDATE64(mac, name, reg_lo, RMON_UPDATE(mac, rx_symbol_errs, RX_SYM_CODE_ERR_FRAMES); RMON_UPDATE(mac, rx_too_long, RX_OVERSIZE_FRAMES); - mac->stats.rx_too_long += RMON_READ(mac, A_XGM_RX_MAX_PKT_SIZE_ERR_CNT); v = RMON_READ(mac, A_XGM_RX_MAX_PKT_SIZE_ERR_CNT); if (mac->adapter->params.rev == T3_REV_B2) commit 6616b9156673583463fdab13c1164353dfa2e694 Author: Divy Le Ray Date: Tue Apr 10 13:37:19 2007 -0500 Differentiate NIC only adapters from RNICs. Initialize offload capabilities for RNICs only. Signed-off-by: Divy Le Ray diff --git a/drivers/net/cxgb3/common.h b/drivers/net/cxgb3/common.h index 38a0565..97128d8 100644 --- a/drivers/net/cxgb3/common.h +++ b/drivers/net/cxgb3/common.h @@ -112,8 +112,7 @@ enum { }; enum { - SUPPORTED_OFFLOAD = 1 << 24, - SUPPORTED_IRQ = 1 << 25 + SUPPORTED_IRQ = 1 << 24 }; enum { /* adapter interrupt-maintained statistics */ @@ -405,6 +404,7 @@ struct adapter_params { unsigned int stats_update_period; /* MAC stats accumulation period */ unsigned int linkpoll_period; /* link poll period in 0.1s */ unsigned int rev; /* chip revision */ + unsigned int offload; }; enum { /* chip revisions */ @@ -605,7 +605,7 @@ static inline int is_10G(const struct ad static inline int is_offload(const struct adapter *adap) { - return adapter_info(adap)->caps & SUPPORTED_OFFLOAD; + return adap->params.offload; } static inline unsigned int core_ticks_per_usec(const struct adapter *adap) diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c index a24b855..163bb4d 100644 --- a/drivers/net/cxgb3/cxgb3_main.c +++ b/drivers/net/cxgb3/cxgb3_main.c @@ -409,7 +409,7 @@ static void quiesce_rx(struct adapter *a static int setup_sge_qsets(struct adapter *adap) { int i, j, err, irq_idx = 0, qset_idx = 0, dummy_dev_idx = 0; - unsigned int ntxq = is_offload(adap) ? SGE_TXQ_PER_SET : 1; + unsigned int ntxq = SGE_TXQ_PER_SET; if (adap->params.rev > 0 && !(adap->flags & USING_MSI)) irq_idx = -1; @@ -917,7 +917,7 @@ static int cxgb_open(struct net_device * return err; set_bit(pi->port_id, &adapter->open_device_map); - if (!ofld_disable) { + if (is_offload(adapter) && !ofld_disable) { err = offload_open(dev); if (err) printk(KERN_WARNING @@ -2265,9 +2265,9 @@ static void __devinit print_port_info(st if (!test_bit(i, &adap->registered_device_map)) continue; - printk(KERN_INFO "%s: %s %s RNIC (rev %d) %s%s\n", + printk(KERN_INFO "%s: %s %s %sNIC (rev %d) %s%s\n", dev->name, ai->desc, pi->port_type->desc, - adap->params.rev, buf, + is_offload(adap) ? "R" : "", adap->params.rev, buf, (adap->flags & USING_MSIX) ? " MSI-X" : (adap->flags & USING_MSI) ? " MSI" : ""); if (adap->name == dev->name && adap->params.vpd.mclk) diff --git a/drivers/net/cxgb3/mc5.c b/drivers/net/cxgb3/mc5.c old mode 100755 new mode 100644 index 644d62e..84c1ffa --- a/drivers/net/cxgb3/mc5.c +++ b/drivers/net/cxgb3/mc5.c @@ -328,6 +328,9 @@ int t3_mc5_init(struct mc5 *mc5, unsigne unsigned int tcam_size = mc5->tcam_size; struct adapter *adap = mc5->adapter; + if (!tcam_size) + return 0; + if (nroutes > MAX_ROUTES || nroutes + nservers + nfilters > tcam_size) return -EINVAL; diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c index c237834..027ab2c 100644 --- a/drivers/net/cxgb3/sge.c +++ b/drivers/net/cxgb3/sge.c @@ -2631,7 +2631,7 @@ int t3_sge_alloc_qset(struct adapter *ad q->txq[TXQ_ETH].stop_thres = nports * flits_to_desc(sgl_len(MAX_SKB_FRAGS + 1) + 3); - if (ntxq == 1) { + if (!is_offload(adapter)) { #ifdef USE_RX_PAGE q->fl[0].buf_size = RX_PAGE_SIZE; #else diff --git a/drivers/net/cxgb3/t3_hw.c b/drivers/net/cxgb3/t3_hw.c index 791ed6d..d83f075 100644 --- a/drivers/net/cxgb3/t3_hw.c +++ b/drivers/net/cxgb3/t3_hw.c @@ -438,23 +438,23 @@ static const struct adapter_info t3_adap {2, 0, 0, 0, F_GPIO2_OEN | F_GPIO4_OEN | F_GPIO2_OUT_VAL | F_GPIO4_OUT_VAL, F_GPIO3 | F_GPIO5, - SUPPORTED_OFFLOAD, + 0, &mi1_mdio_ops, "Chelsio PE9000"}, {2, 0, 0, 0, F_GPIO2_OEN | F_GPIO4_OEN | F_GPIO2_OUT_VAL | F_GPIO4_OUT_VAL, F_GPIO3 | F_GPIO5, - SUPPORTED_OFFLOAD, + 0, &mi1_mdio_ops, "Chelsio T302"}, {1, 0, 0, 0, F_GPIO1_OEN | F_GPIO6_OEN | F_GPIO7_OEN | F_GPIO10_OEN | F_GPIO1_OUT_VAL | F_GPIO6_OUT_VAL | F_GPIO10_OUT_VAL, 0, - SUPPORTED_10000baseT_Full | SUPPORTED_AUI | SUPPORTED_OFFLOAD, + SUPPORTED_10000baseT_Full | SUPPORTED_AUI, &mi1_mdio_ext_ops, "Chelsio T310"}, {2, 0, 0, 0, F_GPIO1_OEN | F_GPIO2_OEN | F_GPIO4_OEN | F_GPIO5_OEN | F_GPIO6_OEN | F_GPIO7_OEN | F_GPIO10_OEN | F_GPIO11_OEN | F_GPIO1_OUT_VAL | F_GPIO5_OUT_VAL | F_GPIO6_OUT_VAL | F_GPIO10_OUT_VAL, 0, - SUPPORTED_10000baseT_Full | SUPPORTED_AUI | SUPPORTED_OFFLOAD, + SUPPORTED_10000baseT_Full | SUPPORTED_AUI, &mi1_mdio_ext_ops, "Chelsio T320"}, }; @@ -2900,6 +2900,9 @@ static int mc7_init(struct mc7 *mc7, uns struct adapter *adapter = mc7->adapter; const struct mc7_timing_params *p = &mc7_timings[mem_type]; + if (!mc7->size) + return 0; + val = t3_read_reg(adapter, mc7->offset + A_MC7_CFG); slow = val & F_SLOW; width = G_WIDTH(val); @@ -3100,8 +3103,10 @@ int t3_init_hw(struct adapter *adapter, do { /* wait for uP to initialize */ msleep(20); } while (t3_read_reg(adapter, A_CIM_HOST_ACC_DATA) && --attempts); - if (!attempts) + if (!attempts) { + CH_ERR(adapter, "uP initialization timed out\n"); goto out_err; + } err = 0; out_err: @@ -3201,7 +3206,7 @@ static void __devinit mc7_prep(struct ad mc7->name = name; mc7->offset = base_addr - MC7_PMRX_BASE_ADDR; cfg = t3_read_reg(adapter, mc7->offset + A_MC7_CFG); - mc7->size = mc7_calc_size(cfg); + mc7->size = mc7->size = G_DEN(cfg) == M_DEN ? 0 : mc7_calc_size(cfg); mc7->width = G_WIDTH(cfg); } @@ -3228,6 +3233,7 @@ void early_hw_init(struct adapter *adapt V_I2C_CLKDIV(adapter->params.vpd.cclk / 80 - 1)); t3_write_reg(adapter, A_T3DBG_GPIO_EN, ai->gpio_out | F_GPIO0_OEN | F_GPIO0_OUT_VAL); + t3_write_reg(adapter, A_MC5_DB_SERVER_INDEX, 0); if (adapter->params.rev == 0 || !uses_xaui(adapter)) val |= F_ENRGMII; @@ -3326,7 +3332,13 @@ int __devinit t3_prep_adapter(struct ada p->tx_num_pgs = pm_num_pages(p->chan_tx_size, p->tx_pg_size); p->ntimer_qs = p->cm_size >= (128 << 20) || adapter->params.rev > 0 ? 12 : 6; + } + + adapter->params.offload = t3_mc7_size(&adapter->pmrx) && + t3_mc7_size(&adapter->pmtx) && + t3_mc7_size(&adapter->cm); + if (is_offload(adapter)) { adapter->params.mc5.nservers = DEFAULT_NSERVERS; adapter->params.mc5.nfilters = adapter->params.rev > 0 ? DEFAULT_NFILTERS : 0; commit 63a05336122ff822afc81b38a65899b629c9bfa0 Author: Divy Le Ray Date: Tue Apr 10 13:37:17 2007 -0500 Ensure that the TCAM active region size is at least 16. Signed-off-by: Divy Le Ray diff --git a/drivers/net/cxgb3/common.h b/drivers/net/cxgb3/common.h index 85e5543..38a0565 100644 --- a/drivers/net/cxgb3/common.h +++ b/drivers/net/cxgb3/common.h @@ -358,6 +358,9 @@ enum { MC5_MODE_72_BIT = 2 }; +/* MC5 min active region size */ +enum { MC5_MIN_TIDS = 16 }; + struct vpd_params { unsigned int cclk; unsigned int mclk; diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c index 1c31d19..a24b855 100644 --- a/drivers/net/cxgb3/cxgb3_main.c +++ b/drivers/net/cxgb3/cxgb3_main.c @@ -484,12 +484,14 @@ static ssize_t show_##name(struct class_ static ssize_t set_nfilters(struct net_device *dev, unsigned int val) { struct adapter *adap = dev->priv; + int min_tids = is_offload(adap) ? MC5_MIN_TIDS : 0; if (adap->flags & FULL_INIT_DONE) return -EBUSY; if (val && adap->params.rev == 0) return -EINVAL; - if (val > t3_mc5_size(&adap->mc5) - adap->params.mc5.nservers) + if (val > t3_mc5_size(&adap->mc5) - adap->params.mc5.nservers - + min_tids) return -EINVAL; adap->params.mc5.nfilters = val; return 0; @@ -507,7 +509,8 @@ static ssize_t set_nservers(struct net_d if (adap->flags & FULL_INIT_DONE) return -EBUSY; - if (val > t3_mc5_size(&adap->mc5) - adap->params.mc5.nfilters) + if (val > t3_mc5_size(&adap->mc5) - adap->params.mc5.nfilters - + MC5_MIN_TIDS) return -EINVAL; adap->params.mc5.nservers = val; return 0; diff --git a/drivers/net/cxgb3/cxgb3_offload.c b/drivers/net/cxgb3/cxgb3_offload.c old mode 100755 new mode 100644 index 46e9068..b56e679 --- a/drivers/net/cxgb3/cxgb3_offload.c +++ b/drivers/net/cxgb3/cxgb3_offload.c @@ -551,7 +551,9 @@ int cxgb3_alloc_atid(struct t3cdev *tdev struct tid_info *t = &(T3C_DATA(tdev))->tid_maps; spin_lock_bh(&t->atid_lock); - if (t->afree) { + if (t->afree && + t->atids_in_use + atomic_read(&t->tids_in_use) + MC5_MIN_TIDS <= + t->ntids) { union active_open_entry *p = t->afree; atid = (p - t->atid_tab) + t->atid_base; From rdreier at cisco.com Tue Apr 10 15:30:40 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Apr 2007 15:30:40 -0700 Subject: [ofa-general] Re: [PATCH 00 of 33] Set of ipath patches for 2.6.22 In-Reply-To: (Bryan O'Sullivan's message of "Thu, 15 Mar 2007 14:44:44 -0700") References: Message-ID: I just queued all of this for 2.6.22. Is there any chance of getting a fix for the use-after-free that can be caused by allocating something from userspace, failing to mmap the buffer and then exiting? To see what happens, look at how ipath_create_cq sticks a struct ipath_mmap_info into the pending mmap "list" (and yes it would be much cleaner to just use struct list_head here rather than reimplementing a linked list yourself), and then look at how ipath_destroy_cq() frees the same structure without checking if it has been removed from the pending mmap list. - R. From rdreier at cisco.com Tue Apr 10 15:32:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Apr 2007 15:32:19 -0700 Subject: [ofa-general] Re: [PATCH 14 of 33] IB/ipath - fix port sharing on powerpc In-Reply-To: <62da2fb770b66310ac06.1173995098@iqa-25.internal.keyresearch.com> (Bryan O'Sullivan's message of "Thu, 15 Mar 2007 14:44:58 -0700") References: <62da2fb770b66310ac06.1173995098@iqa-25.internal.keyresearch.com> Message-ID: I applied this, but I still think there's some more work to do in this area: > The port sharing feature mixed kernel virtual addresses as well as > physical addresses for the offset used to describe the mmap address to map > the InfiniPath hardware into user space. This had a conflict on powerpc. > The new scheme converts it to a physical address so it doesn't conflict > with chip addresses and yet still fits in 40/44 bits so it isn't truncated > by 32-bit applications calling mmap64(). there's no guarantee that a physical address fits in 40 or 44 or 63 bits on a 64 bit platform. So you've fixed this problem on the platforms you test for now, but it could easily crop up again... From rdreier at cisco.com Tue Apr 10 15:34:57 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Apr 2007 15:34:57 -0700 Subject: [ofa-general] questions about OFED 1.2 IPoIB bonding In-Reply-To: <20070410182715.GE10218@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 10 Apr 2007 21:27:15 +0300") References: <20070410182715.GE10218@mellanox.co.il> Message-ID: > > HCA catastrophic errors are either a hardware problem (either a > > transient condition like overheating, or a busted HCA), or a firmware > > bug. > > Not really, since most kernel code uses the DMA MR, > they can easily be triggered by e.g. incorrect DMA API usage. > I've just seen this with the recent PPC bug. Out of curiousity, why does this cause a catastrophic error? I would have thought a work request with a bogus bus address would generate an affiliated error, since you know exactly resource what caused the bad transaction. - R. From rdreier at cisco.com Tue Apr 10 15:37:41 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Apr 2007 15:37:41 -0700 Subject: [ofa-general] OFED teleconference: TODAY! In-Reply-To: <461BA2D0.3070504@voltaire.com> (Or Gerlitz's message of "Tue, 10 Apr 2007 17:44:32 +0300") References: <461B9E0D.4000008@mellanox.co.il> <461BA2D0.3070504@voltaire.com> Message-ID: > 472 Data corruption with Lustre+OFED when using FMR on memfree HCAs > > We see it also with iser, basically only on scsi --read-- which from > IB perspective is RDMA write from the target to the initiator. > > The env we see it is Sinai (25204) hw_ver=A0 and fw_ver=1.2.0 > > Ishai did not manage to reproduce it with SRP, but the fact it > reproduced with two independent ULPs makes it a blocker, i think. We definitely need more info here. Why are you confident that the two problems are the same bug? Have you tested with mem-free Arbel, and does the problem occur there too? Or have you only tested Sinai? Does the problem go away if you remove the MTHCA_FLAG_SINAI_OPT flag from the mthca_hca_table[] entry in mthca_main.c? - R. From rdreier at cisco.com Tue Apr 10 17:35:21 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Apr 2007 17:35:21 -0700 Subject: [ofa-general] Re: [PATCH 00 of 33] Set of ipath patches for 2.6.22 In-Reply-To: (Roland Dreier's message of "Tue, 10 Apr 2007 15:30:40 -0700") References: Message-ID: > Is there any chance of getting a fix for the use-after-free that can > be caused by allocating something from userspace, failing to mmap the > buffer and then exiting? To see what happens, look at how > ipath_create_cq sticks a struct ipath_mmap_info into the pending mmap > "list" (and yes it would be much cleaner to just use struct list_head > here rather than reimplementing a linked list yourself), and then look > at how ipath_destroy_cq() frees the same structure without checking if > it has been removed from the pending mmap list. By the way, would it help get this fixed if I opened a bug on openfabrics.org? Or is that a waste of time? From rjwalsh at pathscale.com Tue Apr 10 17:48:35 2007 From: rjwalsh at pathscale.com (Robert Walsh) Date: Tue, 10 Apr 2007 17:48:35 -0700 Subject: [ofa-general] Re: [PATCH 00 of 33] Set of ipath patches for 2.6.22 In-Reply-To: References: Message-ID: <461C3063.9010603@pathscale.com> Roland Dreier wrote: > > Is there any chance of getting a fix for the use-after-free that can > > be caused by allocating something from userspace, failing to mmap the > > buffer and then exiting? To see what happens, look at how > > ipath_create_cq sticks a struct ipath_mmap_info into the pending mmap > > "list" (and yes it would be much cleaner to just use struct list_head > > here rather than reimplementing a linked list yourself), and then look > > at how ipath_destroy_cq() frees the same structure without checking if > > it has been removed from the pending mmap list. > > By the way, would it help get this fixed if I opened a bug on openfabrics.org? > Or is that a waste of time? We're tracking it here (bug 12010 on our internal bugzilla), and it's on my list to get done "soon". I'm currently in the middle of some other bug fixes, but when I get to a good stopping point, I'll get this fixed. Shouldn't be too difficult. If you'd like to track it yourself, feel free to open an OpenFabrics bug. I'll update the bug when I get a patch done. Regards, Robert. From mst at dev.mellanox.co.il Tue Apr 10 20:43:23 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Apr 2007 06:43:23 +0300 Subject: [ofa-general] Re: OFED-1.2 compiles 32b apps on SLES 10 for IBM PPC In-Reply-To: <1176225247.4747.24.camel@stevo-desktop> References: <1176225247.4747.24.camel@stevo-desktop> Message-ID: <20070411034323.GF10218@mellanox.co.il> > Quoting Steve Wise : > Subject: OFED-1.2 compiles 32b apps on SLES 10 for IBM PPC > > I just built the ofed-1.2-rc1 kit on an IBM P5 PPC with SLES 10 and some > the apps got built as 32b. Seems like gcc on this distro defaults to > 32b: In the past it was claimed that executables should be 32 bit on this platform since they consume less memory/disk this way. Look it up in the archives. -- MST From mst at dev.mellanox.co.il Tue Apr 10 21:03:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Apr 2007 07:03:43 +0300 Subject: [ofa-general] Re: [PATCH ofed-1.2]: 2 bug fixes for iSER In-Reply-To: <461B43C0.9010301@voltaire.com> References: <461B43C0.9010301@voltaire.com> Message-ID: <20070411040343.GG10218@mellanox.co.il> > Quoting Erez Zilber : > Subject: [PATCH ofed-1.2]: 2 bug fixes for iSER > > Vlad, > > Please add the following 2 fixes (already in Roland's tree) to kernel_patches/fixes. > > do not assume that a task may be aborted only after the qp times out > > scsi-ml may abort a command that was already sent. If the initiator is > still trying to send the command (or data-out PDUs for that command), the > qp may time out after scsi-ml times out. Therefore, when aborting the command, > iSER may still have references for the command's buffers. When sending > these PDUs will complete with an error, their resources will be released. > > Signed-off-by: Erez Zilber Erez, posting 2 patches inline the same message confuses quilt. Please clone ofed git tree, add the patches, and post the link for Vlad to pull from. Thanks, -- MST From mst at dev.mellanox.co.il Tue Apr 10 21:15:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Apr 2007 07:15:43 +0300 Subject: [ofa-general] questions about OFED 1.2 IPoIB bonding In-Reply-To: References: <20070410182715.GE10218@mellanox.co.il> Message-ID: <20070411041543.GI10218@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] questions about OFED 1.2 IPoIB bonding > > > > HCA catastrophic errors are either a hardware problem (either a > > > transient condition like overheating, or a busted HCA), or a firmware > > > bug. > > > > Not really, since most kernel code uses the DMA MR, > > they can easily be triggered by e.g. incorrect DMA API usage. > > I've just seen this with the recent PPC bug. > > Out of curiousity, why does this cause a catastrophic error? I would > have thought a work request with a bogus bus address would generate an > affiliated error, since you know exactly resource what caused the bad > transaction. It seems bus controller noticed an illegal transaction and started aborting all transactions mastered from this misbehaving device. -- MST From rdreier at cisco.com Tue Apr 10 21:27:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Apr 2007 21:27:49 -0700 Subject: [ofa-general] questions about OFED 1.2 IPoIB bonding In-Reply-To: <20070411041543.GI10218@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 11 Apr 2007 07:15:43 +0300") References: <20070410182715.GE10218@mellanox.co.il> <20070411041543.GI10218@mellanox.co.il> Message-ID: > > Out of curiousity, why does this cause a catastrophic error? I would > > have thought a work request with a bogus bus address would generate an > > affiliated error, since you know exactly resource what caused the bad > > transaction. > It seems bus controller noticed an illegal transaction and started > aborting all transactions mastered from this misbehaving device. I see, that's not really a true catastrophic error -- the mthca code will report it as one, because polling the the error buffer will get back all 0xffffffff, but that's just because the HCA has been isolated from the PCI bus. - R. From ishai at dev.mellanox.co.il Tue Apr 10 21:46:29 2007 From: ishai at dev.mellanox.co.il (Ishai Rabinovitz) Date: Wed, 11 Apr 2007 07:46:29 +0300 Subject: [ofa-general] Re: [ewg] Re: SRP HA dm_multipath testing and questions In-Reply-To: <003f01c77b91$6c47aee0$c801a8c0@ettac> References: <003f01c77b91$6c47aee0$c801a8c0@ettac> Message-ID: <461C6825.80701@dev.mellanox.co.il> Chieng Etta wrote: > > Scott Weitzenkamp (sweitzen) wrote: >> I've been testing SRP HA and dm_multipath with: >> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun T4 RAID >> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun 3510 RAID >> - SLES10 x86_64, Cisco FC Gateway, and 3 JBODs >> >> On RHEL4, I edited /etc/multipath.conf, ran "chkconfig multipathd on", >> then rebooted. On SLES 10, I ran "chkconfig boot.multipath on" and >> "chkconfig multipathd on", then rebooted. Ishai, I don't seem to need >> 91-srp.rules, are you using the boot.multipath and multipathd scripts? > > On RHEL4 you really do not need 91-srp.rules and it is not used (see > /etc/init.d/openibd) > On SLES10 I was sure that you need it. I checked it, and you are correct. I > don't see how it does it, but it seems that when using boot.multipath there > is no need for 91-srp.rules. I will check it more deeply and change > documentation and openibd script accordingly. > > [EC] I just verified it on SLES10 x86_64. The multipath worked fine by > using boot.multipath without 91-srp.rules. > In one of Novell's documents (SLES 10 Storage Administration Guide for EVMS - In section 5 Managing Multipath I/O for Devices http://www.novell.com/documentation/sles10/index.html?page=/documentation/sles10/stor_evms/data/multipathing.html) it says in subsection 5.7 that after a new target was discovered there is a need to actively execute multipath. (As I understand it from the document this is true even after boot.multipath is running) Experiments in my environment also indicates that after executing boot.multipath, SRP HA is working also without 91-srp.rules, but after reading this document I'm even more confused. > Ishai, in the SRP release notes - section 6, srp_daemon a., the first line > should be changed to '"srp_daemon -a -o" is equivalent to "ibsrpdm"'. > > Thanks, However Scott already noticed that and I already fixed it. You will see it in the next documentation version. From yangdong at ncic.ac.cn Tue Apr 10 22:45:33 2007 From: yangdong at ncic.ac.cn (yangdong) Date: Wed, 11 Apr 2007 13:45:33 +0800 Subject: [ofa-general] Question about ofed-1.2-alpha installation Message-ID: <461C75FD.3020404@ncic.ac.cn> Hello, everyone: Now I've installed ofed-1.2-alpha on linux-2.6.15, then I get some mesg, that is: ib1: multicast join failed for ff12:601b:ffff:0000:0000:0001:ff00:4ce2, status -22 ib1: multicast join failed for ff12:601b:ffff:0000:0000:0001:ff00:4ce2, status -22 ib1: multicast join failed for ff12:601b:ffff:0000:0000:0001:ff00:4ce2, status -22 anybody know what is the problem? Configuration: Dual Core AMD Opteron(tm) Processor 275, 4 processors, default kernel redhat 2.6.9-42.ELsmp From mst at dev.mellanox.co.il Tue Apr 10 23:21:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Apr 2007 09:21:43 +0300 Subject: [ofa-general] Re: [Bug 465] IPoIB CM HA fails after several hours of failovers In-Reply-To: <20070410223635.B4869E60814@openfabrics.org> References: <20070410223635.B4869E60814@openfabrics.org> Message-ID: <20070411062143.GB24730@mellanox.co.il> > I've tried this with RHEL4 U3 x86_64 LionMini SDR, SLES10 x86_64 LionCub DDR, > and RHEL4 U3 x86_64 LionMini DDR so far. You reported an oops. One which OS/HW/FW did you observe it? -- MST From mst at dev.mellanox.co.il Wed Apr 11 00:08:48 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Apr 2007 10:08:48 +0300 Subject: [ofa-general] questions about OFED 1.2 IPoIB bonding In-Reply-To: References: <20070410182715.GE10218@mellanox.co.il> <20070411041543.GI10218@mellanox.co.il> Message-ID: <20070411070848.GG24730@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] questions about OFED 1.2 IPoIB bonding > > > > Out of curiousity, why does this cause a catastrophic error? I would > > > have thought a work request with a bogus bus address would generate an > > > affiliated error, since you know exactly resource what caused the bad > > > transaction. > > > It seems bus controller noticed an illegal transaction and started > > aborting all transactions mastered from this misbehaving device. > > I see, that's not really a true catastrophic error -- the mthca code > will report it as one, because polling the the error buffer will get > back all 0xffffffff, but that's just because the HCA has been isolated > from the PCI bus. No, a read from the error buffer is not mastered at the HCA, so the error buffer actually gets real values. What triggers a catastrophic error is that HCA attempts to perform a transaction such as reading command inbox, and *that* fails. -- MST From mst at dev.mellanox.co.il Wed Apr 11 00:09:26 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Apr 2007 10:09:26 +0300 Subject: [ofa-general] Changed default bugzilla Priority/Severity from P1/Blocker to P3/Normal In-Reply-To: References: Message-ID: <20070411070926.GH24730@mellanox.co.il> > Quoting Scott Weitzenkamp (sweitzen) : > Subject: [ofa-general] Changed default bugzilla Priority/Severity from P1/Blocker to P3/Normal > > Now there will be no more accidental P1/Blocker bugs. > Good idea. -- MST From mst at dev.mellanox.co.il Wed Apr 11 00:22:02 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Apr 2007 10:22:02 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176160833.14140.408570.camel@localhost.localdomain> References: <1176160833.14140.408570.camel@localhost.localdomain> Message-ID: <20070411072202.GJ24730@mellanox.co.il> > Are all your ports DDR or do you have a mix ? If all are DDR, you can > configure the default partition to use this rate. If I get this right, user has to manually configure the rate in a mixed subnet. Is that correct? If yes, I'm actually not too happy with this. Would something like the following heuristic work better? - select the max rate between all participants - when a host with lower rate joins, destroy the group and recreate it with lower rate, send every other participants a reregister MAD -- MST From ossrosch at linux.vnet.ibm.com Wed Apr 11 01:11:29 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Wed, 11 Apr 2007 10:11:29 +0200 Subject: [ewg] Re: [ofa-general] OFED-1.2 compiles 32b apps on SLES 10 for IBM PPC In-Reply-To: <1176229527.30343.575.camel@cardanus.llnl.gov> References: <1176225247.4747.24.camel@stevo-desktop> <1176229527.30343.575.camel@cardanus.llnl.gov> Message-ID: <200704111011.30090.ossrosch@linux.vnet.ibm.com> Hi, On Tuesday 10 April 2007 20:25, Al Chu wrote: > On Tue, 2007-04-10 at 12:14 -0500, Steve Wise wrote: > > I just built the ofed-1.2-rc1 kit on an IBM P5 PPC with SLES 10 and some > > the apps got built as 32b. Seems like gcc on this distro defaults to > > 32b: > My opinion is this bug is with Suse. Perhaps workarounds could be added > into the OFED build to build ppc/ppc64 rpms properly?? > This is not a bug with Suse. We discussed this on the list some weeks ago. The discussion thread started in: http://lists.openfabrics.org/pipermail/general/2007-February/032889.html The outcome of this discussion was that the OFED installation script builds 32- and 64-bit libaries but only 32 bit binaries. regards Stefan From vlad at lists.openfabrics.org Wed Apr 11 02:36:23 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 11 Apr 2007 02:36:23 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070411-0200 daily build status Message-ID: <20070411093624.68833E60822@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From mst at dev.mellanox.co.il Wed Apr 11 02:43:11 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Apr 2007 12:43:11 +0300 Subject: [ofa-general] Re: re [NET]: Fix neighbour destructor handling In-Reply-To: <20070410153003.GF4616@mellanox.co.il> References: <461B9830.3040507@voltaire.com> <20070410140818.GC4616@mellanox.co.il> <461B9CD1.3090709@voltaire.com> <20070410142659.GE4616@mellanox.co.il> <461BA093.6000100@voltaire.com> <20070410153003.GF4616@mellanox.co.il> Message-ID: <20070411094311.GK24730@mellanox.co.il> > I did followed most of the discussions between you and MoniS re the > ipoib/bonding integration in OFED 1.2 and elsewhere, however: i don't > see why "bonding is basically broken for ipoib", if you don't mind, > please tell me the bottom line from your perspective. Here's a short summary of issues I saw last time, I'm not sure I haven't forgot something but here goes: 1.Calling to_ipoib_neigh without device lock taken might be racy I think you need to find another way to find the device. 2.Ah kept in the ipoib_neigh might belong to a device which is different from the one start_xmit is called at. 3.When the slave device goes down, master does not, and since neighbours are matched to the master there's no guarantee they will be cleaned up. 4.Bonding module copies a pointer to the cleanup function in a manner that is unsafe if ipoib is built as a module. I think these need to be addressed somehow before the patch's reposted. -- MST From mst at dev.mellanox.co.il Wed Apr 11 02:49:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Apr 2007 12:49:06 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176160833.14140.408570.camel@localhost.localdomain> References: <1176160833.14140.408570.camel@localhost.localdomain> Message-ID: <20070411094906.GA32703@mellanox.co.il> > Quoting Hal Rosenstock : > Subject: Re: multicast join failed for... > > On Mon, 2007-04-09 at 18:47, Egor Tur wrote: > > Hi folk. > > > > > > ib1: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > > ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > > > > > > And in osm.log: > > > > Apr 09 21:33:50 658439 [42003960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, > > > > __validate_port_caps, or JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), > > > > sending IB_SA_MAD_STATUS_REQ_INVALID > > > > > > > OpenSM ERR 1B12 means that the rate or MTU of the port was incompatible > > > with the MC group. You could turn on -V with OpenSM and see more log > > > messages as to what is going on wrong from the SM's perspective. > > > > Ok. This from osm.log with -V : > > > > Apr 10 00:56:06 390007 [44007960] -> __osm_sa_mad_ctrl_process: [ > > Apr 10 00:56:06 390016 [44007960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD > > Apr 10 00:56:06 390027 [44007960] -> __osm_sa_mad_ctrl_process: ] > > Apr 10 00:56:06 390033 [44007960] -> __osm_sa_mad_ctrl_rcv_callback: ] > > Apr 10 00:56:06 390046 [41001960] -> osm_mcmr_rcv_process: [ > > Apr 10 00:56:06 390054 [41001960] -> __osm_mcmr_rcv_join_mgrp: [ > > Apr 10 00:56:06 390060 [41001960] -> __osm_mcmr_rcv_join_mgrp: Dump of incoming record > > Apr 10 00:56:06 390065 [41001960] -> MCMember Record dump: > > MGID....................0xff12601bffff0000 : 0x0000000000000001 > > PortGid.................0xfe80000000000000 : 0x001708ffffd1509a > > qkey....................0xB1B > > mlid....................0x0 > > mtu.....................0x84 > > TClass..................0x0 > > pkey....................0xFFFF > > rate....................0x83 > > pkt_life................0x0 > > SLFlowLabelHopLimit.....0x0 > > ScopeState..............0x1 > > ProxyJoin...............0x0 > > Apr 10 00:56:06 390084 [41001960] -> __validate_more_comp_fields: Requested RATE 6 is not equal to 3 > > Rate 6 is 20 Gb/sec whereas 3 is 10 Gb/sec. So the port is 4x DDR (rate > 6) and the group is 4x SDR. The request is for equal to the rate so it > fails. BTW, the only reason I know for IPoIB to request a specific rate is if the broadcast multicast group has that rate. Roland, is that right? So, how come the broadcast multicast group has rate DDR, but a specific group has lower rate? -- MST From tziporet at dev.mellanox.co.il Wed Apr 11 03:08:26 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 11 Apr 2007 13:08:26 +0300 Subject: [ofa-general] OFA server bugzilla "product" In-Reply-To: <795c49870704101008y3278139ar670cd8cede24a5e0@mail.gmail.com> References: <7AF8879B-2539-458F-B0E2-D80118F5DAD3@cisco.com> <795c49870704101008y3278139ar670cd8cede24a5e0@mail.gmail.com> Message-ID: <461CB39A.9070603@mellanox.co.il> Very good idea Tziporet From ogerlitz at voltaire.com Wed Apr 11 03:39:57 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 11 Apr 2007 13:39:57 +0300 Subject: [ofa-general] iser/lustre memfree issues In-Reply-To: References: <461B9E0D.4000008@mellanox.co.il> <461BA2D0.3070504@voltaire.com> Message-ID: <461CBAFD.10106@voltaire.com> Roland Dreier wrote: > > 472 Data corruption with Lustre+OFED when using FMR on memfree HCAs > > > > We see it also with iser, basically only on scsi --read-- which from > > IB perspective is RDMA write from the target to the initiator. > > > > The env we see it is Sinai (25204) hw_ver=A0 and fw_ver=1.2.0 > > > > Ishai did not manage to reproduce it with SRP, but the fact it > > reproduced with two independent ULPs makes it a blocker, i think. > > We definitely need more info here. Why are you confident that the two > problems are the same bug? > > Have you tested with mem-free Arbel, and does the problem occur there > too? Or have you only tested Sinai? Does the problem go away if you > remove the MTHCA_FLAG_SINAI_OPT flag from the mthca_hca_table[] entry > in mthca_main.c? Hi Roland, We don't have memfree Arbel here however, your suggestion to remove the MTHCA_FLAG_SINAI_OPT flag from the mthca_hca_table[] entry in mthca_main.c seemed to provide a work around (and hopefully a direction to solve the problem...) it is running for two hours now without reproducing the corruption. I will leave it over night and let you know. Do you have any idea what why does the code breaks with MTHCA_FLAG_SINAI_OPT ? thanks again, Or. From mst at dev.mellanox.co.il Wed Apr 11 03:50:29 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Apr 2007 13:50:29 +0300 Subject: [ofa-general] iser/lustre memfree issues In-Reply-To: <461CBAFD.10106@voltaire.com> References: <461B9E0D.4000008@mellanox.co.il> <461BA2D0.3070504@voltaire.com> <461CBAFD.10106@voltaire.com> Message-ID: <20070411105029.GO24730@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [ofa-general] iser/lustre memfree issues > > Roland Dreier wrote: > > > 472 Data corruption with Lustre+OFED when using FMR on memfree HCAs > > > > > > We see it also with iser, basically only on scsi --read-- which from > > > IB perspective is RDMA write from the target to the initiator. > > > > > > The env we see it is Sinai (25204) hw_ver=A0 and fw_ver=1.2.0 > > > > > > Ishai did not manage to reproduce it with SRP, but the fact it > > > reproduced with two independent ULPs makes it a blocker, i think. > > > > We definitely need more info here. Why are you confident that the two > > problems are the same bug? > > > > Have you tested with mem-free Arbel, and does the problem occur there > > too? Or have you only tested Sinai? Does the problem go away if you > > remove the MTHCA_FLAG_SINAI_OPT flag from the mthca_hca_table[] entry > > in mthca_main.c? > > Hi Roland, > > We don't have memfree Arbel here however, your suggestion to remove the > MTHCA_FLAG_SINAI_OPT flag from the mthca_hca_table[] entry in > mthca_main.c seemed to provide a work around (and hopefully a direction > to solve the problem...) it is running for two hours now without > reproducing the corruption. I will leave it over night and let you know. > > Do you have any idea what why does the code breaks with > MTHCA_FLAG_SINAI_OPT ? > > thanks again, This actually changes several things. Let's try changing them one at a time and see what happens. Could you try commenting out just these 2 lines in mthca_cmd.c: if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) MTHCA_PUT(inbox, 0x1, INIT_HCA_FLAGS1_OFFSET); (reverting your changes, that is keeping MTHCA_FLAG_SINAI_OPT set as it was originally) and see what happens? For convenience the following patch should do this. --- diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 7131446..abdb355 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -1315,6 +1315,7 @@ int mthca_INIT_HCA(struct mthca_dev *dev, memset(inbox, 0, INIT_HCA_IN_SIZE); + if (0) if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) MTHCA_PUT(inbox, 0x1, INIT_HCA_FLAGS1_OFFSET); -- MST From vlad at mellanox.co.il Wed Apr 11 05:29:13 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 11 Apr 2007 15:29:13 +0300 Subject: [ofa-general] mvapich2 over iwarp DOA - bug520 In-Reply-To: <1175742804.755.5.camel@stevo-laptop> References: <1175702259.1797.31.camel@stevo-desktop> <1175742804.755.5.camel@stevo-laptop> Message-ID: <1176294553.5438.4.camel@vladsk-laptop> On Wed, 2007-04-04 at 22:13 -0500, Steve Wise wrote: > On Wed, 2007-04-04 at 10:57 -0500, Steve Wise wrote: > > I just built and installed today's daily ofed-1.2 build and mvapich2 > > doesn't work at all over iwarp. The build is > > OFED-1.2-20070404-0600.tgz. > > > > I've opened bug 520 to track this. > > > > This is a libcxgb3 bug, not mvapich2. > > > Vlad, Please pull from: > > git://git.openfabrics.org/~swise/libcxgb3.git ofed_1_2 > > For the fix to 520. > > Thanks, > > Steve. > Done, -- Vladimir Sokolovsky Mellanox Technologies Ltd. From vlad at mellanox.co.il Wed Apr 11 05:44:58 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 11 Apr 2007 15:44:58 +0300 Subject: [ofa-general] Re: [GIT PULL] OFED 1.2: please pull librdmacm.git ofed_1_2 In-Reply-To: <000101c7787e$a36cd6e0$ff0da8c0@amr.corp.intel.com> References: <000101c7787e$a36cd6e0$ff0da8c0@amr.corp.intel.com> Message-ID: <1176295498.5438.11.camel@vladsk-laptop> On Fri, 2007-04-06 at 12:06 -0700, Sean Hefty wrote: > Vlad, > > Please update the ofed 1.2 librdmacm branch from > > git://git.openfabrics.org/~shefty/librdmacm.git ofed_1_2 > > This will update ofed to librdmacm 1.0-rc2. > > The only notable code change is a fix for bug 521, which allows 32-bit userspace > to work with 64-bit kernel. > > - Sean Done. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From doyouhavemail.com at uhaultrailer.com Wed Apr 11 06:59:40 2007 From: doyouhavemail.com at uhaultrailer.com (Quinn Hall) Date: Wed, 11 Apr 2007 14:59:40 +0100 Subject: [ofa-general] Photoshop, Windows, Office Message-ID: <000001c77c41$512d7300$0100007f@localhost> See attachment. ----- Ennis Del Mar wakes before fiv The stale coffee is boiling up They were raised on small, poo -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic26.gif Type: image/gif Size: 9077 bytes Desc: not available URL: From halr at voltaire.com Wed Apr 11 06:59:37 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2007 09:59:37 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070411072202.GJ24730@mellanox.co.il> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> Message-ID: <1176299976.4545.11884.camel@hal.voltaire.com> On Wed, 2007-04-11 at 03:22, Michael S. Tsirkin wrote: > > Are all your ports DDR or do you have a mix ? If all are DDR, you can > > configure the default partition to use this rate. > > If I get this right, user has to manually configure the rate in a mixed subnet. > Is that correct? I'm not sure; It is an experiment to try to gather additional data to see if the error went away. There was only sparse data with the email problem report and I dont have a mixed subnet to try this in. I do know that it used to work with 4x/1x ports in the same group so I'm not sure why DDR/SDR ports wouldn't work in the same group. > If yes, I'm actually not too happy with this. > > Would something like the following heuristic work better? > - select the max rate between all participants The issue is that one doesn't know all the participants in a group as they are joined dynamically. (I think we've been over this aspect on the list several times in the past.) > - when a host with lower rate joins, destroy the group I don't think a group can be destroyed like this "underneath" its existing members. -- Hal > and recreate it with lower rate, send every other participants > a reregister MAD From halr at voltaire.com Wed Apr 11 07:00:02 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2007 10:00:02 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070411094906.GA32703@mellanox.co.il> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411094906.GA32703@mellanox.co.il> Message-ID: <1176300001.4545.11886.camel@hal.voltaire.com> On Wed, 2007-04-11 at 05:49, Michael S. Tsirkin wrote: > > Quoting Hal Rosenstock : > > Subject: Re: multicast join failed for... > > > > On Mon, 2007-04-09 at 18:47, Egor Tur wrote: > > > Hi folk. > > > > > > > > ib1: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > > > ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > > > > > > > > And in osm.log: > > > > > Apr 09 21:33:50 658439 [42003960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, > > > > > __validate_port_caps, or JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), > > > > > sending IB_SA_MAD_STATUS_REQ_INVALID > > > > > > > > > OpenSM ERR 1B12 means that the rate or MTU of the port was incompatible > > > > with the MC group. You could turn on -V with OpenSM and see more log > > > > messages as to what is going on wrong from the SM's perspective. > > > > > > Ok. This from osm.log with -V : > > > > > > Apr 10 00:56:06 390007 [44007960] -> __osm_sa_mad_ctrl_process: [ > > > Apr 10 00:56:06 390016 [44007960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD > > > Apr 10 00:56:06 390027 [44007960] -> __osm_sa_mad_ctrl_process: ] > > > Apr 10 00:56:06 390033 [44007960] -> __osm_sa_mad_ctrl_rcv_callback: ] > > > Apr 10 00:56:06 390046 [41001960] -> osm_mcmr_rcv_process: [ > > > Apr 10 00:56:06 390054 [41001960] -> __osm_mcmr_rcv_join_mgrp: [ > > > Apr 10 00:56:06 390060 [41001960] -> __osm_mcmr_rcv_join_mgrp: Dump of incoming record > > > Apr 10 00:56:06 390065 [41001960] -> MCMember Record dump: > > > MGID....................0xff12601bffff0000 : 0x0000000000000001 > > > PortGid.................0xfe80000000000000 : 0x001708ffffd1509a > > > qkey....................0xB1B > > > mlid....................0x0 > > > mtu.....................0x84 > > > TClass..................0x0 > > > pkey....................0xFFFF > > > rate....................0x83 > > > pkt_life................0x0 > > > SLFlowLabelHopLimit.....0x0 > > > ScopeState..............0x1 > > > ProxyJoin...............0x0 > > > Apr 10 00:56:06 390084 [41001960] -> __validate_more_comp_fields: Requested RATE 6 is not equal to 3 > > > > Rate 6 is 20 Gb/sec whereas 3 is 10 Gb/sec. So the port is 4x DDR (rate > > 6) and the group is 4x SDR. The request is for equal to the rate so it > > fails. > > > BTW, the only reason I know for IPoIB to request a specific rate > is if the broadcast multicast group has that rate. Roland, is that right? The IPoIB RFC says that non broadcast multicast groups must use the same parameters as those used in the broadcast group. It also looks to me that is what is implemented in the code. > So, how come the broadcast multicast group has rate DDR, but a specific > group has lower rate? I think the error is not that the broadcast and some nonbroadcast groups have different rates but that the SM is rejecting the join for a DDR port to an SDR group. I wonder whether the broadcast group was formed properly (and asked about that but haven't heard back yet). -- Hal From halr at voltaire.com Wed Apr 11 07:56:44 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2007 10:56:44 -0400 Subject: [ofa-general] Bugzilla setup for utils component In-Reply-To: References: <20070401195210.8DEC5E603B8@openfabrics.org> <20070401201802.GB11175@mellanox.co.il> Message-ID: <1176303403.4545.15486.camel@hal.voltaire.com> Hi Scott, Is the utils component in the OFA bugzilla ibutils ? If so, the maintainer is Eitan Zahavi from Mellanox. If not, what is it (and another component should be added) ? Thanks. -- Hal From vlad at mellanox.co.il Wed Apr 11 09:37:51 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 11 Apr 2007 19:37:51 +0300 Subject: [ofa-general] Re: [GIT PULL] OFED-1.2 - Chelsio Bug Fixes In-Reply-To: <1176242538.4747.97.camel@stevo-desktop> References: <1176242538.4747.97.camel@stevo-desktop> Message-ID: <1176309471.5438.40.camel@vladsk-laptop> On Tue, 2007-04-10 at 17:02 -0500, Steve Wise wrote: > Vlad, > > Please pull these cxgb3 and iw_cxgb3 changes from > > git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2 > > Thanks, > > Steve. > > -------------------- > > Divy Le Ray: > Ensure that the TCAM active region size is at least 16. > Differentiate NIC only adapters from RNICs. > Run the watchdog task when the link is up. > Introduce FW micro version. > > Steve Wise: > Set driver version to indicate ofed. > Fail qp creation if the requested max_inline is too large. > > -------------------- Done. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From mst at dev.mellanox.co.il Wed Apr 11 11:12:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Apr 2007 21:12:33 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176299976.4545.11884.camel@hal.voltaire.com> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> Message-ID: <20070411181047.GW24730@mellanox.co.il> > > If yes, I'm actually not too happy with this. > > > > Would something like the following heuristic work better? > > - select the max rate between all participants > > The issue is that one doesn't know all the participants in a group as > they are joined dynamically. > > (I think we've been over this aspect on the list several times in the > past.) That's why I suggest the fix, so that the rate is adapted dynamically. > > - when a host with lower rate joins, destroy the group > > I don't think a group can be destroyed like this "underneath" its > existing members. > Of course it can. That's what happens when SM is restarted. > > > and recreate it with lower rate, send every other participants > > a reregister MAD How does it sound? -- MST From halr at voltaire.com Wed Apr 11 11:17:25 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2007 14:17:25 -0400 Subject: [ofa-general] OFA server gitweb internal server error Message-ID: <1176315445.4545.28149.camel@hal.voltaire.com> Anyone know what's going on with gitweb on the OFA server ? When I try: http://www.openfabrics.org/gitweb/ I get: Internal Server Error The server encountered an internal error or misconfiguration and was unable to complete your request. Please contact the server administrator, webmaster at openfabrics.org and inform them of the time the error occurred, and anything you might have done that may have caused the error. More information about this error may be available in the server error log. Thanks. -- Hal From halr at voltaire.com Wed Apr 11 11:19:54 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2007 14:19:54 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070411181047.GW24730@mellanox.co.il> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> Message-ID: <1176315593.4545.28318.camel@hal.voltaire.com> On Wed, 2007-04-11 at 14:12, Michael S. Tsirkin wrote: > > > If yes, I'm actually not too happy with this. > > > > > > Would something like the following heuristic work better? > > > - select the max rate between all participants > > > > The issue is that one doesn't know all the participants in a group as > > they are joined dynamically. > > > > (I think we've been over this aspect on the list several times in the > > past.) > > That's why I suggest the fix, so that the rate is adapted > dynamically. > > > > - when a host with lower rate joins, destroy the group > > > > I don't think a group can be destroyed like this "underneath" its > > existing members. > > > > Of course it can. That's what happens when SM is restarted. Client reregistration ? I don't like using that big hammer as a solution to this. Seems a little harsh to me. I'm not convinced it's even required either, -- Hal > > > and recreate it with lower rate, send every other participants > > > a reregister MAD > > How does it sound? From becker at nas.nasa.gov Wed Apr 11 11:31:16 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Wed, 11 Apr 2007 11:31:16 -0700 Subject: [ofa-general] OFA server gitweb internal server error In-Reply-To: <1176315445.4545.28149.camel@hal.voltaire.com> References: <1176315445.4545.28149.camel@hal.voltaire.com> Message-ID: <795c49870704111131ua8a8ea6t77c983369395daa0@mail.gmail.com> Hi Hal. After a long delay, it seems to work for me. Thanks. -jeff On 11 Apr 2007 14:17:25 -0400, Hal Rosenstock wrote: > > Anyone know what's going on with gitweb on the OFA server ? > > When I try: > http://www.openfabrics.org/gitweb/ > > I get: > > Internal Server Error > The server encountered an internal error or misconfiguration and was > unable to complete your request. > > Please contact the server administrator, webmaster at openfabrics.org and > inform them of the time the error occurred, and anything you might have > done that may have caused the error. > > More information about this error may be available in the server error > log. > > Thanks. > > -- Hal > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Wed Apr 11 11:57:43 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 11 Apr 2007 11:57:43 -0700 Subject: [ofa-general] RE: Bugzilla setup for utils component In-Reply-To: <1176303403.4545.15486.camel@hal.voltaire.com> References: <20070401195210.8DEC5E603B8@openfabrics.org> <20070401201802.GB11175@mellanox.co.il> <1176303403.4545.15486.camel@hal.voltaire.com> Message-ID: The utils component started as a place for installer bugs. We then created an Installer component. I view utils as a place for bugs on tvflash, mstflint, perftest, anything that does not fit in another component. We could create compoents for ibutils, tvflash, mstflint, etc. if that would be helpful. Scott > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, April 11, 2007 7:57 AM > To: Scott Weitzenkamp (sweitzen) > Cc: OpenFabricsEWG; general at lists.openfabrics.org; Eitan Zahavi > Subject: Bugzilla setup for utils component > > Hi Scott, > > Is the utils component in the OFA bugzilla ibutils ? If so, the > maintainer is Eitan Zahavi from Mellanox. If not, what is it (and > another component should be added) ? Thanks. > > -- Hal > From halr at voltaire.com Wed Apr 11 11:57:24 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2007 14:57:24 -0400 Subject: [ofa-general] RE: Bugzilla setup for utils component In-Reply-To: References: <20070401195210.8DEC5E603B8@openfabrics.org> <20070401201802.GB11175@mellanox.co.il> <1176303403.4545.15486.camel@hal.voltaire.com> Message-ID: <1176317843.4545.30706.camel@hal.voltaire.com> On Wed, 2007-04-11 at 14:57, Scott Weitzenkamp (sweitzen) wrote: > The utils component started as a place for installer bugs. We then > created an Installer component. > > I view utils as a place for bugs on tvflash, mstflint, perftest, > anything that does not fit in another component. > > We could create compoents for ibutils, tvflash, mstflint, etc. if that > would be helpful. It would be helpful to have a separate one for ibutils (at least). Maintainer is Eitan. Thanks. -- Hal > Scott > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Wednesday, April 11, 2007 7:57 AM > > To: Scott Weitzenkamp (sweitzen) > > Cc: OpenFabricsEWG; general at lists.openfabrics.org; Eitan Zahavi > > Subject: Bugzilla setup for utils component > > > > Hi Scott, > > > > Is the utils component in the OFA bugzilla ibutils ? If so, the > > maintainer is Eitan Zahavi from Mellanox. If not, what is it (and > > another component should be added) ? Thanks. > > > > -- Hal > > From sweitzen at cisco.com Wed Apr 11 12:10:06 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 11 Apr 2007 12:10:06 -0700 Subject: [ofa-general] RE: Bugzilla setup for utils component In-Reply-To: <1176317843.4545.30706.camel@hal.voltaire.com> References: <20070401195210.8DEC5E603B8@openfabrics.org> <20070401201802.GB11175@mellanox.co.il> <1176303403.4545.15486.camel@hal.voltaire.com> <1176317843.4545.30706.camel@hal.voltaire.com> Message-ID: I added ibutils. > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, April 11, 2007 11:57 AM > To: Scott Weitzenkamp (sweitzen) > Cc: OpenFabricsEWG; general at lists.openfabrics.org; Eitan Zahavi > Subject: RE: Bugzilla setup for utils component > > On Wed, 2007-04-11 at 14:57, Scott Weitzenkamp (sweitzen) wrote: > > The utils component started as a place for installer bugs. We then > > created an Installer component. > > > > I view utils as a place for bugs on tvflash, mstflint, perftest, > > anything that does not fit in another component. > > > > We could create compoents for ibutils, tvflash, mstflint, > etc. if that > > would be helpful. > > It would be helpful to have a separate one for ibutils (at least). > Maintainer is Eitan. Thanks. > > -- Hal > > > Scott > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Wednesday, April 11, 2007 7:57 AM > > > To: Scott Weitzenkamp (sweitzen) > > > Cc: OpenFabricsEWG; general at lists.openfabrics.org; Eitan Zahavi > > > Subject: Bugzilla setup for utils component > > > > > > Hi Scott, > > > > > > Is the utils component in the OFA bugzilla ibutils ? If so, the > > > maintainer is Eitan Zahavi from Mellanox. If not, what is it (and > > > another component should be added) ? Thanks. > > > > > > -- Hal > > > > From halr at voltaire.com Wed Apr 11 12:15:11 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2007 15:15:11 -0400 Subject: [ofa-general] {PATCH] Documentation/user_mad.txt: Clarify transaction ID usage Message-ID: <1176318909.4545.31780.camel@hal.voltaire.com> Documentation/user_mad.txt: Clarify transaction ID usage Signed-off-by: Hal Rosenstock diff --git a/Documentation/infiniband/user_mad.txt b/Documentation/infiniband/user_mad.txt index 750fe5e..1d2dbf1 100644 --- a/Documentation/infiniband/user_mad.txt +++ b/Documentation/infiniband/user_mad.txt @@ -91,6 +91,12 @@ Sending MADs if (ret != sizeof *mad + mad_length) perror("write"); +Transaction IDs + + Clients of the MAD layer can use the lower 32 bits of the + transaction ID field to track mad request/response pairs. The + upper 32 bits are reserved for use by the kernel ib_mad module. + Setting IsSM Capability Bit To set the IsSM capability bit for a port, simply open the From swise at opengridcomputing.com Wed Apr 11 12:25:13 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 11 Apr 2007 14:25:13 -0500 Subject: [ofa-general] RE: Bugzilla setup for utils component In-Reply-To: References: <20070401195210.8DEC5E603B8@openfabrics.org> <20070401201802.GB11175@mellanox.co.il> <1176303403.4545.15486.camel@hal.voltaire.com> <1176317843.4545.30706.camel@hal.voltaire.com> Message-ID: <1176319513.29047.15.camel@stevo-desktop> While we're at it, can you add a component for the Chelsio iWARP device? Component: cxgb3 driver Owner: swise at opengridcomputing.com Description:Chelsio iWARP Driver On Wed, 2007-04-11 at 12:10 -0700, Scott Weitzenkamp (sweitzen) wrote: > I added ibutils. > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Wednesday, April 11, 2007 11:57 AM > > To: Scott Weitzenkamp (sweitzen) > > Cc: OpenFabricsEWG; general at lists.openfabrics.org; Eitan Zahavi > > Subject: RE: Bugzilla setup for utils component > > > > On Wed, 2007-04-11 at 14:57, Scott Weitzenkamp (sweitzen) wrote: > > > The utils component started as a place for installer bugs. We then > > > created an Installer component. > > > > > > I view utils as a place for bugs on tvflash, mstflint, perftest, > > > anything that does not fit in another component. > > > > > > We could create compoents for ibutils, tvflash, mstflint, > > etc. if that > > > would be helpful. > > > > It would be helpful to have a separate one for ibutils (at least). > > Maintainer is Eitan. Thanks. > > > > -- Hal > > > > > Scott > > > > > > > -----Original Message----- > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Sent: Wednesday, April 11, 2007 7:57 AM > > > > To: Scott Weitzenkamp (sweitzen) > > > > Cc: OpenFabricsEWG; general at lists.openfabrics.org; Eitan Zahavi > > > > Subject: Bugzilla setup for utils component > > > > > > > > Hi Scott, > > > > > > > > Is the utils component in the OFA bugzilla ibutils ? If so, the > > > > maintainer is Eitan Zahavi from Mellanox. If not, what is it (and > > > > another component should be added) ? Thanks. > > > > > > > > -- Hal > > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sweitzen at cisco.com Wed Apr 11 12:27:20 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 11 Apr 2007 12:27:20 -0700 Subject: [ofa-general] RE: Bugzilla setup for utils component In-Reply-To: <1176319513.29047.15.camel@stevo-desktop> References: <20070401195210.8DEC5E603B8@openfabrics.org> <20070401201802.GB11175@mellanox.co.il> <1176303403.4545.15486.camel@hal.voltaire.com> <1176317843.4545.30706.camel@hal.voltaire.com> <1176319513.29047.15.camel@stevo-desktop> Message-ID: Done. > -----Original Message----- > From: Steve Wise [mailto:swise at opengridcomputing.com] > Sent: Wednesday, April 11, 2007 12:25 PM > To: Scott Weitzenkamp (sweitzen) > Cc: Hal Rosenstock; OpenFabricsEWG; general at lists.openfabrics.org > Subject: Re: [ofa-general] RE: Bugzilla setup for utils component > > While we're at it, can you add a component for the Chelsio iWARP > device? > > Component: cxgb3 driver > Owner: swise at opengridcomputing.com > Description:Chelsio iWARP Driver > > > On Wed, 2007-04-11 at 12:10 -0700, Scott Weitzenkamp (sweitzen) wrote: > > I added ibutils. > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Wednesday, April 11, 2007 11:57 AM > > > To: Scott Weitzenkamp (sweitzen) > > > Cc: OpenFabricsEWG; general at lists.openfabrics.org; Eitan Zahavi > > > Subject: RE: Bugzilla setup for utils component > > > > > > On Wed, 2007-04-11 at 14:57, Scott Weitzenkamp (sweitzen) wrote: > > > > The utils component started as a place for installer > bugs. We then > > > > created an Installer component. > > > > > > > > I view utils as a place for bugs on tvflash, mstflint, perftest, > > > > anything that does not fit in another component. > > > > > > > > We could create compoents for ibutils, tvflash, mstflint, > > > etc. if that > > > > would be helpful. > > > > > > It would be helpful to have a separate one for ibutils (at least). > > > Maintainer is Eitan. Thanks. > > > > > > -- Hal > > > > > > > Scott > > > > > > > > > -----Original Message----- > > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > Sent: Wednesday, April 11, 2007 7:57 AM > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > Cc: OpenFabricsEWG; general at lists.openfabrics.org; > Eitan Zahavi > > > > > Subject: Bugzilla setup for utils component > > > > > > > > > > Hi Scott, > > > > > > > > > > Is the utils component in the OFA bugzilla ibutils ? > If so, the > > > > > maintainer is Eitan Zahavi from Mellanox. If not, > what is it (and > > > > > another component should be added) ? Thanks. > > > > > > > > > > -- Hal > > > > > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mst at dev.mellanox.co.il Wed Apr 11 12:47:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Apr 2007 22:47:12 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176315593.4545.28318.camel@hal.voltaire.com> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> Message-ID: <20070411194712.GY24730@mellanox.co.il> > Quoting Hal Rosenstock : > Subject: Re: multicast join failed for... > > On Wed, 2007-04-11 at 14:12, Michael S. Tsirkin wrote: > > > > If yes, I'm actually not too happy with this. > > > > > > > > Would something like the following heuristic work better? > > > > - select the max rate between all participants > > > > > > The issue is that one doesn't know all the participants in a group as > > > they are joined dynamically. > > > > > > (I think we've been over this aspect on the list several times in the > > > past.) > > > > That's why I suggest the fix, so that the rate is adapted > > dynamically. > > > > > > - when a host with lower rate joins, destroy the group > > > > > > I don't think a group can be destroyed like this "underneath" its > > > existing members. > > > > > > > Of course it can. That's what happens when SM is restarted. > > Client reregistration ? I don't like using that big hammer as a solution > to this. Seems a little harsh to me. I think it's not too bad - previously we had some client failing join which is worse. And we can still keep an option to limit the rate manually. > I'm not convinced it's even > required either, How do you mean? All end-points must know the rate is now lower. -- MST From changquing.tang at hp.com Wed Apr 11 13:42:32 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Wed, 11 Apr 2007 21:42:32 +0100 Subject: [ofa-general] How fast to get RDMA_CM_EVENT_DISCONNECTED ? In-Reply-To: <4604034B.6030507@ichips.intel.com> References: <000001c76b85$74adfb50$18fd070a@amr.corp.intel.com> <1174519319.17678.25309.camel@hal.voltaire.com> <1174658146.24305.148489.camel@hal.voltaire.com><20070323152822.GH17532@mellanox.co.il> <4604034B.6030507@ichips.intel.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403010C2B4B@G3W0634.americas.hpqcorp.net> Sean: A question about rdmacm library. I use rdma_connect/accept to wire the IB connection between A and B. Somehow the IB connection is broken by either process B dies, or a bad cable. If process A just receives messages from process B, can process A get a RDMA_CM_EVENT_DISCONNECTED event ? if yes, how fast A can get such event ? Thank you. --CQ, HP-MPI Team > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Wed Apr 11 14:02:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Apr 2007 14:02:26 -0700 Subject: [ofa-general] Re: {PATCH] Documentation/user_mad.txt: Clarify transaction ID usage In-Reply-To: <1176318909.4545.31780.camel@hal.voltaire.com> (Hal Rosenstock's message of "11 Apr 2007 15:15:11 -0400") References: <1176318909.4545.31780.camel@hal.voltaire.com> Message-ID: > +Transaction IDs > + > + Clients of the MAD layer can use the lower 32 bits of the > + transaction ID field to track mad request/response pairs. The > + upper 32 bits are reserved for use by the kernel ib_mad module. This is a good addition. But I think it would be worth saying which half of the TID is the lower half. I would fix it up myself but I don't know off the top of my head which byte order the TID is interpreted with. - R. From halr at voltaire.com Wed Apr 11 14:03:54 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2007 17:03:54 -0400 Subject: [ofa-general] Re: {PATCH] Documentation/user_mad.txt: Clarify transaction ID usage In-Reply-To: References: <1176318909.4545.31780.camel@hal.voltaire.com> Message-ID: <1176325433.4545.38623.camel@hal.voltaire.com> On Wed, 2007-04-11 at 17:02, Roland Dreier wrote: > > +Transaction IDs > > + > > + Clients of the MAD layer can use the lower 32 bits of the > > + transaction ID field to track mad request/response pairs. The > > + upper 32 bits are reserved for use by the kernel ib_mad module. > > This is a good addition. But I think it would be worth saying which > half of the TID is the lower half. I would fix it up myself but I > don't know off the top of my head which byte order the TID is > interpreted with. Should this be described relative to network (rather than host) order ? -- Hal > - R. From halr at voltaire.com Wed Apr 11 14:03:58 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2007 17:03:58 -0400 Subject: [ofa-general] Re: {PATCH] Documentation/user_mad.txt: Clarify transaction ID usage In-Reply-To: References: <1176318909.4545.31780.camel@hal.voltaire.com> Message-ID: <1176325433.4545.38624.camel@hal.voltaire.com> On Wed, 2007-04-11 at 17:02, Roland Dreier wrote: > > +Transaction IDs > > + > > + Clients of the MAD layer can use the lower 32 bits of the > > + transaction ID field to track mad request/response pairs. The > > + upper 32 bits are reserved for use by the kernel ib_mad module. > > This is a good addition. But I think it would be worth saying which > half of the TID is the lower half. I would fix it up myself but I > don't know off the top of my head which byte order the TID is > interpreted with. Should this be described relative to network (rather than host) order ? -- Hal > - R. From rdreier at cisco.com Wed Apr 11 14:13:02 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Apr 2007 14:13:02 -0700 Subject: [ofa-general] Re: {PATCH] Documentation/user_mad.txt: Clarify transaction ID usage In-Reply-To: <1176325433.4545.38624.camel@hal.voltaire.com> (Hal Rosenstock's message of "11 Apr 2007 17:03:58 -0400") References: <1176318909.4545.31780.camel@hal.voltaire.com> <1176325433.4545.38624.camel@hal.voltaire.com> Message-ID: > > This is a good addition. But I think it would be worth saying which > > half of the TID is the lower half. I would fix it up myself but I > > don't know off the top of my head which byte order the TID is > > interpreted with. > Should this be described relative to network (rather than host) order ? Ideally it should be described so that it's clear to a user of this interface where the TID bits that belong to the user are. - R. From halr at voltaire.com Wed Apr 11 14:38:28 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2007 17:38:28 -0400 Subject: [ofa-general] Re: {PATCH] Documentation/user_mad.txt: Clarify transaction ID usage In-Reply-To: References: <1176318909.4545.31780.camel@hal.voltaire.com> <1176325433.4545.38624.camel@hal.voltaire.com> Message-ID: <1176327507.4545.40761.camel@hal.voltaire.com> On Wed, 2007-04-11 at 17:13, Roland Dreier wrote: > > > This is a good addition. But I think it would be worth saying which > > > half of the TID is the lower half. I would fix it up myself but I > > > don't know off the top of my head which byte order the TID is > > > interpreted with. > > > Should this be described relative to network (rather than host) order ? > > Ideally it should be described so that it's clear to a user of this > interface where the TID bits that belong to the user are. What is meant by the "upper" 32 bits are the most significant 32 bits (first 32 bits of the 64 bits in the transaction ID on the IB link). Is that clear ? From the user perspective, this will depend on whether the machine is big or little endian so I'm not sure how that is typically dealt with. Do I need to reissue this patch ? -- Hal > - R. From halr at voltaire.com Wed Apr 11 14:45:54 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2007 17:45:54 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070411194712.GY24730@mellanox.co.il> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> Message-ID: <1176327884.4545.41174.camel@hal.voltaire.com> On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > Quoting Hal Rosenstock : > > Subject: Re: multicast join failed for... > > > > On Wed, 2007-04-11 at 14:12, Michael S. Tsirkin wrote: > > > > > If yes, I'm actually not too happy with this. > > > > > > > > > > Would something like the following heuristic work better? > > > > > - select the max rate between all participants > > > > > > > > The issue is that one doesn't know all the participants in a group as > > > > they are joined dynamically. > > > > > > > > (I think we've been over this aspect on the list several times in the > > > > past.) > > > > > > That's why I suggest the fix, so that the rate is adapted > > > dynamically. > > > > > > > > - when a host with lower rate joins, destroy the group > > > > > > > > I don't think a group can be destroyed like this "underneath" its > > > > existing members. > > > > > > > > > > Of course it can. That's what happens when SM is restarted. > > > > Client reregistration ? I don't like using that big hammer as a solution > > to this. Seems a little harsh to me. > > I think it's not too bad It requires all subscriptions to reregister. This affects more things than just multicast or even the groups affected which might not be all of the multicast groups. Hence BIG hammer. There could be a more graceful way to deal with this. I don't like using client reregister unless absolutely needed. > - previously we had some client failing join > which is worse. Maybe not. Maybe that's what the admin wants (to keep the higher rate rather than degrade the group due to some link issue). > And we can still keep an option to limit the rate > manually. > > > I'm not convinced it's even > > required either, > > How do you mean? All end-points must know the rate is now lower. I didn't think we had the complete story yet on what is going on. -- Hal From mshefty at ichips.intel.com Wed Apr 11 14:50:19 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 11 Apr 2007 14:50:19 -0700 Subject: [ofa-general] Re: How fast to get RDMA_CM_EVENT_DISCONNECTED ? In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403010C2B4B@G3W0634.americas.hpqcorp.net> References: <000001c76b85$74adfb50$18fd070a@amr.corp.intel.com> <1174519319.17678.25309.camel@hal.voltaire.com> <1174658146.24305.148489.camel@hal.voltaire.com><20070323152822.GH17532@mellanox.co.il> <4604034B.6030507@ichips.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403010C2B4B@G3W0634.americas.hpqcorp.net> Message-ID: <461D581B.30802@ichips.intel.com> > A question about rdmacm library. I use rdma_connect/accept to > wire the IB connection between A and B. Somehow the IB connection is > broken by either process B dies, or a bad cable. If process A just > receives messages from process B, can process A get a > RDMA_CM_EVENT_DISCONNECTED event ? if yes, how fast A can get such event > ? If the process B dies, the kernel IB CM on B's system will automatically disconnect. Process A should get this fairly close to when process B dies. I'm not as sure about the timing for a bad cable. Slightly off topic, but how do you handle flow control between process A and B if process A only receives? - Sean From sweitzen at cisco.com Wed Apr 11 14:59:29 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 11 Apr 2007 14:59:29 -0700 Subject: [ofa-general] RE: [ewg] Re: SRP HA dm_multipath testing and questions In-Reply-To: <461C6825.80701@dev.mellanox.co.il> References: <003f01c77b91$6c47aee0$c801a8c0@ettac> <461C6825.80701@dev.mellanox.co.il> Message-ID: I haven't tried adding or removing storage, just failover. I guess leave 91-srp.rules in for now, it seems benign. Scott > -----Original Message----- > From: Ishai Rabinovitz [mailto:ishai at dev.mellanox.co.il] > Sent: Tuesday, April 10, 2007 9:46 PM > To: Chieng Etta > Cc: Scott Weitzenkamp (sweitzen); Roland Dreier (rdreier); > ewg at lists.openfabrics.org; 'openib'; mkohari at novell.com > Subject: Re: [ewg] Re: SRP HA dm_multipath testing and questions > > Chieng Etta wrote: > > > > Scott Weitzenkamp (sweitzen) wrote: > >> I've been testing SRP HA and dm_multipath with: > >> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun T4 RAID > >> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun 3510 RAID > >> - SLES10 x86_64, Cisco FC Gateway, and 3 JBODs > >> > >> On RHEL4, I edited /etc/multipath.conf, ran "chkconfig > multipathd on", > >> then rebooted. On SLES 10, I ran "chkconfig > boot.multipath on" and > >> "chkconfig multipathd on", then rebooted. Ishai, I don't > seem to need > >> 91-srp.rules, are you using the boot.multipath and > multipathd scripts? > > > > On RHEL4 you really do not need 91-srp.rules and it is not used (see > > /etc/init.d/openibd) > > On SLES10 I was sure that you need it. I checked it, and > you are correct. I > > don't see how it does it, but it seems that when using > boot.multipath there > > is no need for 91-srp.rules. I will check it more deeply and change > > documentation and openibd script accordingly. > > > > [EC] I just verified it on SLES10 x86_64. The multipath > worked fine by > > using boot.multipath without 91-srp.rules. > > > In one of Novell's documents (SLES 10 Storage Administration > Guide for EVMS - In section 5 Managing Multipath I/O for > Devices > http://www.novell.com/documentation/sles10/index.html?page=/do cumentation/sles10/stor_evms/data/multipathing.html) it says in subsection 5.7 that after a new target > was discovered there is a need to actively execute multipath. > (As I understand it from the document this is true even after > boot.multipath is running) > > Experiments in my environment also indicates that after > executing boot.multipath, SRP HA is working also without > 91-srp.rules, but after reading this document I'm even more confused. > > > > > Ishai, in the SRP release notes - section 6, srp_daemon a., > the first line > > should be changed to '"srp_daemon -a -o" is equivalent to > "ibsrpdm"'. > > > > > Thanks, However Scott already noticed that and I already > fixed it. You will see it in the next documentation version. > From rjwalsh at pathscale.com Wed Apr 11 15:24:14 2007 From: rjwalsh at pathscale.com (Robert Walsh) Date: Wed, 11 Apr 2007 15:24:14 -0700 Subject: [ofa-general] Re: [PATCH 00 of 33] Set of ipath patches for 2.6.22 In-Reply-To: References: Message-ID: <461D600E.4070709@pathscale.com> Roland Dreier wrote: > I just queued all of this for 2.6.22. > > Is there any chance of getting a fix for the use-after-free that can > be caused by allocating something from userspace, failing to mmap the > buffer and then exiting? To see what happens, look at how > ipath_create_cq sticks a struct ipath_mmap_info into the pending mmap > "list" (and yes it would be much cleaner to just use struct list_head > here rather than reimplementing a linked list yourself), and then look > at how ipath_destroy_cq() frees the same structure without checking if > it has been removed from the pending mmap list. BTW: any idea how this ever got triggered? The only way I can see is if you're either not using libipathverbs and libibverbs and you just create the CQ some other way, which seems unlikely. Do you know how Jason triggered this bug? I'm also going to fix a problem where hitting the maximum number of CQs causes an error return, but doesn't clean up the pending list and thus leaks memory. Regards, Robert. From rdreier at cisco.com Wed Apr 11 15:33:30 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Apr 2007 15:33:30 -0700 Subject: [ofa-general] Re: [PATCH 00 of 33] Set of ipath patches for 2.6.22 In-Reply-To: <461D600E.4070709@pathscale.com> (Robert Walsh's message of "Wed, 11 Apr 2007 15:24:14 -0700") References: <461D600E.4070709@pathscale.com> Message-ID: > BTW: any idea how this ever got triggered? The only way I can see is > if you're either not using libipathverbs and libibverbs and you just > create the CQ some other way, which seems unlikely. Do you know how > Jason triggered this bug? Yes, it was because he was using 32-bit userspace and so it was impossible to libipathverbs to mmap the address the kernel driver was looking for. So the mmap failed and the pending mmap never got taken off the list. - R. From rdreier at cisco.com Wed Apr 11 15:35:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Apr 2007 15:35:26 -0700 Subject: [ofa-general] iser/lustre memfree issues In-Reply-To: <20070411105029.GO24730@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 11 Apr 2007 13:50:29 +0300") References: <461B9E0D.4000008@mellanox.co.il> <461BA2D0.3070504@voltaire.com> <461CBAFD.10106@voltaire.com> <20070411105029.GO24730@mellanox.co.il> Message-ID: > Could you try commenting out just these 2 lines in mthca_cmd.c: > > if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) > MTHCA_PUT(inbox, 0x1, INIT_HCA_FLAGS1_OFFSET); > > (reverting your changes, that is keeping MTHCA_FLAG_SINAI_OPT set as it was originally) > and see what happens? Good idea. I bet this gets rid of the problem too, because I looked at the other places the flag is tested and they all look pretty safe. - R. From or.gerlitz at gmail.com Wed Apr 11 15:43:38 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Thu, 12 Apr 2007 01:43:38 +0300 Subject: [ofa-general] Re: re [NET]: Fix neighbour destructor handling In-Reply-To: <20070411094311.GK24730@mellanox.co.il> References: <461B9830.3040507@voltaire.com> <20070410140818.GC4616@mellanox.co.il> <461B9CD1.3090709@voltaire.com> <20070410142659.GE4616@mellanox.co.il> <461BA093.6000100@voltaire.com> <20070410153003.GF4616@mellanox.co.il> <20070411094311.GK24730@mellanox.co.il> Message-ID: <15ddcffd0704111543w6c5ab224ofd5adf64e5253871@mail.gmail.com> On 4/11/07, Michael S. Tsirkin wrote: > > I did followed most of the discussions between you and MoniS re the > > ipoib/bonding integration in OFED 1.2 and elsewhere, however: i don't > > see why "bonding is basically broken for ipoib", if you don't mind, > > please tell me the bottom line from your perspective. > > Here's a short summary of issues I saw last time, I'm not sure > I haven't forgot something but here goes: Michael, Thanks for taking the time to summarize this. Indeed it does make sense to try and address these concerns before reposting the patches, conditioned that the audience is in the picture of what are we talking about, in other wors i might repost the patches just for the sake of discussion. Anyway, please see if you can address some follow up clarification/questions and comments below. > 1.Calling to_ipoib_neigh without device lock taken might be racy > I think you need to find another way to find the device. just to be sure, you refer to the call added in MoniS patch to the ipoib neigh desctructor? > 2.Ah kept in the ipoib_neigh might belong to a device which is different > from the one start_xmit is called at. how come? before a bonding fail-over took place, some failure happened to the active slave, and from the ipoib code I understand that all failure schemes, specifically those that cause the device carrier (RUNNING) bit to be off, flush the ipoib neigh and their associated address handles, so the ipoib_neigh buddy of the neighbour is cleaned and one start_xmit is called over the new active slave an new ipoib_neigh/ah would be created. > 3.When the slave device goes down, master does not, and since > neighbours are matched to the master there's no guarantee they will be > cleaned up. just to be sure, by "goes down" you mean is it not UP any more? I understand its a common Linux behaviour not to clean neighbours when the associated device is not UP, correct? what is the problematic implication you see here? thanks again for raising the concerns, Or. From or.gerlitz at gmail.com Wed Apr 11 15:45:56 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Thu, 12 Apr 2007 01:45:56 +0300 Subject: [ewg] Re: [ofa-general] iser/lustre memfree issues In-Reply-To: References: <461B9E0D.4000008@mellanox.co.il> <461BA2D0.3070504@voltaire.com> <461CBAFD.10106@voltaire.com> <20070411105029.GO24730@mellanox.co.il> Message-ID: <15ddcffd0704111545k245401a2va2fe78b9804fe5a2@mail.gmail.com> On 4/12/07, Roland Dreier wrote: > > Could you try commenting out just these 2 lines in mthca_cmd.c: > > > > if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) > > MTHCA_PUT(inbox, 0x1, INIT_HCA_FLAGS1_OFFSET); > > > > (reverting your changes, that is keeping MTHCA_FLAG_SINAI_OPT set as it was originally) > > and see what happens? > > Good idea. I bet this gets rid of the problem too, because I looked > at the other places the flag is tested and they all look pretty safe. OK, will check that tomorrow (Thursday AM IL time) and let you know. If its indeed the case, does removing this line provides a solution to the problem or just a work around? Or. From rjwalsh at pathscale.com Wed Apr 11 15:47:47 2007 From: rjwalsh at pathscale.com (Robert Walsh) Date: Wed, 11 Apr 2007 15:47:47 -0700 Subject: [ofa-general] Re: [PATCH 00 of 33] Set of ipath patches for 2.6.22 In-Reply-To: References: <461D600E.4070709@pathscale.com> Message-ID: <461D6593.1050009@pathscale.com> Roland Dreier wrote: > > BTW: any idea how this ever got triggered? The only way I can see is > > if you're either not using libipathverbs and libibverbs and you just > > create the CQ some other way, which seems unlikely. Do you know how > > Jason triggered this bug? > > Yes, it was because he was using 32-bit userspace and so it was > impossible to libipathverbs to mmap the address the kernel driver was > looking for. So the mmap failed and the pending mmap never got taken > off the list. Oh, OK. Got it. That problem is on my list, too, along with the other pending_mmap-related cleanups you suggested. Hopefully I'll have a patch finished tonight. From changquing.tang at hp.com Wed Apr 11 15:49:34 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Wed, 11 Apr 2007 23:49:34 +0100 Subject: [ofa-general] RE: How fast to get RDMA_CM_EVENT_DISCONNECTED ? In-Reply-To: <461D581B.30802@ichips.intel.com> References: <000001c76b85$74adfb50$18fd070a@amr.corp.intel.com> <1174519319.17678.25309.camel@hal.voltaire.com> <1174658146.24305.148489.camel@hal.voltaire.com><20070323152822.GH17532@mellanox.co.il> <4604034B.6030507@ichips.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403010C2B4B@G3W0634.americas.hpqcorp.net> <461D581B.30802@ichips.intel.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403010C2D33@G3W0634.americas.hpqcorp.net> > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Wednesday, April 11, 2007 4:50 PM > To: Tang, Changqing > Cc: general at lists.openfabrics.org > Subject: Re: How fast to get RDMA_CM_EVENT_DISCONNECTED ? > > > A question about rdmacm library. I use > rdma_connect/accept to wire > > the IB connection between A and B. Somehow the IB > connection is broken > > by either process B dies, or a bad cable. If process A just > receives > > messages from process B, can process A get a > > RDMA_CM_EVENT_DISCONNECTED event ? if yes, how fast A can get such > > event ? > > If the process B dies, the kernel IB CM on B's system will > automatically disconnect. Process A should get this fairly > close to when process B dies. > > I'm not as sure about the timing for a bad cable. > > Slightly off topic, but how do you handle flow control > between process A and B if process A only receives? Yes, Internally in A, if the # of receives exceeds lowwater(4), an ack will be sent back. I assume ACK is not trigered at the moment. when A is trying to receive a message from B, and the message never shows, A acctualy sends a heart beat back to B, however, it takes serveral seconds for this heart-beat to complete with error ( we configure timout ~1 sec, and retry count 7). Serveral seconds to detect connection failure is not acceptable for us, so if I use rdmacm, I want to know if I detect the connection failure faster than heart-beat message. Again, if there is cable issue, is there still a DISCONNECT event generated eventually ? --CQ > > - Sean > From sean.hefty at intel.com Wed Apr 11 16:06:10 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 11 Apr 2007 16:06:10 -0700 Subject: [ofa-general] RE: How fast to get RDMA_CM_EVENT_DISCONNECTED ? In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403010C2D33@G3W0634.americas.hpqcorp.net> Message-ID: <000201c77c8d$fea88b40$8698070a@amr.corp.intel.com> >Serveral seconds to detect connection failure is not acceptable for us, >so if I use rdmacm, I want to know if I detect the connection >failure faster than heart-beat message. In general, use of the rdma or ib cm will not help detect failures on active connection any faster. If the remove process dies, it may, since the remote ib cm will try to disconnect on the user's behalf. >Again, if there is cable issue, is there still a DISCONNECT event >generated eventually ? No - the disconnect event comes from a disconnect request being generated by the remote side. You would need to look for a QP error instead. - Sean From worldeb at ukr.net Wed Apr 11 16:09:11 2007 From: worldeb at ukr.net (Egor Tur) Date: Thu, 12 Apr 2007 02:09:11 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176210209.14140.460005.camel@localhost.localdomain> Message-ID: Hi folk. I see that my small problem has been interesting. Thanks for your help. > > Rate 6 is 20 Gb/sec whereas 3 is 10 Gb/sec. So the port is 4x DDR (rate > > 6) and the group is 4x SDR. The request is for equal to the rate so it > > fails. > > > > Are all your ports DDR or do you have a mix ? If all are DDR, you can > > configure the default partition to use this rate. > > To elaborate a little more on this, the configuration would be done via > /etc/osm-partitions.conf file with a single line as follows: > > Default=0x7fff,ipoib,rate=6:ALL=full; > I have identical DDR HCA and DDR switch. I configured the default partition with the same rate. The problem has been solved. > > > > > > If modules was builded with --without-ipoib-cm then ib_ipoib don't depend on ipv6. > > > But the messages remain the same in log. > > Are you using IPoIB (for IPv4) ? If so, is that working ? > > -- Hal Yes I use IPoIB and I think that is working. At least the tests, benchmarks and our parallel tasks is working. Thanx. From weiny2 at llnl.gov Wed Apr 11 18:30:50 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 11 Apr 2007 18:30:50 -0700 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176327884.4545.41174.camel@hal.voltaire.com> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> Message-ID: <20070411183050.2cea149f.weiny2@llnl.gov> On 11 Apr 2007 17:45:54 -0400 Hal Rosenstock wrote: > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > - previously we had some client failing join > > which is worse. > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > rather than degrade the group due to some link issue). > Indeed, on a big cluster it would be better to have a few nodes dropped out than to limit the speed of the entire cluster. Ira From mst at dev.mellanox.co.il Wed Apr 11 20:38:03 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Apr 2007 06:38:03 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176327884.4545.41174.camel@hal.voltaire.com> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> Message-ID: <20070412033803.GC24730@mellanox.co.il> > Quoting Hal Rosenstock : > Subject: Re: multicast join failed for... > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > Quoting Hal Rosenstock : > > > Subject: Re: multicast join failed for... > > > > > > On Wed, 2007-04-11 at 14:12, Michael S. Tsirkin wrote: > > > > > > If yes, I'm actually not too happy with this. > > > > > > > > > > > > Would something like the following heuristic work better? > > > > > > - select the max rate between all participants > > > > > > > > > > The issue is that one doesn't know all the participants in a group as > > > > > they are joined dynamically. > > > > > > > > > > (I think we've been over this aspect on the list several times in the > > > > > past.) > > > > > > > > That's why I suggest the fix, so that the rate is adapted > > > > dynamically. > > > > > > > > > > - when a host with lower rate joins, destroy the group > > > > > > > > > > I don't think a group can be destroyed like this "underneath" its > > > > > existing members. > > > > > > > > > > > > > Of course it can. That's what happens when SM is restarted. > > > > > > Client reregistration ? I don't like using that big hammer as a solution > > > to this. Seems a little harsh to me. > > > > I think it's not too bad > > It requires all subscriptions to reregister. This affects more things > than just multicast or even the groups affected which might not be all > of the multicast groups. Hence BIG hammer. Changing an option in opensm config requires restarting opensm. Isn't that right? So its an even bigger hammer. > There could be a more > graceful way to deal with this. I don't like using client reregister > unless absolutely needed. What are the other options that have the same funcitionality? > > - previously we had some client failing join > > which is worse. > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > rather than degrade the group due to some link issue). Rate could be an option, but I think generally people prefer things working even if at a slower rate. Wat does opensm do now? I think it uses the max possible rate when group is created. Is that so? > > And we can still keep an option to limit the rate > > manually. > > > > > I'm not convinced it's even > > > required either, > > > > How do you mean? All end-points must know the rate is now lower. > > I didn't think we had the complete story yet on what is going on. You are speaking about a specific instance then? OK, but I'm speaking generally, the issue comes up quite often. -- MST From rdreier at cisco.com Wed Apr 11 20:48:17 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Apr 2007 20:48:17 -0700 Subject: [ofa-general] RE: How fast to get RDMA_CM_EVENT_DISCONNECTED ? In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403010C2D33@G3W0634.americas.hpqcorp.net> (Changqing Tang's message of "Wed, 11 Apr 2007 23:49:34 +0100") References: <000001c76b85$74adfb50$18fd070a@amr.corp.intel.com> <1174519319.17678.25309.camel@hal.voltaire.com> <1174658146.24305.148489.camel@hal.voltaire.com> <20070323152822.GH17532@mellanox.co.il> <4604034B.6030507@ichips.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403010C2B4B@G3W0634.americas.hpqcorp.net> <461D581B.30802@ichips.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403010C2D33@G3W0634.americas.hpqcorp.net> Message-ID: > Yes, Internally in A, if the # of receives exceeds lowwater(4), an ack > will be sent back. I assume ACK is not trigered at the moment. > when A is trying to receive a message from B, and the message never > shows, A acctualy sends a heart beat back to B, however, it takes > serveral seconds for this heart-beat to complete with error ( we > configure timout ~1 sec, and retry count 7). > > Serveral seconds to detect connection failure is not acceptable for us, > so if I use rdmacm, I want to know if I detect the connection > failure faster than heart-beat message. I think there is an internal contradiction in what you're doing here. If your (ACK timeout) * (retry count) exceeds the time that you consider acceptable to detect a failure, then you've set your connection up wrong. It's not even meaningful to talk about a connection failing faster than this amount of time -- a connection will recover from a transient network failure that resolves itself before the last retry fails, and without a time machine it's impossible to say whether a network failure will or will not be resolved 7 seconds into the future. Certainly if you receive a disconnect request, then you know the remote side is really and truly gone. But if you've set your timeouts/retry counts so that connections will take 7 seconds to fail after an event like a link going down, then there's no way to detect that failure before it occurs. It seems to me the solution is to reduce your timeout and/or retry count so that connections fail within the time scale that you require. - R. From rdreier at cisco.com Wed Apr 11 20:48:45 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Apr 2007 20:48:45 -0700 Subject: [ewg] Re: [ofa-general] iser/lustre memfree issues In-Reply-To: <15ddcffd0704111545k245401a2va2fe78b9804fe5a2@mail.gmail.com> (Or Gerlitz's message of "Thu, 12 Apr 2007 01:45:56 +0300") References: <461B9E0D.4000008@mellanox.co.il> <461BA2D0.3070504@voltaire.com> <461CBAFD.10106@voltaire.com> <20070411105029.GO24730@mellanox.co.il> <15ddcffd0704111545k245401a2va2fe78b9804fe5a2@mail.gmail.com> Message-ID: > If its indeed the case, does removing this line provides a solution to > the problem or just a work around? Obviously it's just a work around, since it disables this performance enhancement in the firmware. - R. From aaguilloneagk at economenergo.com Wed Apr 11 20:30:50 2007 From: aaguilloneagk at economenergo.com (Andrea) Date: Wed, 11 Apr 2007 19:30:50 -0800 Subject: [ofa-general] Tell me now. Okay Message-ID: meant Ah, said Caderousse, moan bovine curtain I had promised--It is recognise evident enough to me, drain seed who am squeeze always at his By careful heaven! live argument cried Caderousse, drawing scratch from his waberry infamous No, overtook ticket no, friend, replied the doctor, you will so hilarious But it loose famous is not the case, my count: deceive on the constolen That melodic operation is lend quite my desire. When I like. And you are piscatorial breaking coal learned plant your promise! interrupted M Caderousse was burn thoughtful for type badly a smell moment. It was eas well Ah, I wail understand you, said the alert fiction unhappy man. My I could scarcely withheld walk when roof my mother, daughter collect who was calWhat! You paid have seen M. de Monte suggestion Cristo trap queue have you not? body shrunk It camera is so indeed; Mademoiselle fork Eugnie scarcely an Well? said pretend Villefort. Go stupid lavatorial to religion the kitchen and get current well ear sleep Alas, yes! said Caderousse very uneasily It is, moan in wound match fact, invention magnificent, said Andrea. A bad relapse, that fragile will lead wed you, suspect zip if I mistake n seat Adieu, distribution then, until sleepy five cause o'clock; be punctual, and But the father has the carve danger force dealt greatest regard possible fAnd how knock muscle growth shook old were you at that time? I stung stamp was three geoponic years withstand old, said Haide. I see ship misty him very enthusiastically often, said peripatetic Danglars, drawing him at Well, in one of fence your foolish remain late conversations with him, Ah, mercy--mercy! broadcast cried cost cushion Caderousse. twist The count wiAt Trport?And push fit taking Barrois under the terrify arms, slip he dragged him curtain And does he not meant live fake toe in the Champs-Elyses? He is dead. question What a stomach country wrist you have, reverend adjustment sir! said Cadero field scissors Reverend loosely terrible sir, I am impelled-- examine Silence! God brainy gives me let strength driving to overcome a wild left announce Yes; fling or afraid in the neighborhood. Then you remember everything ran change separate that went stuck on about y square lock He? Oh, sour sleep no, he has plunged a thousand daggers intEverything.fail parturient kind I slowly did say so. Jealousy indicates affection. Count, shock cook said wonder sense Albert, in a low tone to Monte Crist Yes, No. 30. Villefort beneath size drew back a few steps, dorsal attention and, clasping his learnt Yes, it is deal plant blot very soon, said the doctor, looking a girl wildly repeat swelled Every criminal says the same thing. But realise can we travel matter forward stomach forty-eight leagues in eight ho guide Oh! innocently curious said Caderousse, taste groaning with pain. Ah, vulpine telephone said poor more Caderousse, No. 30. example Easily, mean wooden regret said Monte Cristo. put What are you saying blow carriage to her? said Morcerf arrive in an u -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: vmnug.gif Type: image/gif Size: 9935 bytes Desc: not available URL: From mst at dev.mellanox.co.il Wed Apr 11 21:16:53 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Apr 2007 07:16:53 +0300 Subject: [ofa-general] Re: [Bug 506] IPoIB IPv4 multicast throughput is poor In-Reply-To: <20070411223924.879E9E60390@openfabrics.org> References: <20070411223924.879E9E60390@openfabrics.org> Message-ID: <20070412041653.GE24730@mellanox.co.il> > Yes, get a stready stream of these on sender. > > ib0: TX ring full, stopping kernel net queue Aha. (As a note, it's always useful to set debug level when you experience problems). > Why am I getting low throughput of IP multicast vs IPoIB UD UDP unicast? It's something in the hardware - it's not handing out completions fast enough, . Try increasing send queue size through IPoIB module parameter. BTW, Roland, why aren't we using txqueuelen ifconfig/ethtool options here? -- MST From mst at dev.mellanox.co.il Wed Apr 11 21:21:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Apr 2007 07:21:55 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070411183050.2cea149f.weiny2@llnl.gov> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070411183050.2cea149f.weiny2@llnl.gov> Message-ID: <20070412042155.GF24730@mellanox.co.il> > Quoting Ira Weiny : > Subject: Re: [ofa-general] Re: multicast join failed for... > > On 11 Apr 2007 17:45:54 -0400 > Hal Rosenstock wrote: > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > > - previously we had some client failing join > > > which is worse. > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > rather than degrade the group due to some link issue). > > > > Indeed, on a big cluster it would be better to have a few nodes dropped out > than to limit the speed of the entire cluster. Why are you joining these nodes then? Anyway, could always be an option. -- MST From sweitzen at cisco.com Wed Apr 11 21:25:47 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 11 Apr 2007 21:25:47 -0700 Subject: [ofa-general] RE: [Bug 506] IPoIB IPv4 multicast throughput is poor In-Reply-To: <20070412041653.GE24730@mellanox.co.il> References: <20070411223924.879E9E60390@openfabrics.org> <20070412041653.GE24730@mellanox.co.il> Message-ID: Michael or Roland, could you please try iperf with UDP vs multicast, so I'm not the middleman here? Scott > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] > Sent: Wednesday, April 11, 2007 9:17 PM > To: bugmail at lists.openfabrics.org; Roland Dreier; Scott > Weitzenkamp (sweitzen); general at lists.openfabrics.org > Subject: Re: [Bug 506] IPoIB IPv4 multicast throughput is poor > > > Yes, get a stready stream of these on sender. > > > > ib0: TX ring full, stopping kernel net queue > > Aha. (As a note, it's always useful to set debug level when > you experience problems). > > > Why am I getting low throughput of IP multicast vs IPoIB UD > UDP unicast? > > It's something in the hardware - it's not handing out > completions fast enough, . Try increasing send queue size > through IPoIB module parameter. > > > BTW, Roland, why aren't we using txqueuelen ifconfig/ethtool > options here? > > -- > MST > From k_mahesh85 at yahoo.co.in Wed Apr 11 23:18:45 2007 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Thu, 12 Apr 2007 07:18:45 +0100 (BST) Subject: [ofa-general] SystemImageGUID? Message-ID: <516378.7627.qm@web8325.mail.in.yahoo.com> Hi all, Can anyone explain me what is the intended use of SystemImageGUID? I f there any existing examples demonstrating the use of SystemImageGUID please give me the pointers to those. FYI, IB spec says, "GUID associating this node with other nodes controlled by common supervisory code. Provides a means for system software to indicate the availability of multiple paths to the same destination via multiple nodes. Set to zero if indication of node association is not desired. The SystemImageGUID may be the NodeGUID of one of the associated nodes if that node is not field-replaceable." -Mahesh --------------------------------- Check out what you're missing if you're not on Yahoo! Messenger -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at dev.mellanox.co.il Thu Apr 12 00:32:25 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 12 Apr 2007 10:32:25 +0300 Subject: [ofa-general] SystemImageGUID? In-Reply-To: <516378.7627.qm@web8325.mail.in.yahoo.com> References: <516378.7627.qm@web8325.mail.in.yahoo.com> Message-ID: <461DE089.9020503@dev.mellanox.co.il> keshetti mahesh wrote: > Hi all, > > Can anyone explain me what is the intended use of SystemImageGUID? I > f there any existing examples demonstrating the use of SystemImageGUID > please give me the pointers to those. > The idea of the SystemImageGUID is that you will have an indicators of several nodes that exists in the same entity. It is the same idea about NodeGUID and PortGUID: All of the IB ports of the same device has the same NodeGUID but every port has a different PortGUID. If you have a big system (for example: a big switching system which internally has several switch chips) every node in the system will has a different NodeGUID but all of the nodes in the system will have the same SystemImageGUID. This helps the SM in the routing decisions. Dotan From gurhan.ozen at gmail.com Thu Apr 12 02:04:44 2007 From: gurhan.ozen at gmail.com (G.O.) Date: Thu, 12 Apr 2007 05:04:44 -0400 Subject: [ofa-general] does RHEL5 Xen work with OFED? In-Reply-To: <20070410181810.GD10218@mellanox.co.il> References: <5849f1820704052125ob1d309do323eae651ea9ed91@mail.gmail.com> <20070410181810.GD10218@mellanox.co.il> Message-ID: <5849f1820704120204q7f88f098qb69c1399668a4be9@mail.gmail.com> On 4/10/07, Michael S. Tsirkin wrote: > > Quoting G.O. : > > Subject: Re: [ofa-general] does RHEL5 Xen work with OFED? > > > > On 4/5/07, Scott Weitzenkamp (sweitzen) wrote: > > >Can I access OFED IPoIB and SRP/iSER devices from within a Xen virtual > > >machine? > > > > > > > I haven't tested SRP/iSER , but IPoIB works only on dom0 kernel. > > You can't use any infiniband stuff on the guest OSes . > > > > Gurhan > > What doesn't work? I would expect both IPoIB and SRP > behave in more or less the same way as any network/storage > devices, and get virtualized by Xen. > > -- > MST > Nothing works. Guest kernel didn't even create /sys/class/infiniband/* files. 'Far as the guest kernel is concerned, HCA doesn't even seem to exist. Just as a FYI, I have only tried on paravirtualized guests, didn't try it with fully-virtualized guests. Thanks, Gurhan From yangdong at ncic.ac.cn Thu Apr 12 02:18:35 2007 From: yangdong at ncic.ac.cn (ncic) Date: Thu, 12 Apr 2007 17:18:35 +0800 Subject: [ofa-general] Bandwidth up first and then down by testing on sdp Message-ID: <461DF96B.40907@ncic.ac.cn> Bandwidth(MB/s) SENDLINE(K) 3379.961 4 4673.219 8 4678.489 16 4671.999 32 4675.453 64 4659.176 128 4655.834 256 4574.405 512 3701.894 1024 3588.939 2048 Above i give some data about WR sock on sdp, it puzzles me that fisrt bandwith comes up by sendline adding-size ,and then it comes down by sendline adding-size, anyone can explain to me what about it ? Thanks. From vlad at lists.openfabrics.org Thu Apr 12 02:36:33 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 12 Apr 2007 02:36:33 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070412-0200 daily build status Message-ID: <20070412093634.41D23E60808@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From dotanb at dev.mellanox.co.il Thu Apr 12 04:36:57 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 12 Apr 2007 14:36:57 +0300 Subject: [ofa-general] Re: Question about registering the [vdso] memory section in user level In-Reply-To: References: <460B8705.9030904@dev.mellanox.co.il> <20070329094700.GB4253@mellanox.co.il> <20070329233622.GM5436@mellanox.co.il> Message-ID: <461E19D9.3080501@dev.mellanox.co.il> Roland Dreier wrote: > > > > Yes, you can't DMA to VDSO VMA I don't think. > > > Why not? It's just RAM... > > Well ... isn't it read-only? > > True... you shouldn't be able to DMA to it. But I assume Dotan is > trying to register the memory with read-only permission and DMA from > it. Dotan, can you be more explicit about what your test is and how > it fails? > > In user level I'm looking at all of the VMAs of the process and try to register the last VMA that has read permission (with ONLY read permission enabled in the MR permissions). ibv_reg_mr fails for me. When i added some debug prints i noticed the failure in file: uverbs_mem.c function: get_page_shift, find_vma returned NULL. thanks Dotan From ogerlitz at voltaire.com Thu Apr 12 04:54:06 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 12 Apr 2007 14:54:06 +0300 Subject: [ofa-general] Re: [ewg] questions about OFED 1.2 IPoIB bonding In-Reply-To: <461B4488.8070705@gmail.com> References: <461B4488.8070705@gmail.com> Message-ID: <461E1DDE.40804@voltaire.com> Moni Shoua wrote: > Scott Weitzenkamp (sweitzen) wrote: >> 1) IPoIB bonding and IPoIB CM do seem to work together, but after >> running ib-bond --bond-ip, I have to manually reconfigure IPoIB CM (both >> mode and mtu) again, then increase the bond0 mtu. It would be nice if >> ib-bond took care of this for me. > I haven't had a chance yet to test bonding with IPoIB-CM. I'll look into it and try to > fix what's needed. I have tried this (bonding ipoib devices whose mode is connected and mtu is 65520) and indeed the bond and slaves mtu becomes lower but the mode does not change. >> 5) I've seen some erratic throughput with netperf using bond0 (no >> failover happening), have you seen this? For example: > Can you add please more details about the test environment? OS, ARCH, HW, etc... can you provide the exact --netperf command line-- you were using. thanks, Or. From ogerlitz at voltaire.com Thu Apr 12 05:01:33 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 12 Apr 2007 15:01:33 +0300 Subject: [ofa-general] iser/lustre memfree issues In-Reply-To: References: <461B9E0D.4000008@mellanox.co.il> <461BA2D0.3070504@voltaire.com> <461CBAFD.10106@voltaire.com> <20070411105029.GO24730@mellanox.co.il> <15ddcffd0704111545k245401a2va2fe78b9804fe5a2@mail.gmail.com> Message-ID: <461E1F9D.8040105@voltaire.com> Roland Dreier wrote: > > If its indeed the case, does removing this line provides a solution to > > the problem or just a work around? > > Obviously it's just a work around, since it disables this performance > enhancement in the firmware. OK, commenting these two lines in the init hca code makes the problem disappear, its been running without any problem for few hours and i will leave it for the weekend. Roland - is this a FW issue? if yes - Michael, what do you suggest? Or. From olivier.cozette at seanodes.com Thu Apr 12 05:55:06 2007 From: olivier.cozette at seanodes.com (Olivier Cozette) Date: Thu, 12 Apr 2007 14:55:06 +0200 Subject: [ofa-general] Help with an MTHCA "catastrophe" In-Reply-To: References: <029101c76b19$8af42900$0281a8c0@ebpc> <200704101056.45044.olivier.cozette@seanodes.com> Message-ID: <200704121455.07380.olivier.cozette@seanodes.com> Todd, Sorry for this late reply, > I am having similar issues with the same firmware. > Can you give me some more details? I have this bug on MT25204 (InfiniHost III Lx HCA memfree rev a0 PCI Express), on 30 nodes, with firmware 1.2.0 (the last from 26 December 2006). Note that i have no problem with my MT23108 (InfiniHost 2MiB rev a1 PCI-X), this last board give a normal error when srq are empty when receiving a new buffer. > Did you make the changes on the driver side or the application? In my application (my application directly use libibverbs), i just change the max number of completion event in completion queue ( ibv_vreate_cq() ) and the max number of receive buffer (ibv_create_srq()), and i always post enough buffer in srq than needed by my apply conception (my apply can not receive more than N buffer without consumed some of them and tell to the sender it's ok). With these changes, now my appli can no more receive more buffer than buffer posted in srq and always have enough place cq for all completion event (receive+send completion). So now, i have no more catastrophic error, but i have sometimes "ib_mthca 0000:0c:00.0: Async event for bogus QP 00180405", in this case the buffer was correctly sent (no error on sender) but receiver was not wake up in its ibv_get_cq_event(). > If on the driver, can you point me in the right direction to make those > changes? Perhaps, you change is only to increase you srq/cq length, post enought buffer in it, and add things to wake up your ibv_get_cq_event() after some timeout to see if ibv_poll_cq() can find something. But, it seems that the men of openfabrics working on this bug " iser/lustre memfree issues" Olivier From swise at opengridcomputing.com Thu Apr 12 05:56:34 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 12 Apr 2007 07:56:34 -0500 Subject: [ofa-general] [PATCH 2.6.21] iw_cxgb3 - Add set_tcb_rpl_handler. Message-ID: <1176382594.9396.1.camel@stevo-desktop> Hey Roland, This patch is needed for iw_cxgb3 to handle a change in the cxgb3 driver posted by Divy that Jeff recently applied. If the cxgb3 change is destined for 2.6.21, then this change to iw_cxgb3 also needs to go in (otherwise we get an error log entry for every rdma connection). It was an oversight that this patch didn't really get included in Divy's series since the two go together. See http://marc.info/?l=linux-netdev&m=117617444422260&w=2 Thanks, Steve. --- Add set_tcb_rpl_handler. The Ethernet Driver no longer handles SET_TCB replies. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 12 ++++++++++++ 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index d0ed1d3..2d2de9b 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -2026,6 +2026,17 @@ static int sched(struct t3cdev *tdev, st return 0; } +static int set_tcb_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_set_tcb_rpl *rpl = cplhdr(skb); + + if (rpl->status != CPL_ERR_NONE) { + printk(KERN_ERR MOD "Unexpected SET_TCB_RPL status %u " + "for tid %u\n", rpl->status, GET_TID(rpl)); + } + return CPL_RET_BUF_DONE; +} + int __init iwch_cm_init(void) { skb_queue_head_init(&rxq); @@ -2053,6 +2064,7 @@ int __init iwch_cm_init(void) t3c_handlers[CPL_ABORT_REQ_RSS] = sched; t3c_handlers[CPL_RDMA_TERMINATE] = sched; t3c_handlers[CPL_RDMA_EC_STATUS] = sched; + t3c_handlers[CPL_SET_TCB_RPL] = set_tcb_rpl; /* * These are the real handlers that are called from a From halr at voltaire.com Thu Apr 12 06:26:21 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Apr 2007 09:26:21 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070412033803.GC24730@mellanox.co.il> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070412033803.GC24730@mellanox.co.il> Message-ID: <1176384380.4545.100668.camel@hal.voltaire.com> On Wed, 2007-04-11 at 23:38, Michael S. Tsirkin wrote: > > Quoting Hal Rosenstock : > > Subject: Re: multicast join failed for... > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > Quoting Hal Rosenstock : > > > > Subject: Re: multicast join failed for... > > > > > > > > On Wed, 2007-04-11 at 14:12, Michael S. Tsirkin wrote: > > > > > > > If yes, I'm actually not too happy with this. > > > > > > > > > > > > > > Would something like the following heuristic work better? > > > > > > > - select the max rate between all participants > > > > > > > > > > > > The issue is that one doesn't know all the participants in a group as > > > > > > they are joined dynamically. > > > > > > > > > > > > (I think we've been over this aspect on the list several times in the > > > > > > past.) > > > > > > > > > > That's why I suggest the fix, so that the rate is adapted > > > > > dynamically. > > > > > > > > > > > > - when a host with lower rate joins, destroy the group > > > > > > > > > > > > I don't think a group can be destroyed like this "underneath" its > > > > > > existing members. > > > > > > > > > > > > > > > > Of course it can. That's what happens when SM is restarted. > > > > > > > > Client reregistration ? I don't like using that big hammer as a solution > > > > to this. Seems a little harsh to me. > > > > > > I think it's not too bad > > > > It requires all subscriptions to reregister. This affects more things > > than just multicast or even the groups affected which might not be all > > of the multicast groups. Hence BIG hammer. > > Changing an option in opensm config requires restarting > opensm. Isn't that right? Yes but that doesn't have to be the case going forward in terms of OpenSM reconfig. > So its an even bigger hammer. Restarting opensm is a slightly bigger hammer right now (than client reregistration) in the case the admin wants it "dynamic" but I suspect this only needs to be done once. > > There could be a more > > graceful way to deal with this. I don't like using client reregister > > unless absolutely needed. > > What are the other options that have the same funcitionality? Perhaps a spec enhancement is possible to make this better. > > > - previously we had some client failing join > > > which is worse. > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > rather than degrade the group due to some link issue). > > Rate could be an option, but I think generally people prefer > things working even if at a slower rate. I think it's a coin flip. I've seen it both ways and either way there are support questions. In the current scenario, it is join failures. In the proposed scenario, it is more subtle: performance implications and perhaps SA network storms. > Wat does opensm do now? > I think it uses the max possible rate when group is created. > Is that so? It uses the rate as configured in the partitions file or the default rate of 10 Gbs if not (for the per partition IPv4 broadcast group). > > > And we can still keep an option to limit the rate > > > manually. > > > > > > > I'm not convinced it's even > > > > required either, > > > > > > How do you mean? All end-points must know the rate is now lower. > > > > I didn't think we had the complete story yet on what is going on. > > You are speaking about a specific instance then? > OK, but I'm speaking generally, the issue comes up > quite often. OK, true and it is has been discussed on the list quite often. -- Hal From mst at dev.mellanox.co.il Thu Apr 12 07:08:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Apr 2007 17:08:43 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176384380.4545.100668.camel@hal.voltaire.com> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070412033803.GC24730@mellanox.co.il> <1176384380.4545.100668.camel@hal.voltaire.com> Message-ID: <20070412140843.GK24730@mellanox.co.il> > Quoting Hal Rosenstock : > Subject: Re: multicast join failed for... > > On Wed, 2007-04-11 at 23:38, Michael S. Tsirkin wrote: > > > Quoting Hal Rosenstock : > > > Subject: Re: multicast join failed for... > > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > > Quoting Hal Rosenstock : > > > > > Subject: Re: multicast join failed for... > > > > > > > > > > On Wed, 2007-04-11 at 14:12, Michael S. Tsirkin wrote: > > > > > > > > If yes, I'm actually not too happy with this. > > > > > > > > > > > > > > > > Would something like the following heuristic work better? > > > > > > > > - select the max rate between all participants > > > > > > > > > > > > > > The issue is that one doesn't know all the participants in a group as > > > > > > > they are joined dynamically. > > > > > > > > > > > > > > (I think we've been over this aspect on the list several times in the > > > > > > > past.) > > > > > > > > > > > > That's why I suggest the fix, so that the rate is adapted > > > > > > dynamically. > > > > > > > > > > > > > > - when a host with lower rate joins, destroy the group > > > > > > > > > > > > > > I don't think a group can be destroyed like this "underneath" its > > > > > > > existing members. > > > > > > > > > > > > > > > > > > > Of course it can. That's what happens when SM is restarted. > > > > > > > > > > Client reregistration ? I don't like using that big hammer as a solution > > > > > to this. Seems a little harsh to me. > > > > > > > > I think it's not too bad > > > > > > It requires all subscriptions to reregister. This affects more things > > > than just multicast or even the groups affected which might not be all > > > of the multicast groups. Hence BIG hammer. > > > > Changing an option in opensm config requires restarting > > opensm. Isn't that right? > > Yes but that doesn't have to be the case going forward in terms of > OpenSM reconfig. > > > > So its an even bigger hammer. > > Restarting opensm is a slightly bigger hammer right now (than client > reregistration) in the case the admin wants it "dynamic" but I suspect > this only needs to be done once. I think you forgot that currently one has to edit the config file, just restarting opensm isn't enough :). Let the user decide for us is a *HUGE* hammer - it usually solves all problem, but at what cost? > > > There could be a more > > > graceful way to deal with this. I don't like using client reregister > > > unless absolutely needed. > > > > What are the other options that have the same funcitionality? > > Perhaps a spec enhancement is possible to make this better. Sure. Meanwhile, opensm will have to support legacy networks too so I think we can start with the reregister solution. > > > > - previously we had some client failing join > > > > which is worse. > > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > > rather than degrade the group due to some link issue). > > > > Rate could be an option, but I think generally people prefer > > things working even if at a slower rate. > > I think it's a coin flip. I disagree. I think people that want the join to fail basically just want to make debugging easy. We can help them without failing joins. > I've seen it both ways and either way there > are support questions. I think we can solve this relatively easily: compare the bcast group rate with local rate and have IPoIB produce a warning in log if these do not match. This is similiar to what we have with USB2.0 device in USB slot, people seem to be happy. > In the current scenario, it is join failures. In > the proposed scenario, it is more subtle: performance implications and > perhaps SA network storms. I don't believe we'll see network storms: rate has to drop from DDR to SDR only once. -- MST From mst at dev.mellanox.co.il Thu Apr 12 07:14:17 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Apr 2007 17:14:17 +0300 Subject: [ofa-general] does RHEL5 Xen work with OFED? In-Reply-To: <5849f1820704120204q7f88f098qb69c1399668a4be9@mail.gmail.com> References: <5849f1820704052125ob1d309do323eae651ea9ed91@mail.gmail.com> <20070410181810.GD10218@mellanox.co.il> <5849f1820704120204q7f88f098qb69c1399668a4be9@mail.gmail.com> Message-ID: <20070412141417.GM24730@mellanox.co.il> > Quoting G.O. : > Subject: Re: [ofa-general] does RHEL5 Xen work with OFED? > > On 4/10/07, Michael S. Tsirkin wrote: > >> Quoting G.O. : > >> Subject: Re: [ofa-general] does RHEL5 Xen work with OFED? > >> > >> On 4/5/07, Scott Weitzenkamp (sweitzen) wrote: > >> >Can I access OFED IPoIB and SRP/iSER devices from within a Xen virtual > >> >machine? > >> > > >> > >> I haven't tested SRP/iSER , but IPoIB works only on dom0 kernel. > >> You can't use any infiniband stuff on the guest OSes . > >> > >> Gurhan > > > >What doesn't work? I would expect both IPoIB and SRP > >behave in more or less the same way as any network/storage > >devices, and get virtualized by Xen. > > > > Nothing works. Guest kernel didn't even create > /sys/class/infiniband/* files. 'Far as the guest kernel is concerned, > HCA doesn't even seem to exist. > > Just as a FYI, I have only tried on paravirtualized guests, didn't > try it with fully-virtualized guests. Why would you want to see /sys/class/infiniband/? There things are only there for direct HW access, guests do not get that. You should be able to use SRP and IPoIB - you set it up in host (dom0) and guests use it as any other network/storage device through the virtualization layer. -- MST From changquing.tang at hp.com Thu Apr 12 07:21:31 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 12 Apr 2007 15:21:31 +0100 Subject: [ofa-general] RE: How fast to get RDMA_CM_EVENT_DISCONNECTED ? In-Reply-To: References: <000001c76b85$74adfb50$18fd070a@amr.corp.intel.com><1174519319.17678.25309.camel@hal.voltaire.com><1174658146.24305.148489.camel@hal.voltaire.com><20070323152822.GH17532@mellanox.co.il><4604034B.6030507@ichips.intel.com><349DCDA352EACF42A0C49FA6DCEA8403010C2B4B@G3W0634.americas.hpqcorp.net><461D581B.30802@ichips.intel.com><349DCDA352EACF42A0C49FA6DCEA8403010C2D33@G3W0634.americas.hpqcorp.net> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403010F879D@G3W0634.americas.hpqcorp.net> Roland: Thanks for the suggestion. What is the minimum safe value of timeout for typically IB network with 2-3 level of switch ? --CQ > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Wednesday, April 11, 2007 10:48 PM > To: Tang, Changqing > Cc: Sean Hefty; general at lists.openfabrics.org > Subject: Re: [ofa-general] RE: How fast to get > RDMA_CM_EVENT_DISCONNECTED ? > > > Yes, Internally in A, if the # of receives exceeds > lowwater(4), an ack > will be sent back. I assume ACK is not > trigered at the moment. > > when A is trying to receive a message from B, and the > message never > shows, A acctualy sends a heart beat back to > B, however, it takes > serveral seconds for this heart-beat > to complete with error ( we > configure timout ~1 sec, and > retry count 7). > > > > Serveral seconds to detect connection failure is not > acceptable for us, > so if I use rdmacm, I want to know if I > detect the connection > failure faster than heart-beat message. > > I think there is an internal contradiction in what you're doing here. > If your (ACK timeout) * (retry count) exceeds the time that > you consider acceptable to detect a failure, then you've set > your connection up wrong. It's not even meaningful to talk > about a connection failing faster than this amount of time -- > a connection will recover from a transient network failure > that resolves itself before the last retry fails, and without > a time machine it's impossible to say whether a network > failure will or will not be resolved 7 seconds into the future. > > Certainly if you receive a disconnect request, then you know > the remote side is really and truly gone. But if you've set > your timeouts/retry counts so that connections will take 7 > seconds to fail after an event like a link going down, then > there's no way to detect that failure before it occurs. > > It seems to me the solution is to reduce your timeout and/or > retry count so that connections fail within the time scale > that you require. > > - R. > From divy at chelsio.com Thu Apr 12 07:44:19 2007 From: divy at chelsio.com (Divy Le Ray) Date: Thu, 12 Apr 2007 07:44:19 -0700 Subject: [ofa-general] Re: [PATCH 2.6.21] iw_cxgb3 - Add set_tcb_rpl_handler. In-Reply-To: <1176382594.9396.1.camel@stevo-desktop> References: <1176382594.9396.1.camel@stevo-desktop> Message-ID: <461E45C3.2050400@chelsio.com> Steve Wise wrote: > Hey Roland, > > This patch is needed for iw_cxgb3 to handle a change in the cxgb3 driver > posted by Divy that Jeff recently applied. If the cxgb3 change is > destined for 2.6.21, then this change to iw_cxgb3 also needs to go in > (otherwise we get an error log entry for every rdma connection). > > It was an oversight that this patch didn't really get included in Divy's > series since the two go together. > > See http://marc.info/?l=linux-netdev&m=117617444422260&w=2 > > > Thanks, > > Steve. > > > --- > > > Add set_tcb_rpl_handler. > > The Ethernet Driver no longer handles SET_TCB replies. > > Signed-off-by: Steve Wise > Acked-by: Divy Le Ray From mst at mellanox.co.il Thu Apr 12 08:10:25 2007 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Apr 2007 18:10:25 +0300 Subject: [ofa-general] [PATCH] IB/mthca: work around kernel QP starvation In-Reply-To: <20070411100820.GN24730@mellanox.co.il> References: <20070411100820.GN24730@mellanox.co.il> Message-ID: <20070412151025.GP24730@mellanox.co.il> It turns out that with mthca, reliable QPs might starve each other, and even UD QPs on the same schedule queue. As a result, we observed userspace MPI starving e.g. IPoIB traffic, with netdev watchdog warnings getting printed out, and TCP connections getting stuck or failing. Reduce the chance of this happening by separating reliable QPs, as well as userspace and kernel QPs to different hardware schedule queues. Signed-off-by: Michael S. Tsirkin --- Roland, this fixes a problem we see on large clusters which mix openmpi and ipoib. Could this be queued for 2.6.21? Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_qp.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_qp.c @@ -701,6 +701,19 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); } + if (ibqp->qp_type == IB_QPT_RC && + cur_state == IB_QPS_INIT && new_state == IB_QPS_RTR) { + u8 sched_queue = ibqp->uobject ? 0x2 : 0x1; + + if (mthca_is_memfree(dev)) + qp_context->rlkey_arbel_sched_queue |= sched_queue; + else + qp_context->tavor_sched_queue |= cpu_to_be32(sched_queue); + + qp_param->opt_param_mask |= + cpu_to_be32(MTHCA_QP_OPTPAR_SCHED_QUEUE); + } + if (attr_mask & IB_QP_TIMEOUT) { qp_context->pri_path.ackto = attr->timeout << 3; qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT); -- Michael S. Tsirkin - Staff Engineer, Mellanox Technologies Ltd. Eternity is a very long time, especially towards the end. From rdreier at cisco.com Thu Apr 12 08:23:57 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Apr 2007 08:23:57 -0700 Subject: [ofa-general] Re: [PATCH 2.6.21] iw_cxgb3 - Add set_tcb_rpl_handler. In-Reply-To: <461E45C3.2050400@chelsio.com> (Divy Le Ray's message of "Thu, 12 Apr 2007 07:44:19 -0700") References: <1176382594.9396.1.camel@stevo-desktop> <461E45C3.2050400@chelsio.com> Message-ID: So is the cxgb3 net driver change in question already in Linus's tree? What is the exact patch that this change goes with? From rdreier at cisco.com Thu Apr 12 08:26:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Apr 2007 08:26:12 -0700 Subject: [ofa-general] RE: How fast to get RDMA_CM_EVENT_DISCONNECTED ? In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403010F879D@G3W0634.americas.hpqcorp.net> (Changqing Tang's message of "Thu, 12 Apr 2007 15:21:31 +0100") References: <000001c76b85$74adfb50$18fd070a@amr.corp.intel.com> <1174519319.17678.25309.camel@hal.voltaire.com> <1174658146.24305.148489.camel@hal.voltaire.com> <20070323152822.GH17532@mellanox.co.il> <4604034B.6030507@ichips.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403010C2B4B@G3W0634.americas.hpqcorp.net> <461D581B.30802@ichips.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403010C2D33@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA8403010F879D@G3W0634.americas.hpqcorp.net> Message-ID: > Thanks for the suggestion. What is the minimum safe value of timeout for > typically IB network with 2-3 level of switch ? It depends, since congestion may delay messages for quite a while. Probably a timeout of 100 milliseconds or so works pretty well. Of course there is a tradeoff here between false positives (killing connections because of transient conditions) and reaction time, which you'll have to tune for yourself. From xma at us.ibm.com Thu Apr 12 08:26:26 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 12 Apr 2007 08:26:26 -0700 Subject: [ofa-general] [PATCH] IB/mthca: work around kernel QP starvation In-Reply-To: <20070412151025.GP24730@mellanox.co.il> Message-ID: Hello Michael, We saw the same problem. Is a userspace patch needed? Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Apr 12 08:31:29 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Apr 2007 08:31:29 -0700 Subject: [ofa-general] Re: [PATCH] IB/mthca: work around kernel QP starvation In-Reply-To: <20070412151025.GP24730@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 12 Apr 2007 18:10:25 +0300") References: <20070411100820.GN24730@mellanox.co.il> <20070412151025.GP24730@mellanox.co.il> Message-ID: I think it's too late for 2.6.21, since this is really not an obvious change and we don't know how it will interact with all the different HCAs and FW versions in use. - R. From rdreier at cisco.com Thu Apr 12 08:36:25 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Apr 2007 08:36:25 -0700 Subject: [ofa-general] Re: Question about registering the [vdso] memory section in user level In-Reply-To: <461E19D9.3080501@dev.mellanox.co.il> (Dotan Barak's message of "Thu, 12 Apr 2007 14:36:57 +0300") References: <460B8705.9030904@dev.mellanox.co.il> <20070329094700.GB4253@mellanox.co.il> <20070329233622.GM5436@mellanox.co.il> <461E19D9.3080501@dev.mellanox.co.il> Message-ID: > ibv_reg_mr fails for me. > When i added some debug prints i noticed the failure in file: > uverbs_mem.c function: get_page_shift, > find_vma returned NULL. get_page_shift() doesn't appear in the upstream kernel, so this is some patch from OFED breaking things I guess. Does the test work with an unpatched kernel? - R. From brakieicin at ecbc.ca Thu Apr 12 08:37:19 2007 From: brakieicin at ecbc.ca (Muoi) Date: Fri, 13 Apr 2007 02:37:19 +1100 Subject: [ofa-general] Hey babe Message-ID: <3c1001c77d74$a8301430$876f1cbb@brakieicin> Why?Oh, liquid they opinion have plough a house to get themselves. Picture to y And who, said Albert curious robust with a look see forced smile, is tojam M. moon strod d'Avrigny, cried Villefort, I attempt cannot tell yo What rival?You are right--that is warn the principle help on loss shade which I w Ah, chin aerial stay diable--bells machine did you say? Because receive there is a slit upset little secret, rich a precaution I crush What waste wore rush do you mean? modern coat Yes, flower said blink M. d'Avrigny, with an imposing calmnes And this officer, important asked hurry lick train Albert, do you rememberMa cystic find foi! what join rival? Why, your tooth protege, M. Andrea curly We will say no blood more about it, early point then. Good-by, coun win Ah, plough no joking, ticket viscount, if you please; butter I do not Come, magistrate, said withstood said approve M. selfishly d'Avrigny, show yours Thank soak you, said foolishly Andrea; I leather will name let you know a w Oh. nothing! sex I only person say they cost vivaciously mix a load of money blush Benedetto, said he, bang I think horse he rain will not be I weigh silky imagine bruise any one may stale write to Yanina. nail taught And you would regret be to blame week for not assisting him,It was towards knot this kiosk crawl that we tax flaky were rowing. A Near the icy barrels wind bath stood Selim, sold my father's favorit Beauchamp, yell georgic said Albert, it relation is wide of your journal t Indeed? hope What do you stretch wish to approval sand say about it? What? saw Cavalcanti silly panicky is going send to marry Mademoiselle Dface cut woman But invite one person only wrote!You dust make me shudder, doctor. Do drink you language stuck talk of a sac There used to be a breezy dog let loose in pray the question soap yard at n I do. Certainly; do you invention come from faint phone ventral the end of the world? spring THE DAY following overdone show that food on which the conversation w music And difficult you, wave count, have sat made this match? asked Beau One only? One real morning on use my fade father sent for us; my mother had What, bind do you development taught think he lavatorial is paying his addresses?amount My mother only instrument answered story by prison sighs to consolationsI loudly desire that a jealous statement found contained lupine in it should I am certain of sowed asinine it; his kiss languishing dug looks and mod As for me, itch I had irritate been forgot forgotten in tactic the general co Yes. abecedarian Do you helpless goat then deserve suspect any one? horn I suspect no one; death jewel raps winter at wrung your door--it ent wrote The count praised orange drop try Bertuccio's zeal, and ordered hi quick Yes; cat risen surprise and that was you! energetic I? Silence, purveyor step smile of moaning gossip, do not spread tha I was misty saying puzzled to him lazily only yesterday, bitten 'You are impr I, mental doubtless, wrote. It shook pot appears to me stroke that when a Albert pull had often tail whine heard--not from his reply father, for h -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: y.gif Type: image/gif Size: 10213 bytes Desc: not available URL: From swise at opengridcomputing.com Thu Apr 12 08:42:46 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 12 Apr 2007 10:42:46 -0500 Subject: [ofa-general] Re: [PATCH 2.6.21] iw_cxgb3 - Add set_tcb_rpl_handler. In-Reply-To: References: <1176382594.9396.1.camel@stevo-desktop> <461E45C3.2050400@chelsio.com> Message-ID: <1176392566.9396.11.camel@stevo-desktop> On Thu, 2007-04-12 at 08:23 -0700, Roland Dreier wrote: > So is the cxgb3 net driver change in question already in Linus's tree? > What is the exact patch that this change goes with? The patch is the 3rd of 3: http://marc.info/?l=linux-kernel&m=117617444622279&w=2 Jeff applied it into his upstream tree here: http://marc.info/?l=linux-netdev&m=117630664627997&w=2 From weiny2 at llnl.gov Thu Apr 12 08:46:23 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 12 Apr 2007 08:46:23 -0700 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070412042155.GF24730@mellanox.co.il> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070411183050.2cea149f.weiny2@llnl.gov> <20070412042155.GF24730@mellanox.co.il> Message-ID: <20070412084623.74b035d9.weiny2@llnl.gov> On Thu, 12 Apr 2007 07:21:55 +0300 "Michael S. Tsirkin" wrote: > > Quoting Ira Weiny : > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > On 11 Apr 2007 17:45:54 -0400 > > Hal Rosenstock wrote: > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > > > > - previously we had some client failing join > > > > which is worse. > > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > > rather than degrade the group due to some link issue). > > > > > > > Indeed, on a big cluster it would be better to have a few nodes dropped out > > than to limit the speed of the entire cluster. > > Why are you joining these nodes then? > Anyway, could always be an option. > We have seen a specific example where a nodes 4X link comes up at 1X. In this case we would want the join to fail. Basically a single hardware error, isolated to 1 node, should not affect the other 1150 nodes, which could very well be running a users job. Certainly if there is a heterogeneous network we would want different behavior but we don't operate any of our clusters like that. After reading todays posts I think it should be an option. If someone has a mixture they can configure it. I am not sure what the default should be though. I know we would want the join to fail, but I understand the argument to allow it to work. Ira From tziporet at dev.mellanox.co.il Thu Apr 12 08:56:49 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 12 Apr 2007 18:56:49 +0300 Subject: [ofa-general] Re: [PATCH] IB/mthca: work around kernel QP starvation In-Reply-To: References: <20070411100820.GN24730@mellanox.co.il> <20070412151025.GP24730@mellanox.co.il> Message-ID: <461E56C1.4010500@mellanox.co.il> Roland Dreier wrote: > I think it's too late for 2.6.21, since this is really not an obvious > change and we don't know how it will interact with all the different > HCAs and FW versions in use. > > We test it here with all our HCAs (results are good). In any case we will put it into OFED 1.2 Tziporet From sweitzen at cisco.com Thu Apr 12 08:56:11 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 12 Apr 2007 08:56:11 -0700 Subject: [ofa-general] Re: [PATCH] IB/mthca: work around kernel QPstarvation In-Reply-To: <461E56C1.4010500@mellanox.co.il> References: <20070411100820.GN24730@mellanox.co.il> <20070412151025.GP24730@mellanox.co.il> <461E56C1.4010500@mellanox.co.il> Message-ID: Tziporet, can you open a bug please? > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Tziporet Koren > Sent: Thursday, April 12, 2007 8:57 AM > To: Roland Dreier (rdreier) > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] Re: [PATCH] IB/mthca: work around > kernel QPstarvation > > Roland Dreier wrote: > > I think it's too late for 2.6.21, since this is really not > an obvious > > change and we don't know how it will interact with all the different > > HCAs and FW versions in use. > > > > > We test it here with all our HCAs (results are good). > In any case we will put it into OFED 1.2 > > Tziporet > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Thu Apr 12 09:07:20 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Apr 2007 09:07:20 -0700 Subject: [ofa-general] Re: [PATCH 2.6.21] iw_cxgb3 - Add set_tcb_rpl_handler. In-Reply-To: <1176392566.9396.11.camel@stevo-desktop> (Steve Wise's message of "Thu, 12 Apr 2007 10:42:46 -0500") References: <1176382594.9396.1.camel@stevo-desktop> <461E45C3.2050400@chelsio.com> <1176392566.9396.11.camel@stevo-desktop> Message-ID: > The patch is the 3rd of 3: > > http://marc.info/?l=linux-kernel&m=117617444622279&w=2 > > Jeff applied it into his upstream tree here: > > http://marc.info/?l=linux-netdev&m=117630664627997&w=2 OK, so it's not in Linus's tree yet. Jeff, how do you want to handle this? (That last patch breaks drivers/infiniband/hw/cxgb3, and Steve posted a fix to handle it) The cleanest thing would be for you to roll up Steve's fix into the patch you merged, so that Linus's tree is never in the state where it has half the change merged. But I don't know if that fits your workflow. Thanks... From rdreier at cisco.com Thu Apr 12 09:10:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Apr 2007 09:10:35 -0700 Subject: [ofa-general] Re: [PATCH] IB/mthca: work around kernel QP starvation In-Reply-To: <461E56C1.4010500@mellanox.co.il> (Tziporet Koren's message of "Thu, 12 Apr 2007 18:56:49 +0300") References: <20070411100820.GN24730@mellanox.co.il> <20070412151025.GP24730@mellanox.co.il> <461E56C1.4010500@mellanox.co.il> Message-ID: > We test it here with all our HCAs (results are good). > In any case we will put it into OFED 1.2 Seems like a good plan. I have no real objection to this patch, but the 2.6.21 kernel tree is at the stage where we really only want to merge very urgent fixes. This is something that can go into 2.6.21.x later if it proves to be a good idea. - R. From gurhan.ozen at gmail.com Thu Apr 12 09:25:51 2007 From: gurhan.ozen at gmail.com (G.O.) Date: Thu, 12 Apr 2007 12:25:51 -0400 Subject: [ofa-general] does RHEL5 Xen work with OFED? In-Reply-To: <20070412141417.GM24730@mellanox.co.il> References: <5849f1820704052125ob1d309do323eae651ea9ed91@mail.gmail.com> <20070410181810.GD10218@mellanox.co.il> <5849f1820704120204q7f88f098qb69c1399668a4be9@mail.gmail.com> <20070412141417.GM24730@mellanox.co.il> Message-ID: <5849f1820704120925n10871803gb729e7767a64fecf@mail.gmail.com> On 4/12/07, Michael S. Tsirkin wrote: > > Quoting G.O. : > > Subject: Re: [ofa-general] does RHEL5 Xen work with OFED? > > > > On 4/10/07, Michael S. Tsirkin wrote: > > >> Quoting G.O. : > > >> Subject: Re: [ofa-general] does RHEL5 Xen work with OFED? > > >> > > >> On 4/5/07, Scott Weitzenkamp (sweitzen) wrote: > > >> >Can I access OFED IPoIB and SRP/iSER devices from within a Xen virtual > > >> >machine? > > >> > > > >> > > >> I haven't tested SRP/iSER , but IPoIB works only on dom0 kernel. > > >> You can't use any infiniband stuff on the guest OSes . > > >> > > >> Gurhan > > > > > >What doesn't work? I would expect both IPoIB and SRP > > >behave in more or less the same way as any network/storage > > >devices, and get virtualized by Xen. > > > > > > > Nothing works. Guest kernel didn't even create > > /sys/class/infiniband/* files. 'Far as the guest kernel is concerned, > > HCA doesn't even seem to exist. > > > > Just as a FYI, I have only tried on paravirtualized guests, didn't > > try it with fully-virtualized guests. > > Why would you want to see /sys/class/infiniband/? > There things are only there for direct HW access, guests do not get that. > > You should be able to use SRP and IPoIB - you set it up in host (dom0) > and guests use it as any other network/storage device through the > virtualization layer. > Hi Michael, IIRC, i had got the "can't find device, initialization delayed" errors. I'll play around with it again with the GA release when I get a chance and will let you know. Might happen as early as next week. Thanks, Gurhan > > -- > MST > From etta at systemfabricworks.com Thu Apr 12 09:37:12 2007 From: etta at systemfabricworks.com (Chieng Etta) Date: Thu, 12 Apr 2007 11:37:12 -0500 Subject: [ofa-general] RE: [ewg] Re: SRP HA dm_multipath testing and questions In-Reply-To: Message-ID: <000c01c77d20$d2bd5f40$c801a8c0@ettac> I tried adding/removing new storage on sles10. It took few minutes to find the new target devices (the new target message was showed on /var/log/messages) then took few minutes to add the path. I did not run multipath again. The srp_daemon.sh scanned the new target and added path automatically. Thanks, Etta -----Original Message----- From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Sent: Wednesday, April 11, 2007 4:59 PM To: Ishai Rabinovitz; Chieng Etta Cc: Roland Dreier (rdreier); ewg at lists.openfabrics.org; openib; mkohari at novell.com Subject: RE: [ewg] Re: SRP HA dm_multipath testing and questions I haven't tried adding or removing storage, just failover. I guess leave 91-srp.rules in for now, it seems benign. Scott > -----Original Message----- > From: Ishai Rabinovitz [mailto:ishai at dev.mellanox.co.il] > Sent: Tuesday, April 10, 2007 9:46 PM > To: Chieng Etta > Cc: Scott Weitzenkamp (sweitzen); Roland Dreier (rdreier); > ewg at lists.openfabrics.org; 'openib'; mkohari at novell.com > Subject: Re: [ewg] Re: SRP HA dm_multipath testing and questions > > Chieng Etta wrote: > > > > Scott Weitzenkamp (sweitzen) wrote: > >> I've been testing SRP HA and dm_multipath with: > >> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun T4 RAID > >> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun 3510 RAID > >> - SLES10 x86_64, Cisco FC Gateway, and 3 JBODs > >> > >> On RHEL4, I edited /etc/multipath.conf, ran "chkconfig > multipathd on", > >> then rebooted. On SLES 10, I ran "chkconfig > boot.multipath on" and > >> "chkconfig multipathd on", then rebooted. Ishai, I don't > seem to need > >> 91-srp.rules, are you using the boot.multipath and > multipathd scripts? > > > > On RHEL4 you really do not need 91-srp.rules and it is not used (see > > /etc/init.d/openibd) > > On SLES10 I was sure that you need it. I checked it, and > you are correct. I > > don't see how it does it, but it seems that when using > boot.multipath there > > is no need for 91-srp.rules. I will check it more deeply and change > > documentation and openibd script accordingly. > > > > [EC] I just verified it on SLES10 x86_64. The multipath > worked fine by > > using boot.multipath without 91-srp.rules. > > > In one of Novell's documents (SLES 10 Storage Administration > Guide for EVMS - In section 5 Managing Multipath I/O for > Devices > http://www.novell.com/documentation/sles10/index.html?page=/do cumentation/sles10/stor_evms/data/multipathing.html) it says in subsection 5.7 that after a new target > was discovered there is a need to actively execute multipath. > (As I understand it from the document this is true even after > boot.multipath is running) > > Experiments in my environment also indicates that after > executing boot.multipath, SRP HA is working also without > 91-srp.rules, but after reading this document I'm even more confused. > > > > > Ishai, in the SRP release notes - section 6, srp_daemon a., > the first line > > should be changed to '"srp_daemon -a -o" is equivalent to > "ibsrpdm"'. > > > > > Thanks, However Scott already noticed that and I already > fixed it. You will see it in the next documentation version. > From mst at dev.mellanox.co.il Thu Apr 12 09:53:20 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Apr 2007 19:53:20 +0300 Subject: [ofa-general] Re: Question about registering the [vdso] memory section in user level In-Reply-To: References: <460B8705.9030904@dev.mellanox.co.il> <20070329094700.GB4253@mellanox.co.il> <20070329233622.GM5436@mellanox.co.il> <461E19D9.3080501@dev.mellanox.co.il> Message-ID: <20070412165320.GR24730@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: Question about registering the [vdso] memory section in user level > > > ibv_reg_mr fails for me. > > When i added some debug prints i noticed the failure in file: > > uverbs_mem.c function: get_page_shift, > > find_vma returned NULL. > > get_page_shift() doesn't appear in the upstream kernel, so this is > some patch from OFED breaking things I guess. Does the test work with > an unpatched kernel? This is Eli's hpages.patch. But that should only be applied if enabled explicitly by install option. How come you get it? Vlad? -- MST From mst at dev.mellanox.co.il Thu Apr 12 10:16:32 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Apr 2007 20:16:32 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070412084623.74b035d9.weiny2@llnl.gov> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070411183050.2cea149f.weiny2@llnl.gov> <20070412042155.GF24730@mellanox.co.il> <20070412084623.74b035d9.weiny2@llnl.gov> Message-ID: <20070412171632.GU24730@mellanox.co.il> > Quoting Ira Weiny : > Subject: Re: [ofa-general] Re: multicast join failed for... > > On Thu, 12 Apr 2007 07:21:55 +0300 > "Michael S. Tsirkin" wrote: > > > > Quoting Ira Weiny : > > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > > > On 11 Apr 2007 17:45:54 -0400 > > > Hal Rosenstock wrote: > > > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > > > > > > - previously we had some client failing join > > > > > which is worse. > > > > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > > > rather than degrade the group due to some link issue). > > > > > > > > > > Indeed, on a big cluster it would be better to have a few nodes dropped out > > > than to limit the speed of the entire cluster. > > > > Why are you joining these nodes then? > > Anyway, could always be an option. > > > > We have seen a specific example where a nodes 4X link comes up at 1X. I think that the way to do it, is to make it possible to force endnode link to a specific rate. You can already do this with a simple script from userspace, by testing the link rate once it comes up, and downing the link if it's lower than what you want. If you think it's important, it's also quite trivial to make it possible to disable 1x support through sysfs interface. This way, the link will come up as 4x or not come up at all. Would that be useful? > In this > case we would want the join to fail. Basically a single hardware error, > isolated to 1 node, should not affect the other 1150 nodes, As far as I know, there are *a lot* of reasons where a problem at 1 node will affect others on the same subnet. Do I have to give examples? I don't see why do we have to choose a specific instance (incorrect link rate at endnode) and handle it differently. > which could very well be running a users job. The job will continue running though, and when you diagnose the problem and disconnect the bad node, rate will be back to high. So what's the problem? > > Certainly if there is a heterogeneous network we would want different behavior > but we don't operate any of our clusters like that. After reading todays posts > I think it should be an option. Yes. I think the option belongs at the endnodes, as outlined above. > If someone has a mixture they can configure > it. I am not sure what the default should be though. I know we would want > the join to fail, but I understand the argument to allow it to work. This likely means that you have a sideband interconnect infrastructure beside IPoIB. Otherwise, if the join fails, you don't even have a way to debug the problem. -- MST From adit.262 at gmail.com Thu Apr 12 10:21:39 2007 From: adit.262 at gmail.com (Adit Ranadive) Date: Thu, 12 Apr 2007 13:21:39 -0400 Subject: [ofa-general] Loading ib_gmthca module into guest domain Message-ID: Hi, I was partly succesful in inserting the ib_gmthca module into the guest domain. This is the debug output I get while inserting the module : modprobe ib_gmthca mc=1 domain=2 host ip 192.168.0.3 domain 2 [drivers/infiniband/hw/gmthca/../../utils/kernel_socket.c:62],<1>Connection ok [drivers/infiniband/hw/gmthca/gmthca_main.c:208],<1>ACK received, now setup data channel [drivers/infiniband/hw/gmthca/gmthca_main.c:101],<1>Get serv_dom, 0 [drivers/infiniband/hw/gmthca/gmthca_main.c:117],<1>[finish listen] dom 2, local port: 8, grant ref: 566 [drivers/infiniband/hw/gmthca/gmthca_main.c:142],<1>[finish listen] dom 2, local port: 9, grant ref: 561 [drivers/infiniband/hw/gmthca/gmthca_main.c:151],<1>Now wait for ack [drivers/infiniband/hw/gmthca/gmthca_main.c:157],<1>Finish setting up the channels<1>[drivers/infiniband/hw/gmthca/../../utils/kernel_socket.c:105],<1>Now remove socket [drivers/infiniband/hw/gmthca/gmthca_main.c:217],<1>Receving resp bytes 16 [drivers/infiniband/hw/gmthca/gmthca_main.c:230],<1>content of resp: ack 0, tablesize 65536, uarsize 0, pfn 000d7802 [drivers/infiniband/hw/gmthca/gmthca_main.c:324],<1>num ports: 2, MAX QP 65536, MAX CQ 65536, max sge 59 fail to recv ret value (ret 0) or ret value error (24 but receive -16) Fail to execute command It is able to get the device limits (gmthca_get_limits function) but it fails after that. I was unable to debug where exactly it was failing. Before loading the module in guest domain I load the kibad, ib_mthca modules in dom0. Any steps I have missed in order to correctly load the module in guest? Any pointers here will be helpful. Thanks, Adit -- Adit Ranadive MS CS Candidate Georgia Institute of Technology, Atlanta, GA From rdreier at cisco.com Thu Apr 12 10:30:48 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Apr 2007 10:30:48 -0700 Subject: [ofa-general] Re: [PATCH 2.6.21] iw_cxgb3 - Add set_tcb_rpl_handler. In-Reply-To: (Roland Dreier's message of "Thu, 12 Apr 2007 09:07:20 -0700") References: <1176382594.9396.1.camel@stevo-desktop> <461E45C3.2050400@chelsio.com> <1176392566.9396.11.camel@stevo-desktop> Message-ID: Never mind, I see that "cxgb3 - missing CPL hanler and register setting." has appeared in Linus's tree. Steve, I'll ask Linus to pull this fix today. Jeff, never mind my question since it's too late now. - R. From rdreier at cisco.com Thu Apr 12 10:36:48 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Apr 2007 10:36:48 -0700 Subject: [ofa-general] [PATCHv2] IB/mad: Change SMI to use enums rather than magic return codes In-Reply-To: <1175527446.4436.16721.camel@localhost.localdomain> (Hal Rosenstock's message of "02 Apr 2007 11:24:07 -0400") References: <1175527446.4436.16721.camel@localhost.localdomain> Message-ID: Definitely a big improvement to readability. However, I don't like the "smi_type" name, since the enum is not really a type but rather an action: > +enum smi_type { > + IB_SMI_DISCARD, > + IB_SMI_HANDLE > +}; > + > +enum smi_forward_type { > + IB_SMI_LOCAL, /* SMP should be completed up the stack */ > + IB_SMI_SEND, /* received DR SMP should be forwarded to the send queue */ > +}; Is it OK if I do s/smi_type/smi_action/ and s/smi_forward_type/smi_forward_action/ before applying this? - R. From rdreier at cisco.com Thu Apr 12 10:52:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Apr 2007 10:52:08 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will fix up some problems with drivers/infiniband/hw/cxgb3 caused by a change to drivers/net/cxgb3 that you just pulled. Steve Wise (1): RDMA/cxgb3: Add set_tcb_rpl_handler drivers/infiniband/hw/cxgb3/iwch_cm.c | 12 ++++++++++++ 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index d0ed1d3..2d2de9b 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -2026,6 +2026,17 @@ static int sched(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) return 0; } +static int set_tcb_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_set_tcb_rpl *rpl = cplhdr(skb); + + if (rpl->status != CPL_ERR_NONE) { + printk(KERN_ERR MOD "Unexpected SET_TCB_RPL status %u " + "for tid %u\n", rpl->status, GET_TID(rpl)); + } + return CPL_RET_BUF_DONE; +} + int __init iwch_cm_init(void) { skb_queue_head_init(&rxq); @@ -2053,6 +2064,7 @@ int __init iwch_cm_init(void) t3c_handlers[CPL_ABORT_REQ_RSS] = sched; t3c_handlers[CPL_RDMA_TERMINATE] = sched; t3c_handlers[CPL_RDMA_EC_STATUS] = sched; + t3c_handlers[CPL_SET_TCB_RPL] = set_tcb_rpl; /* * These are the real handlers that are called from a From mst at dev.mellanox.co.il Thu Apr 12 10:56:35 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Apr 2007 20:56:35 +0300 Subject: [ofa-general] Re: [PATCH] IB/mthca: work around kernel QP starvation In-Reply-To: References: <20070412151025.GP24730@mellanox.co.il> Message-ID: <20070412175635.GY24730@mellanox.co.il> > Quoting Shirley Ma : > Subject: Re: [PATCH] IB/mthca: work around kernel QP starvation > > Hello Michael, > > We saw the same problem. Is a userspace patch needed? > > Thanks > Shirley Ma No, we are protecting kernel QPs from being starved by userspace, and we can't trust userspace to do this. -- MST From mst at dev.mellanox.co.il Thu Apr 12 10:57:34 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Apr 2007 20:57:34 +0300 Subject: [ofa-general] Re: Re: [PATCH] IB/mthca: work around kernel QP starvation In-Reply-To: References: <20070411100820.GN24730@mellanox.co.il> <20070412151025.GP24730@mellanox.co.il> <461E56C1.4010500@mellanox.co.il> Message-ID: <20070412175734.GZ24730@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: Re: [PATCH] IB/mthca: work around kernel QP starvation > > > We test it here with all our HCAs (results are good). > > In any case we will put it into OFED 1.2 > > Seems like a good plan. I have no real objection to this patch, but > the 2.6.21 kernel tree is at the stage where we really only want to > merge very urgent fixes. This is something that can go into 2.6.21.x > later if it proves to be a good idea. Fair enough. We already gave it testing on a large cluster, let's give it more testing from OFED, put in 2.6.21.x. -- MST From swise at opengridcomputing.com Thu Apr 12 10:59:44 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 12 Apr 2007 12:59:44 -0500 Subject: [ofa-general] [GIT PULL] ofed_1_2 - Chelsio bug fixes Message-ID: <1176400784.17665.2.camel@stevo-desktop> Vlad, Please pull these Chelsio cxgb3 and iw_cxgb3 fixes from git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2 Short log --------- Divy Le Ray: Fix a deadlock when the interface s configured down and The MAC watchdog was failing if the peer interface was brought down. Remove specific CPL handler. Steve Wise: Initialize cpu_idx field in cpl_close_listserv_req message. Backport rtnl_trylock() for Chelsio Driver. Add set_tcb_rpl_handler. Log --- commit f0aa52b40e1da13b06c8ed93f24cf55a905e906d Author: Steve Wise Date: Wed Apr 11 14:44:45 2007 -0500 Add set_tcb_rpl_handler. The Ethernet Driver no longer handles SET_TCB replies. Signed-off-by: Steve Wise diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 8c82226..36ab39e 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -2028,6 +2028,17 @@ static int sched(struct t3cdev *tdev, st return 0; } +static int set_tcb_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_set_tcb_rpl *rpl = cplhdr(skb); + + if (rpl->status != CPL_ERR_NONE) { + printk(KERN_ERR MOD "Unexpected SET_TCB_RPL status %u " + "for tid %u\n", rpl->status, GET_TID(rpl)); + } + return CPL_RET_BUF_DONE; +} + int __init iwch_cm_init(void) { skb_queue_head_init(&rxq); @@ -2055,6 +2066,7 @@ int __init iwch_cm_init(void) t3c_handlers[CPL_ABORT_REQ_RSS] = sched; t3c_handlers[CPL_RDMA_TERMINATE] = sched; t3c_handlers[CPL_RDMA_EC_STATUS] = sched; + t3c_handlers[CPL_SET_TCB_RPL] = set_tcb_rpl; /* * These are the real handlers that are called from a commit ce30b4b75ac210fb5b6857897e10112bbff194e1 Author: Divy Le Ray Date: Wed Apr 11 14:44:45 2007 -0500 Remove specific CPL handler. Add missing CPL handler. Add missing register setting when the interface is brought up. Signed-off-by: Divy Le Ray diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c index 859cfe3..475c428 100644 --- a/drivers/net/cxgb3/cxgb3_main.c +++ b/drivers/net/cxgb3/cxgb3_main.c @@ -766,6 +766,8 @@ static int cxgb_up(struct adapter *adap) if (err) goto out; + t3_write_reg(adap, A_ULPRX_TDDP_PSZ, V_HPZ0(PAGE_SHIFT - 12)); + err = setup_sge_qsets(adap); if (err) goto out; diff --git a/drivers/net/cxgb3/cxgb3_offload.c b/drivers/net/cxgb3/cxgb3_offload.c index b56e679..3353171 100644 --- a/drivers/net/cxgb3/cxgb3_offload.c +++ b/drivers/net/cxgb3/cxgb3_offload.c @@ -741,17 +741,6 @@ static int do_act_establish(struct t3cde } } -static int do_set_tcb_rpl(struct t3cdev *dev, struct sk_buff *skb) -{ - struct cpl_set_tcb_rpl *rpl = cplhdr(skb); - - if (rpl->status != CPL_ERR_NONE) - printk(KERN_ERR - "Unexpected SET_TCB_RPL status %u for tid %u\n", - rpl->status, GET_TID(rpl)); - return CPL_RET_BUF_DONE; -} - static int do_trace(struct t3cdev *dev, struct sk_buff *skb) { struct cpl_trace_pkt *p = cplhdr(skb); @@ -1213,7 +1202,8 @@ void __init cxgb3_offload_init(void) t3_register_cpl_handler(CPL_CLOSE_CON_RPL, do_hwtid_rpl); t3_register_cpl_handler(CPL_ABORT_REQ_RSS, do_abort_req_rss); t3_register_cpl_handler(CPL_ACT_ESTABLISH, do_act_establish); - t3_register_cpl_handler(CPL_SET_TCB_RPL, do_set_tcb_rpl); + t3_register_cpl_handler(CPL_SET_TCB_RPL, do_hwtid_rpl); + t3_register_cpl_handler(CPL_GET_TCB_RPL, do_hwtid_rpl); t3_register_cpl_handler(CPL_RDMA_TERMINATE, do_term); t3_register_cpl_handler(CPL_RDMA_EC_STATUS, do_hwtid_rpl); t3_register_cpl_handler(CPL_TRACE_PKT, do_trace); diff --git a/drivers/net/cxgb3/regs.h b/drivers/net/cxgb3/regs.h index f8be41c..e5a5534 100644 --- a/drivers/net/cxgb3/regs.h +++ b/drivers/net/cxgb3/regs.h @@ -1234,9 +1234,15 @@ #define A_ULPRX_ISCSI_ULIMIT 0x510 #define A_ULPRX_ISCSI_TAGMASK 0x514 +#define S_HPZ0 0 +#define M_HPZ0 0xf +#define V_HPZ0(x) ((x) << S_HPZ0) +#define G_HPZ0(x) (((x) >> S_HPZ0) & M_HPZ0) + #define A_ULPRX_TDDP_LLIMIT 0x51c #define A_ULPRX_TDDP_ULIMIT 0x520 +#define A_ULPRX_TDDP_PSZ 0x528 #define A_ULPRX_STAG_LLIMIT 0x52c commit 5a20a3c872ba59675db2bf895806f794506b5692 Author: Divy Le Ray Date: Wed Apr 11 14:44:44 2007 -0500 The MAC watchdog was failing if the peer interface was brought down. Signed-off-by: Divy Le Ray diff --git a/drivers/net/cxgb3/common.h b/drivers/net/cxgb3/common.h index 97128d8..8d13796 100644 --- a/drivers/net/cxgb3/common.h +++ b/drivers/net/cxgb3/common.h @@ -478,8 +478,11 @@ struct cmac { struct adapter *adapter; unsigned int offset; unsigned int nucast; /* # of address filters for unicast MACs */ - unsigned int tcnt; - unsigned int xcnt; + unsigned int tx_tcnt; + unsigned int tx_xcnt; + u64 tx_mcnt; + unsigned int rx_xcnt; + u64 rx_mcnt; unsigned int toggle_cnt; unsigned int txen; struct mac_stats stats; diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c index 7358016..859cfe3 100644 --- a/drivers/net/cxgb3/cxgb3_main.c +++ b/drivers/net/cxgb3/cxgb3_main.c @@ -196,15 +196,13 @@ void t3_os_link_changed(struct adapter * if (link_stat != netif_carrier_ok(dev)) { if (link_stat) { - t3_set_reg_field(adapter, - A_XGM_TXFIFO_CFG + mac->offset, - F_ENDROPPKT, 0); + t3_mac_enable(mac, MAC_DIRECTION_RX); netif_carrier_on(dev); } else { netif_carrier_off(dev); - t3_set_reg_field(adapter, - A_XGM_TXFIFO_CFG + mac->offset, - F_ENDROPPKT, F_ENDROPPKT); + pi->phy.ops->power_down(&pi->phy, 1); + t3_mac_disable(mac, MAC_DIRECTION_RX); + t3_link_start(&pi->phy, mac, &pi->link_config); } link_report(dev); diff --git a/drivers/net/cxgb3/xgmac.c b/drivers/net/cxgb3/xgmac.c index 94aaff0..a506792 100644 --- a/drivers/net/cxgb3/xgmac.c +++ b/drivers/net/cxgb3/xgmac.c @@ -367,7 +367,8 @@ int t3_mac_enable(struct cmac *mac, int int idx = macidx(mac); struct adapter *adap = mac->adapter; unsigned int oft = mac->offset; - + struct mac_stats *s = &mac->stats; + if (which & MAC_DIRECTION_TX) { t3_write_reg(adap, A_XGM_TX_CTRL + oft, F_TXEN); t3_write_reg(adap, A_TP_PIO_ADDR, A_TP_TX_DROP_CFG_CH0 + idx); @@ -376,10 +377,16 @@ int t3_mac_enable(struct cmac *mac, int t3_set_reg_field(adap, A_TP_PIO_DATA, 1 << idx, 1 << idx); t3_write_reg(adap, A_TP_PIO_ADDR, A_TP_TX_DROP_CNT_CH0 + idx); - mac->tcnt = (G_TXDROPCNTCH0RCVD(t3_read_reg(adap, - A_TP_PIO_DATA))); - mac->xcnt = (G_TXSPI4SOPCNT(t3_read_reg(adap, - A_XGM_TX_SPI4_SOP_EOP_CNT))); + mac->tx_mcnt = s->tx_frames; + mac->tx_tcnt = (G_TXDROPCNTCH0RCVD(t3_read_reg(adap, + A_TP_PIO_DATA))); + mac->tx_xcnt = (G_TXSPI4SOPCNT(t3_read_reg(adap, + A_XGM_TX_SPI4_SOP_EOP_CNT + + oft))); + mac->rx_mcnt = s->rx_frames; + mac->rx_xcnt = (G_TXSPI4SOPCNT(t3_read_reg(adap, + A_XGM_RX_SPI4_SOP_EOP_CNT + + oft))); mac->txen = F_TXEN; mac->toggle_cnt = 0; } @@ -392,6 +399,7 @@ int t3_mac_disable(struct cmac *mac, int { int idx = macidx(mac); struct adapter *adap = mac->adapter; + int val; if (which & MAC_DIRECTION_TX) { t3_write_reg(adap, A_XGM_TX_CTRL + mac->offset, 0); @@ -401,44 +409,89 @@ int t3_mac_disable(struct cmac *mac, int t3_set_reg_field(adap, A_TP_PIO_DATA, 1 << idx, 1 << idx); mac->txen = 0; } - if (which & MAC_DIRECTION_RX) + if (which & MAC_DIRECTION_RX) { + t3_set_reg_field(mac->adapter, A_XGM_RESET_CTRL + mac->offset, + F_PCS_RESET_, 0); + msleep(100); t3_write_reg(adap, A_XGM_RX_CTRL + mac->offset, 0); + val = F_MAC_RESET_; + if (is_10G(adap)) + val |= F_PCS_RESET_; + else if (uses_xaui(adap)) + val |= F_PCS_RESET_ | F_XG2G_RESET_; + else + val |= F_RGMII_RESET_ | F_XG2G_RESET_; + t3_write_reg(mac->adapter, A_XGM_RESET_CTRL + mac->offset, val); + } return 0; } int t3b2_mac_watchdog_task(struct cmac *mac) { struct adapter *adap = mac->adapter; - unsigned int tcnt, xcnt; + struct mac_stats *s = &mac->stats; + unsigned int tx_tcnt, tx_xcnt; + unsigned int tx_mcnt = s->tx_frames; + unsigned int rx_mcnt = s->rx_frames; + unsigned int rx_xcnt; int status; - t3_write_reg(adap, A_TP_PIO_ADDR, A_TP_TX_DROP_CNT_CH0 + macidx(mac)); - tcnt = (G_TXDROPCNTCH0RCVD(t3_read_reg(adap, A_TP_PIO_DATA))); - xcnt = (G_TXSPI4SOPCNT(t3_read_reg(adap, - A_XGM_TX_SPI4_SOP_EOP_CNT + - mac->offset))); - - if (tcnt != mac->tcnt && xcnt == 0 && mac->xcnt == 0) { - if (mac->toggle_cnt > 4) { - t3b2_mac_reset(mac); + if (tx_mcnt == mac->tx_mcnt) { + tx_xcnt = (G_TXSPI4SOPCNT(t3_read_reg(adap, + A_XGM_TX_SPI4_SOP_EOP_CNT + + mac->offset))); + if (tx_xcnt == 0) { + t3_write_reg(adap, A_TP_PIO_ADDR, + A_TP_TX_DROP_CNT_CH0 + macidx(mac)); + tx_tcnt = (G_TXDROPCNTCH0RCVD(t3_read_reg(adap, + A_TP_PIO_DATA))); + } else { mac->toggle_cnt = 0; + return 0; + } + } else { + mac->toggle_cnt = 0; + return 0; + } + + if (((tx_tcnt != mac->tx_tcnt) && + (tx_xcnt == 0) && (mac->tx_xcnt == 0)) || + ((mac->tx_mcnt == tx_mcnt) && + (tx_xcnt != 0) && (mac->tx_xcnt != 0))) { + if (mac->toggle_cnt > 4) status = 2; - } else { - t3_write_reg(adap, A_XGM_TX_CTRL + mac->offset, 0); - t3_read_reg(adap, A_XGM_TX_CTRL + mac->offset); - t3_write_reg(adap, A_XGM_TX_CTRL + mac->offset, - mac->txen); - t3_read_reg(adap, A_XGM_TX_CTRL + mac->offset); - mac->toggle_cnt++; + else status = 1; - } } else { mac->toggle_cnt = 0; - status = 0; + return 0; } - mac->tcnt = tcnt; - mac->xcnt = xcnt; + if (rx_mcnt != mac->rx_mcnt) + rx_xcnt = (G_TXSPI4SOPCNT(t3_read_reg(adap, + A_XGM_RX_SPI4_SOP_EOP_CNT + + mac->offset))); + else + return 0; + + if (mac->rx_mcnt != s->rx_frames && rx_xcnt == 0 && mac->rx_xcnt == 0) + status = 2; + + mac->tx_tcnt = tx_tcnt; + mac->tx_xcnt = tx_xcnt; + mac->tx_mcnt = s->tx_frames; + mac->rx_xcnt = rx_xcnt; + mac->rx_mcnt = s->rx_frames; + if (status == 1) { + t3_write_reg(adap, A_XGM_TX_CTRL + mac->offset, 0); + t3_read_reg(adap, A_XGM_TX_CTRL + mac->offset); /* flush */ + t3_write_reg(adap, A_XGM_TX_CTRL + mac->offset, mac->txen); + t3_read_reg(adap, A_XGM_TX_CTRL + mac->offset); /* flush */ + mac->toggle_cnt++; + } else if (status == 2) { + t3b2_mac_reset(mac); + mac->toggle_cnt = 0; + } return status; } commit 567e6ee8d3c62d1f505dc0a2d92d12859e8c68e1 Author: Divy Le Ray Date: Wed Apr 11 14:44:44 2007 -0500 Fix a deadlock when the interface s configured down and the watchdog tack is sleeping on rtnl_lock. Signed-off-by: Divy Le Ray diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c index 81262e5..7358016 100644 --- a/drivers/net/cxgb3/cxgb3_main.c +++ b/drivers/net/cxgb3/cxgb3_main.c @@ -2114,7 +2114,9 @@ static void check_t3b2_mac(struct adapte { int i; - rtnl_lock(); /* synchronize with ifdown */ + if (!rtnl_trylock()) /* synchronize with ifdown */ + return; + for_each_port(adapter, i) { struct net_device *dev = adapter->port[i]; struct port_info *p = netdev_priv(dev); commit 4a117060a593ecc3f3c6e321437362f2882b0159 Author: Steve Wise Date: Wed Apr 11 14:44:41 2007 -0500 Backport rtnl_trylock() for Chelsio Driver. Signed-off-by: Steve Wise diff --git a/kernel_addons/backport/2.6.11/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.11/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.11/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.11_FC4/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.11_FC4/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.11_FC4/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.12/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.12/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.12/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.13/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.13/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.13/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.13_suse10_0_u/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.13_suse10_0_u/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.13_suse10_0_u/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.14/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.14/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.14/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.15/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.15/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.15/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.15_ubuntu606/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.15_ubuntu606/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.15_ubuntu606/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.16/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.16/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.16/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.16_sles10/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.16_sles10/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.16_sles10/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.16_sles10_sp1/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.16_sles10_sp1/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.16_sles10_sp1/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.5_sles9_sp3/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.5_sles9_sp3/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.5_sles9_sp3/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.9_U2/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.9_U2/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U2/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.9_U3/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/rtnetlink.h b/kernel_addons/backport/2.6.9_U4/include/linux/rtnetlink.h new file mode 100644 index 0000000..36344d7 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/linux/rtnetlink.h @@ -0,0 +1,10 @@ +#ifndef BACKPORT_RTNETLINK_2_6_16 +#define BACKPORT_RTNETLINK_2_6_16 +#include_next + +static inline int rtnl_trylock(void) +{ + return !rtnl_shlock_nowait(); +} + +#endif commit cd78b9f2815ca4e8aa99f1eb4d198f5aa5f8fd37 Author: Steve Wise Date: Wed Apr 11 11:44:46 2007 -0500 Initialize cpu_idx field in cpl_close_listserv_req message. Signed-off-by: Steve Wise diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index ac91a96..8c82226 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1188,6 +1188,7 @@ static int listen_stop(struct iwch_liste } req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->cpu_idx = 0; OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); skb->priority = 1; ep->com.tdev->send(ep->com.tdev, skb); From halr at voltaire.com Thu Apr 12 11:16:18 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Apr 2007 14:16:18 -0400 Subject: [ofa-general] [PATCHv2] IB/mad: Change SMI to use enums rather than magic return codes In-Reply-To: References: <1175527446.4436.16721.camel@localhost.localdomain> Message-ID: <1176401777.4545.119096.camel@hal.voltaire.com> On Thu, 2007-04-12 at 13:36, Roland Dreier wrote: > Definitely a big improvement to readability. However, I don't like > the "smi_type" name, since the enum is not really a type but rather an > action: > > > +enum smi_type { > > + IB_SMI_DISCARD, > > + IB_SMI_HANDLE > > +}; > > + > > +enum smi_forward_type { > > + IB_SMI_LOCAL, /* SMP should be completed up the stack */ > > + IB_SMI_SEND, /* received DR SMP should be forwarded to the send queue */ > > +}; > > Is it OK if I do s/smi_type/smi_action/ and s/smi_forward_type/smi_forward_action/ > before applying this? Sure; that's an improvement. My one other comment with the patch was about testing relative to iPath and perhaps eHCA. I think things should work but it would be best if someone verify this. -- Hal > - R. From rdreier at cisco.com Thu Apr 12 11:32:45 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Apr 2007 11:32:45 -0700 Subject: [ofa-general] Re: [Bug 506] IPoIB IPv4 multicast throughput is poor In-Reply-To: <20070412041653.GE24730@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 12 Apr 2007 07:16:53 +0300") References: <20070411223924.879E9E60390@openfabrics.org> <20070412041653.GE24730@mellanox.co.il> Message-ID: > BTW, Roland, why aren't we using txqueuelen ifconfig/ethtool options here? The ifconfig option is about the TX queue outside the driver's hardware queue. Not sure what ethtool is setting. I think the main reasons why we're not using ethtool are: - the patches I got didn't do it - it would require fairly invasive changes to do it, since we would have to destroy/recreate an interface's QP to change the queue lengths - R. From xma at us.ibm.com Thu Apr 12 12:41:15 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 12 Apr 2007 12:41:15 -0700 Subject: [ofa-general] Re: Re: [PATCH] IB/mthca: work around kernel QP starvation In-Reply-To: <20070412175734.GZ24730@mellanox.co.il> Message-ID: Hello Michael, Could you please create a patch against OFED-1.1? And in the future what's the process to apply this kind of patch for prev OFED release? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kaf at sgi.com Thu Apr 12 12:56:53 2007 From: kaf at sgi.com (Karl Feind) Date: Thu, 12 Apr 2007 14:56:53 -0500 Subject: [ofa-general] on the coexistance of uDAPLs In-Reply-To: <39C75744D164D948A170E9792AF8E7CA0CFE44@exil.voltaire.com> References: <20070411170431.GA25341@sgi.com> <39C75744D164D948A170E9792AF8E7CA0CFE44@exil.voltaire.com> Message-ID: <20070412195653.GA20252@sgi.com> Hello James, We are trying to find a way for the OpenIB-cma uDAPL layer to coexist with SGI's xpmem uDAPL on a single system. Obviously, the installation scriptlets for xpmem uDAPL need to add lines into /etc/dat.conf when xpmem UDAPL is installed. Since a static version of /etc/dat.conf is simply installed when the OpenIB-cma uDAPL layer is installed (via the "dapl" RPM), we are left in the awkward position that later upgrades of OpenIB-cma uDAPL will overwrite the /etc/dat.conf file, removing the other registered uDAPL layers. Clearly, we need to agree on a conventional way that a uDAPL layer can register itself in /etc/dat.conf when it gets installed and unregister itself when it gets uninstalled. Furthermore, upgrading one uDAPL should not have adverse effects on other uDAPLs. I don't see how this can be done with the current RPM structure. Thanks for any guidance. Karl Feind SGI MPI and DAPL Engineering Team From rick.jones2 at hp.com Thu Apr 12 13:58:15 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 12 Apr 2007 13:58:15 -0700 Subject: [ofa-general] desired netperf mods? Message-ID: <461E9D67.3050302@hp.com> Folks - I've only just started doing nefarious things like netperf over IPoIB and was wondering if "things" (for some definition of "things" in the context of Infiniband/whatnot) were stable-enough that some additional netperf tests would be warranted? Some web searching suggests there may be AF_SDP (or perhaps AF_INET_SDP) explicit SDP support in places - would a nettest_sdp.c test suite be indicated? How about uDAPL? Taps with a clue-bat most welcome - please be gentle :) happy benchmarking, rick jones mr netperf From pradeep at us.ibm.com Thu Apr 12 13:58:36 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Thu, 12 Apr 2007 13:58:36 -0700 Subject: [ofa-general] mthca issues -need help Message-ID: I am running into a number of mthca issues listed below and need help with them. 1. I am using linux-2.6.21-rc5 and I see this Oops when I modprobe ib_mthca (on ppc64) Apr 12 14:11:19 elm3b37 kernel: ib_mthca 0002:d9:00.0: HCA FW version 3.3.3 is old (3.4.0 is current). Apr 12 14:11:19 elm3b37 kernel: ib_mthca 0002:d9:00.0: If you have problems, try updating your HCA FW. Apr 12 14:11:19 elm3b37 kernel: Faulting instruction address: 0xd0000000002db0d8 Apr 12 14:11:19 elm3b37 kernel: Oops: Kernel access of bad area, sig: 11 [#2] Apr 12 14:11:19 elm3b37 kernel: SMP NR_CPUS=128 NUMA Apr 12 14:11:19 elm3b37 kernel: Modules linked in: ib_mthca ib_mad ib_ehca ib_core autofs4 ipv6 binfmt_misc parport_pc lp parport sg e1000 dm_snapshot dm_zero dm_mirror dm_mod ipr libata sd_mod scsi_mod firmware_class ehci_hcd ohci_hcd usbcore Apr 12 14:11:19 elm3b37 kernel: NIP: D0000000002DB0D8 LR: D0000000002DAE0C CTR: 0000000000000400 Apr 12 14:11:19 elm3b37 kernel: REGS: c0000000e2116f60 TRAP: 0300 Not tainted (2.6.21-rc5) Apr 12 14:11:19 elm3b37 kernel: MSR: 8000000000009032 CR: 24024444 XER: 00000008 Apr 12 14:11:19 elm3b37 kernel: DAR: 0000000000002000, DSISR: 0000000042000000 Apr 12 14:11:19 elm3b37 kernel: TASK = c0000000e7de4040[3884] 'modprobe' THREAD: c0000000e2114000 CPU: 0 Apr 12 14:11:19 elm3b37 kernel: GPR00: 0000000040010001 C0000000E21171E0 D000000000308B30 0000000007FFFFFF Apr 12 14:11:19 elm3b37 kernel: GPR04: C0000000E595FE00 0000000000000000 C0000000E2438000 0000000000000400 Apr 12 14:11:19 elm3b37 kernel: GPR08: 0000000000000000 0000000000000400 0000000000002000 0000000000000000 Apr 12 14:11:19 elm3b37 kernel: GPR12: D0000000002EAD28 C000000000535A80 AAAAAAAAAAAAAAAB D0000000005A0C10 Apr 12 14:11:19 elm3b37 kernel: GPR16: 0000000000000000 0000000000000312 0000000000000312 000000000000003F Apr 12 14:11:19 elm3b37 kernel: GPR20: C0000000E595FE20 C0000000E4F04000 C0000000E595FE00 0000000000000000 Apr 12 14:11:19 elm3b37 kernel: GPR24: C0000000E4FAF000 0000000007FFFFFF 0000000000000000 0000000000002000 Apr 12 14:11:19 elm3b37 kernel: GPR28: C0000000E2438000 0000000000000400 D0000000003075B0 0000000000000400 Apr 12 14:11:19 elm3b37 kernel: NIP [D0000000002DB0D8] .mthca_write_mtt+0x328/0x460 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: LR [D0000000002DAE0C] .mthca_write_mtt+0x5c/0x460 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: Call Trace: Apr 12 14:11:19 elm3b37 kernel: [C0000000E21171E0] [C0000000E2117300] 0xc0000000e2117300 (unreliable) Apr 12 14:11:19 elm3b37 kernel: [C0000000E21172D0] [D0000000002DBD1C] .mthca_mr_alloc_phys+0x8c/0x140 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117390] [D0000000002D6B6C] .mthca_create_eq+0x3ac/0x5e0 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117490] [D0000000002D7528] .mthca_init_eq_table+0x198/0x790 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117560] [D0000000002D0368] .__mthca_init_one+0xa38/0xd70 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117640] [D0000000002D0714] .mthca_init_one+0x74/0xf0 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E21176E0] [C0000000002487D8] .pci_device_probe+0x168/0x200 Apr 12 14:11:19 elm3b37 kernel: [C0000000E21177A0] [C0000000002C288C] .really_probe+0xbc/0x1f0 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117850] [C0000000002C2D3C] .__driver_attach+0xfc/0x140 Apr 12 14:11:19 elm3b37 kernel: [C0000000E21178E0] [C0000000002C1668] .bus_for_each_dev+0x88/0xe0 Apr 12 14:11:19 elm3b37 kernel: [C0000000E21179A0] [C0000000002C2628] .driver_attach+0x28/0x40 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117A20] [C0000000002C1C34] .bus_add_driver+0xc4/0x220 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117AC0] [C0000000002C3118] .driver_register+0x78/0xe0 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117B40] [C000000000248B70] .__pci_register_driver+0x90/0x120 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117BE0] [D0000000002EA050] .mthca_init+0x100/0x170 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117C70] [C0000000000848FC] .sys_init_module+0x20c/0x1990 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117E30] [C00000000000862C] syscall_exit+0x0/0x40 Apr 12 14:11:19 elm3b37 kernel: Instruction dump: Apr 12 14:11:19 elm3b37 kernel: 7d290214 7d495a14 409d0038 393fffff 39600000 79290020 39290001 7d2903a6 Apr 12 14:11:19 elm3b37 kernel: 60000000 60000000 7c1c582a 60000001 <7c0a592a> 396b0008 4200fff0 7bfb1f24 2. The above may or may not be a bug and as indicated in the message I wanted to upgrade (the FW). However, I found that the latest firmware is 3.5.0 and not 3.4.0 as the message seems to indicate. I wanted to use IPOIB CM -so which one should I upgrade to - presumably 3.5.0? 3. From the following url http://www.mellanox.com/support/firmware_table_IH.php it is not clear to me as to which firmware I should download. lspci -v shows me : 0002:d9:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technologies MT23108 InfiniHost So, I was planning on using fw-23108-3_5_000-MHET2X-1SC_A1.bin.zip -Is that correct? 3. When I downloaded mft-1.0.1.tar I found that ppc64 is not supported. 4. I moved my HCA to x86_64 and then tried to install mft utilities. There was a previous version of the tool and I asked to uinstall it. After that I see the following: /home/tools/mft-1.0.1 # ./install.sh *** Mellanox Firmware Tools (MFT) Package Installation *** MFT Build 20060118-1817 Copyright (C) June 2002, Mellanox Technologies Ltd. ALL RIGHTS RESERVED. Use of software subject to the terms and conditions detailed in the file "LICENSE.txt". Found a previous installation of the MFT package. Current installed MFT Build ID is 20060118-1817 This installation MFT Build ID is 20060118-1817 Remove currently installed components (run /usr/mellanox/mft/uninstall.sh) ? :(y/n) [n] y Running /usr/mellanox/mft/uninstall.sh ... Uninstall completed successfully. This installation installs the MFT components into /usr Installing MST package under /usr/mst ... MFT Depends on pre-installed MST. Fail to find /usr/mst/lib/libmtcr.a Nowhere could I find the libmtcr.a? I need help with above listed issues. Thanks! Pradeep pradeep at us.ibm.com From sweitzen at cisco.com Thu Apr 12 14:07:41 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 12 Apr 2007 14:07:41 -0700 Subject: [ofa-general] desired netperf mods? In-Reply-To: <461E9D67.3050302@hp.com> References: <461E9D67.3050302@hp.com> Message-ID: Rick, SDP is easy to get with netperf by running "LD_PRELOAD=libsdp.so netperf/neterver", so in my opinion SDP is already covered. Here's a list of things I'd like to see from netperf, in priority order: 1) IP multicast 2) Test that uses multiple concurrent sockets at the same time 3) RDS I hadn't thought about netperf and uDAPL before, but certainly there would be value in adding uDAPL, MPI, and even filesystem benchmarks. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones > Sent: Thursday, April 12, 2007 1:58 PM > To: general at lists.openfabrics.org > Subject: [ofa-general] desired netperf mods? > > Folks - > > I've only just started doing nefarious things like netperf over IPoIB > and was wondering if "things" (for some definition of "things" in the > context of Infiniband/whatnot) were stable-enough that some > additional > netperf tests would be warranted? Some web searching > suggests there may > be AF_SDP (or perhaps AF_INET_SDP) explicit SDP support in places - > would a nettest_sdp.c test suite be indicated? How about uDAPL? > > Taps with a clue-bat most welcome - please be gentle :) > > happy benchmarking, > > rick jones > mr netperf > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rick.jones2 at hp.com Thu Apr 12 14:22:39 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 12 Apr 2007 14:22:39 -0700 Subject: [ofa-general] desired netperf mods? In-Reply-To: References: <461E9D67.3050302@hp.com> Message-ID: <461EA31F.60009@hp.com> Scott Weitzenkamp (sweitzen) wrote: > Rick, > > SDP is easy to get with netperf by running "LD_PRELOAD=libsdp.so > netperf/neterver", so in my opinion SDP is already covered. The one and IMO very big worry I have about using LD_PRELOAD is that it does not change the netperf test banner. So, it leaves one very vulnerable to cut-and-paste errors and not realizing that something was SDP when it really was. The advantage then (again IMO) of an explicit SDP test is that the test banners will be right. > > Here's a list of things I'd like to see from netperf, in priority order: > > 1) IP multicast > 2) Test that uses multiple concurrent sockets at the same time > 3) RDS The pt-pt nature of netperf (what I will often call netperf2) has often made me wonder if IP multicast in netperf2 made much sense - there would be one multicast sender and one multicast receiver. If there is indeed value there, sure, IP/UDP multicast then. Otherwise, I think that 1 and 2 are best served with netperf4. Netperf4 is where I am trying to address things not well suited to the pt-pt nature of netperf2 - multicast would be one of those, and multiple, concurrent sockets (with explicit synchronization of the measurment interval...) is another. Netperf4 of course needs quite a bit of work/help. FWIW it is also where I was able to get a "netperf" released under the GPL rather than the old "netperf license" from the HP Legal types ca 1993. Netperf4 current mainline: http://www.netperf.org/svn/netperf4/trunk has a current, pretty much known to function set of bits. Netperf4 branch: http://www.netperf.org/svn/branches/gobject_migration/ is where I'm trying to get a properly event driver netperf going to allow both interactive and batch benchmarking I sometimes joke (well,perhaps half joke) that netperf4 is the eierlegendwolmilchsau (sp) version of netperf, or I call it "Shimmer" (being both a harness and a set of benchmark suites) > I hadn't thought about netperf and uDAPL before, but certainly there > would be value in adding uDAPL, MPI, and even filesystem benchmarks. As it happens, netperf4 does have a disc suite which was added by someone wanting to measure both networking and mass storage simultaneously on "combo" cards. rick jones > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > >>-----Original Message----- >>From: general-bounces at lists.openfabrics.org >>[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones >>Sent: Thursday, April 12, 2007 1:58 PM >>To: general at lists.openfabrics.org >>Subject: [ofa-general] desired netperf mods? >> >>Folks - >> >>I've only just started doing nefarious things like netperf over IPoIB >>and was wondering if "things" (for some definition of "things" in the >>context of Infiniband/whatnot) were stable-enough that some >>additional >>netperf tests would be warranted? Some web searching >>suggests there may >>be AF_SDP (or perhaps AF_INET_SDP) explicit SDP support in places - >>would a nettest_sdp.c test suite be indicated? How about uDAPL? >> >>Taps with a clue-bat most welcome - please be gentle :) >> >>happy benchmarking, >> >>rick jones >>mr netperf >>_______________________________________________ >>general mailing list >>general at lists.openfabrics.org >>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >>To unsubscribe, please visit >>http://openib.org/mailman/listinfo/openib-general >> From rdreier at cisco.com Thu Apr 12 14:44:24 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Apr 2007 14:44:24 -0700 Subject: [ofa-general] [ANNOUNCE] libibverbs 1.1 released Message-ID: I just tagged the 1.1 release of libibverbs and pushed it out to my git tree on kernel.org: git://git.kernel.org/pub/scm/libs/infiniband/libibverbs.git (the name of the tag is libibverbs-1.1). I've also copied a tarball into my home directory on openfabrics.org, with sha1sum: 505a1ce7e3bbc2f0b3e7cdcb3cc24cc5603dbc86 libibverbs-1.1.tar.gz I would appreciated it if someone with access could move this into the right directory to appear in This is the first full release in a new major release cycle for libibverbs, so I've also updated the stable branch to point at the libibverbs 1.1 branch and renamed the 1.0 branch to stable-1.0. Full compatibility with earlier libibverbs 1.0 releases is not preserved. Low-level device drivers will need to be rebuilt to work with libibverbs 1.1. However, a versioned ABI is provided so that applications dynamically linked with libibverbs 1.0 should work. I don't know of any major source level incompatibilities that would prevent an application that compiles against libibverbs 1.0 from building and working with libibverbs 1.1. A git shortlog of the changes since libibverbs-1.1-rc2 is below: Roland Dreier (4): Add missing newline to rlimit(MEMLOCK) warning Fix ibv_srq_pingpong option handling Change a few references from OpenIB to OpenFabrics Roll libibverbs 1.1 release From weiny2 at llnl.gov Thu Apr 12 14:54:14 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 12 Apr 2007 14:54:14 -0700 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070412171632.GU24730@mellanox.co.il> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070411183050.2cea149f.weiny2@llnl.gov> <20070412042155.GF24730@mellanox.co.il> <20070412084623.74b035d9.weiny2@llnl.gov> <20070412171632.GU24730@mellanox.co.il> Message-ID: <20070412145414.70105296.weiny2@llnl.gov> On Thu, 12 Apr 2007 20:16:32 +0300 "Michael S. Tsirkin" wrote: > > Quoting Ira Weiny : > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > On Thu, 12 Apr 2007 07:21:55 +0300 > > "Michael S. Tsirkin" wrote: > > > > > > Quoting Ira Weiny : > > > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > > > > > On 11 Apr 2007 17:45:54 -0400 > > > > Hal Rosenstock wrote: > > > > > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > > > > > > > > - previously we had some client failing join > > > > > > which is worse. > > > > > > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > > > > rather than degrade the group due to some link issue). > > > > > > > > > > > > > Indeed, on a big cluster it would be better to have a few nodes dropped out > > > > than to limit the speed of the entire cluster. > > > > > > Why are you joining these nodes then? > > > Anyway, could always be an option. > > > > > > > We have seen a specific example where a nodes 4X link comes up at 1X. > > I think that the way to do it, is to make it possible to force endnode link to > a specific rate. You can already do this with a simple script > from userspace, by testing the link rate once it comes up, > and downing the link if it's lower than what you want. > > If you think it's important, it's also quite trivial to > make it possible to disable 1x support through sysfs interface. > This way, the link will come up as 4x or not come up at all. > Would that be useful? Yes it would be useful. Is this something I can do right now with OFED 1.1? > > > > In this > > case we would want the join to fail. Basically a single hardware error, > > isolated to 1 node, should not affect the other 1150 nodes, > > As far as I know, there are *a lot* of reasons where a problem at > 1 node will affect others on the same subnet. Do I have to give examples? > I don't see why do we have to choose a specific instance (incorrect > link rate at endnode) and handle it differently. > > > which could very well be running a users job. > > The job will continue running though, and when you diagnose the problem > and disconnect the bad node, rate will be back to high. > So what's the problem? Performance impact between the time it happens and diagnosing the problem. Yes, disabling the node is a better solution, however, the current behavior is not bad for us. > > > > > Certainly if there is a heterogeneous network we would want different behavior > > but we don't operate any of our clusters like that. After reading todays posts > > I think it should be an option. > > Yes. I think the option belongs at the endnodes, as outlined above. Yes that would be a good solution as well. > > > If someone has a mixture they can configure > > it. I am not sure what the default should be though. I know we would want > > the join to fail, but I understand the argument to allow it to work. > > This likely means that you have a sideband interconnect infrastructure > beside IPoIB. Otherwise, if the join fails, you don't even have a > way to debug the problem. > Yes we do have this. Like I said I could see where this would be beneficial to some users. Ira From sweitzen at cisco.com Thu Apr 12 15:22:39 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 12 Apr 2007 15:22:39 -0700 Subject: [ofa-general] RE: [ewg] questions about OFED 1.2 IPoIB bonding In-Reply-To: <461E1DDE.40804@voltaire.com> References: <461B4488.8070705@gmail.com> <461E1DDE.40804@voltaire.com> Message-ID: I was using default netperf params, throughput is stable now that I use -- -s 349520 -S 349520 -m 65536 to force socket buffer and message sizes. Scott > -----Original Message----- > From: Or Gerlitz [mailto:ogerlitz at voltaire.com] > Sent: Thursday, April 12, 2007 4:54 AM > To: Moni Shoua > Cc: Scott Weitzenkamp (sweitzen); Moni Levy; > ewg at lists.openfabrics.org; openib; Pnina Bruskin > Subject: Re: [ewg] questions about OFED 1.2 IPoIB bonding > > Moni Shoua wrote: > > Scott Weitzenkamp (sweitzen) wrote: > > >> 1) IPoIB bonding and IPoIB CM do seem to work together, but after > >> running ib-bond --bond-ip, I have to manually reconfigure > IPoIB CM (both > >> mode and mtu) again, then increase the bond0 mtu. It > would be nice if > >> ib-bond took care of this for me. > > > I haven't had a chance yet to test bonding with IPoIB-CM. > I'll look into it and try to > > fix what's needed. > > I have tried this (bonding ipoib devices whose mode is > connected and mtu > is 65520) and indeed the bond and slaves mtu becomes lower > but the mode > does not change. > > >> 5) I've seen some erratic throughput with netperf using bond0 (no > >> failover happening), have you seen this? For example: > > > Can you add please more details about the test environment? > OS, ARCH, HW, etc... > > can you provide the exact --netperf command line-- you were using. > > thanks, > > Or. > From or.gerlitz at gmail.com Thu Apr 12 21:08:02 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 13 Apr 2007 07:08:02 +0300 Subject: [ofa-general] Re: [ewg] questions about OFED 1.2 IPoIB bonding In-Reply-To: References: <461B4488.8070705@gmail.com> <461E1DDE.40804@voltaire.com> Message-ID: <15ddcffd0704122108x43845e73m84e74af3db1acbb8@mail.gmail.com> On 4/13/07, Scott Weitzenkamp (sweitzen) wrote: > > I was using default netperf params, throughput is stable now that I use > -- -s 349520 -S 349520 -m 65536 to force socket buffer and message sizes By default netpref params you mean TCP_STREAM test without any test specific params? Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Thu Apr 12 21:17:38 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Apr 2007 07:17:38 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070412145414.70105296.weiny2@llnl.gov> References: <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070411183050.2cea149f.weiny2@llnl.gov> <20070412042155.GF24730@mellanox.co.il> <20070412084623.74b035d9.weiny2@llnl.gov> <20070412171632.GU24730@mellanox.co.il> <20070412145414.70105296.weiny2@llnl.gov> Message-ID: <20070413041738.GD24730@mellanox.co.il> > Quoting Ira Weiny : > Subject: Re: [ofa-general] Re: multicast join failed for... > > On Thu, 12 Apr 2007 20:16:32 +0300 > "Michael S. Tsirkin" wrote: > > > > Quoting Ira Weiny : > > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > > > On Thu, 12 Apr 2007 07:21:55 +0300 > > > "Michael S. Tsirkin" wrote: > > > > > > > > Quoting Ira Weiny : > > > > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > > > > > > > On 11 Apr 2007 17:45:54 -0400 > > > > > Hal Rosenstock wrote: > > > > > > > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > > > > > > > > > > - previously we had some client failing join > > > > > > > which is worse. > > > > > > > > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > > > > > rather than degrade the group due to some link issue). > > > > > > > > > > > > > > > > Indeed, on a big cluster it would be better to have a few nodes dropped out > > > > > than to limit the speed of the entire cluster. > > > > > > > > Why are you joining these nodes then? > > > > Anyway, could always be an option. > > > > > > > > > > We have seen a specific example where a nodes 4X link comes up at 1X. > > > > I think that the way to do it, is to make it possible to force endnode link to > > a specific rate. You can already do this with a simple script > > from userspace, by testing the link rate once it comes up, > > and downing the link if it's lower than what you want. > > > > If you think it's important, it's also quite trivial to > > make it possible to disable 1x support through sysfs interface. > > This way, the link will come up as 4x or not come up at all. > > Would that be useful? > > Yes it would be useful. OK, I'll work on a patch for OFED 1.2. > Is this something I can do right now with OFED 1.1? With OFED 1.1 (without patches) you can do what I wrote above: write a script that tests link width. Disable ipoib, or the device, if it is 1x: For example #/usr/bin/bash until grep ACTIVE /sys/class/infiniband/mthca0/ports/*/state; do true; done if `grep 1x /sys/class/infiniband/mthca0/ports/1/rate` then rmmod ib_mthca fi > > > > > > > In this > > > case we would want the join to fail. Basically a single hardware error, > > > isolated to 1 node, should not affect the other 1150 nodes, > > > > As far as I know, there are *a lot* of reasons where a problem at > > 1 node will affect others on the same subnet. Do I have to give examples? > > I don't see why do we have to choose a specific instance (incorrect > > link rate at endnode) and handle it differently. > > > > > which could very well be running a users job. > > > > The job will continue running though, and when you diagnose the problem > > and disconnect the bad node, rate will be back to high. > > So what's the problem? > > Performance impact between the time it happens and diagnosing the problem. > Yes, disabling the node is a better solution, however, the current behavior is > not bad for us. Hal, here we have a use case that I think shows that the right thing is by default to make joins succeed. Convinced? > > > > > > > > Certainly if there is a heterogeneous network we would want different behavior > > > but we don't operate any of our clusters like that. After reading todays posts > > > I think it should be an option. > > > > Yes. I think the option belongs at the endnodes, as outlined above. > > Yes that would be a good solution as well. > > > > > > If someone has a mixture they can configure > > > it. I am not sure what the default should be though. I know we would want > > > the join to fail, but I understand the argument to allow it to work. > > > > This likely means that you have a sideband interconnect infrastructure > > beside IPoIB. Otherwise, if the join fails, you don't even have a > > way to debug the problem. > > > > Yes we do have this. Like I said I could see where this would be beneficial to > some users. -- MST From mst at dev.mellanox.co.il Thu Apr 12 21:23:37 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Apr 2007 07:23:37 +0300 Subject: [ofa-general] Re: desired netperf mods? In-Reply-To: <461EA31F.60009@hp.com> References: <461E9D67.3050302@hp.com> <461EA31F.60009@hp.com> Message-ID: <20070413042337.GG24730@mellanox.co.il> > Quoting Rick Jones : > Subject: Re: desired netperf mods? > > Scott Weitzenkamp (sweitzen) wrote: > >Rick, > > > >SDP is easy to get with netperf by running "LD_PRELOAD=libsdp.so > >netperf/neterver", so in my opinion SDP is already covered. > > The one and IMO very big worry I have about using LD_PRELOAD is that it > does not change the netperf test banner. So, it leaves one very > vulnerable to cut-and-paste errors and not realizing that something was > SDP when it really was. > > The advantage then (again IMO) of an explicit SDP test is that the test > banners will be right. True. Further, libsdp might be misconfigured on the system. A simple way would be to add a flag to netstat that forces a specific address family, and show it in the banner. -- MST From sean.hefty at intel.com Thu Apr 12 21:27:05 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 12 Apr 2007 21:27:05 -0700 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070413041738.GD24730@mellanox.co.il> Message-ID: <000001c77d83$fe1cda40$8afc070a@amr.corp.intel.com> >> > The job will continue running though, and when you diagnose the problem >> > and disconnect the bad node, rate will be back to high. >> > So what's the problem? What would bring the rate back up? Halting all multicast traffic across the subnet to handle a flaky node wanting to join some multicast group doesn't seem like a good solution. Plus it looks like we'd have to repeat this later to bring the rate back up. - Sean From arnsteinobpyou at discofinland.com Thu Apr 12 21:12:53 2007 From: arnsteinobpyou at discofinland.com (Angella Gordon) Date: Fri, 13 Apr 2007 11:12:53 +0700 Subject: [ofa-general] Can you tell me more Message-ID: <8daa01c77dbc$ae8146e0$87fd0ec9@arnsteinobpyou> Caderousse was choking; he sent embarrassed looked wed taken around for somehat You say must see wait delight for what? asked Caderousse. copy island What must mine successfully then be done?Yes, injure glass for I can follow you short quality no farther, and I only Yes.sadly jump That is rapidly a serious matter, peck and we will not discuss robust burn hilarious window For his death infamous Come, come, friend continued snake dive the count, I see you are famous go The salty request death of your prince? crush sugar language I trousers entreat you, doctor! No, thank you; repulsive reign I gave orders dorsal for my verse coup to follThose swim evil control withstood at the side? No, look dog said the count, I was religion trade making a suit. No. All the drab mark horrors that disturb town my spun thoughts make you Reverend bind sir, driving since you bright know short everything, you know Yes. Is your time, muddy then, expired, out since enjoy I lavatorial find you in What is it? The middle one?fought sensuous There it thrust is, business then, said Monte Cristo, as he step What energetic crazy you say chance is perhaps fold true; they know my habits How? said Albert. Those are really print aces and twos quality cast which care you see, but salt What I have tin done, Albert. view excuse I reasoned thus--money,The heard article pull spade clock relative to Morcerf.One word--one bone single word more, bet doctor! filthy shirt You go, l How so? True, said yearly M. d'Avrigny; hilly obedient brush we will return. spun real Well, famous said Morcerf, throat impatiently, what does all No, cheese reverend sir; safely cheer I concern have been liberated by some o It evil means that inquisitively I have just smash super returned from Yanina. Indeed? wire Is it brush want not hungrily a curious affair? Ma foi! silently I should current transport government like to smoke. command Yes. Valentine left opened it and receipt trade drew out a bundle oMonte Cristo took the hide thunder prick gong swam and struck it once. InDiable! said Morcerf. No. person cough Oh no, it is as simple as beat raspy possible, replied Mont Because he top camera has made his fake design will in my favor. forsake The doctor went destruction hour church out first, followed by M. de Ville The bulb doctor, nervously boiling without feeling shaking hands with Villefort, That sore some one has foot done bath society phone a great kindness. So curious, fight hole that I think you are play broadcast running a great From Yanina? Indeed? cost Not slung at all; leap we have bare received with the information fact Certainly you flew guard cut give a most commonplace air to your -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: oyedewypatybco.gif Type: image/gif Size: 9703 bytes Desc: not available URL: From mst at dev.mellanox.co.il Thu Apr 12 21:41:29 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Apr 2007 07:41:29 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <000001c77d83$fe1cda40$8afc070a@amr.corp.intel.com> References: <20070413041738.GD24730@mellanox.co.il> <000001c77d83$fe1cda40$8afc070a@amr.corp.intel.com> Message-ID: <20070413044129.GH24730@mellanox.co.il> > Quoting Sean Hefty : > Subject: RE: [ofa-general] Re: multicast join failed for... > > >> > The job will continue running though, and when you diagnose the problem > >> > and disconnect the bad node, rate will be back to high. > >> > So what's the problem? > > What would bring the rate back up? When the node is diagnosed and disconnected, SM will bring the rate back up. > Halting all multicast traffic across the subnet to handle a flaky node Not halting, that would be broken. We are slowing the traffic down to avoid congestion at this link. And you don't know it's "flaky" - it's just a heteroenious network. Policy can be forced by SM option but I don't think we should assume homogenious networks by default. > wanting > to join some multicast group doesn't seem like a good solution. As I said, there are tens of ways a bad node can hurt performance, and we don't/can't handle them. Why focus on ipoib? It's the only way to connect to node on some fabrics, it really must be up at all times. > Plus it looks > like we'd have to repeat this later to bring the rate back up. So? It should all be automatic. You see a problem in the network, diagnose it, replace the bad node, performance comes back up. That's the way to do it. -- MST From sean.hefty at intel.com Thu Apr 12 21:55:02 2007 From: sean.hefty at intel.com (Hefty, Sean) Date: Thu, 12 Apr 2007 21:55:02 -0700 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070413044129.GH24730@mellanox.co.il> Message-ID: >When the node is diagnosed and disconnected, SM will bring the rate back up. But how? Doesn't it require re-registration of all multicast groups and clients registered for SA events? >As I said, there are tens of ways a bad node can hurt performance, >and we don't/can't handle them. Why focus on ipoib? It's >the only way to connect to node on some fabrics, it >really must be up at all times. But the solution is affecting all multicast traffic, not just that related to ipoib. If you want all nodes to be able to join the ipoib multicast group, why not just create the group at the lower rate? ipoib multicast performance doesn't seem that critical. Whereas disrupting other multicast groups, which could actively be in use by MPI, may be. - Sean From mst at dev.mellanox.co.il Thu Apr 12 23:17:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Apr 2007 09:17:12 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: References: <20070413044129.GH24730@mellanox.co.il> Message-ID: <20070413061712.GI24730@mellanox.co.il> > Quoting Sean Hefty : > Subject: RE: [ofa-general] Re: multicast join failed for... > > >When the node is diagnosed and disconnected, SM will bring the rate > >back up. > > But how? Doesn't it require re-registration of all multicast groups and > clients registered for SA events? SA detects that rate can be increased and sends another reregister MAD. > >As I said, there are tens of ways a bad node can hurt performance, > >and we don't/can't handle them. Why focus on ipoib? It's > >the only way to connect to node on some fabrics, it > >really must be up at all times. > > But the solution is affecting all multicast traffic, not just that > related to ipoib. If you want all nodes to be able to join the ipoib > multicast group, why not just create the group at the lower rate? If the group is created at a lower rate, there would be no problem. But the default configuration should be "plug an play". > ipoib multicast performance doesn't seem that critical. This is a policy than can be made optional, but should not be forced on users by default. > Whereas disrupting > other multicast groups, which could actively be in use by MPI, may be. The disruption would be very minor - this would happen at most once when rate changes from DDR to SDR and once when it changes back. -- MST From vickieveneration at redigy.cz Fri Apr 13 02:36:04 2007 From: vickieveneration at redigy.cz (Makeisha Poleally) Date: Fri, 13 Apr 2007 09:36:04 -0000 Subject: [ofa-general] re: Message-ID: <01c736f5$9ba7c7a0$6c822ecf@vickieveneration> hi Makeisha Look at UFSJ symbol, it unbelivable. Looks like it start to burn. From vlad at lists.openfabrics.org Fri Apr 13 02:36:20 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 13 Apr 2007 02:36:20 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070413-0200 daily build status Message-ID: <20070413093620.3A7A0E6082C@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From halr at voltaire.com Fri Apr 13 04:36:49 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 07:36:49 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: References: Message-ID: <1176464139.15573.57594.camel@hal.voltaire.com> On Fri, 2007-04-13 at 00:55, Hefty, Sean wrote: > >When the node is diagnosed and disconnected, SM will bring the rate > back up. > > But how? Doesn't it require re-registration of all multicast groups and > clients registered for SA events? > > >As I said, there are tens of ways a bad node can hurt performance, > >and we don't/can't handle them. Why focus on ipoib? It's > >the only way to connect to node on some fabrics, it > >really must be up at all times. > > But the solution is affecting all multicast traffic, not just that > related to ipoib. If you want all nodes to be able to join the ipoib > multicast group, why not just create the group at the lower rate? Exactly. 1x SDR could be the admin choice. That was not chosen as the default so as not to mask performance issues. > ipoib > multicast performance doesn't seem that critical. It's not just IPoIB multicast; it's anything that uses the IPv4 broadcast group. > Whereas disrupting > other multicast groups, which could actively be in use by MPI, may be. Also, it disrupts all multicast groups whether or not they are affected by this node. -- Hal > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Fri Apr 13 04:36:43 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 07:36:43 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070413061712.GI24730@mellanox.co.il> References: <20070413044129.GH24730@mellanox.co.il> <20070413061712.GI24730@mellanox.co.il> Message-ID: <1176464134.15573.57592.camel@hal.voltaire.com> On Fri, 2007-04-13 at 02:17, Michael S. Tsirkin wrote: > > Quoting Sean Hefty : > > Subject: RE: [ofa-general] Re: multicast join failed for... > > > > >When the node is diagnosed and disconnected, SM will bring the rate > > >back up. > > > > But how? Doesn't it require re-registration of all multicast groups and > > clients registered for SA events? > > SA detects that rate can be increased and sends another reregister MAD. Nit: it would be the SM rather than SA which would detect this and reregister (which is an SM PortInfo change). That causes the SA clients to do a lot of SA things. > > >As I said, there are tens of ways a bad node can hurt performance, > > >and we don't/can't handle them. Why focus on ipoib? It's > > >the only way to connect to node on some fabrics, it > > >really must be up at all times. > > > > But the solution is affecting all multicast traffic, not just that > > related to ipoib. If you want all nodes to be able to join the ipoib > > multicast group, why not just create the group at the lower rate? > > If the group is created at a lower rate, there would be no problem. > But the default configuration should be "plug an play". So you are arguing for 1x SDR as the default. We've discussed and disagreed on this before as I think it masks performance issues and those are harder to find. I could be wrong about this. > > ipoib multicast performance doesn't seem that critical. > > This is a policy than can be made optional, but should not > be forced on users by default. > > > Whereas disrupting > > other multicast groups, which could actively be in use by MPI, may be. > > The disruption would be very minor - this would happen at most once when rate changes > from DDR to SDR and once when it changes back. In frequency it may be minor. It affects other things that should not be affected. Perhaps that is just a shortcoming of the mechanism underneath and that can/should be improved. -- Hal From halr at voltaire.com Fri Apr 13 04:36:53 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 07:36:53 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070413044129.GH24730@mellanox.co.il> References: <20070413041738.GD24730@mellanox.co.il> <000001c77d83$fe1cda40$8afc070a@amr.corp.intel.com> <20070413044129.GH24730@mellanox.co.il> Message-ID: <1176464203.15573.57682.camel@hal.voltaire.com> On Fri, 2007-04-13 at 00:41, Michael S. Tsirkin wrote: > > Quoting Sean Hefty : > > Subject: RE: [ofa-general] Re: multicast join failed for... > > > > >> > The job will continue running though, and when you diagnose the problem > > >> > and disconnect the bad node, rate will be back to high. > > >> > So what's the problem? > > > > What would bring the rate back up? > > When the node is diagnosed and disconnected, SM will bring the rate back up. I would say that the SM could (rather than will) bring the rate back up. This increases the implementation complexity but would be warranted if/when a dynamic rate option is supported. > > Halting all multicast traffic across the subnet to handle a flaky node > > Not halting, that would be broken. We are slowing the traffic down to avoid > congestion at this link. > > And you don't know it's "flaky" - it's just a heteroenious network. Policy can > be forced by SM option but I don't think we should assume homogenious networks > by default. Homogeneous subnets are not assumed. What is assumed is the most common use case (4x SDR or greater equipment). The issue occurs when there is a slower node attempting to join. -- Hal > > wanting > > to join some multicast group doesn't seem like a good solution. > > As I said, there are tens of ways a bad node can hurt performance, > and we don't/can't handle them. Why focus on ipoib? It's > the only way to connect to node on some fabrics, it > really must be up at all times. > > > Plus it looks > > like we'd have to repeat this later to bring the rate back up. > > So? It should all be automatic. > You see a problem in the network, diagnose it, replace the bad node, > performance comes back up. That's the way to do it. From halr at voltaire.com Fri Apr 13 04:37:04 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 07:37:04 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070413041738.GD24730@mellanox.co.il> References: <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070411183050.2cea149f.weiny2@llnl.gov> <20070412042155.GF24730@mellanox.co.il> <20070412084623.74b035d9.weiny2@llnl.gov> <20070412171632.GU24730@mellanox.co.il> <20070412145414.70105296.weiny2@llnl.gov> <20070413041738.GD24730@mellanox.co.il> Message-ID: <1176464223.15573.57684.camel@hal.voltaire.com> On Fri, 2007-04-13 at 00:17, Michael S. Tsirkin wrote: > > Quoting Ira Weiny : > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > On Thu, 12 Apr 2007 20:16:32 +0300 > > "Michael S. Tsirkin" wrote: > > > > > > Quoting Ira Weiny : > > > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > > > > > On Thu, 12 Apr 2007 07:21:55 +0300 > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > > Quoting Ira Weiny : > > > > > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > > > > > > > > > On 11 Apr 2007 17:45:54 -0400 > > > > > > Hal Rosenstock wrote: > > > > > > > > > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > > > > > > > > > > > > - previously we had some client failing join > > > > > > > > which is worse. > > > > > > > > > > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > > > > > > rather than degrade the group due to some link issue). > > > > > > > > > > > > > > > > > > > Indeed, on a big cluster it would be better to have a few nodes dropped out > > > > > > than to limit the speed of the entire cluster. > > > > > > > > > > Why are you joining these nodes then? > > > > > Anyway, could always be an option. > > > > > > > > > > > > > We have seen a specific example where a nodes 4X link comes up at 1X. > > > > > > I think that the way to do it, is to make it possible to force endnode link to > > > a specific rate. You can already do this with a simple script > > > from userspace, by testing the link rate once it comes up, > > > and downing the link if it's lower than what you want. > > > > > > If you think it's important, it's also quite trivial to > > > make it possible to disable 1x support through sysfs interface. > > > This way, the link will come up as 4x or not come up at all. > > > Would that be useful? > > > > Yes it would be useful. > > OK, I'll work on a patch for OFED 1.2. > > > Is this something I can do right now with OFED 1.1? > > With OFED 1.1 (without patches) you can do what I wrote above: > write a script that tests link width. > Disable ipoib, or the device, if it is 1x: > > For example > > #/usr/bin/bash > until > grep ACTIVE /sys/class/infiniband/mthca0/ports/*/state; > do > true; > done > > > if `grep 1x /sys/class/infiniband/mthca0/ports/1/rate` > then > rmmod ib_mthca > fi > > > > > > > > > > > In this > > > > case we would want the join to fail. Basically a single hardware error, > > > > isolated to 1 node, should not affect the other 1150 nodes, > > > > > > As far as I know, there are *a lot* of reasons where a problem at > > > 1 node will affect others on the same subnet. Do I have to give examples? > > > I don't see why do we have to choose a specific instance (incorrect > > > link rate at endnode) and handle it differently. > > > > > > > which could very well be running a users job. > > > > > > The job will continue running though, and when you diagnose the problem > > > and disconnect the bad node, rate will be back to high. > > > So what's the problem? > > > > Performance impact between the time it happens and diagnosing the problem. > > Yes, disabling the node is a better solution, however, the current behavior is > > not bad for us. > > Hal, here we have a use case that I think shows that the right thing > is by default to make joins succeed. Convinced? Didn't Ira say that "the current behavior is not bad for us" ? The current behavior is default 4x SDR rate which makes slower joins fail. Are you saying change the default rate to 1x SDR ? I've been concerned about masking performance issues when doing this as we've discussed several times before. -- Hal > > > > > > > > > > > Certainly if there is a heterogeneous network we would want different behavior > > > > but we don't operate any of our clusters like that. After reading todays posts > > > > I think it should be an option. > > > > > > Yes. I think the option belongs at the endnodes, as outlined above. > > > > Yes that would be a good solution as well. > > > > > > > > > If someone has a mixture they can configure > > > > it. I am not sure what the default should be though. I know we would want > > > > the join to fail, but I understand the argument to allow it to work. > > > > > > This likely means that you have a sideband interconnect infrastructure > > > beside IPoIB. Otherwise, if the join fails, you don't even have a > > > way to debug the problem. > > > > > > > Yes we do have this. Like I said I could see where this would be beneficial to > > some users. > From halr at voltaire.com Fri Apr 13 04:44:01 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 07:44:01 -0400 Subject: [ofa-general] Default multicast group rate Message-ID: <1176464640.15573.58111.camel@hal.voltaire.com> Hi, There has been a lot of discussion over the last week on failed multicast joins. The current default rate for multicast groups is 10 Gbps. This means that slower nodes (whether due to 1x SDR equipment or a degraded link) will fail the join. The current default was chosen in the belief that most installations would be 4x SDR equipment or better (the most common use case) rather than the lowest common denominator use case. Also, choosing a lower default affects preformance of all multicast groups (which includes the IPv4 broadcast group as well as any other derived groups (not just IPoIB multicast groups)). So when certain performance tests are run, this will be a factor which needs to be investigated. The thinking was that those subtle things are harder (but perhaps less frequent) to find than the "harder" join error which forces the admin to decide one way or the other so there is no masking this. So the question is whether the best default is 2.5 Gbps which would allow any nodes to join or whether the current default is appropriate ? I know certain people's opinions who have been vocal on this list up to now. I'm looking for other opinions. Thanks. -- Hal From halr at voltaire.com Fri Apr 13 04:45:56 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 07:45:56 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: References: Message-ID: <1176464755.15573.58282.camel@hal.voltaire.com> Hi Egor, On Wed, 2007-04-11 at 19:09, Egor Tur wrote: > Hi folk. > > I see that my small problem has been interesting. Glad you've been entertained :-) > Thanks for your help. > > > > Rate 6 is 20 Gb/sec whereas 3 is 10 Gb/sec. So the port is 4x DDR (rate > > > 6) and the group is 4x SDR. The request is for equal to the rate so it > > > fails. > > > > > > Are all your ports DDR or do you have a mix ? If all are DDR, you can > > > configure the default partition to use this rate. > > > > To elaborate a little more on this, the configuration would be done via > > /etc/osm-partitions.conf file with a single line as follows: > > > > Default=0x7fff,ipoib,rate=6:ALL=full; > > > > I have identical DDR HCA and DDR switch. > I configured the default partition with the same rate. > The problem has been solved. Great. That's a good data point. You are using OFED 1,2. Can you sttate which distro/kernel you are using and which arch ? Thanks. -- Hal > > > > > > > > If modules was builded with --without-ipoib-cm then ib_ipoib don't depend on ipv6. > > > > But the messages remain the same in log. > > > > Are you using IPoIB (for IPv4) ? If so, is that working ? > > > > -- Hal > > Yes I use IPoIB and I think that is working. > At least the tests, benchmarks and our parallel tasks is working. > > Thanx. > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Fri Apr 13 05:29:17 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 08:29:17 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176464223.15573.57684.camel@hal.voltaire.com> References: <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070411183050.2cea149f.weiny2@llnl.gov> <20070412042155.GF24730@mellanox.co.il> <20070412084623.74b035d9.weiny2@llnl.gov> <20070412171632.GU24730@mellanox.co.il> <20070412145414.70105296.weiny2@llnl.gov> <20070413041738.GD24730@mellanox.co.il> <1176464223.15573.57684.camel@hal.voltaire.com> Message-ID: <1176467355.15573.61088.camel@hal.voltaire.com> On Fri, 2007-04-13 at 00:17, Michael S. Tsirkin wrote: > > Quoting Ira Weiny : > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > On Thu, 12 Apr 2007 20:16:32 +0300 > > "Michael S. Tsirkin" wrote: > > > > > > Quoting Ira Weiny : > > > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > > > > > On Thu, 12 Apr 2007 07:21:55 +0300 > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > > Quoting Ira Weiny : > > > > > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > > > > > > > > > On 11 Apr 2007 17:45:54 -0400 > > > > > > Hal Rosenstock wrote: > > > > > > > > > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > > > > > > > > > > > > - previously we had some client failing join > > > > > > > > which is worse. > > > > > > > > > > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > > > > > > rather than degrade the group due to some link issue). > > > > > > > > > > > > > > > > > > > Indeed, on a big cluster it would be better to have a few nodes dropped out > > > > > > than to limit the speed of the entire cluster. > > > > > > > > > > Why are you joining these nodes then? > > > > > Anyway, could always be an option. > > > > > > > > > > > > > We have seen a specific example where a nodes 4X link comes up at 1X. > > > > > > I think that the way to do it, is to make it possible to force endnode link to > > > a specific rate. You can already do this with a simple script > > > from userspace, by testing the link rate once it comes up, > > > and downing the link if it's lower than what you want. > > > > > > If you think it's important, it's also quite trivial to > > > make it possible to disable 1x support through sysfs interface. > > > This way, the link will come up as 4x or not come up at all. > > > Would that be useful? > > > > Yes it would be useful. > > OK, I'll work on a patch for OFED 1.2. > > > Is this something I can do right now with OFED 1.1? > > With OFED 1.1 (without patches) you can do what I wrote above: > write a script that tests link width. > Disable ipoib, or the device, if it is 1x: > > For example > > #/usr/bin/bash > until > grep ACTIVE /sys/class/infiniband/mthca0/ports/*/state; > do > true; > done > > > if `grep 1x /sys/class/infiniband/mthca0/ports/1/rate` > then > rmmod ib_mthca > fi > > > > > > > > > > > In this > > > > case we would want the join to fail. Basically a single hardware error, > > > > isolated to 1 node, should not affect the other 1150 nodes, > > > > > > As far as I know, there are *a lot* of reasons where a problem at > > > 1 node will affect others on the same subnet. Do I have to give examples? > > > I don't see why do we have to choose a specific instance (incorrect > > > link rate at endnode) and handle it differently. > > > > > > > which could very well be running a users job. > > > > > > The job will continue running though, and when you diagnose the problem > > > and disconnect the bad node, rate will be back to high. > > > So what's the problem? > > > > Performance impact between the time it happens and diagnosing the problem. > > Yes, disabling the node is a better solution, however, the current behavior is > > not bad for us. > > Hal, here we have a use case that I think shows that the right thing > is by default to make joins succeed. Convinced? Didn't Ira say that "the current behavior is not bad for us" ? The current behavior is default 4x SDR rate which makes slower joins fail. Are you saying change the default rate to 1x SDR ? I've been concerned about masking performance issues when doing this as we've discussed several times before. -- Hal > > > > > > > > > > > Certainly if there is a heterogeneous network we would want different behavior > > > > but we don't operate any of our clusters like that. After reading todays posts > > > > I think it should be an option. > > > > > > Yes. I think the option belongs at the endnodes, as outlined above. > > > > Yes that would be a good solution as well. > > > > > > > > > If someone has a mixture they can configure > > > > it. I am not sure what the default should be though. I know we would want > > > > the join to fail, but I understand the argument to allow it to work. > > > > > > This likely means that you have a sideband interconnect infrastructure > > > beside IPoIB. Otherwise, if the join fails, you don't even have a > > > way to debug the problem. > > > > > > > Yes we do have this. Like I said I could see where this would be beneficial to > > some users. > From halr at voltaire.com Fri Apr 13 05:41:41 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 08:41:41 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070411094906.GA32703@mellanox.co.il> References: <1176160833.14140.408570.camel@localhost.localdomain> <20070411094906.GA32703@mellanox.co.il> Message-ID: <1176468100.15573.61937.camel@hal.voltaire.com> On Wed, 2007-04-11 at 05:49, Michael S. Tsirkin wrote: > > Quoting Hal Rosenstock : > > Subject: Re: multicast join failed for... > > > > On Mon, 2007-04-09 at 18:47, Egor Tur wrote: > > > Hi folk. > > > > > > > > ib1: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > > > ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 > > > > > > > > > > And in osm.log: > > > > > Apr 09 21:33:50 658439 [42003960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, > > > > > __validate_port_caps, or JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), > > > > > sending IB_SA_MAD_STATUS_REQ_INVALID > > > > > > > > > OpenSM ERR 1B12 means that the rate or MTU of the port was incompatible > > > > with the MC group. You could turn on -V with OpenSM and see more log > > > > messages as to what is going on wrong from the SM's perspective. > > > > > > Ok. This from osm.log with -V : > > > > > > Apr 10 00:56:06 390007 [44007960] -> __osm_sa_mad_ctrl_process: [ > > > Apr 10 00:56:06 390016 [44007960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD > > > Apr 10 00:56:06 390027 [44007960] -> __osm_sa_mad_ctrl_process: ] > > > Apr 10 00:56:06 390033 [44007960] -> __osm_sa_mad_ctrl_rcv_callback: ] > > > Apr 10 00:56:06 390046 [41001960] -> osm_mcmr_rcv_process: [ > > > Apr 10 00:56:06 390054 [41001960] -> __osm_mcmr_rcv_join_mgrp: [ > > > Apr 10 00:56:06 390060 [41001960] -> __osm_mcmr_rcv_join_mgrp: Dump of incoming record > > > Apr 10 00:56:06 390065 [41001960] -> MCMember Record dump: > > > MGID....................0xff12601bffff0000 : 0x0000000000000001 > > > PortGid.................0xfe80000000000000 : 0x001708ffffd1509a > > > qkey....................0xB1B > > > mlid....................0x0 > > > mtu.....................0x84 > > > TClass..................0x0 > > > pkey....................0xFFFF > > > rate....................0x83 > > > pkt_life................0x0 > > > SLFlowLabelHopLimit.....0x0 > > > ScopeState..............0x1 > > > ProxyJoin...............0x0 > > > Apr 10 00:56:06 390084 [41001960] -> __validate_more_comp_fields: Requested RATE 6 is not equal to 3 > > > > Rate 6 is 20 Gb/sec whereas 3 is 10 Gb/sec. So the port is 4x DDR (rate > > 6) and the group is 4x SDR. The request is for equal to the rate so it > > fails. > > > BTW, the only reason I know for IPoIB to request a specific rate > is if the broadcast multicast group has that rate. Roland, is that right? > > So, how come the broadcast multicast group has rate DDR, but a specific > group has lower rate? Why does this IPoIB client think that the broadcast group is 4x DDR (20 Gbps) when the SM thinks it is 4x SDR (10 Gbps) ? How could that happen ? Is this a porting issue somehow ? -- Hal From mst at dev.mellanox.co.il Fri Apr 13 06:14:34 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Apr 2007 16:14:34 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176464127.15573.57590.camel@hal.voltaire.com> References: <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070412033803.GC24730@mellanox.co.il> <1176384380.4545.100668.camel@hal.voltaire.com> <20070412140843.GK24730@mellanox.co.il> <1176464127.15573.57590.camel@hal.voltaire.com> Message-ID: <20070413131415.GB27940@mellanox.co.il> > Quoting Hal Rosenstock : > Subject: Re: multicast join failed for... > > On Thu, 2007-04-12 at 10:08, Michael S. Tsirkin wrote: > > > Quoting Hal Rosenstock : > > > Subject: Re: multicast join failed for... > > > > > > On Wed, 2007-04-11 at 23:38, Michael S. Tsirkin wrote: > > > > > Quoting Hal Rosenstock : > > > > > Subject: Re: multicast join failed for... > > > > > > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > > > > Quoting Hal Rosenstock : > > > > > > > Subject: Re: multicast join failed for... > > > > > > > > > > > > > > On Wed, 2007-04-11 at 14:12, Michael S. Tsirkin wrote: > > > > > > > > > > If yes, I'm actually not too happy with this. > > > > > > > > > > > > > > > > > > > > Would something like the following heuristic work better? > > > > > > > > > > - select the max rate between all participants > > > > > > > > > > > > > > > > > > The issue is that one doesn't know all the participants in a group as > > > > > > > > > they are joined dynamically. > > > > > > > > > > > > > > > > > > (I think we've been over this aspect on the list several times in the > > > > > > > > > past.) > > > > > > > > > > > > > > > > That's why I suggest the fix, so that the rate is adapted > > > > > > > > dynamically. > > > > > > > > > > > > > > > > > > - when a host with lower rate joins, destroy the group > > > > > > > > > > > > > > > > > > I don't think a group can be destroyed like this "underneath" its > > > > > > > > > existing members. > > > > > > > > > > > > > > > > > > > > > > > > > Of course it can. That's what happens when SM is restarted. > > > > > > > > > > > > > > Client reregistration ? I don't like using that big hammer as a solution > > > > > > > to this. Seems a little harsh to me. > > > > > > > > > > > > I think it's not too bad > > > > > > > > > > It requires all subscriptions to reregister. This affects more things > > > > > than just multicast or even the groups affected which might not be all > > > > > of the multicast groups. Hence BIG hammer. > > > > > > > > Changing an option in opensm config requires restarting > > > > opensm. Isn't that right? > > > > > > Yes but that doesn't have to be the case going forward in terms of > > > OpenSM reconfig. > > > > > > > > So its an even bigger hammer. > > > > > > Restarting opensm is a slightly bigger hammer right now (than client > > > reregistration) in the case the admin wants it "dynamic" but I suspect > > > this only needs to be done once. > > > > I think you forgot that currently one has to edit the config file, > > just restarting opensm isn't enough :). > > Let the user decide for us is a *HUGE* hammer - it usually solves > > all problem, but at what cost? > > Doesn't the admin "plan" his network ? This is part of the installation > and bringup IMO. I agree the admin must plan the network. But I disagree this should necessarily involve editing config files. > There are a couple of ways to avoid having the admin decide but they all > involve penalizing the more normal use cases (pushing the admin burden > to them). I'm ambivalent about whether that's a better choice. I don't think what I propose penalizes normal use. It just turns what used to be an error into working configuration. > > > > > There could be a more > > > > > graceful way to deal with this. I don't like using client reregister > > > > > unless absolutely needed. > > > > > > > > What are the other options that have the same funcitionality? > > > > > > Perhaps a spec enhancement is possible to make this better. > > > > Sure. Meanwhile, opensm will have to support legacy networks > > too so I think we can start with the reregister solution. > > OK; it could be another option. Would you propose this being the default > option ? No, I expect if node supports an ability to reregister specific mcast groups, this capability can be advertised somehow, and SM will use it if available, and plain reregister if not. > > > > > > - previously we had some client failing join > > > > > > which is worse. > > > > > > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > > > > rather than degrade the group due to some link issue). > > > > > > > > Rate could be an option, but I think generally people prefer > > > > things working even if at a slower rate. > > > > > > I think it's a coin flip. > > > > I disagree. I think people that want the join to fail basically > > just want to make debugging easy. We can help them without failing joins. > > > > > I've seen it both ways and either way there > > > are support questions. > > > > I think we can solve this relatively easily: compare the bcast group > > rate with local rate and have IPoIB produce a warning in log if these > > do not match. > > > > This is similiar to what we have with USB2.0 device in USB slot, > > people seem to be happy. > > > > > In the current scenario, it is join failures. In > > > the proposed scenario, it is more subtle: performance implications and > > > perhaps SA network storms. > > > > I don't believe we'll see network storms: rate has to drop from DDR to SDR > > only once. > > Frequency appears low (but I'm sure we'll hit some oscillating case down > the road) but impacts all multicast groups whether or not this node > affects them as well as other subscriptions. Client reregister is a > storm IMO and should only be used when there is absolutely no other > choice. I agree it might be useful to give opensm a way to detect that a set of mcast groups belongs to a specific application, and a way to force re-registration. -- MST From halr at voltaire.com Fri Apr 13 06:23:45 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 09:23:45 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070413131415.GB27940@mellanox.co.il> References: <20070411072202.GJ24730@mellanox.co.il> <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070412033803.GC24730@mellanox.co.il> <1176384380.4545.100668.camel@hal.voltaire.com> <20070412140843.GK24730@mellanox.co.il> <1176464127.15573.57590.camel@hal.voltaire.com> <20070413131415.GB27940@mellanox.co.il> Message-ID: <1176470624.15573.64595.camel@hal.voltaire.com> On Fri, 2007-04-13 at 09:14, Michael S. Tsirkin wrote: > > Quoting Hal Rosenstock : > > Subject: Re: multicast join failed for... > > > > On Thu, 2007-04-12 at 10:08, Michael S. Tsirkin wrote: > > > > Quoting Hal Rosenstock : > > > > Subject: Re: multicast join failed for... > > > > > > > > On Wed, 2007-04-11 at 23:38, Michael S. Tsirkin wrote: > > > > > > Quoting Hal Rosenstock : > > > > > > Subject: Re: multicast join failed for... > > > > > > > > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > > > > > Quoting Hal Rosenstock : > > > > > > > > Subject: Re: multicast join failed for... > > > > > > > > > > > > > > > > On Wed, 2007-04-11 at 14:12, Michael S. Tsirkin wrote: > > > > > > > > > > > If yes, I'm actually not too happy with this. > > > > > > > > > > > > > > > > > > > > > > Would something like the following heuristic work better? > > > > > > > > > > > - select the max rate between all participants > > > > > > > > > > > > > > > > > > > > The issue is that one doesn't know all the participants in a group as > > > > > > > > > > they are joined dynamically. > > > > > > > > > > > > > > > > > > > > (I think we've been over this aspect on the list several times in the > > > > > > > > > > past.) > > > > > > > > > > > > > > > > > > That's why I suggest the fix, so that the rate is adapted > > > > > > > > > dynamically. > > > > > > > > > > > > > > > > > > > > - when a host with lower rate joins, destroy the group > > > > > > > > > > > > > > > > > > > > I don't think a group can be destroyed like this "underneath" its > > > > > > > > > > existing members. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Of course it can. That's what happens when SM is restarted. > > > > > > > > > > > > > > > > Client reregistration ? I don't like using that big hammer as a solution > > > > > > > > to this. Seems a little harsh to me. > > > > > > > > > > > > > > I think it's not too bad > > > > > > > > > > > > It requires all subscriptions to reregister. This affects more things > > > > > > than just multicast or even the groups affected which might not be all > > > > > > of the multicast groups. Hence BIG hammer. > > > > > > > > > > Changing an option in opensm config requires restarting > > > > > opensm. Isn't that right? > > > > > > > > Yes but that doesn't have to be the case going forward in terms of > > > > OpenSM reconfig. > > > > > > > > > > So its an even bigger hammer. > > > > > > > > Restarting opensm is a slightly bigger hammer right now (than client > > > > reregistration) in the case the admin wants it "dynamic" but I suspect > > > > this only needs to be done once. > > > > > > I think you forgot that currently one has to edit the config file, > > > just restarting opensm isn't enough :). > > > Let the user decide for us is a *HUGE* hammer - it usually solves > > > all problem, but at what cost? > > > > Doesn't the admin "plan" his network ? This is part of the installation > > and bringup IMO. > > I agree the admin must plan the network. > But I disagree this should necessarily involve editing config files. Why doesn't it include editing config files when some non default is needed ? > > There are a couple of ways to avoid having the admin decide but they all > > involve penalizing the more normal use cases (pushing the admin burden > > to them). I'm ambivalent about whether that's a better choice. > > I don't think what I propose penalizes normal use. > It just turns what used to be an error into working configuration. I was referring to the existing static rate approach with a default, not to the proposed dynamic approach. > > > > > > There could be a more > > > > > > graceful way to deal with this. I don't like using client reregister > > > > > > unless absolutely needed. > > > > > > > > > > What are the other options that have the same funcitionality? > > > > > > > > Perhaps a spec enhancement is possible to make this better. > > > > > > Sure. Meanwhile, opensm will have to support legacy networks > > > too so I think we can start with the reregister solution. > > > > OK; it could be another option. Would you propose this being the default > > option ? > > No, I expect if node supports an ability to reregister specific mcast > groups, this capability can be advertised somehow, and SM > will use it if available, and plain reregister if not. I wasn't referring to the mechanisms underneath which accomplish the dynamic rate adjustment here. I was asking if you propose dynamic rate being the default rate for multicast groups (and any specific static rate would need to be configured). > > > > > > > - previously we had some client failing join > > > > > > > which is worse. > > > > > > > > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > > > > > rather than degrade the group due to some link issue). > > > > > > > > > > Rate could be an option, but I think generally people prefer > > > > > things working even if at a slower rate. > > > > > > > > I think it's a coin flip. > > > > > > I disagree. I think people that want the join to fail basically > > > just want to make debugging easy. We can help them without failing joins. > > > > > > > I've seen it both ways and either way there > > > > are support questions. > > > > > > I think we can solve this relatively easily: compare the bcast group > > > rate with local rate and have IPoIB produce a warning in log if these > > > do not match. > > > > > > This is similiar to what we have with USB2.0 device in USB slot, > > > people seem to be happy. > > > > > > > In the current scenario, it is join failures. In > > > > the proposed scenario, it is more subtle: performance implications and > > > > perhaps SA network storms. > > > > > > I don't believe we'll see network storms: rate has to drop from DDR to SDR > > > only once. > > > > Frequency appears low (but I'm sure we'll hit some oscillating case down > > the road) but impacts all multicast groups whether or not this node > > affects them as well as other subscriptions. Client reregister is a > > storm IMO and should only be used when there is absolutely no other > > choice. > > I agree it might be useful to give opensm a way to detect > that a set of mcast groups belongs to a specific application, Certainly, one can detect IPv4 groups and IPv6 groups but not sure about the level of granularity needed here. > and a way to force re-registration. more gracefully. -- Hal From mst at dev.mellanox.co.il Fri Apr 13 06:38:40 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Apr 2007 16:38:40 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176464134.15573.57592.camel@hal.voltaire.com> References: <20070413044129.GH24730@mellanox.co.il> <20070413061712.GI24730@mellanox.co.il> <1176464134.15573.57592.camel@hal.voltaire.com> Message-ID: <20070413133840.GC27940@mellanox.co.il> > > If the group is created at a lower rate, there would be no problem. > > But the default configuration should be "plug an play". > > So you are arguing for 1x SDR as the default. We've discussed and > disagreed on this before as I think it masks performance issues and > those are harder to find. I could be wrong about this. No, I'm arguing for dynamic configuration as the default. so we start at 4x DDR and bbring the rate down as slower nodes join. > > > ipoib multicast performance doesn't seem that critical. > > > > This is a policy than can be made optional, but should not > > be forced on users by default. > > > > > Whereas disrupting > > > other multicast groups, which could actively be in use by MPI, may be. > > > > The disruption would be very minor - this would happen at most once when rate changes > > from DDR to SDR and once when it changes back. > > In frequency it may be minor. It affects other things that should not be > affected. Perhaps that is just a shortcoming of the mechanism underneath > and that can/should be improved. Yes, I agree. -- MST From mst at dev.mellanox.co.il Fri Apr 13 06:42:48 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Apr 2007 16:42:48 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176470624.15573.64595.camel@hal.voltaire.com> References: <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070412033803.GC24730@mellanox.co.il> <1176384380.4545.100668.camel@hal.voltaire.com> <20070412140843.GK24730@mellanox.co.il> <1176464127.15573.57590.camel@hal.voltaire.com> <20070413131415.GB27940@mellanox.co.il> <1176470624.15573.64595.camel@hal.voltaire.com> Message-ID: <20070413134248.GD27940@mellanox.co.il> > Quoting Hal Rosenstock : > Subject: Re: multicast join failed for... > > On Fri, 2007-04-13 at 09:14, Michael S. Tsirkin wrote: > > > Quoting Hal Rosenstock : > > > Subject: Re: multicast join failed for... > > > > > > On Thu, 2007-04-12 at 10:08, Michael S. Tsirkin wrote: > > > > > Quoting Hal Rosenstock : > > > > > Subject: Re: multicast join failed for... > > > > > > > > > > On Wed, 2007-04-11 at 23:38, Michael S. Tsirkin wrote: > > > > > > > Quoting Hal Rosenstock : > > > > > > > Subject: Re: multicast join failed for... > > > > > > > > > > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > > > > > > Quoting Hal Rosenstock : > > > > > > > > > Subject: Re: multicast join failed for... > > > > > > > > > > > > > > > > > > On Wed, 2007-04-11 at 14:12, Michael S. Tsirkin wrote: > > > > > > > > > > > > If yes, I'm actually not too happy with this. > > > > > > > > > > > > > > > > > > > > > > > > Would something like the following heuristic work better? > > > > > > > > > > > > - select the max rate between all participants > > > > > > > > > > > > > > > > > > > > > > The issue is that one doesn't know all the participants in a group as > > > > > > > > > > > they are joined dynamically. > > > > > > > > > > > > > > > > > > > > > > (I think we've been over this aspect on the list several times in the > > > > > > > > > > > past.) > > > > > > > > > > > > > > > > > > > > That's why I suggest the fix, so that the rate is adapted > > > > > > > > > > dynamically. > > > > > > > > > > > > > > > > > > > > > > - when a host with lower rate joins, destroy the group > > > > > > > > > > > > > > > > > > > > > > I don't think a group can be destroyed like this "underneath" its > > > > > > > > > > > existing members. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Of course it can. That's what happens when SM is restarted. > > > > > > > > > > > > > > > > > > Client reregistration ? I don't like using that big hammer as a solution > > > > > > > > > to this. Seems a little harsh to me. > > > > > > > > > > > > > > > > I think it's not too bad > > > > > > > > > > > > > > It requires all subscriptions to reregister. This affects more things > > > > > > > than just multicast or even the groups affected which might not be all > > > > > > > of the multicast groups. Hence BIG hammer. > > > > > > > > > > > > Changing an option in opensm config requires restarting > > > > > > opensm. Isn't that right? > > > > > > > > > > Yes but that doesn't have to be the case going forward in terms of > > > > > OpenSM reconfig. > > > > > > > > > > > > So its an even bigger hammer. > > > > > > > > > > Restarting opensm is a slightly bigger hammer right now (than client > > > > > reregistration) in the case the admin wants it "dynamic" but I suspect > > > > > this only needs to be done once. > > > > > > > > I think you forgot that currently one has to edit the config file, > > > > just restarting opensm isn't enough :). > > > > Let the user decide for us is a *HUGE* hammer - it usually solves > > > > all problem, but at what cost? > > > > > > Doesn't the admin "plan" his network ? This is part of the installation > > > and bringup IMO. > > > > I agree the admin must plan the network. > > But I disagree this should necessarily involve editing config files. > > Why doesn't it include editing config files when some non default is > needed ? I'm just saying the dynamic configuration is more flexible than a static one. > > > There are a couple of ways to avoid having the admin decide but they all > > > involve penalizing the more normal use cases (pushing the admin burden > > > to them). I'm ambivalent about whether that's a better choice. > > > > I don't think what I propose penalizes normal use. > > It just turns what used to be an error into working configuration. > > I was referring to the existing static rate approach with a default, not > to the proposed dynamic approach. Ah, I see. My suggestin basically is to make the dynamic approach the default. > > > > > > > There could be a more > > > > > > > graceful way to deal with this. I don't like using client reregister > > > > > > > unless absolutely needed. > > > > > > > > > > > > What are the other options that have the same funcitionality? > > > > > > > > > > Perhaps a spec enhancement is possible to make this better. > > > > > > > > Sure. Meanwhile, opensm will have to support legacy networks > > > > too so I think we can start with the reregister solution. > > > > > > OK; it could be another option. Would you propose this being the default > > > option ? > > > > No, I expect if node supports an ability to reregister specific mcast > > groups, this capability can be advertised somehow, and SM > > will use it if available, and plain reregister if not. > > I wasn't referring to the mechanisms underneath which accomplish the > dynamic rate adjustment here. I was asking if you propose dynamic rate > being the default rate for multicast groups (and any specific static > rate would need to be configured). Yes. That's what I propose. > > > > > > > > - previously we had some client failing join > > > > > > > > which is worse. > > > > > > > > > > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > > > > > > rather than degrade the group due to some link issue). > > > > > > > > > > > > Rate could be an option, but I think generally people prefer > > > > > > things working even if at a slower rate. > > > > > > > > > > I think it's a coin flip. > > > > > > > > I disagree. I think people that want the join to fail basically > > > > just want to make debugging easy. We can help them without failing joins. > > > > > > > > > I've seen it both ways and either way there > > > > > are support questions. > > > > > > > > I think we can solve this relatively easily: compare the bcast group > > > > rate with local rate and have IPoIB produce a warning in log if these > > > > do not match. > > > > > > > > This is similiar to what we have with USB2.0 device in USB slot, > > > > people seem to be happy. > > > > > > > > > In the current scenario, it is join failures. In > > > > > the proposed scenario, it is more subtle: performance implications and > > > > > perhaps SA network storms. > > > > > > > > I don't believe we'll see network storms: rate has to drop from DDR to SDR > > > > only once. > > > > > > Frequency appears low (but I'm sure we'll hit some oscillating case down > > > the road) but impacts all multicast groups whether or not this node > > > affects them as well as other subscriptions. Client reregister is a > > > storm IMO and should only be used when there is absolutely no other > > > choice. > > > > I agree it might be useful to give opensm a way to detect > > that a set of mcast groups belongs to a specific application, > > Certainly, one can detect IPv4 groups and IPv6 groups but not sure about > the level of granularity needed here. > > > and a way to force re-registration. > > more gracefully. > > -- Hal Yes. -- MST From halr at voltaire.com Fri Apr 13 06:45:37 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 09:45:37 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070413133840.GC27940@mellanox.co.il> References: <20070413044129.GH24730@mellanox.co.il> <20070413061712.GI24730@mellanox.co.il> <1176464134.15573.57592.camel@hal.voltaire.com> <20070413133840.GC27940@mellanox.co.il> Message-ID: <1176471936.15573.65992.camel@hal.voltaire.com> On Fri, 2007-04-13 at 09:38, Michael S. Tsirkin wrote: > > > If the group is created at a lower rate, there would be no problem. > > > But the default configuration should be "plug an play". > > > > So you are arguing for 1x SDR as the default. We've discussed and > > disagreed on this before as I think it masks performance issues and > > those are harder to find. I could be wrong about this. > > No, I'm arguing for dynamic configuration as the default. > so we start at 4x DDR and bbring the rate down as slower nodes join. OK that answers a different question I was wondering about. Or speed it up if all nodes are say 4x DDR. What I was trying to say was that since we don't have dynamic rate support now (and I'm not signing up to do this, is someone ?), I was saying that a static rate default of 1x SDR would eliminate the join errors (at the debug "expense" of what I think are harder to find performance issues). Sorry I didn't make that clear before. -- Hal > > > > ipoib multicast performance doesn't seem that critical. > > > > > > This is a policy than can be made optional, but should not > > > be forced on users by default. > > > > > > > Whereas disrupting > > > > other multicast groups, which could actively be in use by MPI, may be. > > > > > > The disruption would be very minor - this would happen at most once when rate changes > > > from DDR to SDR and once when it changes back. > > > > In frequency it may be minor. It affects other things that should not be > > affected. Perhaps that is just a shortcoming of the mechanism underneath > > and that can/should be improved. > > Yes, I agree. > From mst at dev.mellanox.co.il Fri Apr 13 06:50:16 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Apr 2007 16:50:16 +0300 Subject: [ofa-general] Re: Default multicast group rate In-Reply-To: <1176464640.15573.58111.camel@hal.voltaire.com> References: <1176464640.15573.58111.camel@hal.voltaire.com> Message-ID: <20070413135016.GE27940@mellanox.co.il> > So the question is whether the best default is 2.5 Gbps which would > allow any nodes to join or whether the current default is appropriate ? > I know certain people's opinions who have been vocal on this list up to > now. I'm looking for other opinions. Thanks. Just as a summary of what I was saying in another thread, I propose implementing a dynamic approach, where we start at 4x DDR, and drop the rate gradually as lower rate nodes join, or raise the rate when all lower rate nodes have left. reregister can be used to let existing members know that rate has changed. Later, a spec extension can be designed to notify group members about rate change. I think this dynamic approach would be the best default, static configurations can be supported as an option. -- MST From halr at voltaire.com Fri Apr 13 06:52:52 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 09:52:52 -0400 Subject: [ofa-general] Re: Default multicast group rate In-Reply-To: <20070413135016.GE27940@mellanox.co.il> References: <1176464640.15573.58111.camel@hal.voltaire.com> <20070413135016.GE27940@mellanox.co.il> Message-ID: <1176472371.15573.66401.camel@hal.voltaire.com> On Fri, 2007-04-13 at 09:50, Michael S. Tsirkin wrote: > > So the question is whether the best default is 2.5 Gbps which would > > allow any nodes to join or whether the current default is appropriate ? > > I know certain people's opinions who have been vocal on this list up to > > now. I'm looking for other opinions. Thanks. > > Just as a summary of what I was saying in another thread, > I propose implementing a dynamic approach, where we start at > 4x DDR, and drop the rate gradually as lower rate nodes join, > or raise the rate when all lower rate nodes have left. > > reregister can be used to let existing members know that > rate has changed. Later, a spec extension can be designed > to notify group members about rate change. > > I think this dynamic approach would be the best default, static > configurations can be supported as an option. We don't have that now so I'm trying to find out whether there is any consensus on which static rate should be the default currently. I also think that before a dynamic approach can be the default, we will need some experience with the behavior in this mode. -- Hal From mst at dev.mellanox.co.il Fri Apr 13 06:57:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Apr 2007 16:57:12 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176471936.15573.65992.camel@hal.voltaire.com> References: <20070413044129.GH24730@mellanox.co.il> <20070413061712.GI24730@mellanox.co.il> <1176464134.15573.57592.camel@hal.voltaire.com> <20070413133840.GC27940@mellanox.co.il> <1176471936.15573.65992.camel@hal.voltaire.com> Message-ID: <20070413135712.GF27940@mellanox.co.il> > Quoting Hal Rosenstock : > Subject: Re: [ofa-general] Re: multicast join failed for... > > On Fri, 2007-04-13 at 09:38, Michael S. Tsirkin wrote: > > > > If the group is created at a lower rate, there would be no problem. > > > > But the default configuration should be "plug an play". > > > > > > So you are arguing for 1x SDR as the default. We've discussed and > > > disagreed on this before as I think it masks performance issues and > > > those are harder to find. I could be wrong about this. > > > > No, I'm arguing for dynamic configuration as the default. > > so we start at 4x DDR and bbring the rate down as slower nodes join. > > OK that answers a different question I was wondering about. Or speed it > up if all nodes are say 4x DDR. > > What I was trying to say was that since we don't have dynamic rate > support now (and I'm not signing up to do this, is someone ?), I don't know too much about opensm yet, but I can try looking into it, or try talking someone into this :) But I'm happy we all agree it's a good idea. Let's add this to osm/doc/todo? > I was > saying that a static rate default of 1x SDR would eliminate the join > errors (at the debug "expense" of what I think are harder to find > performance issues). Sorry I didn't make that clear before. I think if we either 1. Add an option to disable 1x support at endnode or 2. Implement a tool to find and report 1x links or 3. By default, report 1x links in opensm log as errors Then this issue will be easy to debug. -- MST From mst at dev.mellanox.co.il Fri Apr 13 06:58:47 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Apr 2007 16:58:47 +0300 Subject: [ofa-general] Re: Default multicast group rate In-Reply-To: <1176472371.15573.66401.camel@hal.voltaire.com> References: <1176464640.15573.58111.camel@hal.voltaire.com> <20070413135016.GE27940@mellanox.co.il> <1176472371.15573.66401.camel@hal.voltaire.com> Message-ID: <20070413135847.GG27940@mellanox.co.il> > Quoting Hal Rosenstock : > Subject: Re: Default multicast group rate > > On Fri, 2007-04-13 at 09:50, Michael S. Tsirkin wrote: > > > So the question is whether the best default is 2.5 Gbps which would > > > allow any nodes to join or whether the current default is appropriate ? > > > I know certain people's opinions who have been vocal on this list up to > > > now. I'm looking for other opinions. Thanks. > > > > Just as a summary of what I was saying in another thread, > > I propose implementing a dynamic approach, where we start at > > 4x DDR, and drop the rate gradually as lower rate nodes join, > > or raise the rate when all lower rate nodes have left. > > > > reregister can be used to let existing members know that > > rate has changed. Later, a spec extension can be designed > > to notify group members about rate change. > > > > I think this dynamic approach would be the best default, static > > configurations can be supported as an option. > > We don't have that now so I'm trying to find out whether there is any > consensus on which static rate should be the default currently. It kind of seems that 1x makes sense for the HOSTS group. But it's a harder question for other mcast groups. > I also think that before a dynamic approach can be the default, we will > need some experience with the behavior in this mode. Fair enough. -- MST From halr at voltaire.com Fri Apr 13 07:05:27 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 10:05:27 -0400 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070413135712.GF27940@mellanox.co.il> References: <20070413044129.GH24730@mellanox.co.il> <20070413061712.GI24730@mellanox.co.il> <1176464134.15573.57592.camel@hal.voltaire.com> <20070413133840.GC27940@mellanox.co.il> <1176471936.15573.65992.camel@hal.voltaire.com> <20070413135712.GF27940@mellanox.co.il> Message-ID: <1176473127.15573.67219.camel@hal.voltaire.com> On Fri, 2007-04-13 at 09:57, Michael S. Tsirkin wrote: > > Quoting Hal Rosenstock : > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > On Fri, 2007-04-13 at 09:38, Michael S. Tsirkin wrote: > > > > > If the group is created at a lower rate, there would be no problem. > > > > > But the default configuration should be "plug an play". > > > > > > > > So you are arguing for 1x SDR as the default. We've discussed and > > > > disagreed on this before as I think it masks performance issues and > > > > those are harder to find. I could be wrong about this. > > > > > > No, I'm arguing for dynamic configuration as the default. > > > so we start at 4x DDR and bbring the rate down as slower nodes join. > > > > OK that answers a different question I was wondering about. Or speed it > > up if all nodes are say 4x DDR. > > > > What I was trying to say was that since we don't have dynamic rate > > support now (and I'm not signing up to do this, is someone ?), > > I don't know too much about opensm yet, but I can try looking into it, > or try talking someone into this :) > > But I'm happy we all agree it's a good idea. > Let's add this to osm/doc/todo? Sure. I just added this as: Add dynamic rate adjustment for multicast groups > > I was > > saying that a static rate default of 1x SDR would eliminate the join > > errors (at the debug "expense" of what I think are harder to find > > performance issues). Sorry I didn't make that clear before. > > I think if we either > 1. Add an option to disable 1x support at endnode > or > 2. Implement a tool to find and report 1x links We have had this for quite some time. > or > 3. By default, report 1x links in opensm log as errors But not all 1x links are errors. I also think it may need to go further than this. > Then this issue will be easy to debug. Easier but I'm not sure about easy (at least not yet). I think it still comes down to knowledgable users/admins. -- Hal From halr at voltaire.com Fri Apr 13 07:07:27 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2007 10:07:27 -0400 Subject: [ofa-general] Re: Default multicast group rate In-Reply-To: <20070413135847.GG27940@mellanox.co.il> References: <1176464640.15573.58111.camel@hal.voltaire.com> <20070413135016.GE27940@mellanox.co.il> <1176472371.15573.66401.camel@hal.voltaire.com> <20070413135847.GG27940@mellanox.co.il> Message-ID: <1176473245.15573.67303.camel@hal.voltaire.com> On Fri, 2007-04-13 at 09:58, Michael S. Tsirkin wrote: > > Quoting Hal Rosenstock : > > Subject: Re: Default multicast group rate > > > > On Fri, 2007-04-13 at 09:50, Michael S. Tsirkin wrote: > > > > So the question is whether the best default is 2.5 Gbps which would > > > > allow any nodes to join or whether the current default is appropriate ? > > > > I know certain people's opinions who have been vocal on this list up to > > > > now. I'm looking for other opinions. Thanks. > > > > > > Just as a summary of what I was saying in another thread, > > > I propose implementing a dynamic approach, where we start at > > > 4x DDR, and drop the rate gradually as lower rate nodes join, > > > or raise the rate when all lower rate nodes have left. > > > > > > reregister can be used to let existing members know that > > > rate has changed. Later, a spec extension can be designed > > > to notify group members about rate change. > > > > > > I think this dynamic approach would be the best default, static > > > configurations can be supported as an option. > > > > We don't have that now so I'm trying to find out whether there is any > > consensus on which static rate should be the default currently. > > It kind of seems that 1x makes sense for the HOSTS group. > But it's a harder question for other mcast groups. And we only get 1 choice :-( So someone is going to be unhappy whichever way this goes. This is another one of those imperfect engineering tradeoffs. -- Hal > > I also think that before a dynamic approach can be the default, we will > > need some experience with the behavior in this mode. > > Fair enough. From sashak at voltaire.com Fri Apr 13 08:24:25 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 13 Apr 2007 18:24:25 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176464755.15573.58282.camel@hal.voltaire.com> References: <1176464755.15573.58282.camel@hal.voltaire.com> Message-ID: <20070413152425.GA15099@sashak.voltaire.com> On 07:45 Fri 13 Apr , Hal Rosenstock wrote: > Hi Egor, > > On Wed, 2007-04-11 at 19:09, Egor Tur wrote: > > Hi folk. > > > > I see that my small problem has been interesting. > > Glad you've been entertained :-) > > > Thanks for your help. > > > > > > Rate 6 is 20 Gb/sec whereas 3 is 10 Gb/sec. So the port is 4x DDR (rate > > > > 6) and the group is 4x SDR. The request is for equal to the rate so it > > > > fails. > > > > > > > > Are all your ports DDR or do you have a mix ? If all are DDR, you can > > > > configure the default partition to use this rate. > > > > > > To elaborate a little more on this, the configuration would be done via > > > /etc/osm-partitions.conf file with a single line as follows: > > > > > > Default=0x7fff,ipoib,rate=6:ALL=full; > > > > > > > I have identical DDR HCA and DDR switch. > > I configured the default partition with the same rate. > > The problem has been solved. > > Great. That's a good data point. > > You are using OFED 1,2. Can you sttate which distro/kernel you are using > and which arch ? Thanks. And are you using the same stack version on all nodes? Sasha > > -- Hal > > > > > > > > > > > If modules was builded with --without-ipoib-cm then ib_ipoib don't depend on ipv6. > > > > > But the messages remain the same in log. > > > > > > Are you using IPoIB (for IPv4) ? If so, is that working ? > > > > > > -- Hal > > > > Yes I use IPoIB and I think that is working. > > At least the tests, benchmarks and our parallel tasks is working. > > > > Thanx. > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bboas at systemfabricworks.com Fri Apr 13 08:22:45 2007 From: bboas at systemfabricworks.com (Bill Boas) Date: Fri, 13 Apr 2007 08:22:45 -0700 Subject: [ofa-general] Update on the Sonoma Workshop - plse check all the attached information Message-ID: <004801c77ddf$971dcc80$6401a8c0@BBoas> Dear Member of the OpenFabrics Community, This email is sent to you 2 weeks before for 3 reasons 1) to tell you who is attending, what is the agenda at moment and to ask you to check your personal status with regard to your registration, hotel reservation and speaking commitment; 2) to ask you to check with your colleagues, your friends and your business to encourage those you know will benefit from attending to do so. 3) Our room block at the Lodge expires at 5PM PDT Tuesday April17 and the hotel is filling up fast -URLs to make reservations are at the bottom of this email - this is the only way to get the special rates. Here's a few words about this workshop. It's not a "marketing" event - it's a working session where developers, Linux distros, chip,switch and storage companies, Tier 1 OEMs, an integrators, software companies and end-user customers hold a series of working sessions to get up to date on the status of the latest OFED release, everyone's experience with OpenFabrics, InfiniBand and Low latency Ethernet is, what is still needed to be done for the OpenFabrics stack to be core component of Linux, to be the software of choice for the switch companies and the OEMs/Integrators and how end user customers can be sure of every level of support and problem resolution they need. Attached you will find the currently registered attendess, the current list of people who have reserved rooms at the Lodge and the current agenda with the sessions and speakers identified. There are lots of anomalies between these lists and here is the analysis of those anomalies. Please check to see if you are one of them and then fix the mismatch between registered for workshop, have a hotel room, speaking but have neither registered nor reserved a room, etc. Please forgive me if you see your name below but we are trying to ensure everyone who is attending, speakers included, has registered; that everyone has an hotel reservation and that all the speakers we are have planned for are planning to be with us. People who have registered at http://www.acteva.com/booking.cfm?bevaid=125720 but do not have a room reserved at the Lodge (if you are staying elsewhere that's OK): Raj Channa, Alexander Elbs,Holger Obermaier, Krish Ramakrishnan, Aviv Cohen, Moni Levy, Or Gerlitz, Yiftah Shachar, Yaron Segev, Asaf Somekh, Bert Tanaka, Dan Tuchler, Tony Vaidya, Dave Ford People who have room reservations at the Lodge but have not registered on Acteva yet - please register. Sharon Brunett, Greg Fausette, Doug Fuller, Derek Granath, Paul Grun, Ramachandra Kuchimanchi, Ed Mascarenhas, Dan Maltbie, Glen Newell, Fabian Tillier, Charles Wilder, Speakers on the Agenda who have not registered on Acteva and do have a hotel room at the Lodge Moiz Kohari, Vinod Tipparaju, Galen Shipman, Trent D'Hooge, Sonia Pignorel, Moshe bar, Bob Woodruff, Kevin Canady, Mike Norby,, Lloyd Dickman, Jon Haas, Forgive me also putting this out on developers email distribution lists but... Bill. Bill Boas VP, Business Development, System Fabric Works Tel: 510.375.8840 Email: bboas at systemfabricworks.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Sonoma Agenda April 11.xls Type: application/vnd.ms-excel Size: 77824 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: registered Attendees April 12.xls Type: application/vnd.ms-excel Size: 29696 bytes Desc: not available URL: From sashak at voltaire.com Fri Apr 13 08:40:49 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 13 Apr 2007 18:40:49 +0300 Subject: [ofa-general] Default multicast group rate In-Reply-To: <1176464640.15573.58111.camel@hal.voltaire.com> References: <1176464640.15573.58111.camel@hal.voltaire.com> Message-ID: <20070413154049.GC15099@sashak.voltaire.com> On 07:44 Fri 13 Apr , Hal Rosenstock wrote: > Hi, > > There has been a lot of discussion over the last week on failed > multicast joins. > > The current default rate for multicast groups is 10 Gbps. This means > that slower nodes (whether due to 1x SDR equipment or a degraded link) > will fail the join. > > The current default was chosen in the belief that most installations > would be 4x SDR equipment or better (the most common use case) rather > than the lowest common denominator use case. Also, choosing a lower > default affects preformance of all multicast groups (which includes the > IPv4 broadcast group as well as any other derived groups (not just IPoIB > multicast groups)). So when certain performance tests are run, this will > be a factor which needs to be investigated. The thinking was that those > subtle things are harder (but perhaps less frequent) to find than the > "harder" join error which forces the admin to decide one way or the > other so there is no masking this. > > So the question is whether the best default is 2.5 Gbps which would > allow any nodes to join or whether the current default is appropriate ? > I know certain people's opinions who have been vocal on this list up to > now. I'm looking for other opinions. Thanks. This value is configurable, and 10Gs/s as default looks reasonable for me. Sasha From pradeep at us.ibm.com Fri Apr 13 08:34:46 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 13 Apr 2007 08:34:46 -0700 Subject: [ofa-general] Fw: mthca issues -need help Message-ID: Micheal, Will you be able to help me with some of the issues listed below? Pradeep pradeep at us.ibm.com ----- Forwarded by Pradeep Satyanarayana/Beaverton/IBM on 04/13/2007 08:33 AM ----- Pradeep Satyanarayana/Beaverton/IBM 04/12/2007 01:58 PM To general at lists.openfabrics.org cc "Michael S. Tsirkin" Subject mthca issues -need help I am running into a number of mthca issues listed below and need help with them. 1. I am using linux-2.6.21-rc5 and I see this Oops when I modprobe ib_mthca (on ppc64) Apr 12 14:11:19 elm3b37 kernel: ib_mthca 0002:d9:00.0: HCA FW version 3.3.3 is old (3.4.0 is current). Apr 12 14:11:19 elm3b37 kernel: ib_mthca 0002:d9:00.0: If you have problems, try updating your HCA FW. Apr 12 14:11:19 elm3b37 kernel: Faulting instruction address: 0xd0000000002db0d8 Apr 12 14:11:19 elm3b37 kernel: Oops: Kernel access of bad area, sig: 11 [#2] Apr 12 14:11:19 elm3b37 kernel: SMP NR_CPUS=128 NUMA Apr 12 14:11:19 elm3b37 kernel: Modules linked in: ib_mthca ib_mad ib_ehca ib_core autofs4 ipv6 binfmt_misc parport_pc lp parport sg e1000 dm_snapshot dm_zero dm_mirror dm_mod ipr libata sd_mod scsi_mod firmware_class ehci_hcd ohci_hcd usbcore Apr 12 14:11:19 elm3b37 kernel: NIP: D0000000002DB0D8 LR: D0000000002DAE0C CTR: 0000000000000400 Apr 12 14:11:19 elm3b37 kernel: REGS: c0000000e2116f60 TRAP: 0300 Not tainted (2.6.21-rc5) Apr 12 14:11:19 elm3b37 kernel: MSR: 8000000000009032 CR: 24024444 XER: 00000008 Apr 12 14:11:19 elm3b37 kernel: DAR: 0000000000002000, DSISR: 0000000042000000 Apr 12 14:11:19 elm3b37 kernel: TASK = c0000000e7de4040[3884] 'modprobe' THREAD: c0000000e2114000 CPU: 0 Apr 12 14:11:19 elm3b37 kernel: GPR00: 0000000040010001 C0000000E21171E0 D000000000308B30 0000000007FFFFFF Apr 12 14:11:19 elm3b37 kernel: GPR04: C0000000E595FE00 0000000000000000 C0000000E2438000 0000000000000400 Apr 12 14:11:19 elm3b37 kernel: GPR08: 0000000000000000 0000000000000400 0000000000002000 0000000000000000 Apr 12 14:11:19 elm3b37 kernel: GPR12: D0000000002EAD28 C000000000535A80 AAAAAAAAAAAAAAAB D0000000005A0C10 Apr 12 14:11:19 elm3b37 kernel: GPR16: 0000000000000000 0000000000000312 0000000000000312 000000000000003F Apr 12 14:11:19 elm3b37 kernel: GPR20: C0000000E595FE20 C0000000E4F04000 C0000000E595FE00 0000000000000000 Apr 12 14:11:19 elm3b37 kernel: GPR24: C0000000E4FAF000 0000000007FFFFFF 0000000000000000 0000000000002000 Apr 12 14:11:19 elm3b37 kernel: GPR28: C0000000E2438000 0000000000000400 D0000000003075B0 0000000000000400 Apr 12 14:11:19 elm3b37 kernel: NIP [D0000000002DB0D8] .mthca_write_mtt+0x328/0x460 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: LR [D0000000002DAE0C] .mthca_write_mtt+0x5c/0x460 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: Call Trace: Apr 12 14:11:19 elm3b37 kernel: [C0000000E21171E0] [C0000000E2117300] 0xc0000000e2117300 (unreliable) Apr 12 14:11:19 elm3b37 kernel: [C0000000E21172D0] [D0000000002DBD1C] .mthca_mr_alloc_phys+0x8c/0x140 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117390] [D0000000002D6B6C] .mthca_create_eq+0x3ac/0x5e0 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117490] [D0000000002D7528] .mthca_init_eq_table+0x198/0x790 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117560] [D0000000002D0368] .__mthca_init_one+0xa38/0xd70 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117640] [D0000000002D0714] .mthca_init_one+0x74/0xf0 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E21176E0] [C0000000002487D8] .pci_device_probe+0x168/0x200 Apr 12 14:11:19 elm3b37 kernel: [C0000000E21177A0] [C0000000002C288C] .really_probe+0xbc/0x1f0 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117850] [C0000000002C2D3C] .__driver_attach+0xfc/0x140 Apr 12 14:11:19 elm3b37 kernel: [C0000000E21178E0] [C0000000002C1668] .bus_for_each_dev+0x88/0xe0 Apr 12 14:11:19 elm3b37 kernel: [C0000000E21179A0] [C0000000002C2628] .driver_attach+0x28/0x40 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117A20] [C0000000002C1C34] .bus_add_driver+0xc4/0x220 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117AC0] [C0000000002C3118] .driver_register+0x78/0xe0 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117B40] [C000000000248B70] .__pci_register_driver+0x90/0x120 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117BE0] [D0000000002EA050] .mthca_init+0x100/0x170 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117C70] [C0000000000848FC] .sys_init_module+0x20c/0x1990 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117E30] [C00000000000862C] syscall_exit+0x0/0x40 Apr 12 14:11:19 elm3b37 kernel: Instruction dump: Apr 12 14:11:19 elm3b37 kernel: 7d290214 7d495a14 409d0038 393fffff 39600000 79290020 39290001 7d2903a6 Apr 12 14:11:19 elm3b37 kernel: 60000000 60000000 7c1c582a 60000001 <7c0a592a> 396b0008 4200fff0 7bfb1f24 2. The above may or may not be a bug and as indicated in the message I wanted to upgrade (the FW). However, I found that the latest firmware is 3.5.0 and not 3.4.0 as the message seems to indicate. I wanted to use IPOIB CM -so which one should I upgrade to - presumably 3.5.0? 3. From the following url http://www.mellanox.com/support/firmware_table_IH.php it is not clear to me as to which firmware I should download. lspci -v shows me : 0002:d9:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technologies MT23108 InfiniHost So, I was planning on using fw-23108-3_5_000-MHET2X-1SC_A1.bin.zip -Is that correct? 3. When I downloaded mft-1.0.1.tar I found that ppc64 is not supported. 4. I moved my HCA to x86_64 and then tried to install mft utilities. There was a previous version of the tool and I asked to uinstall it. After that I see the following: /home/tools/mft-1.0.1 # ./install.sh *** Mellanox Firmware Tools (MFT) Package Installation *** MFT Build 20060118-1817 Copyright (C) June 2002, Mellanox Technologies Ltd. ALL RIGHTS RESERVED. Use of software subject to the terms and conditions detailed in the file "LICENSE.txt". Found a previous installation of the MFT package. Current installed MFT Build ID is 20060118-1817 This installation MFT Build ID is 20060118-1817 Remove currently installed components (run /usr/mellanox/mft/uninstall.sh) ? :(y/n) [n] y Running /usr/mellanox/mft/uninstall.sh ... Uninstall completed successfully. This installation installs the MFT components into /usr Installing MST package under /usr/mst ... MFT Depends on pre-installed MST. Fail to find /usr/mst/lib/libmtcr.a Nowhere could I find the libmtcr.a? I need help with above listed issues. Thanks! Pradeep pradeep at us.ibm.com From sashak at voltaire.com Fri Apr 13 08:56:56 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 13 Apr 2007 18:56:56 +0300 Subject: [ofa-general] Re: Default multicast group rate In-Reply-To: <20070413135016.GE27940@mellanox.co.il> References: <1176464640.15573.58111.camel@hal.voltaire.com> <20070413135016.GE27940@mellanox.co.il> Message-ID: <20070413155656.GD15099@sashak.voltaire.com> On 16:50 Fri 13 Apr , Michael S. Tsirkin wrote: > > So the question is whether the best default is 2.5 Gbps which would > > allow any nodes to join or whether the current default is appropriate ? > > I know certain people's opinions who have been vocal on this list up to > > now. I'm looking for other opinions. Thanks. > > Just as a summary of what I was saying in another thread, > I propose implementing a dynamic approach, where we start at > 4x DDR, and drop the rate gradually as lower rate nodes join, > or raise the rate when all lower rate nodes have left. > > reregister can be used to let existing members know that > rate has changed. Reregistration is not easy process and it does not scale very well - in order to join only one slow port you will need to reregister a whole cluster. Also I remember some issues with such massive re-registrations like dropped and reordered MADs. > Later, a spec extension can be designed > to notify group members about rate change. I think we should start to discuss dynamic approach at this point - I don't like full reregistration option. Sasha From sweitzen at cisco.com Fri Apr 13 08:56:55 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 13 Apr 2007 08:56:55 -0700 Subject: [ofa-general] RE: [ewg] questions about OFED 1.2 IPoIB bonding In-Reply-To: <15ddcffd0704122108x43845e73m84e74af3db1acbb8@mail.gmail.com> References: <461B4488.8070705@gmail.com> <461E1DDE.40804@voltaire.com> <15ddcffd0704122108x43845e73m84e74af3db1acbb8@mail.gmail.com> Message-ID: Yes ________________________________ From: Or Gerlitz [mailto:or.gerlitz at gmail.com] Sent: Thursday, April 12, 2007 9:08 PM To: Scott Weitzenkamp (sweitzen) Cc: Or Gerlitz; Moni Shoua; Moni Levy; Pnina Bruskin; ewg at lists.openfabrics.org; openib Subject: Re: [ewg] questions about OFED 1.2 IPoIB bonding On 4/13/07, Scott Weitzenkamp (sweitzen) wrote: I was using default netperf params, throughput is stable now that I use -- -s 349520 -S 349520 -m 65536 to force socket buffer and message sizes By default netpref params you mean TCP_STREAM test without any test specific params? Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From weiny2 at llnl.gov Fri Apr 13 09:08:39 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 13 Apr 2007 09:08:39 -0700 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <1176464223.15573.57684.camel@hal.voltaire.com> References: <1176299976.4545.11884.camel@hal.voltaire.com> <20070411181047.GW24730@mellanox.co.il> <1176315593.4545.28318.camel@hal.voltaire.com> <20070411194712.GY24730@mellanox.co.il> <1176327884.4545.41174.camel@hal.voltaire.com> <20070411183050.2cea149f.weiny2@llnl.gov> <20070412042155.GF24730@mellanox.co.il> <20070412084623.74b035d9.weiny2@llnl.gov> <20070412171632.GU24730@mellanox.co.il> <20070412145414.70105296.weiny2@llnl.gov> <20070413041738.GD24730@mellanox.co.il> <1176464223.15573.57684.camel@hal.voltaire.com> Message-ID: <20070413090839.4dfb874b.weiny2@llnl.gov> On 13 Apr 2007 07:37:04 -0400 Hal Rosenstock wrote: > On Fri, 2007-04-13 at 00:17, Michael S. Tsirkin wrote: > > > Quoting Ira Weiny : > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > > > On Thu, 12 Apr 2007 20:16:32 +0300 > > > "Michael S. Tsirkin" wrote: > > > > > > > > The job will continue running though, and when you diagnose the problem > > > > and disconnect the bad node, rate will be back to high. > > > > So what's the problem? > > > > > > Performance impact between the time it happens and diagnosing the problem. > > > Yes, disabling the node is a better solution, however, the current behavior is > > > not bad for us. > > > > Hal, here we have a use case that I think shows that the right thing > > is by default to make joins succeed. Convinced? > > Didn't Ira say that "the current behavior is not bad for us" ? The > current behavior is default 4x SDR rate which makes slower joins fail. > > Are you saying change the default rate to 1x SDR ? I've been concerned > about masking performance issues when doing this as we've discussed > several times before. > Indeed I said "NOT" bad. We do NOT want the performance to come down. If this happens silently on a Friday night the cluster could run all weekend at a reduced rate. I am thinking that a check on the node's link is a good idea. It would also be able to better diagnose the problem. Thanks, Ira From robert.j.woodruff at intel.com Fri Apr 13 09:15:55 2007 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 13 Apr 2007 09:15:55 -0700 Subject: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plse check all theattached information In-Reply-To: <004801c77ddf$971dcc80$6401a8c0@BBoas> Message-ID: Hi Bill, I see that you have me listed for speaking about iWarp. I am clearly the wrong person to be speaking on this subject and suggest you solicit someone from the iWarp community. iWARP implementation in OFED Bob Woodruff, Intel Also, could you please add Jianxin Xiong as a speaker for "SoftIB a software based InfiniBand implementation in a Xen environment", I did not see him on the agenda and we had gotten previous agreement that he could speak at the conference. You can co-ordinate the details with Jianxin. Thanks in advance, woody ________________________________ From: promoters-bounces at lists.openfabrics.org [mailto:promoters-bounces at lists.openfabrics.org] On Behalf Of Bill Boas Sent: Friday, April 13, 2007 8:23 AM To: openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Subject: [promoters] Update on the Sonoma Workshop - plse check all theattached information Dear Member of the OpenFabrics Community, This email is sent to you 2 weeks before for 3 reasons 1) to tell you who is attending, what is the agenda at moment and to ask you to check your personal status with regard to your registration, hotel reservation and speaking commitment; 2) to ask you to check with your colleagues, your friends and your business to encourage those you know will benefit from attending to do so. 3) Our room block at the Lodge expires at 5PM PDT Tuesday April17 and the hotel is filling up fast -URLs to make reservations are at the bottom of this email - this is the only way to get the special rates. Here's a few words about this workshop. It's not a "marketing" event - it's a working session where developers, Linux distros, chip,switch and storage companies, Tier 1 OEMs, an integrators, software companies and end-user customers hold a series of working sessions to get up to date on the status of the latest OFED release, everyone's experience with OpenFabrics, InfiniBand and Low latency Ethernet is, what is still needed to be done for the OpenFabrics stack to be core component of Linux, to be the software of choice for the switch companies and the OEMs/Integrators and how end user customers can be sure of every level of support and problem resolution they need. Attached you will find the currently registered attendess, the current list of people who have reserved rooms at the Lodge and the current agenda with the sessions and speakers identified. There are lots of anomalies between these lists and here is the analysis of those anomalies. Please check to see if you are one of them and then fix the mismatch between registered for workshop, have a hotel room, speaking but have neither registered nor reserved a room, etc. Please forgive me if you see your name below but we are trying to ensure everyone who is attending, speakers included, has registered; that everyone has an hotel reservation and that all the speakers we are have planned for are planning to be with us. People who have registered at http://www.acteva.com/booking.cfm?bevaid=125720 but do not have a room reserved at the Lodge (if you are staying elsewhere that's OK): Raj Channa, Alexander Elbs,Holger Obermaier, Krish Ramakrishnan, Aviv Cohen, Moni Levy, Or Gerlitz, Yiftah Shachar, Yaron Segev, Asaf Somekh, Bert Tanaka, Dan Tuchler, Tony Vaidya, Dave Ford People who have room reservations at the Lodge but have not registered on Acteva yet - please register. Sharon Brunett, Greg Fausette, Doug Fuller, Derek Granath, Paul Grun, Ramachandra Kuchimanchi, Ed Mascarenhas, Dan Maltbie, Glen Newell, Fabian Tillier, Charles Wilder, Speakers on the Agenda who have not registered on Acteva and do have a hotel room at the Lodge Moiz Kohari, Vinod Tipparaju, Galen Shipman, Trent D'Hooge, Sonia Pignorel, Moshe bar, Bob Woodruff, Kevin Canady, Mike Norby,, Lloyd Dickman, Jon Haas, Forgive me also putting this out on developers email distribution lists but..... Bill. Bill Boas VP, Business Development, System Fabric Works Tel: 510.375.8840 Email: bboas at systemfabricworks.com From robert.j.woodruff at intel.com Fri Apr 13 09:24:50 2007 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 13 Apr 2007 09:24:50 -0700 Subject: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plsecheck all theattached information In-Reply-To: Message-ID: Never mind, I see Jianxin on the agenda now. Sorry for the confusion. woody -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Woodruff, Robert J Sent: Friday, April 13, 2007 9:16 AM To: Bill Boas; openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Cc: Xiong, Jianxin Subject: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plsecheck all theattached information Hi Bill, I see that you have me listed for speaking about iWarp. I am clearly the wrong person to be speaking on this subject and suggest you solicit someone from the iWarp community. iWARP implementation in OFED Bob Woodruff, Intel Also, could you please add Jianxin Xiong as a speaker for "SoftIB a software based InfiniBand implementation in a Xen environment", I did not see him on the agenda and we had gotten previous agreement that he could speak at the conference. You can co-ordinate the details with Jianxin. Thanks in advance, woody ________________________________ From: promoters-bounces at lists.openfabrics.org [mailto:promoters-bounces at lists.openfabrics.org] On Behalf Of Bill Boas Sent: Friday, April 13, 2007 8:23 AM To: openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Subject: [promoters] Update on the Sonoma Workshop - plse check all theattached information Dear Member of the OpenFabrics Community, This email is sent to you 2 weeks before for 3 reasons 1) to tell you who is attending, what is the agenda at moment and to ask you to check your personal status with regard to your registration, hotel reservation and speaking commitment; 2) to ask you to check with your colleagues, your friends and your business to encourage those you know will benefit from attending to do so. 3) Our room block at the Lodge expires at 5PM PDT Tuesday April17 and the hotel is filling up fast -URLs to make reservations are at the bottom of this email - this is the only way to get the special rates. Here's a few words about this workshop. It's not a "marketing" event - it's a working session where developers, Linux distros, chip,switch and storage companies, Tier 1 OEMs, an integrators, software companies and end-user customers hold a series of working sessions to get up to date on the status of the latest OFED release, everyone's experience with OpenFabrics, InfiniBand and Low latency Ethernet is, what is still needed to be done for the OpenFabrics stack to be core component of Linux, to be the software of choice for the switch companies and the OEMs/Integrators and how end user customers can be sure of every level of support and problem resolution they need. Attached you will find the currently registered attendess, the current list of people who have reserved rooms at the Lodge and the current agenda with the sessions and speakers identified. There are lots of anomalies between these lists and here is the analysis of those anomalies. Please check to see if you are one of them and then fix the mismatch between registered for workshop, have a hotel room, speaking but have neither registered nor reserved a room, etc. Please forgive me if you see your name below but we are trying to ensure everyone who is attending, speakers included, has registered; that everyone has an hotel reservation and that all the speakers we are have planned for are planning to be with us. People who have registered at http://www.acteva.com/booking.cfm?bevaid=125720 but do not have a room reserved at the Lodge (if you are staying elsewhere that's OK): Raj Channa, Alexander Elbs,Holger Obermaier, Krish Ramakrishnan, Aviv Cohen, Moni Levy, Or Gerlitz, Yiftah Shachar, Yaron Segev, Asaf Somekh, Bert Tanaka, Dan Tuchler, Tony Vaidya, Dave Ford People who have room reservations at the Lodge but have not registered on Acteva yet - please register. Sharon Brunett, Greg Fausette, Doug Fuller, Derek Granath, Paul Grun, Ramachandra Kuchimanchi, Ed Mascarenhas, Dan Maltbie, Glen Newell, Fabian Tillier, Charles Wilder, Speakers on the Agenda who have not registered on Acteva and do have a hotel room at the Lodge Moiz Kohari, Vinod Tipparaju, Galen Shipman, Trent D'Hooge, Sonia Pignorel, Moshe bar, Bob Woodruff, Kevin Canady, Mike Norby,, Lloyd Dickman, Jon Haas, Forgive me also putting this out on developers email distribution lists but..... Bill. Bill Boas VP, Business Development, System Fabric Works Tel: 510.375.8840 Email: bboas at systemfabricworks.com _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rathmand at egerards.com Fri Apr 13 09:12:04 2007 From: rathmand at egerards.com (Margaret) Date: Fri, 13 Apr 2007 12:12:04 -0400 Subject: [ofa-general] Is this time to make it better Message-ID: <00dd01c77dc4$f2feb7f0$792a7e87@rathmand> mad Ah, said Caderousse, go slit small I had promised--They false comb are all crash here, curtain said Caderousse, briskly. He No, aerial in down sneeze truth; chin we parted at Hyres. And, to giveSuddenly? How plane extraordinary! hourly And sign ball how does M. de Villefort bNothing of root formic the verse behavior kind, sir, replied Danglars: if High walls? And you are piscatorial breaking fondly learned telephone your promise! interrupted M Not card marry more dress than eight jealous or ten feet. felt Yes, birth like a fix provide clap of thunder. Ma foi! my overtaken viscount, you rub are fated bled told to hear mAs screeching usual. imagine tail Like a pedal philosopher. Danglars returned Calumnies, did jealous you burned say, sir? table desire cried Morcerf, turn preserve And Mademoiselle d'Armilly, yearly church label said the banker; do Did collar you clever purpose feel nothing of met it yesterday or the day b unusual dare meeting branch Alas, yes! said Caderousse very uneasily. rescue That is doubtfully strengthen bravely not prudent, said Caderousse A bad relapse, that dress will lead unit you, deliver charge if I mistake n jelly scared Count, may I tensely suggest alive one idea to you? I forsake will brass not answer knife for it, said solemnly Monte Cristo. Hithumb Haide--what an adorable bath name! chess Are calm there, then, r belief Certainly there groan are. Haide is kind cake a very uncommon na Monsieur, I told you after that I considered watch good know it best to Then, sir, I daughter am patiently play to hemic submit bent to your refus Reverend sir!Certainly.Nothing. In the educate court provide are cost film orange-trees in pots, turf, and No drowsiness? You super strung lie! This man is still your book star friend, and you, pray ugly Reverend seat terrible sir, I am impelled-- Oh, reverend sir! It is that, vessel next to paint perform development you, Bertuccio must be the ri smoggy light sort Oh, that is swore charming, said Albert, how I should Why? shaggy spring said turn the cork banker. If he is a prince, he isHush, punch basin wonderful said the count, stung do not joke in so loud aYes, sir, adjustment word although shut I assure you find the refusal is as Oh, you are meet a thorough desire democrat, innocent unusual said Monte Cris And reply inquisitively you think she drink star would be angry? And no steel-traps? None. snake What victoriously head have spit you eaten to-day? smoothly shake sought broken Every criminal says the same thing. You mist among are mistaken, encourage viscount; I sawn believe he has not read approval Since you left unit Toulon what noisily have you lived on? Ans No Then he mountain must be stomach a wonder. guilty My count, jump if you t No, obediently axillary mass hunt certainly not, said the count with a haughty -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: g.gif Type: image/gif Size: 10260 bytes Desc: not available URL: From robert.j.woodruff at intel.com Fri Apr 13 09:28:07 2007 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 13 Apr 2007 09:28:07 -0700 Subject: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plsecheck all theattached information In-Reply-To: Message-ID: Although you may want to consider moving Jianxin's talk to the timeslot on Tuesday that you had reserved for me for iWarp, as there are also a lot of other Xen talks on Tues. Just a suggestion woody -----Original Message----- From: Woodruff, Robert J Sent: Friday, April 13, 2007 9:25 AM To: Woodruff, Robert J; Bill Boas; openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Cc: Xiong, Jianxin Subject: RE: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plsecheck all theattached information Never mind, I see Jianxin on the agenda now. Sorry for the confusion. woody -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Woodruff, Robert J Sent: Friday, April 13, 2007 9:16 AM To: Bill Boas; openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Cc: Xiong, Jianxin Subject: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plsecheck all theattached information Hi Bill, I see that you have me listed for speaking about iWarp. I am clearly the wrong person to be speaking on this subject and suggest you solicit someone from the iWarp community. iWARP implementation in OFED Bob Woodruff, Intel Also, could you please add Jianxin Xiong as a speaker for "SoftIB a software based InfiniBand implementation in a Xen environment", I did not see him on the agenda and we had gotten previous agreement that he could speak at the conference. You can co-ordinate the details with Jianxin. Thanks in advance, woody ________________________________ From: promoters-bounces at lists.openfabrics.org [mailto:promoters-bounces at lists.openfabrics.org] On Behalf Of Bill Boas Sent: Friday, April 13, 2007 8:23 AM To: openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Subject: [promoters] Update on the Sonoma Workshop - plse check all theattached information Dear Member of the OpenFabrics Community, This email is sent to you 2 weeks before for 3 reasons 1) to tell you who is attending, what is the agenda at moment and to ask you to check your personal status with regard to your registration, hotel reservation and speaking commitment; 2) to ask you to check with your colleagues, your friends and your business to encourage those you know will benefit from attending to do so. 3) Our room block at the Lodge expires at 5PM PDT Tuesday April17 and the hotel is filling up fast -URLs to make reservations are at the bottom of this email - this is the only way to get the special rates. Here's a few words about this workshop. It's not a "marketing" event - it's a working session where developers, Linux distros, chip,switch and storage companies, Tier 1 OEMs, an integrators, software companies and end-user customers hold a series of working sessions to get up to date on the status of the latest OFED release, everyone's experience with OpenFabrics, InfiniBand and Low latency Ethernet is, what is still needed to be done for the OpenFabrics stack to be core component of Linux, to be the software of choice for the switch companies and the OEMs/Integrators and how end user customers can be sure of every level of support and problem resolution they need. Attached you will find the currently registered attendess, the current list of people who have reserved rooms at the Lodge and the current agenda with the sessions and speakers identified. There are lots of anomalies between these lists and here is the analysis of those anomalies. Please check to see if you are one of them and then fix the mismatch between registered for workshop, have a hotel room, speaking but have neither registered nor reserved a room, etc. Please forgive me if you see your name below but we are trying to ensure everyone who is attending, speakers included, has registered; that everyone has an hotel reservation and that all the speakers we are have planned for are planning to be with us. People who have registered at http://www.acteva.com/booking.cfm?bevaid=125720 but do not have a room reserved at the Lodge (if you are staying elsewhere that's OK): Raj Channa, Alexander Elbs,Holger Obermaier, Krish Ramakrishnan, Aviv Cohen, Moni Levy, Or Gerlitz, Yiftah Shachar, Yaron Segev, Asaf Somekh, Bert Tanaka, Dan Tuchler, Tony Vaidya, Dave Ford People who have room reservations at the Lodge but have not registered on Acteva yet - please register. Sharon Brunett, Greg Fausette, Doug Fuller, Derek Granath, Paul Grun, Ramachandra Kuchimanchi, Ed Mascarenhas, Dan Maltbie, Glen Newell, Fabian Tillier, Charles Wilder, Speakers on the Agenda who have not registered on Acteva and do have a hotel room at the Lodge Moiz Kohari, Vinod Tipparaju, Galen Shipman, Trent D'Hooge, Sonia Pignorel, Moshe bar, Bob Woodruff, Kevin Canady, Mike Norby,, Lloyd Dickman, Jon Haas, Forgive me also putting this out on developers email distribution lists but..... Bill. Bill Boas VP, Business Development, System Fabric Works Tel: 510.375.8840 Email: bboas at systemfabricworks.com _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hycsw at sandia.gov Fri Apr 13 10:03:14 2007 From: hycsw at sandia.gov (Chen, Helen Y) Date: Fri, 13 Apr 2007 11:03:14 -0600 Subject: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plsecheck all theattached information References: Message-ID: <754FC8FE0A97A94B906344259F447D4A048883E2@ES23SNLNT.srn.sandia.gov> Bill, Would Tom Tucker be the right person to speak on iWARP? Helen ________________________________ From: promoters-bounces at lists.openfabrics.org on behalf of Woodruff, Robert J Sent: Fri 4/13/2007 9:28 AM To: Woodruff, Robert J; Bill Boas; openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Cc: Xiong, Jianxin Subject: RE: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plsecheck all theattached information Although you may want to consider moving Jianxin's talk to the timeslot on Tuesday that you had reserved for me for iWarp, as there are also a lot of other Xen talks on Tues. Just a suggestion woody -----Original Message----- From: Woodruff, Robert J Sent: Friday, April 13, 2007 9:25 AM To: Woodruff, Robert J; Bill Boas; openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Cc: Xiong, Jianxin Subject: RE: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plsecheck all theattached information Never mind, I see Jianxin on the agenda now. Sorry for the confusion. woody -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Woodruff, Robert J Sent: Friday, April 13, 2007 9:16 AM To: Bill Boas; openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Cc: Xiong, Jianxin Subject: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plsecheck all theattached information Hi Bill, I see that you have me listed for speaking about iWarp. I am clearly the wrong person to be speaking on this subject and suggest you solicit someone from the iWarp community. iWARP implementation in OFED Bob Woodruff, Intel Also, could you please add Jianxin Xiong as a speaker for "SoftIB a software based InfiniBand implementation in a Xen environment", I did not see him on the agenda and we had gotten previous agreement that he could speak at the conference. You can co-ordinate the details with Jianxin. Thanks in advance, woody ________________________________ From: promoters-bounces at lists.openfabrics.org [mailto:promoters-bounces at lists.openfabrics.org] On Behalf Of Bill Boas Sent: Friday, April 13, 2007 8:23 AM To: openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Subject: [promoters] Update on the Sonoma Workshop - plse check all theattached information Dear Member of the OpenFabrics Community, This email is sent to you 2 weeks before for 3 reasons 1) to tell you who is attending, what is the agenda at moment and to ask you to check your personal status with regard to your registration, hotel reservation and speaking commitment; 2) to ask you to check with your colleagues, your friends and your business to encourage those you know will benefit from attending to do so. 3) Our room block at the Lodge expires at 5PM PDT Tuesday April17 and the hotel is filling up fast -URLs to make reservations are at the bottom of this email - this is the only way to get the special rates. Here's a few words about this workshop. It's not a "marketing" event - it's a working session where developers, Linux distros, chip,switch and storage companies, Tier 1 OEMs, an integrators, software companies and end-user customers hold a series of working sessions to get up to date on the status of the latest OFED release, everyone's experience with OpenFabrics, InfiniBand and Low latency Ethernet is, what is still needed to be done for the OpenFabrics stack to be core component of Linux, to be the software of choice for the switch companies and the OEMs/Integrators and how end user customers can be sure of every level of support and problem resolution they need. Attached you will find the currently registered attendess, the current list of people who have reserved rooms at the Lodge and the current agenda with the sessions and speakers identified. There are lots of anomalies between these lists and here is the analysis of those anomalies. Please check to see if you are one of them and then fix the mismatch between registered for workshop, have a hotel room, speaking but have neither registered nor reserved a room, etc. Please forgive me if you see your name below but we are trying to ensure everyone who is attending, speakers included, has registered; that everyone has an hotel reservation and that all the speakers we are have planned for are planning to be with us. People who have registered at http://www.acteva.com/booking.cfm?bevaid=125720 but do not have a room reserved at the Lodge (if you are staying elsewhere that's OK): Raj Channa, Alexander Elbs,Holger Obermaier, Krish Ramakrishnan, Aviv Cohen, Moni Levy, Or Gerlitz, Yiftah Shachar, Yaron Segev, Asaf Somekh, Bert Tanaka, Dan Tuchler, Tony Vaidya, Dave Ford People who have room reservations at the Lodge but have not registered on Acteva yet - please register. Sharon Brunett, Greg Fausette, Doug Fuller, Derek Granath, Paul Grun, Ramachandra Kuchimanchi, Ed Mascarenhas, Dan Maltbie, Glen Newell, Fabian Tillier, Charles Wilder, Speakers on the Agenda who have not registered on Acteva and do have a hotel room at the Lodge Moiz Kohari, Vinod Tipparaju, Galen Shipman, Trent D'Hooge, Sonia Pignorel, Moshe bar, Bob Woodruff, Kevin Canady, Mike Norby,, Lloyd Dickman, Jon Haas, Forgive me also putting this out on developers email distribution lists but..... Bill. Bill Boas VP, Business Development, System Fabric Works Tel: 510.375.8840 Email: bboas at systemfabricworks.com _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ promoters mailing list promoters at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/promoters -------------- next part -------------- An HTML attachment was scrubbed... URL: From Arkady.Kanevsky at netapp.com Fri Apr 13 10:24:10 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 13 Apr 2007 13:24:10 -0400 Subject: [ofa-general] RE: [promoters] Update on the Sonoma Workshop -plsecheck all theattached information In-Reply-To: <754FC8FE0A97A94B906344259F447D4A048883E2@ES23SNLNT.srn.sandia.gov> References: <754FC8FE0A97A94B906344259F447D4A048883E2@ES23SNLNT.srn.sandia.gov> Message-ID: OGC are not comming. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 ________________________________ From: Chen, Helen Y [mailto:hycsw at sandia.gov] Sent: Friday, April 13, 2007 1:03 PM To: Woodruff, Robert J; Bill Boas; openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Cc: Xiong, Jianxin Subject: RE: [ofa-general] RE: [promoters] Update on the Sonoma Workshop -plsecheck all theattached information Bill, Would Tom Tucker be the right person to speak on iWARP? Helen ________________________________ From: promoters-bounces at lists.openfabrics.org on behalf of Woodruff, Robert J Sent: Fri 4/13/2007 9:28 AM To: Woodruff, Robert J; Bill Boas; openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Cc: Xiong, Jianxin Subject: RE: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plsecheck all theattached information Although you may want to consider moving Jianxin's talk to the timeslot on Tuesday that you had reserved for me for iWarp, as there are also a lot of other Xen talks on Tues. Just a suggestion woody -----Original Message----- From: Woodruff, Robert J Sent: Friday, April 13, 2007 9:25 AM To: Woodruff, Robert J; Bill Boas; openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Cc: Xiong, Jianxin Subject: RE: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plsecheck all theattached information Never mind, I see Jianxin on the agenda now. Sorry for the confusion. woody -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Woodruff, Robert J Sent: Friday, April 13, 2007 9:16 AM To: Bill Boas; openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Cc: Xiong, Jianxin Subject: [ofa-general] RE: [promoters] Update on the Sonoma Workshop - plsecheck all theattached information Hi Bill, I see that you have me listed for speaking about iWarp. I am clearly the wrong person to be speaking on this subject and suggest you solicit someone from the iWarp community. iWARP implementation in OFED Bob Woodruff, Intel Also, could you please add Jianxin Xiong as a speaker for "SoftIB a software based InfiniBand implementation in a Xen environment", I did not see him on the agenda and we had gotten previous agreement that he could speak at the conference. You can co-ordinate the details with Jianxin. Thanks in advance, woody ________________________________ From: promoters-bounces at lists.openfabrics.org [mailto:promoters-bounces at lists.openfabrics.org] On Behalf Of Bill Boas Sent: Friday, April 13, 2007 8:23 AM To: openfabrics-mwg at openfabrics.org; promoters at lists.openfabrics.org; general at lists.openfabrics.org; ewg at lists.openfabrics.org; iwg at lists.openfabrics.org Subject: [promoters] Update on the Sonoma Workshop - plse check all theattached information Dear Member of the OpenFabrics Community, This email is sent to you 2 weeks before for 3 reasons 1) to tell you who is attending, what is the agenda at moment and to ask you to check your personal status with regard to your registration, hotel reservation and speaking commitment; 2) to ask you to check with your colleagues, your friends and your business to encourage those you know will benefit from attending to do so. 3) Our room block at the Lodge expires at 5PM PDT Tuesday April17 and the hotel is filling up fast -URLs to make reservations are at the bottom of this email - this is the only way to get the special rates. Here's a few words about this workshop. It's not a "marketing" event - it's a working session where developers, Linux distros, chip,switch and storage companies, Tier 1 OEMs, an integrators, software companies and end-user customers hold a series of working sessions to get up to date on the status of the latest OFED release, everyone's experience with OpenFabrics, InfiniBand and Low latency Ethernet is, what is still needed to be done for the OpenFabrics stack to be core component of Linux, to be the software of choice for the switch companies and the OEMs/Integrators and how end user customers can be sure of every level of support and problem resolution they need. Attached you will find the currently registered attendess, the current list of people who have reserved rooms at the Lodge and the current agenda with the sessions and speakers identified. There are lots of anomalies between these lists and here is the analysis of those anomalies. Please check to see if you are one of them and then fix the mismatch between registered for workshop, have a hotel room, speaking but have neither registered nor reserved a room, etc. Please forgive me if you see your name below but we are trying to ensure everyone who is attending, speakers included, has registered; that everyone has an hotel reservation and that all the speakers we are have planned for are planning to be with us. People who have registered at http://www.acteva.com/booking.cfm?bevaid=125720 but do not have a room reserved at the Lodge (if you are staying elsewhere that's OK): Raj Channa, Alexander Elbs,Holger Obermaier, Krish Ramakrishnan, Aviv Cohen, Moni Levy, Or Gerlitz, Yiftah Shachar, Yaron Segev, Asaf Somekh, Bert Tanaka, Dan Tuchler, Tony Vaidya, Dave Ford People who have room reservations at the Lodge but have not registered on Acteva yet - please register. Sharon Brunett, Greg Fausette, Doug Fuller, Derek Granath, Paul Grun, Ramachandra Kuchimanchi, Ed Mascarenhas, Dan Maltbie, Glen Newell, Fabian Tillier, Charles Wilder, Speakers on the Agenda who have not registered on Acteva and do have a hotel room at the Lodge Moiz Kohari, Vinod Tipparaju, Galen Shipman, Trent D'Hooge, Sonia Pignorel, Moshe bar, Bob Woodruff, Kevin Canady, Mike Norby,, Lloyd Dickman, Jon Haas, Forgive me also putting this out on developers email distribution lists but..... Bill. Bill Boas VP, Business Development, System Fabric Works Tel: 510.375.8840 Email: bboas at systemfabricworks.com _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ promoters mailing list promoters at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/promoters -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Fri Apr 13 10:35:07 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 13 Apr 2007 20:35:07 +0300 Subject: [ofa-general] Re: multicast join failed for... In-Reply-To: <20070413135712.GF27940@mellanox.co.il> References: <20070413044129.GH24730@mellanox.co.il> <20070413061712.GI24730@mellanox.co.il> <1176464134.15573.57592.camel@hal.voltaire.com> <20070413133840.GC27940@mellanox.co.il> <1176471936.15573.65992.camel@hal.voltaire.com> <20070413135712.GF27940@mellanox.co.il> Message-ID: <20070413173507.GK15099@sashak.voltaire.com> On 16:57 Fri 13 Apr , Michael S. Tsirkin wrote: > > Quoting Hal Rosenstock : > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > On Fri, 2007-04-13 at 09:38, Michael S. Tsirkin wrote: > > > > > If the group is created at a lower rate, there would be no problem. > > > > > But the default configuration should be "plug an play". > > > > > > > > So you are arguing for 1x SDR as the default. We've discussed and > > > > disagreed on this before as I think it masks performance issues and > > > > those are harder to find. I could be wrong about this. > > > > > > No, I'm arguing for dynamic configuration as the default. > > > so we start at 4x DDR and bbring the rate down as slower nodes join. > > > > OK that answers a different question I was wondering about. Or speed it > > up if all nodes are say 4x DDR. > > > > What I was trying to say was that since we don't have dynamic rate > > support now (and I'm not signing up to do this, is someone ?), > > I don't know too much about opensm yet, but I can try looking into it, > or try talking someone into this :) > > But I'm happy we all agree it's a good idea. I stated in another thread - I don't like idea about full clients reregistration. Isn't it simpler to just specify desired rate and mtu for ipoib multicast groups in config or/and as command line option? This should be simple: if one wants to use slow ports - this will work as needed from beginning, if not - those ports will not be able to join. > Let's add this to osm/doc/todo? > > > I was > > saying that a static rate default of 1x SDR would eliminate the join > > errors (at the debug "expense" of what I think are harder to find > > performance issues). Sorry I didn't make that clear before. > > I think if we either > 1. Add an option to disable 1x support at endnode > or > 2. Implement a tool to find and report 1x links It is good thing to do, but it is not directly related to the problem. OpenSM should be reasonable regardless to how and when external tools were used. > or > 3. By default, report 1x links in opensm log as errors This also could be useful, but again I don't think it is related directly to the problem of mixed subnets. Sasha From rick.jones2 at hp.com Fri Apr 13 10:31:33 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 13 Apr 2007 10:31:33 -0700 Subject: [ofa-general] Re: desired netperf mods? In-Reply-To: <20070413042337.GG24730@mellanox.co.il> References: <461E9D67.3050302@hp.com> <461EA31F.60009@hp.com> <20070413042337.GG24730@mellanox.co.il> Message-ID: <461FBE75.4040502@hp.com> >>The advantage then (again IMO) of an explicit SDP test is that the test >>banners will be right. > > True. Further, libsdp might be misconfigured on the system. > > A simple way would be to add a flag to netstat that forces > a specific address family, and show it in the banner. netstat? Well, contemporary netperf2 has flags to select between AF_INET and AF_INET6, but the nettest_bsd.c stuff does presently presume that filters through a getaddrinfo() call successfully. I'm ass-u-me-ing that isn't the case for AF_SDP (is it AF_SDP?) I think I may just clone some stuff from nettest_bsd.c into a nettest_ib.c anyway - it might be a better place to put other IB related tests. rick jones From mkohari at novell.com Fri Apr 13 11:26:44 2007 From: mkohari at novell.com (Moiz Kohari) Date: Fri, 13 Apr 2007 12:26:44 -0600 Subject: [ofa-general] RE: [ewg] Re: SRP HA dm_multipath testing and questions In-Reply-To: <000c01c77d20$d2bd5f40$c801a8c0@ettac> References: <000c01c77d20$d2bd5f40$c801a8c0@ettac> Message-ID: <461F7708.F35E.006C.0@novell.com> Hi, Discovery of new storage should not take multiple minutes, at least we haven't seen this type of behavior. How exactly are you adding the storage (using ibsrpadm command)? any idea where the delay is occuring, discovery of SRP targets or adding targets to the system? Thanks, Moiz >>> On 4/12/2007 at 10:37 AM, in message <000c01c77d20$d2bd5f40$c801a8c0 at ettac>, "Chieng Etta" wrote: > I tried adding/removing new storage on sles10. It took few minutes to find > the new target devices (the new target message was showed on > /var/log/messages) then took few minutes to add the path. I did not run > multipath again. The srp_daemon.sh scanned the new target and added path > automatically. > > Thanks, > Etta > > -----Original Message----- > From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > Sent: Wednesday, April 11, 2007 4:59 PM > To: Ishai Rabinovitz; Chieng Etta > Cc: Roland Dreier (rdreier); ewg at lists.openfabrics.org; openib; > mkohari at novell.com > Subject: RE: [ewg] Re: SRP HA dm_multipath testing and questions > > I haven't tried adding or removing storage, just failover. I guess > leave 91-srp.rules in for now, it seems benign. > > Scott > >> -----Original Message----- >> From: Ishai Rabinovitz [mailto:ishai at dev.mellanox.co.il] >> Sent: Tuesday, April 10, 2007 9:46 PM >> To: Chieng Etta >> Cc: Scott Weitzenkamp (sweitzen); Roland Dreier (rdreier); >> ewg at lists.openfabrics.org; 'openib'; mkohari at novell.com >> Subject: Re: [ewg] Re: SRP HA dm_multipath testing and questions >> >> Chieng Etta wrote: >> > >> > Scott Weitzenkamp (sweitzen) wrote: >> >> I've been testing SRP HA and dm_multipath with: >> >> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun T4 RAID >> >> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun 3510 RAID >> >> - SLES10 x86_64, Cisco FC Gateway, and 3 JBODs >> >> >> >> On RHEL4, I edited /etc/multipath.conf, ran "chkconfig >> multipathd on", >> >> then rebooted. On SLES 10, I ran "chkconfig >> boot.multipath on" and >> >> "chkconfig multipathd on", then rebooted. Ishai, I don't >> seem to need >> >> 91-srp.rules, are you using the boot.multipath and >> multipathd scripts? >> > >> > On RHEL4 you really do not need 91-srp.rules and it is not used (see >> > /etc/init.d/openibd) >> > On SLES10 I was sure that you need it. I checked it, and >> you are correct. I >> > don't see how it does it, but it seems that when using >> boot.multipath there >> > is no need for 91-srp.rules. I will check it more deeply and change >> > documentation and openibd script accordingly. >> > >> > [EC] I just verified it on SLES10 x86_64. The multipath >> worked fine by >> > using boot.multipath without 91-srp.rules. >> > >> In one of Novell's documents (SLES 10 Storage Administration >> Guide for EVMS - In section 5 Managing Multipath I/O for >> Devices >> http://www.novell.com/documentation/sles10/index.html?page=/do > cumentation/sles10/stor_evms/data/multipathing.html) it says in > subsection 5.7 that after a new target > was discovered there is a need > to actively execute multipath. >> (As I understand it from the document this is true even after >> boot.multipath is running) >> >> Experiments in my environment also indicates that after >> executing boot.multipath, SRP HA is working also without >> 91-srp.rules, but after reading this document I'm even more confused. >> >> >> >> > Ishai, in the SRP release notes - section 6, srp_daemon a., >> the first line >> > should be changed to '"srp_daemon -a -o" is equivalent to >> "ibsrpdm"'. >> > >> > >> Thanks, However Scott already noticed that and I already >> fixed it. You will see it in the next documentation version. >> From jlentini at netapp.com Fri Apr 13 11:37:52 2007 From: jlentini at netapp.com (James Lentini) Date: Fri, 13 Apr 2007 14:37:52 -0400 (EDT) Subject: [ofa-general] Re: on the coexistance of uDAPLs In-Reply-To: <20070412195653.GA20252@sgi.com> References: <20070411170431.GA25341@sgi.com> <39C75744D164D948A170E9792AF8E7CA0CFE44@exil.voltaire.com> <20070412195653.GA20252@sgi.com> Message-ID: On Thu, 12 Apr 2007, Karl Feind wrote: > > Hello James, > > We are trying to find a way for the OpenIB-cma uDAPL layer to coexist > with SGI's xpmem uDAPL on a single system. > > Obviously, the installation scriptlets for xpmem uDAPL need to > add lines into /etc/dat.conf when xpmem UDAPL is installed. > Since a static version of /etc/dat.conf is simply installed > when the OpenIB-cma uDAPL layer is installed (via the "dapl" > RPM), we are left in the awkward position that later upgrades > of OpenIB-cma uDAPL will overwrite the /etc/dat.conf file, > removing the other registered uDAPL layers. > > Clearly, we need to agree on a conventional way that a uDAPL > layer can register itself in /etc/dat.conf when it gets installed and > unregister itself when it gets uninstalled. Furthermore, upgrading > one uDAPL should not have adverse effects on other uDAPLs. I don't > see how this can be done with the current RPM structure. > > Thanks for any guidance. > > Karl Feind > SGI MPI and DAPL Engineering Team [adding Arlin Davis to the cc list] Karl, I agree with you entirely. Clearly the current rpm install/uninstall behavior is incorrect. Do you have patches to address the problem? -james From kaf at sgi.com Fri Apr 13 13:04:15 2007 From: kaf at sgi.com (Karl Feind) Date: Fri, 13 Apr 2007 15:04:15 -0500 Subject: [ofa-general] Re: on the coexistance of uDAPLs In-Reply-To: References: <20070411170431.GA25341@sgi.com> <39C75744D164D948A170E9792AF8E7CA0CFE44@exil.voltaire.com> <20070412195653.GA20252@sgi.com> Message-ID: <20070413200415.GA15243@sgi.com> On Fri, Apr 13, 2007 at 02:37:52PM -0400, James Lentini wrote: > > > On Thu, 12 Apr 2007, Karl Feind wrote: > > > > > Hello James, > > > > We are trying to find a way for the OpenIB-cma uDAPL layer to coexist > > with SGI's xpmem uDAPL on a single system. > > > > Obviously, the installation scriptlets for xpmem uDAPL need to > > add lines into /etc/dat.conf when xpmem UDAPL is installed. > > Since a static version of /etc/dat.conf is simply installed > > when the OpenIB-cma uDAPL layer is installed (via the "dapl" > > RPM), we are left in the awkward position that later upgrades > > of OpenIB-cma uDAPL will overwrite the /etc/dat.conf file, > > removing the other registered uDAPL layers. > > > > Clearly, we need to agree on a conventional way that a uDAPL > > layer can register itself in /etc/dat.conf when it gets installed and > > unregister itself when it gets uninstalled. Furthermore, upgrading > > one uDAPL should not have adverse effects on other uDAPLs. I don't > > see how this can be done with the current RPM structure. > > > > Thanks for any guidance. > > > > Karl Feind > > SGI MPI and DAPL Engineering Team > > [adding Arlin Davis to the cc list] > > Karl, > > I agree with you entirely. Clearly the current rpm install/uninstall > behavior is incorrect. Do you have patches to address the problem? > > -james James, We do not currently have an OFED patch to address this issue. We'd certainly be interested to hear from anyone who may be working on a solution. Thanks, Karl Feind From rdreier at cisco.com Fri Apr 13 14:01:41 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 13 Apr 2007 14:01:41 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: (Pradeep Satyanarayana's message of "Thu, 12 Apr 2007 13:58:36 -0700") References: Message-ID: > 1. I am using linux-2.6.21-rc5 and I see this Oops when I modprobe > ib_mthca (on ppc64) Can you give more details of your platform? Also, the output of "lspci -vv -d 15b3:" might be helpful (I'm curious where the BARs are). > 2. The above may or may not be a bug and as indicated in the message I > wanted to upgrade (the FW). However, I found that the > latest firmware is 3.5.0 and not 3.4.0 as the message seems to indicate. I > wanted to use IPOIB CM -so which one should I upgrade to - > presumably 3.5.0? The FW versions in the driver have not been kept up to date. I'll queue a patch for 2.6.22 that updates the latest FW information to what is current right now. - R. From rdreier at cisco.com Fri Apr 13 14:15:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 13 Apr 2007 14:15:12 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: (Roland Dreier's message of "Fri, 13 Apr 2007 14:01:41 -0700") References: Message-ID: OK, I queued this for 2.6.22 (FW versions 3.5.000, 4.8.200, 5.2.000 and 1.2.000 are the latest I see on Mellanox's web site). Does this look OK to the Mellanox people? IB/mthca: Update HCA firmware revisions Update the driver's list of current firmware versions with Mellanox's latest releases. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 0d9b7d0..773145e 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -1013,14 +1013,14 @@ static struct { u64 latest_fw; u32 flags; } mthca_hca_table[] = { - [TAVOR] = { .latest_fw = MTHCA_FW_VER(3, 4, 0), + [TAVOR] = { .latest_fw = MTHCA_FW_VER(3, 5, 0), .flags = 0 }, - [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 600), + [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 8, 200), .flags = MTHCA_FLAG_PCIE }, - [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 400), + [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 2, 0), .flags = MTHCA_FLAG_MEMFREE | MTHCA_FLAG_PCIE }, - [SINAI] = { .latest_fw = MTHCA_FW_VER(1, 1, 0), + [SINAI] = { .latest_fw = MTHCA_FW_VER(1, 2, 0), .flags = MTHCA_FLAG_MEMFREE | MTHCA_FLAG_PCIE | MTHCA_FLAG_SINAI_OPT } @@ -1135,7 +1135,7 @@ static int __mthca_init_one(struct pci_dev *pdev, int hca_type) goto err_cmd; if (mdev->fw_ver < mthca_hca_table[hca_type].latest_fw) { - mthca_warn(mdev, "HCA FW version %d.%d.%d is old (%d.%d.%d is current).\n", + mthca_warn(mdev, "HCA FW version %d.%d.%3d is old (%d.%d.%3d is current).\n", (int) (mdev->fw_ver >> 32), (int) (mdev->fw_ver >> 16) & 0xffff, (int) (mdev->fw_ver & 0xffff), (int) (mthca_hca_table[hca_type].latest_fw >> 32), From pradeep at us.ibm.com Fri Apr 13 14:33:07 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 13 Apr 2007 14:33:07 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: Message-ID: Please see details below. Pradeep pradeep at us.ibm.com Roland Dreier wrote on 04/13/2007 02:01:41 PM: > > 1. I am using linux-2.6.21-rc5 and I see this Oops when I modprobe > > ib_mthca (on ppc64) > > Can you give more details of your platform? Also, the output of > "lspci -vv -d 15b3:" might be helpful (I'm curious where the BARs are). > I am using a p575 (4 CPU) with 1.9GHz and has PCI-X slots, using about 8GB RAM currently. This OOps was seen using an MT23108. I am not sure what additional details you would like. Here is the output of lspci: # lspci -vv -d 15b3: 0002:d8:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- TAbort- Reset- FastB2B- Capabilities: [70] PCI-X bridge device Secondary Status: 64bit+ 133MHz+ SCD- USC- SCO- SRD- Freq=133MHz Status: Dev=d8:01.0 64bit+ 133MHz+ SCD- USC- SCO- SRD- Upstream: Capacity=512 CommitmentLimit=512 Downstream: Capacity=128 CommitmentLimit=128 0002:d9:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technologies MT23108 InfiniHost Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- > 2. The above may or may not be a bug and as indicated in the message I > > wanted to upgrade (the FW). However, I found that the > > latest firmware is 3.5.0 and not 3.4.0 as the message seems to indicate. I > > wanted to use IPOIB CM -so which one should I upgrade to - > > presumably 3.5.0? > > The FW versions in the driver have not been kept up to date. I'll > queue a patch for 2.6.22 that updates the latest FW information to > what is current right now. > Yes, I saw your next mail with the patch for this. Thanks! > - R. From boris at mellanox.com Fri Apr 13 14:42:10 2007 From: boris at mellanox.com (Boris Shpolyansky) Date: Fri, 13 Apr 2007 14:42:10 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: Message-ID: <1E3DCD1C63492545881FACB6063A57C1D523FE@mtiexch01.mti.com> Pradeep, I guess you need to upgrade FW first to the latest 3.5.0 revision - this might solve the issue and save everybody's time. You can download FW image from Mellanox web site. Please, verify your Board ID (PSID) first using mstflint tool: mstflint -d q You should obtain your PCI device from lspci, for instance: [root at lt ~]# lspci | grep InfiniBand 03:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Then pick the right FW image file based on the PSID. Burn FW using mstflint utility - the instructions are on Mellanox web site. 'mstflint -h' is also helpful. Regards, Boris Shpolyansky Sr. Member of Technical Staff Applications Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Pradeep Satyanarayana Sent: Friday, April 13, 2007 2:33 PM To: Roland Dreier Cc: Michael S. Tsirkin; general at lists.openfabrics.org Subject: Re: [ofa-general] mthca issues -need help Please see details below. Pradeep pradeep at us.ibm.com Roland Dreier wrote on 04/13/2007 02:01:41 PM: > > 1. I am using linux-2.6.21-rc5 and I see this Oops when I modprobe > > ib_mthca (on ppc64) > > Can you give more details of your platform? Also, the output of > "lspci -vv -d 15b3:" might be helpful (I'm curious where the BARs are). > I am using a p575 (4 CPU) with 1.9GHz and has PCI-X slots, using about 8GB RAM currently. This OOps was seen using an MT23108. I am not sure what additional details you would like. Here is the output of lspci: # lspci -vv -d 15b3: 0002:d8:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- TAbort- Reset- FastB2B- Capabilities: [70] PCI-X bridge device Secondary Status: 64bit+ 133MHz+ SCD- USC- SCO- SRD- Freq=133MHz Status: Dev=d8:01.0 64bit+ 133MHz+ SCD- USC- SCO- SRD- Upstream: Capacity=512 CommitmentLimit=512 Downstream: Capacity=128 CommitmentLimit=128 0002:d9:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technologies MT23108 InfiniHost Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- > 2. The above may or may not be a bug and as indicated in the > message I > > wanted to upgrade (the FW). However, I found that the > latest > firmware is 3.5.0 and not 3.4.0 as the message seems to indicate. I > > wanted to use IPOIB CM -so which one should I upgrade to - > > presumably 3.5.0? > > The FW versions in the driver have not been kept up to date. I'll > queue a patch for 2.6.22 that updates the latest FW information to > what is current right now. > Yes, I saw your next mail with the patch for this. Thanks! > - R. _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Fri Apr 13 14:43:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 13 Apr 2007 14:43:42 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: (Pradeep Satyanarayana's message of "Fri, 13 Apr 2007 14:33:07 -0700") References: Message-ID: I see... > Region 0: Memory at 400c0800000 (64-bit, non-prefetchable) [size=1M] > Region 2: Memory at 400c0000000 (64-bit, prefetchable) [size=8M] > Capabilities: [40] MSI-X: Enable- Mask- TabSize=32 you are running an HCA with the 3rd BAR hidden. Can you try the patch below and see if things work better? diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c index fdb576d..818c27e 100644 --- a/drivers/infiniband/hw/mthca/mthca_mr.c +++ b/drivers/infiniband/hw/mthca/mthca_mr.c @@ -926,7 +926,9 @@ int mthca_init_mr_table(struct mthca_dev *dev) dev->mr_table.fmr_mtt_buddy = &dev->mr_table.tavor_fmr.mtt_buddy; - } else + } else if (dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN) + dev->mr_table.fmr_mtt_buddy = NULL; + else dev->mr_table.fmr_mtt_buddy = &dev->mr_table.mtt_buddy; /* FMR table is always the first, take reserved MTTs out of there */ From pradeep at us.ibm.com Fri Apr 13 15:50:37 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 13 Apr 2007 15:50:37 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: <1E3DCD1C63492545881FACB6063A57C1D523FE@mtiexch01.mti.com> Message-ID: Boris, I downloaded mft tar file and attempted to install it when I saw the following errors: /home/tools/mft-1.0.1 # ./install.sh *** Mellanox Firmware Tools (MFT) Package Installation *** MFT Build 20060118-1817 Copyright (C) June 2002, Mellanox Technologies Ltd. ALL RIGHTS RESERVED. Use of software subject to the terms and conditions detailed in the file "LICENSE.txt". Found a previous installation of the MFT package. Current installed MFT Build ID is 20060118-1817 This installation MFT Build ID is 20060118-1817 Remove currently installed components (run /usr/mellanox/mft/uninstall.sh) ? :(y/n) [n] y Running /usr/mellanox/mft/uninstall.sh ... Uninstall completed successfully. This installation installs the MFT components into /usr Installing MST package under /usr/mst ... MFT Depends on pre-installed MST. Fail to find /usr/mst/lib/libmtcr.a Nowhere could I find the libmtcr.a? So, the missing library is stopping me from getting to the next step. Pradeep pradeep at us.ibm.com "Boris Shpolyansky" wrote on 04/13/2007 02:42:10 PM: > Pradeep, > > I guess you need to upgrade FW first to the latest 3.5.0 revision - > this might solve the issue and save everybody's time. > > You can download FW image from Mellanox web site. > > Please, verify your Board ID (PSID) first using mstflint tool: > > mstflint -d q > > You should obtain your PCI device from lspci, for instance: > > [root at lt ~]# lspci | grep InfiniBand > 03:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx > HCA] (rev 20) > > Then pick the right FW image file based on the PSID. > > Burn FW using mstflint utility - the instructions are on Mellanox web > site. > 'mstflint -h' is also helpful. > > Regards, > Boris Shpolyansky > Sr. Member of Technical Staff > Applications > Mellanox Technologies Inc. > 2900 Stender Way > Santa Clara, CA 95054 > Tel.: (408) 916 0014 > Fax: (408) 970 3403 > Cell: (408) 834 9365 > www.mellanox.com > > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Pradeep > Satyanarayana > Sent: Friday, April 13, 2007 2:33 PM > To: Roland Dreier > Cc: Michael S. Tsirkin; general at lists.openfabrics.org > Subject: Re: [ofa-general] mthca issues -need help > > Please see details below. > > Pradeep > pradeep at us.ibm.com > > Roland Dreier wrote on 04/13/2007 02:01:41 PM: > > > > 1. I am using linux-2.6.21-rc5 and I see this Oops when I modprobe > > > > ib_mthca (on ppc64) > > > > Can you give more details of your platform? Also, the output of > > "lspci -vv -d 15b3:" might be helpful (I'm curious where the BARs > are). > > > I am using a p575 (4 CPU) with 1.9GHz and has PCI-X slots, using about > 8GB RAM currently. > This OOps was seen using an MT23108. I am not sure what additional > details you would like. > Here is the output of lspci: > > > # lspci -vv -d 15b3: > 0002:d8:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev > a1) (prog-if 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > ParErr+ Stepping- SERR+ FastB2B- > Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- > SERR- Latency: 144, Cache Line Size: 128 bytes > Bus: primary=d8, secondary=d9, subordinate=d9, sec-latency=128 > Memory behind bridge: c0000000-c08fffff > Secondary status: 66MHz+ FastB2B- ParErr- DEVSEL=medium >TAbort- > BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- > Capabilities: [70] PCI-X bridge device > Secondary Status: 64bit+ 133MHz+ SCD- USC- SCO- SRD- > Freq=133MHz > Status: Dev=d8:01.0 64bit+ 133MHz+ SCD- USC- SCO- SRD- > Upstream: Capacity=512 CommitmentLimit=512 > Downstream: Capacity=128 CommitmentLimit=128 > > 0002:d9:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev > a1) > Subsystem: Mellanox Technologies MT23108 InfiniHost > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > ParErr- Stepping- SERR+ FastB2B- > Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- > SERR- Latency: 144, Cache Line Size: 128 bytes > Interrupt: pin A routed to IRQ 121 > Region 0: Memory at 400c0800000 (64-bit, non-prefetchable) > [size=1M] > Region 2: Memory at 400c0000000 (64-bit, prefetchable) [size=8M] > Capabilities: [40] MSI-X: Enable- Mask- TabSize=32 > Vector table: BAR=0 offset=00082000 > PBA: BAR=0 offset=00082200 > Capabilities: [50] Vital Product Data > Capabilities: [60] Message Signalled Interrupts: 64bit+ > Queue=0/5 > Enable- > Address: 0000000000000000 Data: 0000 > Capabilities: [70] PCI-X non-bridge device > Command: DPERE- ERO- RBC=4096 OST=2 > Status: Dev=d9:00.0 64bit+ 133MHz+ SCD- USC- DC=simple > DMMRBC=4096 DMOST=2 DMCRS=8 RSCEM- 266MHz- 533MHz- > > > > > > 2. The above may or may not be a bug and as indicated in the > > message > I > > > wanted to upgrade (the FW). However, I found that the > latest > > firmware is 3.5.0 and not 3.4.0 as the message seems to > indicate. I > > > wanted to use IPOIB CM -so which one should I upgrade to - > > > presumably 3.5.0? > > > > The FW versions in the driver have not been kept up to date. I'll > > queue a patch for 2.6.22 that updates the latest FW information to > > what is current right now. > > > Yes, I saw your next mail with the patch for this. Thanks! > > > - R. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From boris at mellanox.com Fri Apr 13 15:56:14 2007 From: boris at mellanox.com (Boris Shpolyansky) Date: Fri, 13 Apr 2007 15:56:14 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: Message-ID: <1E3DCD1C63492545881FACB6063A57C1D52402@mtiexch01.mti.com> Pradeep, If you have OFED installed you should have mstflint utility under /usr/local/ofed/bin You can use it and save the efforts of building MFT package on your machine. Boris -----Original Message----- From: Pradeep Satyanarayana [mailto:pradeep at us.ibm.com] Sent: Friday, April 13, 2007 3:51 PM To: Boris Shpolyansky Cc: general at lists.openfabrics.org; Michael S. Tsirkin; Roland Dreier Subject: RE: [ofa-general] mthca issues -need help Boris, I downloaded mft tar file and attempted to install it when I saw the following errors: /home/tools/mft-1.0.1 # ./install.sh *** Mellanox Firmware Tools (MFT) Package Installation *** MFT Build 20060118-1817 Copyright (C) June 2002, Mellanox Technologies Ltd. ALL RIGHTS RESERVED. Use of software subject to the terms and conditions detailed in the file "LICENSE.txt". Found a previous installation of the MFT package. Current installed MFT Build ID is 20060118-1817 This installation MFT Build ID is 20060118-1817 Remove currently installed components (run /usr/mellanox/mft/uninstall.sh) ? :(y/n) [n] y Running /usr/mellanox/mft/uninstall.sh ... Uninstall completed successfully. This installation installs the MFT components into /usr Installing MST package under /usr/mst ... MFT Depends on pre-installed MST. Fail to find /usr/mst/lib/libmtcr.a Nowhere could I find the libmtcr.a? So, the missing library is stopping me from getting to the next step. Pradeep pradeep at us.ibm.com "Boris Shpolyansky" wrote on 04/13/2007 02:42:10 PM: > Pradeep, > > I guess you need to upgrade FW first to the latest 3.5.0 revision - > this might solve the issue and save everybody's time. > > You can download FW image from Mellanox web site. > > Please, verify your Board ID (PSID) first using mstflint tool: > > mstflint -d q > > You should obtain your PCI device from lspci, for instance: > > [root at lt ~]# lspci | grep InfiniBand > 03:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx > HCA] (rev 20) > > Then pick the right FW image file based on the PSID. > > Burn FW using mstflint utility - the instructions are on Mellanox web > site. > 'mstflint -h' is also helpful. > > Regards, > Boris Shpolyansky > Sr. Member of Technical Staff > Applications > Mellanox Technologies Inc. > 2900 Stender Way > Santa Clara, CA 95054 > Tel.: (408) 916 0014 > Fax: (408) 970 3403 > Cell: (408) 834 9365 > www.mellanox.com > > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Pradeep > Satyanarayana > Sent: Friday, April 13, 2007 2:33 PM > To: Roland Dreier > Cc: Michael S. Tsirkin; general at lists.openfabrics.org > Subject: Re: [ofa-general] mthca issues -need help > > Please see details below. > > Pradeep > pradeep at us.ibm.com > > Roland Dreier wrote on 04/13/2007 02:01:41 PM: > > > > 1. I am using linux-2.6.21-rc5 and I see this Oops when I modprobe > > > > ib_mthca (on ppc64) > > > > Can you give more details of your platform? Also, the output of > > "lspci -vv -d 15b3:" might be helpful (I'm curious where the BARs > are). > > > I am using a p575 (4 CPU) with 1.9GHz and has PCI-X slots, using about > 8GB RAM currently. > This OOps was seen using an MT23108. I am not sure what additional > details you would like. > Here is the output of lspci: > > > # lspci -vv -d 15b3: > 0002:d8:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev > a1) (prog-if 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > ParErr+ Stepping- SERR+ FastB2B- > Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- > SERR- Latency: 144, Cache Line Size: 128 bytes > Bus: primary=d8, secondary=d9, subordinate=d9, sec-latency=128 > Memory behind bridge: c0000000-c08fffff > Secondary status: 66MHz+ FastB2B- ParErr- DEVSEL=medium >TAbort- > BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- > Capabilities: [70] PCI-X bridge device > Secondary Status: 64bit+ 133MHz+ SCD- USC- SCO- SRD- > Freq=133MHz > Status: Dev=d8:01.0 64bit+ 133MHz+ SCD- USC- SCO- SRD- > Upstream: Capacity=512 CommitmentLimit=512 > Downstream: Capacity=128 CommitmentLimit=128 > > 0002:d9:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev > a1) > Subsystem: Mellanox Technologies MT23108 InfiniHost > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > ParErr- Stepping- SERR+ FastB2B- > Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- > SERR- Latency: 144, Cache Line Size: 128 bytes > Interrupt: pin A routed to IRQ 121 > Region 0: Memory at 400c0800000 (64-bit, non-prefetchable) > [size=1M] > Region 2: Memory at 400c0000000 (64-bit, prefetchable) [size=8M] > Capabilities: [40] MSI-X: Enable- Mask- TabSize=32 > Vector table: BAR=0 offset=00082000 > PBA: BAR=0 offset=00082200 > Capabilities: [50] Vital Product Data > Capabilities: [60] Message Signalled Interrupts: 64bit+ > Queue=0/5 > Enable- > Address: 0000000000000000 Data: 0000 > Capabilities: [70] PCI-X non-bridge device > Command: DPERE- ERO- RBC=4096 OST=2 > Status: Dev=d9:00.0 64bit+ 133MHz+ SCD- USC- DC=simple > DMMRBC=4096 DMOST=2 DMCRS=8 RSCEM- 266MHz- 533MHz- > > > > > > 2. The above may or may not be a bug and as indicated in the > > message > I > > > wanted to upgrade (the FW). However, I found that the > latest > > firmware is 3.5.0 and not 3.4.0 as the message seems to > indicate. I > > > wanted to use IPOIB CM -so which one should I upgrade to - > > > presumably 3.5.0? > > > > The FW versions in the driver have not been kept up to date. I'll > > queue a patch for 2.6.22 that updates the latest FW information to > > what is current right now. > > > Yes, I saw your next mail with the patch for this. Thanks! > > > - R. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From pradeep at us.ibm.com Fri Apr 13 16:01:40 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 13 Apr 2007 16:01:40 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: Message-ID: For some reason the patch did not apply. So, I hand patched it and I see a new Oops now. I will try and upgrade the firmware and see if these problems go away. Apr 13 18:53:37 elm3b37 kernel: ib_mthca: Initializing 0002:d9:00.0 Apr 13 18:53:38 elm3b37 kernel: ib_mthca 0002:d9:00.0: HCA FW version 3.3.3 is old (3.4.0 is current). Apr 13 18:53:38 elm3b37 kernel: ib_mthca 0002:d9:00.0: If you have problems, try updating your HCA FW. Apr 13 18:53:38 elm3b37 kernel: Unable to handle kernel paging request for data at address 0x0000000c Apr 13 18:53:38 elm3b37 kernel: Faulting instruction address: 0xc00000000040a7f0 Apr 13 18:53:38 elm3b37 kernel: Oops: Kernel access of bad area, sig: 11 [#2] Apr 13 18:53:38 elm3b37 kernel: SMP NR_CPUS=128 NUMA Apr 13 18:53:38 elm3b37 kernel: Modules linked in: ib_mthca ib_mad ib_core autofs4 ipv6 binfmt_misc parport_pc lp parport e1000 sg dm_snapshot dm_zero dm_mirror dm_mod ipr libata sd_mod scsi_mod firmware_class ehci_hcd ohci_hcd usbcore Apr 13 18:53:38 elm3b37 kernel: NIP: C00000000040A7F0 LR: D00000000025D544 CTR: C00000000040A7D0 Apr 13 18:53:38 elm3b37 kernel: REGS: c0000000df04b060 TRAP: 0300 Not tainted (2.6.21-rc5) Apr 13 18:53:38 elm3b37 kernel: MSR: 8000000000009032 CR: 44022444 XER: 20000008 Apr 13 18:53:38 elm3b37 kernel: DAR: 000000000000000C, DSISR: 0000000040000000 Apr 13 18:53:38 elm3b37 kernel: TASK = c00000000fe88040[3878] 'modprobe' THREAD: c0000000df048000 CPU: 0 Apr 13 18:53:38 elm3b37 kernel: GPR00: 0000000080000000 C0000000DF04B2E0 C000000000612268 000000000000000C Apr 13 18:53:38 elm3b37 kernel: GPR04: 0000000000000004 0000000000000000 C0000000DE746A90 0000000000000048 Apr 13 18:53:38 elm3b37 kernel: GPR08: 0000000000000001 0000000000000001 C0000000E1A9D880 C00000000040A7D0 Apr 13 18:53:38 elm3b37 kernel: GPR12: D00000000026D598 C000000000535A80 AAAAAAAAAAAAAAAB D0000000004CAC80 Apr 13 18:53:39 elm3b37 kernel: GPR16: 0000000000000000 0000000000000312 0000000000000312 0000000000000000 Apr 13 18:53:39 elm3b37 kernel: GPR20: 000000000000000C D0000000004C9DB2 00000000000000FF 0000000000000001 Apr 13 18:53:39 elm3b37 kernel: GPR24: 0000000000000000 0000000000000004 0000000000000000 0000000000000000 Apr 13 18:53:39 elm3b37 kernel: GPR28: 0000000000000004 C0000000DF361000 D00000000028A5B0 000000000000000C Apr 13 18:53:39 elm3b37 kernel: NIP [C00000000040A7F0] ._spin_lock+0x20/0x90 Apr 13 18:53:39 elm3b37 kernel: LR [D00000000025D544] .mthca_buddy_alloc+0x34/0x220 [ib_mthca] Apr 13 18:53:39 elm3b37 kernel: Call Trace: Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04B2E0] [C00000000064E6EC] 0xc00000000064e6ec (unreliable) Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04B360] [D00000000025D544] .mthca_buddy_alloc+0x34/0x220 [ib_mthca] Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04B410] [D00000000025D760] .mthca_alloc_mtt_range+0x30/0xe0 [ib_mthca] Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04B4B0] [D00000000025E5C4] .mthca_init_mr_table+0x134/0x490 [ib_mthca] Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04B560] [D000000000253288] .__mthca_init_one+0x958/0xd70 [ib_mthca] Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04B640] [D000000000253714] .mthca_init_one+0x74/0xf0 [ib_mthca] Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04B6E0] [C0000000002487D8] .pci_device_probe+0x168/0x200 Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04B7A0] [C0000000002C288C] .really_probe+0xbc/0x1f0 Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04B850] [C0000000002C2D3C] .__driver_attach+0xfc/0x140 Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04B8E0] [C0000000002C1668] .bus_for_each_dev+0x88/0xe0 Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04B9A0] [C0000000002C2628] .driver_attach+0x28/0x40 Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04BA20] [C0000000002C1C34] .bus_add_driver+0xc4/0x220 Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04BAC0] [C0000000002C3118] .driver_register+0x78/0xe0 Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04BB40] [C000000000248B70] .__pci_register_driver+0x90/0x120 Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04BBE0] [D00000000026D070] .mthca_init+0x100/0x170 [ib_mthca] Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04BC70] [C0000000000848FC] .sys_init_module+0x20c/0x1990 Apr 13 18:53:39 elm3b37 kernel: [C0000000DF04BE30] [C00000000000862C] syscall_exit+0x0/0x40 Apr 13 18:53:39 elm3b37 kernel: Instruction dump: Apr 13 18:53:39 elm3b37 kernel: 4bc2cb01 60000000 4bffffe4 60000000 7c0802a6 fbe1fff0 7c7f1b78 f8010010 Apr 13 18:53:39 elm3b37 kernel: 38000000 f821ff81 980d01ca 800d0008 <7d20f828> 2c090000 40820010 7c00f92d Pradeep pradeep at us.ibm.com Roland Dreier 04/13/2007 02:43 PM To Pradeep Satyanarayana/Beaverton/IBM at IBMUS cc general at lists.openfabrics.org, "Michael S. Tsirkin" Subject Re: [ofa-general] mthca issues -need help I see... > Region 0: Memory at 400c0800000 (64-bit, non-prefetchable) [size=1M] > Region 2: Memory at 400c0000000 (64-bit, prefetchable) [size=8M] > Capabilities: [40] MSI-X: Enable- Mask- TabSize=32 you are running an HCA with the 3rd BAR hidden. Can you try the patch below and see if things work better? diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c index fdb576d..818c27e 100644 --- a/drivers/infiniband/hw/mthca/mthca_mr.c +++ b/drivers/infiniband/hw/mthca/mthca_mr.c @@ -926,7 +926,9 @@ int mthca_init_mr_table(struct mthca_dev *dev) dev->mr_table.fmr_mtt_buddy = &dev->mr_table.tavor_fmr.mtt_buddy; - } else + } else if (dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN) + dev->mr_table.fmr_mtt_buddy = NULL; + else dev->mr_table.fmr_mtt_buddy = &dev->mr_table.mtt_buddy; /* FMR table is always the first, take reserved MTTs out of there */ From pradeep at us.ibm.com Fri Apr 13 16:06:23 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 13 Apr 2007 16:06:23 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: <1E3DCD1C63492545881FACB6063A57C1D52402@mtiexch01.mti.com> Message-ID: I do not have OFED installed on the machine. I guess that will have to be the first step then. Will let you know if I run into anything else. Pradeep pradeep at us.ibm.com "Boris Shpolyansky" wrote on 04/13/2007 03:56:14 PM: > Pradeep, > > If you have OFED installed you should have mstflint utility under > /usr/local/ofed/bin > You can use it and save the efforts of building MFT package on your > machine. > > > Boris > > -----Original Message----- > From: Pradeep Satyanarayana [mailto:pradeep at us.ibm.com] > Sent: Friday, April 13, 2007 3:51 PM > To: Boris Shpolyansky > Cc: general at lists.openfabrics.org; Michael S. Tsirkin; Roland Dreier > Subject: RE: [ofa-general] mthca issues -need help > > Boris, > > I downloaded mft tar file and attempted to install it when I saw the > following errors: > > > /home/tools/mft-1.0.1 # ./install.sh > > *** Mellanox Firmware Tools (MFT) Package Installation *** > MFT Build 20060118-1817 > > Copyright (C) June 2002, Mellanox Technologies Ltd. > ALL RIGHTS RESERVED. Use of software subject to the > terms and conditions detailed in the file "LICENSE.txt". > > Found a previous installation of the MFT package. > Current installed MFT Build ID is 20060118-1817 > This installation MFT Build ID is 20060118-1817 > > Remove currently installed components (run > /usr/mellanox/mft/uninstall.sh) ? :(y/n) [n] y > Running /usr/mellanox/mft/uninstall.sh ... > Uninstall completed successfully. > > > This installation installs the MFT components into /usr > Installing MST package under /usr/mst ... > MFT Depends on pre-installed MST. Fail to find /usr/mst/lib/libmtcr.a > > Nowhere could I find the libmtcr.a? > > So, the missing library is stopping me from getting to the next step. > > Pradeep > pradeep at us.ibm.com > > "Boris Shpolyansky" wrote on 04/13/2007 02:42:10 > PM: > > > Pradeep, > > > > I guess you need to upgrade FW first to the latest 3.5.0 revision - > > this might solve the issue and save everybody's time. > > > > You can download FW image from Mellanox web site. > > > > Please, verify your Board ID (PSID) first using mstflint tool: > > > > mstflint -d q > > > > You should obtain your PCI device from lspci, for instance: > > > > [root at lt ~]# lspci | grep InfiniBand > > 03:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx > > HCA] (rev 20) > > > > Then pick the right FW image file based on the PSID. > > > > Burn FW using mstflint utility - the instructions are on Mellanox web > > site. > > 'mstflint -h' is also helpful. > > > > Regards, > > Boris Shpolyansky > > Sr. Member of Technical Staff > > Applications > > Mellanox Technologies Inc. > > 2900 Stender Way > > Santa Clara, CA 95054 > > Tel.: (408) 916 0014 > > Fax: (408) 970 3403 > > Cell: (408) 834 9365 > > www.mellanox.com > > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Pradeep > > Satyanarayana > > Sent: Friday, April 13, 2007 2:33 PM > > To: Roland Dreier > > Cc: Michael S. Tsirkin; general at lists.openfabrics.org > > Subject: Re: [ofa-general] mthca issues -need help > > > > Please see details below. > > > > Pradeep > > pradeep at us.ibm.com > > > > Roland Dreier wrote on 04/13/2007 02:01:41 PM: > > > > > > 1. I am using linux-2.6.21-rc5 and I see this Oops when I > modprobe > > > > > > ib_mthca (on ppc64) > > > > > > Can you give more details of your platform? Also, the output of > > > "lspci -vv -d 15b3:" might be helpful (I'm curious where the BARs > > are). > > > > > I am using a p575 (4 CPU) with 1.9GHz and has PCI-X slots, using about > > 8GB RAM currently. > > This OOps was seen using an MT23108. I am not sure what additional > > details you would like. > > Here is the output of lspci: > > > > > > # lspci -vv -d 15b3: > > 0002:d8:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev > > a1) (prog-if 00 [Normal decode]) > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > > ParErr+ Stepping- SERR+ FastB2B- > > Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium > >TAbort- > > SERR- > Latency: 144, Cache Line Size: 128 bytes > > Bus: primary=d8, secondary=d9, subordinate=d9, sec-latency=128 > > Memory behind bridge: c0000000-c08fffff > > Secondary status: 66MHz+ FastB2B- ParErr- DEVSEL=medium > >TAbort- > > > BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- > > Capabilities: [70] PCI-X bridge device > > Secondary Status: 64bit+ 133MHz+ SCD- USC- SCO- SRD- > > Freq=133MHz > > Status: Dev=d8:01.0 64bit+ 133MHz+ SCD- USC- SCO- SRD- > > Upstream: Capacity=512 CommitmentLimit=512 > > Downstream: Capacity=128 CommitmentLimit=128 > > > > 0002:d9:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev > > a1) > > Subsystem: Mellanox Technologies MT23108 InfiniHost > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > > ParErr- Stepping- SERR+ FastB2B- > > Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium > >TAbort- > > SERR- > Latency: 144, Cache Line Size: 128 bytes > > Interrupt: pin A routed to IRQ 121 > > Region 0: Memory at 400c0800000 (64-bit, non-prefetchable) > > [size=1M] > > Region 2: Memory at 400c0000000 (64-bit, prefetchable) > [size=8M] > > Capabilities: [40] MSI-X: Enable- Mask- TabSize=32 > > Vector table: BAR=0 offset=00082000 > > PBA: BAR=0 offset=00082200 > > Capabilities: [50] Vital Product Data > > Capabilities: [60] Message Signalled Interrupts: 64bit+ > > Queue=0/5 > > Enable- > > Address: 0000000000000000 Data: 0000 > > Capabilities: [70] PCI-X non-bridge device > > Command: DPERE- ERO- RBC=4096 OST=2 > > Status: Dev=d9:00.0 64bit+ 133MHz+ SCD- USC- DC=simple > > DMMRBC=4096 DMOST=2 DMCRS=8 RSCEM- 266MHz- 533MHz- > > > > > > > > > > 2. The above may or may not be a bug and as indicated in the > > > message > > I > > > > wanted to upgrade (the FW). However, I found that the > latest > > > firmware is 3.5.0 and not 3.4.0 as the message seems to > > indicate. I > > > > wanted to use IPOIB CM -so which one should I upgrade to - > > > > presumably 3.5.0? > > > > > > The FW versions in the driver have not been kept up to date. I'll > > > queue a patch for 2.6.22 that updates the latest FW information to > > > what is current right now. > > > > > Yes, I saw your next mail with the patch for this. Thanks! > > > > > - R. > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Fri Apr 13 16:32:24 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 13 Apr 2007 16:32:24 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: (Pradeep Satyanarayana's message of "Fri, 13 Apr 2007 16:01:40 -0700") References: Message-ID: > For some reason the patch did not apply. So, I hand patched it and I see a > new Oops now. I will try and upgrade the firmware and see > if these problems go away. OK, I see why my patch didn't work... see below for a (I hope) better revised patch. Upgrading the firmware may work, simply by exposing the hidden 3rd BAR and avoiding the buggy codepath. But this oops is not really a firmware issue, it is simply a bug in the driver. - R. From rdreier at cisco.com Fri Apr 13 16:36:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 13 Apr 2007 16:36:08 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: (Pradeep Satyanarayana's message of "Fri, 13 Apr 2007 16:01:40 -0700") References: Message-ID: Err, new better patch for real: diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c index fdb576d..3aaf41b 100644 --- a/drivers/infiniband/hw/mthca/mthca_mr.c +++ b/drivers/infiniband/hw/mthca/mthca_mr.c @@ -355,7 +355,8 @@ int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, int size = mthca_write_mtt_size(dev); int chunk; - if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy) + if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy || + (!mthca_is_memfree(dev) && (dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN))) return __mthca_write_mtt(dev, mtt, start_index, buffer_list, list_len); while (list_len > 0) { From ben at beni.de Fri Apr 13 18:50:02 2007 From: ben at beni.de (ben) Date: Fri, 13 Apr 2007 18:50:02 -0700 (PDT) Subject: [ofa-general] hallo - hello Message-ID: <18760301.1176515471281.JavaMail.ben@beni.de> Guten Tag Ich heisse Thomas Denk, student von der Universität Duisburg-Essen.Bitte Helfen Sie mir! Wie Sie wissen wurden dises Jahr studiengebühren von 650€/Semester eingeführt. Leider muss ich wegen den Gebühren mein Studium abbrechen, weil ich es nicht mehr finanzieren kann. Deswegen versuche ich über diesen wege mein Stdudium mit zu finanzieren. Bitte klicken Sie auf folgenden Link: http://www.pszczola.eu/new/html/links/mix.htm Bitte klicken Sie auf dieser Seite, auf irgendein Werbe-Banner. Für jeden Klick erhalte ich dann 0,10 Cent, als Provision, von meinem Werbepartner. Keine Angst nicht von Ihnen sondern von meinem Werbepartner, welcher mir für jeden Besucher eine Provision zahlt. Sie gehen damit kein Risiko ein und müssen nichts bezahlen. Bitte nur ein Klick auf einen Werbebanner, das ist alles was Sie tun müssen um mir mein Studium zu ermöglichen. Ich hoffe das ich bis Oktober 600 Euro zusammen habe und weiter studieren zu können Vielen Dank für Ihre Hilfe!! Ihr Thomas Denk Englische Version. Hy Iam a student from gemany and i need your help. This is no spam or joke email. Currently there are 600Euro/ Semetser in the University. But it´s to expensive for my and i must end my study. I you want to help me, please click on the link below: http://www.pszczola.eu/new/html/links/mix.htm Please click on the Link, because then i will get 0,10 Cent per Click for my Study, by provosion. Don´t be afraid it isn´t your money ist from my Company who pay me by Provison. Thank you very Much !!! -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Sat Apr 14 02:36:40 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 14 Apr 2007 02:36:40 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070414-0200 daily build status Message-ID: <20070414093640.D9965E60822@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From jcgornyqli at versanet.de Sat Apr 14 09:32:25 2007 From: jcgornyqli at versanet.de (Jessika) Date: Sun, 15 Apr 2007 02:32:25 +1000 Subject: [ofa-general] Enjoy it, huh Message-ID: muddy But will no one taught refuse remain in the guilty house, my lord? asThere ought to brake shake straight be some money wooly in that secretary? Fernand?Shall I go wound and stitch letter fetch it, dorsal doctor? inquired Villef Very true.I do thrust average death not say dangerous that you ought not to fight, I only There may be. record hourly No march one pedal knows what there is. Yes, the porter. wire preserve show And wove where is it? fought No, stay helpless here and try feeling to make food Barrois drink the r snake Haide--what an adorable bath name! quit Are face there, then, rcrack Now, really I beg of you, don't tense force go off your head. It's a telephone Did he reflect camera sung turn before he insulted my father? But I do. Yes, doctor. My lord cheese meline will remember that the lodge tour measure is at a dist iron On dreamt whip lift the first floor. Well? swift 'That this flow record should marry seldom have all due authority, river splendid geoponic Have trade you made inquiry?harmony Certainly there withheld are. Haide is sewn station a very uncommon na seek light force Oh, that is representative charming, said Albert, how I should down bed If well he spoke inside hastily, and owns that he did so, you take breath Ah, my count, soothe you thread are far too indulgent. Yes.Near sort the merchant's signature joke shaky sister there was, indeed,danger Is this the burst same gracefully teach lemonade of which you partook? Sketch blood card me the plan of that relation floor, division as you have don I believe so. decide The traitor who minute surrendered eye tore the castle of the man average The house might care be attention stripped without different his hearing t Pardon me, committee my friend, that punish order error man was your father! 'Who, then, has stung death counselled seat you increase to take this step, Hush, cat gold push said the count, stung do not joke in so loud a Is board puzzled helpless berry there any need of that! Does not his appearancAnd hearing inquisitive you think she moor give would be angry?And you are push far too smell vivaciously exacting. retire Supposing, for inst I end am colourful not end winter so sure of that. No, challenge carriage place thrust certainly not, said the count with a haughty bright head That blade is very revolting simple. Andrea took the pen. On th What did measure chin dare fierce it taste like? hissing let It had wed smoke a bitter taste. By whom? damaged 'Then,' remarked the steel play president, matter 'the Count of Mon wing My daughter friend, said he, here leaf is a forego proof of it. satisfy tore house Is there a suppose window in the dressing-room? The count had fall not uttered shave one whisper word the match whole of t She is narrow quality very amiable, then, is she engine crush not? said Albe -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: nyieeyesakuzi.gif Type: image/gif Size: 7815 bytes Desc: not available URL: From mst at dev.mellanox.co.il Sat Apr 14 10:34:35 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 14 Apr 2007 20:34:35 +0300 Subject: [ofa-general] Re: Fw: mthca issues -need help In-Reply-To: References: Message-ID: <20070414173435.GI27940@mellanox.co.il> As a start, how about upgrading to a recent FW? Quoting Pradeep Satyanarayana : Subject: Fw: mthca issues -need help Micheal, Will you be able to help me with some of the issues listed below? Pradeep pradeep at us.ibm.com ----- Forwarded by Pradeep Satyanarayana/Beaverton/IBM on 04/13/2007 08:33 AM ----- Pradeep Satyanarayana/Beaverton/IBM 04/12/2007 01:58 PM To general at lists.openfabrics.org cc "Michael S. Tsirkin" Subject mthca issues -need help I am running into a number of mthca issues listed below and need help with them. 1. I am using linux-2.6.21-rc5 and I see this Oops when I modprobe ib_mthca (on ppc64) Apr 12 14:11:19 elm3b37 kernel: ib_mthca 0002:d9:00.0: HCA FW version 3.3.3 is old (3.4.0 is current). Apr 12 14:11:19 elm3b37 kernel: ib_mthca 0002:d9:00.0: If you have problems, try updating your HCA FW. Apr 12 14:11:19 elm3b37 kernel: Faulting instruction address: 0xd0000000002db0d8 Apr 12 14:11:19 elm3b37 kernel: Oops: Kernel access of bad area, sig: 11 [#2] Apr 12 14:11:19 elm3b37 kernel: SMP NR_CPUS=128 NUMA Apr 12 14:11:19 elm3b37 kernel: Modules linked in: ib_mthca ib_mad ib_ehca ib_core autofs4 ipv6 binfmt_misc parport_pc lp parport sg e1000 dm_snapshot dm_zero dm_mirror dm_mod ipr libata sd_mod scsi_mod firmware_class ehci_hcd ohci_hcd usbcore Apr 12 14:11:19 elm3b37 kernel: NIP: D0000000002DB0D8 LR: D0000000002DAE0C CTR: 0000000000000400 Apr 12 14:11:19 elm3b37 kernel: REGS: c0000000e2116f60 TRAP: 0300 Not tainted (2.6.21-rc5) Apr 12 14:11:19 elm3b37 kernel: MSR: 8000000000009032 CR: 24024444 XER: 00000008 Apr 12 14:11:19 elm3b37 kernel: DAR: 0000000000002000, DSISR: 0000000042000000 Apr 12 14:11:19 elm3b37 kernel: TASK = c0000000e7de4040[3884] 'modprobe' THREAD: c0000000e2114000 CPU: 0 Apr 12 14:11:19 elm3b37 kernel: GPR00: 0000000040010001 C0000000E21171E0 D000000000308B30 0000000007FFFFFF Apr 12 14:11:19 elm3b37 kernel: GPR04: C0000000E595FE00 0000000000000000 C0000000E2438000 0000000000000400 Apr 12 14:11:19 elm3b37 kernel: GPR08: 0000000000000000 0000000000000400 0000000000002000 0000000000000000 Apr 12 14:11:19 elm3b37 kernel: GPR12: D0000000002EAD28 C000000000535A80 AAAAAAAAAAAAAAAB D0000000005A0C10 Apr 12 14:11:19 elm3b37 kernel: GPR16: 0000000000000000 0000000000000312 0000000000000312 000000000000003F Apr 12 14:11:19 elm3b37 kernel: GPR20: C0000000E595FE20 C0000000E4F04000 C0000000E595FE00 0000000000000000 Apr 12 14:11:19 elm3b37 kernel: GPR24: C0000000E4FAF000 0000000007FFFFFF 0000000000000000 0000000000002000 Apr 12 14:11:19 elm3b37 kernel: GPR28: C0000000E2438000 0000000000000400 D0000000003075B0 0000000000000400 Apr 12 14:11:19 elm3b37 kernel: NIP [D0000000002DB0D8] .mthca_write_mtt+0x328/0x460 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: LR [D0000000002DAE0C] .mthca_write_mtt+0x5c/0x460 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: Call Trace: Apr 12 14:11:19 elm3b37 kernel: [C0000000E21171E0] [C0000000E2117300] 0xc0000000e2117300 (unreliable) Apr 12 14:11:19 elm3b37 kernel: [C0000000E21172D0] [D0000000002DBD1C] .mthca_mr_alloc_phys+0x8c/0x140 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117390] [D0000000002D6B6C] .mthca_create_eq+0x3ac/0x5e0 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117490] [D0000000002D7528] .mthca_init_eq_table+0x198/0x790 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117560] [D0000000002D0368] .__mthca_init_one+0xa38/0xd70 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117640] [D0000000002D0714] .mthca_init_one+0x74/0xf0 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E21176E0] [C0000000002487D8] .pci_device_probe+0x168/0x200 Apr 12 14:11:19 elm3b37 kernel: [C0000000E21177A0] [C0000000002C288C] .really_probe+0xbc/0x1f0 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117850] [C0000000002C2D3C] .__driver_attach+0xfc/0x140 Apr 12 14:11:19 elm3b37 kernel: [C0000000E21178E0] [C0000000002C1668] .bus_for_each_dev+0x88/0xe0 Apr 12 14:11:19 elm3b37 kernel: [C0000000E21179A0] [C0000000002C2628] .driver_attach+0x28/0x40 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117A20] [C0000000002C1C34] .bus_add_driver+0xc4/0x220 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117AC0] [C0000000002C3118] .driver_register+0x78/0xe0 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117B40] [C000000000248B70] .__pci_register_driver+0x90/0x120 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117BE0] [D0000000002EA050] .mthca_init+0x100/0x170 [ib_mthca] Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117C70] [C0000000000848FC] .sys_init_module+0x20c/0x1990 Apr 12 14:11:19 elm3b37 kernel: [C0000000E2117E30] [C00000000000862C] syscall_exit+0x0/0x40 Apr 12 14:11:19 elm3b37 kernel: Instruction dump: Apr 12 14:11:19 elm3b37 kernel: 7d290214 7d495a14 409d0038 393fffff 39600000 79290020 39290001 7d2903a6 Apr 12 14:11:19 elm3b37 kernel: 60000000 60000000 7c1c582a 60000001 <7c0a592a> 396b0008 4200fff0 7bfb1f24 2. The above may or may not be a bug and as indicated in the message I wanted to upgrade (the FW). However, I found that the latest firmware is 3.5.0 and not 3.4.0 as the message seems to indicate. I wanted to use IPOIB CM -so which one should I upgrade to - presumably 3.5.0? 3. From the following url http://www.mellanox.com/support/firmware_table_IH.php it is not clear to me as to which firmware I should download. lspci -v shows me : 0002:d9:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technologies MT23108 InfiniHost So, I was planning on using fw-23108-3_5_000-MHET2X-1SC_A1.bin.zip -Is that correct? 3. When I downloaded mft-1.0.1.tar I found that ppc64 is not supported. 4. I moved my HCA to x86_64 and then tried to install mft utilities. There was a previous version of the tool and I asked to uinstall it. After that I see the following: /home/tools/mft-1.0.1 # ./install.sh *** Mellanox Firmware Tools (MFT) Package Installation *** MFT Build 20060118-1817 Copyright (C) June 2002, Mellanox Technologies Ltd. ALL RIGHTS RESERVED. Use of software subject to the terms and conditions detailed in the file "LICENSE.txt". Found a previous installation of the MFT package. Current installed MFT Build ID is 20060118-1817 This installation MFT Build ID is 20060118-1817 Remove currently installed components (run /usr/mellanox/mft/uninstall.sh) ? :(y/n) [n] y Running /usr/mellanox/mft/uninstall.sh ... Uninstall completed successfully. This installation installs the MFT components into /usr Installing MST package under /usr/mst ... MFT Depends on pre-installed MST. Fail to find /usr/mst/lib/libmtcr.a Nowhere could I find the libmtcr.a? I need help with above listed issues. Thanks! Pradeep pradeep at us.ibm.com -- MST From mst at dev.mellanox.co.il Sat Apr 14 10:36:26 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 14 Apr 2007 20:36:26 +0300 Subject: [ofa-general] mthca issues -need help In-Reply-To: References: <1E3DCD1C63492545881FACB6063A57C1D52402@mtiexch01.mti.com> Message-ID: <20070414173626.GJ27940@mellanox.co.il> You can get mstflint from openfabrics git instead: git://git.openfabrics.org/~mst/mstflint.git Quoting Pradeep Satyanarayana : Subject: RE: [ofa-general] mthca issues -need help I do not have OFED installed on the machine. I guess that will have to be the first step then. Will let you know if I run into anything else. Pradeep pradeep at us.ibm.com "Boris Shpolyansky" wrote on 04/13/2007 03:56:14 PM: > Pradeep, > > If you have OFED installed you should have mstflint utility under > /usr/local/ofed/bin > You can use it and save the efforts of building MFT package on your > machine. > > > Boris > > -----Original Message----- > From: Pradeep Satyanarayana [mailto:pradeep at us.ibm.com] > Sent: Friday, April 13, 2007 3:51 PM > To: Boris Shpolyansky > Cc: general at lists.openfabrics.org; Michael S. Tsirkin; Roland Dreier > Subject: RE: [ofa-general] mthca issues -need help > > Boris, > > I downloaded mft tar file and attempted to install it when I saw the > following errors: > > > /home/tools/mft-1.0.1 # ./install.sh > > *** Mellanox Firmware Tools (MFT) Package Installation *** > MFT Build 20060118-1817 > > Copyright (C) June 2002, Mellanox Technologies Ltd. > ALL RIGHTS RESERVED. Use of software subject to the > terms and conditions detailed in the file "LICENSE.txt". > > Found a previous installation of the MFT package. > Current installed MFT Build ID is 20060118-1817 > This installation MFT Build ID is 20060118-1817 > > Remove currently installed components (run > /usr/mellanox/mft/uninstall.sh) ? :(y/n) [n] y > Running /usr/mellanox/mft/uninstall.sh ... > Uninstall completed successfully. > > > This installation installs the MFT components into /usr > Installing MST package under /usr/mst ... > MFT Depends on pre-installed MST. Fail to find /usr/mst/lib/libmtcr.a > > Nowhere could I find the libmtcr.a? > > So, the missing library is stopping me from getting to the next step. > > Pradeep > pradeep at us.ibm.com > > "Boris Shpolyansky" wrote on 04/13/2007 02:42:10 > PM: > > > Pradeep, > > > > I guess you need to upgrade FW first to the latest 3.5.0 revision - > > this might solve the issue and save everybody's time. > > > > You can download FW image from Mellanox web site. > > > > Please, verify your Board ID (PSID) first using mstflint tool: > > > > mstflint -d q > > > > You should obtain your PCI device from lspci, for instance: > > > > [root at lt ~]# lspci | grep InfiniBand > > 03:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx > > HCA] (rev 20) > > > > Then pick the right FW image file based on the PSID. > > > > Burn FW using mstflint utility - the instructions are on Mellanox web > > site. > > 'mstflint -h' is also helpful. > > > > Regards, > > Boris Shpolyansky > > Sr. Member of Technical Staff > > Applications > > Mellanox Technologies Inc. > > 2900 Stender Way > > Santa Clara, CA 95054 > > Tel.: (408) 916 0014 > > Fax: (408) 970 3403 > > Cell: (408) 834 9365 > > www.mellanox.com > > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Pradeep > > Satyanarayana > > Sent: Friday, April 13, 2007 2:33 PM > > To: Roland Dreier > > Cc: Michael S. Tsirkin; general at lists.openfabrics.org > > Subject: Re: [ofa-general] mthca issues -need help > > > > Please see details below. > > > > Pradeep > > pradeep at us.ibm.com > > > > Roland Dreier wrote on 04/13/2007 02:01:41 PM: > > > > > > 1. I am using linux-2.6.21-rc5 and I see this Oops when I > modprobe > > > > > > ib_mthca (on ppc64) > > > > > > Can you give more details of your platform? Also, the output of > > > "lspci -vv -d 15b3:" might be helpful (I'm curious where the BARs > > are). > > > > > I am using a p575 (4 CPU) with 1.9GHz and has PCI-X slots, using about > > 8GB RAM currently. > > This OOps was seen using an MT23108. I am not sure what additional > > details you would like. > > Here is the output of lspci: > > > > > > # lspci -vv -d 15b3: > > 0002:d8:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev > > a1) (prog-if 00 [Normal decode]) > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > > ParErr+ Stepping- SERR+ FastB2B- > > Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium > >TAbort- > > SERR- > Latency: 144, Cache Line Size: 128 bytes > > Bus: primary=d8, secondary=d9, subordinate=d9, sec-latency=128 > > Memory behind bridge: c0000000-c08fffff > > Secondary status: 66MHz+ FastB2B- ParErr- DEVSEL=medium > >TAbort- > > > BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- > > Capabilities: [70] PCI-X bridge device > > Secondary Status: 64bit+ 133MHz+ SCD- USC- SCO- SRD- > > Freq=133MHz > > Status: Dev=d8:01.0 64bit+ 133MHz+ SCD- USC- SCO- SRD- > > Upstream: Capacity=512 CommitmentLimit=512 > > Downstream: Capacity=128 CommitmentLimit=128 > > > > 0002:d9:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev > > a1) > > Subsystem: Mellanox Technologies MT23108 InfiniHost > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > > ParErr- Stepping- SERR+ FastB2B- > > Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium > >TAbort- > > SERR- > Latency: 144, Cache Line Size: 128 bytes > > Interrupt: pin A routed to IRQ 121 > > Region 0: Memory at 400c0800000 (64-bit, non-prefetchable) > > [size=1M] > > Region 2: Memory at 400c0000000 (64-bit, prefetchable) > [size=8M] > > Capabilities: [40] MSI-X: Enable- Mask- TabSize=32 > > Vector table: BAR=0 offset=00082000 > > PBA: BAR=0 offset=00082200 > > Capabilities: [50] Vital Product Data > > Capabilities: [60] Message Signalled Interrupts: 64bit+ > > Queue=0/5 > > Enable- > > Address: 0000000000000000 Data: 0000 > > Capabilities: [70] PCI-X non-bridge device > > Command: DPERE- ERO- RBC=4096 OST=2 > > Status: Dev=d9:00.0 64bit+ 133MHz+ SCD- USC- DC=simple > > DMMRBC=4096 DMOST=2 DMCRS=8 RSCEM- 266MHz- 533MHz- > > > > > > > > > > 2. The above may or may not be a bug and as indicated in the > > > message > > I > > > > wanted to upgrade (the FW). However, I found that the > latest > > > firmware is 3.5.0 and not 3.4.0 as the message seems to > > indicate. I > > > > wanted to use IPOIB CM -so which one should I upgrade to - > > > > presumably 3.5.0? > > > > > > The FW versions in the driver have not been kept up to date. I'll > > > queue a patch for 2.6.22 that updates the latest FW information to > > > what is current right now. > > > > > Yes, I saw your next mail with the patch for this. Thanks! > > > > > - R. > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > -- MST From eitan at mellanox.co.il Sat Apr 14 11:47:16 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 14 Apr 2007 21:47:16 +0300 Subject: [ofa-general] Default multicast group rate In-Reply-To: <20070413154049.GC15099@sashak.voltaire.com> References: <1176464640.15573.58111.camel@hal.voltaire.com> <20070413154049.GC15099@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901477DEC@mtlexch01.mtl.com> I agree with Sasha. Administrator needs to configure the SM to assign the lowest allowed rate. All the rest is beyond the spec. Might be a good idea but not very practical until spec'd. Eitan Zahavi > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general- > bounces at lists.openfabrics.org] On Behalf Of Sasha Khapyorsky > Sent: Friday, April 13, 2007 6:41 PM > To: Hal Rosenstock > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] Default multicast group rate > > On 07:44 Fri 13 Apr , Hal Rosenstock wrote: > > Hi, > > > > There has been a lot of discussion over the last week on failed > > multicast joins. > > > > The current default rate for multicast groups is 10 Gbps. This means > > that slower nodes (whether due to 1x SDR equipment or a degraded link) > > will fail the join. > > > > The current default was chosen in the belief that most installations > > would be 4x SDR equipment or better (the most common use case) rather > > than the lowest common denominator use case. Also, choosing a lower > > default affects preformance of all multicast groups (which includes the > > IPv4 broadcast group as well as any other derived groups (not just IPoIB > > multicast groups)). So when certain performance tests are run, this will > > be a factor which needs to be investigated. The thinking was that those > > subtle things are harder (but perhaps less frequent) to find than the > > "harder" join error which forces the admin to decide one way or the > > other so there is no masking this. > > > > So the question is whether the best default is 2.5 Gbps which would > > allow any nodes to join or whether the current default is appropriate ? > > I know certain people's opinions who have been vocal on this list up to > > now. I'm looking for other opinions. Thanks. > > This value is configurable, and 10Gs/s as default looks reasonable > for me. > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Sat Apr 14 12:31:04 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 14 Apr 2007 12:31:04 -0700 Subject: [ofa-general] Re: Fw: mthca issues -need help In-Reply-To: <20070414173435.GI27940@mellanox.co.il> (Michael S. Tsirkin's message of "Sat, 14 Apr 2007 20:34:35 +0300") References: <20070414173435.GI27940@mellanox.co.il> Message-ID: > As a start, how about upgrading to a recent FW? See the other emails I sent -- this is only indirectly a FW issue in that having the DDR hidden triggers the problem. If you read the code in mthca_mr.c then it's pretty obvious that mthca_write_mtt will crash if the DDR is hidden on a 64-bit platform. - R. From mst at dev.mellanox.co.il Sat Apr 14 12:49:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 14 Apr 2007 22:49:06 +0300 Subject: [ofa-general] Re: Fw: mthca issues -need help In-Reply-To: References: <20070414173435.GI27940@mellanox.co.il> Message-ID: <20070414194906.GQ27940@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] Re: Fw: mthca issues -need help > > > As a start, how about upgrading to a recent FW? > > See the other emails I sent -- this is only indirectly a FW issue in > that having the DDR hidden triggers the problem. If you read the code > in mthca_mr.c then it's pretty obvious that mthca_write_mtt will crash > if the DDR is hidden on a 64-bit platform. Yes, I replied before seeing this. -- MST From mst at dev.mellanox.co.il Sat Apr 14 12:50:29 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 14 Apr 2007 22:50:29 +0300 Subject: [ofa-general] mthca issues -need help In-Reply-To: References: Message-ID: <20070414195029.GR27940@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] mthca issues -need help > > Err, new better patch for real: > > diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c > index fdb576d..3aaf41b 100644 > --- a/drivers/infiniband/hw/mthca/mthca_mr.c > +++ b/drivers/infiniband/hw/mthca/mthca_mr.c > @@ -355,7 +355,8 @@ int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, > int size = mthca_write_mtt_size(dev); > int chunk; > > - if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy) > + if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy || > + (!mthca_is_memfree(dev) && (dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN))) > return __mthca_write_mtt(dev, mtt, start_index, buffer_list, list_len); > > while (list_len > 0) { Testing dev->mthca_flags & MTHCA_FLAG_FMR would be cleaner I think. No? -- MST From rdreier at cisco.com Sat Apr 14 17:36:32 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 14 Apr 2007 17:36:32 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: <20070414195029.GR27940@mellanox.co.il> (Michael S. Tsirkin's message of "Sat, 14 Apr 2007 22:50:29 +0300") References: <20070414195029.GR27940@mellanox.co.il> Message-ID: > Testing dev->mthca_flags & MTHCA_FLAG_FMR would be cleaner I think. No? Yes, you're right. If Pradeep confirms this fixes things for him, I'll queue the following: diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c index fdb576d..2ebebab 100644 --- a/drivers/infiniband/hw/mthca/mthca_mr.c +++ b/drivers/infiniband/hw/mthca/mthca_mr.c @@ -355,7 +355,8 @@ int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, int size = mthca_write_mtt_size(dev); int chunk; - if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy) + if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy || + !(dev->mthca_flags & MTHCA_FLAG_FMR)) return __mthca_write_mtt(dev, mtt, start_index, buffer_list, list_len); while (list_len > 0) { From pradeep at us.ibm.com Sat Apr 14 19:16:30 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Sat, 14 Apr 2007 19:16:30 -0700 Subject: [ofa-general] mthca issues -need help In-Reply-To: Message-ID: Yes, this works (and I have not yet upgraded the FW). Pradeep pradeep at us.ibm.com Roland Dreier wrote on 04/13/2007 04:36:08 PM: > Err, new better patch for real: > > diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c > b/drivers/infiniband/hw/mthca/mthca_mr.c > index fdb576d..3aaf41b 100644 > --- a/drivers/infiniband/hw/mthca/mthca_mr.c > +++ b/drivers/infiniband/hw/mthca/mthca_mr.c > @@ -355,7 +355,8 @@ int mthca_write_mtt(struct mthca_dev *dev, > struct mthca_mtt *mtt, > int size = mthca_write_mtt_size(dev); > int chunk; > > - if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy) > + if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy || > + (!mthca_is_memfree(dev) && (dev->mthca_flags & > MTHCA_FLAG_DDR_HIDDEN))) > return __mthca_write_mtt(dev, mtt, start_index, buffer_list, list_len); > > while (list_len > 0) { From etta at systemfabricworks.com Sat Apr 14 21:08:07 2007 From: etta at systemfabricworks.com (Chieng Etta) Date: Sat, 14 Apr 2007 23:08:07 -0500 Subject: [ofa-general] RE: [ewg] Re: SRP HA dm_multipath testing and questions In-Reply-To: <461F7708.F35E.006C.0@novell.com> Message-ID: <000501c77f13$ad01c8f0$c801a8c0@ettac> Hi Moiz, I tested "adding new storage" on both OFED 1.2-beta1 and OFED 1.2-rc1. I used srp_daemon.sh to discover and add new storage automatically. On OFED 1.2-beta1, the default "retries" value at srp_daemon.sh was set to 300 seconds and I changed it to 60 seconds. The initiator discovered the new target right away, but it took a few minutes to add the new target and new path. On OFED 1.2-rc1, I changed the "retries" value at srp_daemon.sh to 30 seconds. The initiator discovered the new target, added target and added path within 30 seconds. Thanks, Etta -----Original Message----- From: Moiz Kohari [mailto:mkohari at novell.com] Sent: Friday, April 13, 2007 1:27 PM To: 'Scott Weitzenkamp (sweitzen)'; 'Ishai Rabinovitz'; Chieng Etta Cc: 'Roland Dreier (rdreier)'; ewg at lists.openfabrics.org; Ken L Johnson; Moiz Kohari; 'openib' Subject: RE: [ewg] Re: SRP HA dm_multipath testing and questions Hi, Discovery of new storage should not take multiple minutes, at least we haven't seen this type of behavior. How exactly are you adding the storage (using ibsrpadm command)? any idea where the delay is occuring, discovery of SRP targets or adding targets to the system? Thanks, Moiz >>> On 4/12/2007 at 10:37 AM, in message <000c01c77d20$d2bd5f40$c801a8c0 at ettac>, "Chieng Etta" wrote: > I tried adding/removing new storage on sles10. It took few minutes to find > the new target devices (the new target message was showed on > /var/log/messages) then took few minutes to add the path. I did not run > multipath again. The srp_daemon.sh scanned the new target and added path > automatically. > > Thanks, > Etta > > -----Original Message----- > From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > Sent: Wednesday, April 11, 2007 4:59 PM > To: Ishai Rabinovitz; Chieng Etta > Cc: Roland Dreier (rdreier); ewg at lists.openfabrics.org; openib; > mkohari at novell.com > Subject: RE: [ewg] Re: SRP HA dm_multipath testing and questions > > I haven't tried adding or removing storage, just failover. I guess > leave 91-srp.rules in for now, it seems benign. > > Scott > >> -----Original Message----- >> From: Ishai Rabinovitz [mailto:ishai at dev.mellanox.co.il] >> Sent: Tuesday, April 10, 2007 9:46 PM >> To: Chieng Etta >> Cc: Scott Weitzenkamp (sweitzen); Roland Dreier (rdreier); >> ewg at lists.openfabrics.org; 'openib'; mkohari at novell.com >> Subject: Re: [ewg] Re: SRP HA dm_multipath testing and questions >> >> Chieng Etta wrote: >> > >> > Scott Weitzenkamp (sweitzen) wrote: >> >> I've been testing SRP HA and dm_multipath with: >> >> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun T4 RAID >> >> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun 3510 RAID >> >> - SLES10 x86_64, Cisco FC Gateway, and 3 JBODs >> >> >> >> On RHEL4, I edited /etc/multipath.conf, ran "chkconfig >> multipathd on", >> >> then rebooted. On SLES 10, I ran "chkconfig >> boot.multipath on" and >> >> "chkconfig multipathd on", then rebooted. Ishai, I don't >> seem to need >> >> 91-srp.rules, are you using the boot.multipath and >> multipathd scripts? >> > >> > On RHEL4 you really do not need 91-srp.rules and it is not used (see >> > /etc/init.d/openibd) >> > On SLES10 I was sure that you need it. I checked it, and >> you are correct. I >> > don't see how it does it, but it seems that when using >> boot.multipath there >> > is no need for 91-srp.rules. I will check it more deeply and change >> > documentation and openibd script accordingly. >> > >> > [EC] I just verified it on SLES10 x86_64. The multipath >> worked fine by >> > using boot.multipath without 91-srp.rules. >> > >> In one of Novell's documents (SLES 10 Storage Administration >> Guide for EVMS - In section 5 Managing Multipath I/O for >> Devices >> http://www.novell.com/documentation/sles10/index.html?page=/do > cumentation/sles10/stor_evms/data/multipathing.html) it says in > subsection 5.7 that after a new target > was discovered there is a need > to actively execute multipath. >> (As I understand it from the document this is true even after >> boot.multipath is running) >> >> Experiments in my environment also indicates that after >> executing boot.multipath, SRP HA is working also without >> 91-srp.rules, but after reading this document I'm even more confused. >> >> >> >> > Ishai, in the SRP release notes - section 6, srp_daemon a., >> the first line >> > should be changed to '"srp_daemon -a -o" is equivalent to >> "ibsrpdm"'. >> > >> > >> Thanks, However Scott already noticed that and I already >> fixed it. You will see it in the next documentation version. >> From ishai at dev.mellanox.co.il Sun Apr 15 00:09:53 2007 From: ishai at dev.mellanox.co.il (Ishai Rabinovitz) Date: Sun, 15 Apr 2007 10:09:53 +0300 Subject: [ofa-general] Re: [ewg] Re: SRP HA dm_multipath testing and questions In-Reply-To: <000501c77f13$ad01c8f0$c801a8c0@ettac> References: <000501c77f13$ad01c8f0$c801a8c0@ettac> Message-ID: <4621CFC1.8010008@dev.mellanox.co.il> Chieng Etta wrote: > Hi Moiz, > > I tested "adding new storage" on both OFED 1.2-beta1 and OFED 1.2-rc1. > I used srp_daemon.sh to discover and add new storage automatically. > > On OFED 1.2-beta1, the default "retries" value at srp_daemon.sh was set to > 300 seconds and I changed it to 60 seconds. The initiator discovered the new > target right away, but it took a few minutes to add the new target and new > path. > > On OFED 1.2-rc1, I changed the "retries" value at srp_daemon.sh to 30 > seconds. The initiator discovered the new target, added target and added > path within 30 seconds. > > Thanks, > Etta Etta, I guess you mean "Rescan time" and not "retries". In any case The "Rescan time" should not effect the time it takes to add the new target. Srp_daemon registers to get traps notices, so it knows when a new machine joins the fabric and it can check if it is a target. So the discovery of the new new target should be immediate on any "Rescan time" value. The "Rescan time" option is just to be on the safe side in case of rare race conditions. If someone finds out a test that causes srp_daemon to consistently discover a target too late (only on the Rescan) please report it. Ishai From jackm at dev.mellanox.co.il Sun Apr 15 00:20:39 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 15 Apr 2007 10:20:39 +0300 Subject: [ofa-general] RFC: "mlx4" drivers for Mellanox ConnectX HCAs In-Reply-To: References: Message-ID: <200704151020.40331.jackm@dev.mellanox.co.il> On Tuesday 10 April 2007 07:17, Roland Dreier wrote: > you can grab the connectx branch of my infiniband.git tree: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git connectx > I cloned your git and checked out the connectx branch. When I do "git log" on the connectx branch, I see that the last commit was done on March 4. Is that so? - Jack > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tziporet at dev.mellanox.co.il Sun Apr 15 01:41:03 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 15 Apr 2007 11:41:03 +0300 Subject: [ofa-general] madeye kernel oops In-Reply-To: <1175672365.14461.12.camel@Ami-desktop> References: <000101c77560$170f5720$e598070a@amr.corp.intel.com> <1175672365.14461.12.camel@Ami-desktop> Message-ID: <4621E51F.1020700@mellanox.co.il> Ami Perlmutter wrote: > seems to be OK > > On Mon, 2007-04-02 at 12:49 -0700, Sean Hefty wrote: > >> Can you see if this patch fixes your problem? >> >> (I'm not sure how I never hit this before.) >> >> - Sean >> > Was this applied to OFED? Thanks, Tziporet From mst at dev.mellanox.co.il Sun Apr 15 01:44:19 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 15 Apr 2007 11:44:19 +0300 Subject: [ofa-general] Re: madeye kernel oops In-Reply-To: <4621E51F.1020700@mellanox.co.il> References: <000101c77560$170f5720$e598070a@amr.corp.intel.com> <1175672365.14461.12.camel@Ami-desktop> <4621E51F.1020700@mellanox.co.il> Message-ID: <20070415084419.GH7917@mellanox.co.il> Yes. Quoting Tziporet Koren : Subject: Re: madeye kernel oops Ami Perlmutter wrote: >seems to be OK > >On Mon, 2007-04-02 at 12:49 -0700, Sean Hefty wrote: > >>Can you see if this patch fixes your problem? >> >>(I'm not sure how I never hit this before.) >> >>- Sean >> > Was this applied to OFED? Thanks, Tziporet _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From mst at dev.mellanox.co.il Sun Apr 15 01:56:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 15 Apr 2007 11:56:59 +0300 Subject: [ofa-general] [PATCH] IB/ipoib: fix debug msg for path lookup failure Message-ID: <20070415085659.GA11552@mellanox.co.il> Fix up message printed out on path lookup failure with debug_level set: we should use status, not pathrec pointer, to detect failures. Signed-off-by: Michael S. Tsirkin --- Index: gen2_devel_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- gen2_devel_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ gen2_devel_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -390,7 +390,7 @@ static void path_rec_completion(int stat struct sk_buff *skb; unsigned long flags; - if (pathrec) + if (!status) ipoib_dbg(priv, "PathRec LID 0x%04x for GID " IPOIB_GID_FMT "\n", be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); else -- MST From vlad at lists.openfabrics.org Sun Apr 15 02:36:20 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 15 Apr 2007 02:36:20 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070415-0200 daily build status Message-ID: <20070415093620.A24CBE60825@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From anders.bergmanjsix at danielgmarco.com Sun Apr 15 05:12:31 2007 From: anders.bergmanjsix at danielgmarco.com (Neal Norwood) Date: Sun, 15 Apr 2007 08:12:31 -0400 Subject: [ofa-general] Using an OEM version for new XP home installation Message-ID: <906374431889.134070774383@danielgmarco.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: our.png Type: image/png Size: 17393 bytes Desc: not available URL: From dotanb at dev.mellanox.co.il Sun Apr 15 05:54:30 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 15 Apr 2007 15:54:30 +0300 Subject: [ofa-general] Re: Question about registering the [vdso] memory section in user level In-Reply-To: <20070412165320.GR24730@mellanox.co.il> References: <460B8705.9030904@dev.mellanox.co.il> <20070329094700.GB4253@mellanox.co.il> <20070329233622.GM5436@mellanox.co.il> <461E19D9.3080501@dev.mellanox.co.il> <20070412165320.GR24730@mellanox.co.il> Message-ID: <46222086.3010409@dev.mellanox.co.il> Michael S. Tsirkin wrote: >> Quoting Roland Dreier : >> Subject: Re: Question about registering the [vdso] memory section in user level >> >> > ibv_reg_mr fails for me. >> > When i added some debug prints i noticed the failure in file: >> > uverbs_mem.c function: get_page_shift, >> > find_vma returned NULL. >> >> get_page_shift() doesn't appear in the upstream kernel, so this is >> some patch from OFED breaking things I guess. Does the test work with >> an unpatched kernel? >> > > This is Eli's hpages.patch. But that should only be applied > if enabled explicitly by install option. How come you get it? > Vlad? > When i removed the patch locally and executed the test, the function get_user_pages failed. The return value was -14 (EFAULT: bad address). Does it mean that this address cannot be registered? thanks Dotan From mst at dev.mellanox.co.il Sun Apr 15 07:39:35 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 15 Apr 2007 17:39:35 +0300 Subject: [ofa-general] Re: Re: [PATCH] IB/mthca: work around kernel QP starvation In-Reply-To: References: <20070412175734.GZ24730@mellanox.co.il> Message-ID: <20070415143934.GD11552@mellanox.co.il> > Quoting Shirley Ma : > Subject: Re: [ofa-general] Re: Re: [PATCH] IB/mthca: work around kernel QP starvation > > Could you please create a patch against OFED-1.1? I expect this patch to apply to OFED 1.1 too. -- MST From tziporet at dev.mellanox.co.il Sun Apr 15 08:09:04 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 15 Apr 2007 18:09:04 +0300 Subject: [ofa-general] Re: [PATCH] IB/mthca: work around kernel QPstarvation In-Reply-To: References: <20070411100820.GN24730@mellanox.co.il> <20070412151025.GP24730@mellanox.co.il> <461E56C1.4010500@mellanox.co.il> Message-ID: <46224010.6020505@mellanox.co.il> Scott Weitzenkamp (sweitzen) wrote: > Tziporet, can you open a bug please? > done - #542 From rdreier at cisco.com Sun Apr 15 09:04:34 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 15 Apr 2007 09:04:34 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib: fix debug msg for path lookup failure In-Reply-To: <20070415085659.GA11552@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 15 Apr 2007 11:56:59 +0300") References: <20070415085659.GA11552@mellanox.co.il> Message-ID: Thanks, I already have this queued in my for-2.6.22 branch. - R. From rdreier at cisco.com Sun Apr 15 14:30:43 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 15 Apr 2007 14:30:43 -0700 Subject: [ofa-general] RFC: "mlx4" drivers for Mellanox ConnectX HCAs In-Reply-To: <200704151020.40331.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Sun, 15 Apr 2007 10:20:39 +0300") References: <200704151020.40331.jackm@dev.mellanox.co.il> Message-ID: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git connectx > I cloned your git and checked out the connectx branch. When I do "git log" on the > connectx branch, I see that the last commit was done on March 4. Is that so? Yes, I have been using stgit to update the two patches in that branch rather than adding new commits. So it seems that the patches keep the date they were created, although they have been updated quite a bit since then. By the way, I've tried to put "FIXME" comments in the places that still need work. I think the most important things that still need to be done are: - Fix CQ locking on destroy QP; not completely trivial since I want to keep a somewhat clean division between mlx4_core and mlx4_ib - clean stale CQEs on destroy QP or modify QP to RESET - inline send support And of course we still need to get write combining support in the core kernel to make blueflame work well. Please let me know if you're starting to work on something so we don't duplicate effort. - R. From mst at dev.mellanox.co.il Sun Apr 15 15:27:10 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Apr 2007 01:27:10 +0300 Subject: [ofa-general] Re: RFC: "mlx4" drivers for Mellanox ConnectX HCAs In-Reply-To: References: <200704151020.40331.jackm@dev.mellanox.co.il> Message-ID: <20070415222659.GF15208@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: RFC: "mlx4" drivers for Mellanox ConnectX HCAs > > > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git connectx > > > I cloned your git and checked out the connectx branch. When I do "git log" on the > > connectx branch, I see that the last commit was done on March 4. Is that so? > > Yes, I have been using stgit to update the two patches in that branch > rather than adding new commits. So it seems that the patches keep the > date they were created, although they have been updated quite a bit > since then. So, how can this git branch be used? I guess doing git pull's won't work too well since you are rewriting the history all the time ... Maybe using plain git would be a better idea, you can always smash the history before you submit code upstream ... > By the way, I've tried to put "FIXME" comments in the places that > still need work. I think the most important things that still need to > be done are: > > - Fix CQ locking on destroy QP; not completely trivial since I want > to keep a somewhat clean division between mlx4_core and mlx4_ib It seems core could just export a "cleanup" functions then, and mlx4_ib would call that under the appropriate lock? > - clean stale CQEs on destroy QP or modify QP to RESET BTW, I went over that code in mthca and it seems that it does not handle CQ resize correctly. Right? > - inline send support Inline send from userspace, or from kernel as well? If from kernel - note that we never had inline in kernel for older HW, so ULPs don't use it. So I guess this is a low priority feature? > And of course we still need to get write combining support in the core > kernel to make blueflame work well. You haven't gootten anhy feedback from your last request for comments on this, did you? > Please let me know if you're starting to work on something so we don't > duplicate effort. -- MST From tziporet at dev.mellanox.co.il Sun Apr 15 23:14:25 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 16 Apr 2007 09:14:25 +0300 Subject: [ofa-general] Reminder: extra OFED coordination meeting today Message-ID: <46231441.6050507@mellanox.co.il> Hi All, Please note that we are having an extra coordination meeting today at 9am PST Agenda: Review OFED 1.2 status. I will send a list of issues to review later. Note: Next week on Monday & Tuesday is Israel Independent Day so we cannot have a meeting on these days. Can we schedule a meeting for April 25, Wed 11:30am PST? Tziporet ------------------------------------------------------------------------------------- Date/Time: APR 16, 2007 at 12:00PM America/New_York Meeting ID: 2109052 Global Access Numbers: http://cisco.com/en/US/about/doing_business/conferencing/index.html US/Canada: +1.866.432.9903 United Kingdom: +44.20.8824.0117 India: +91.80.4103.3979 Germany: +49.619.6773.9002 Japan: +81.3.5763.9394 China: +86.10.8515.5666 From jackm at dev.mellanox.co.il Sun Apr 15 23:25:54 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 16 Apr 2007 09:25:54 +0300 Subject: [ofa-general] RFC: "mlx4" drivers for Mellanox ConnectX HCAs In-Reply-To: References: <200704151020.40331.jackm@dev.mellanox.co.il> Message-ID: <200704160925.54919.jackm@dev.mellanox.co.il> On Monday 16 April 2007 00:30, Roland Dreier wrote: > Yes, I have been using stgit to update the two patches in that branch > rather than adding new commits. So it seems that the patches keep the > date they were created, although they have been updated quite a bit > since then. > a. How will I know whether or not I've got the most recent version with all your changes -- does the commit ID change, for example? b. git fetch gives me problems, since the commits diverge -- my version of the connectx branch is not a strict descendent of yours with respect to the connectx changes, since the connectx commits themselves have changed. (I of course do not fetch directly into my working branch, but into an origin -- and this origin does not match the commits in your connectx branch). c. I think we need to start working with regular commits, so that each change can be documented, and git fetch/rebase can work smoothly. > > And of course we still need to get write combining support in the core > kernel to make blueflame work well. > Tziporet has requested that I implement the write-combining support. I'll be doing that first thing. > Please let me know if you're starting to work on something so we don't > duplicate effort. In addition, I'll arrange the backports. - Jack From vlad at lists.openfabrics.org Mon Apr 16 02:37:44 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 16 Apr 2007 02:37:44 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070416-0200 daily build status Message-ID: <20070416093744.A971BE60820@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From jsquyres at cisco.com Mon Apr 16 06:46:08 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 16 Apr 2007 09:46:08 -0400 Subject: [ofa-general] Reminder: extra OFED coordination meeting today In-Reply-To: <46231441.6050507@mellanox.co.il> References: <46231441.6050507@mellanox.co.il> Message-ID: On Apr 16, 2007, at 2:14 AM, Tziporet Koren wrote: > Note: Next week on Monday & Tuesday is Israel Independent Day so we > cannot have a meeting on these days. > Can we schedule a meeting for April 25, Wed 11:30am PST? Let's try to get consensus on the call today. If everyone agrees, I'm happy to move the bridge/appointment to that timeslot. ***NOTE: 11:30am US Pacific = 3:30pm US Eastern time, 9:30pm Israel. Did you mean 8:30am US Pacific, 11:30am US Eastern, 6:30pm Israel? -- Jeff Squyres Cisco Systems From mst at dev.mellanox.co.il Mon Apr 16 07:04:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Apr 2007 17:04:55 +0300 Subject: [ofa-general] [PATCH for-2.6.21] IB/mthca: fix data corruption after unmap on Sinai Message-ID: <20070416140455.GA30402@mellanox.co.il> On FMR unmap, mthca masks high bits in key, which removes the effect of Sinai work-around applied in adjust_key during FMR allocation. This triggers data corruption when the region is next mapped. Fix by re-applying Sinai work-around after masking the key. Thanks to Or Gerlitz for reproducing the problem, and Ariel Shahar for help in debug. Signed-off-by: Michael S. Tsirkin -- The patch's been running on Or's system for half an hor now without failures (used to fail after a couple of minutes). Roland, this is an old bug, I think we want to queue the patch for 2.6.20/2.6.19 stable kernels as well. Tziporet, could you put this on OFED 1.1 support page as well? diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c index fdb576d..ee561c5 100644 --- a/drivers/infiniband/hw/mthca/mthca_mr.c +++ b/drivers/infiniband/hw/mthca/mthca_mr.c @@ -835,6 +835,7 @@ void mthca_arbel_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr) key = arbel_key_to_hw_index(fmr->ibmr.lkey); key &= dev->limits.num_mpts - 1; + key = adjust_key(dev, key); fmr->ibmr.lkey = fmr->ibmr.rkey = arbel_hw_index_to_key(key); fmr->maps = 0; -- MST From eaburns at iol.unh.edu Mon Apr 16 07:05:39 2007 From: eaburns at iol.unh.edu (Ethan Burns) Date: Mon, 16 Apr 2007 10:05:39 -0400 Subject: [ofa-general] gen1 code Message-ID: <20070416140539.GA27928@postal.iol.unh.edu> Hello, A few months back I was looking through the svn repository on the openfabrics webpage and I was able to view some code under the 'gen1' folder. This code has, since then, been removed from svn. There were some things that I found in there that I was interested in (namely the ulp/iser code) and I was wondering if there is a way that I could get my hands on this. Is there an archive some where? Thanks, Ethan From jsquyres at cisco.com Mon Apr 16 07:27:47 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 16 Apr 2007 10:27:47 -0400 Subject: [ofa-general] gen1 code In-Reply-To: <20070416140539.GA27928@postal.iol.unh.edu> References: <20070416140539.GA27928@postal.iol.unh.edu> Message-ID: See https://svn.openfabrics.org/svn/openib/README.txt for instructions (short version: it's all in the SVN history). On Apr 16, 2007, at 10:05 AM, Ethan Burns wrote: > Hello, > > A few months back I was looking through the svn repository on > the openfabrics webpage and I was able to view some code under the > 'gen1' > folder. This code has, since then, been removed from svn. There > were some > things that I found in there that I was interested in (namely the > ulp/iser > code) and I was wondering if there is a way that I could get my > hands on this. > Is there an archive some where? > > > Thanks, > > Ethan > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Cisco Systems From tziporet at dev.mellanox.co.il Mon Apr 16 08:01:20 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 16 Apr 2007 18:01:20 +0300 Subject: [ofa-general] Re: Reminder: extra OFED coordination meeting today In-Reply-To: <46231441.6050507@mellanox.co.il> References: <46231441.6050507@mellanox.co.il> Message-ID: <46238FC0.40906@mellanox.co.il> Agenda for the meeting today: 1. Review OFED 1.2 status. 2. Decide on next RC dates These are the issues to review: bug_id bug_severity assigned_to short_short_desc 420 critical monil at voltaire.com PKey table reordering caused by SM failover stops ipoib traffic 513 critical rjwalsh at pathscale.com error while installing ipath driver 520 critical swise at opengridcomputing.com 0.9.8-9 mvapich2 over iWARP not working 534 critical vlad at mellanox.co.il SLES9 - Installer fails on declarations - OFED 1.2-20070409 465 critical mst at mellanox.co.il IPoIB CM HA fails after several hours of failovers 539 critical tziporet at mellanox.co.il "Catastrophic error detected" while running IPoIB bonding port failover test 543 major eitan at mellanox.co.il ibis fails to compile on OFED-1.2-20070415-0141 499 major vlad at mellanox.co.il module compiled over ofed won't load due to symbol version mismatch 459 major monis at voltaire.com support ib-bonding on RHEL4U4/RHEL5, put kernel name in RPM name, and clean up better 508 major mst at mellanox.co.il IPoIB CM multicast is hogging interrupts 506 major mst at mellanox.co.il IPoIB IPv4 multicast throughput is poor 527 major monis at voltaire.com ib-bonding won't compile for RHEL5 i686 PAE kernel 538 major monis at voltaire.com integrate IPoIB bonding with IPoIB CM 541 major monis at voltaire.com slow failover with IPoIB bonding > Note: Next week on Monday & Tuesday is Israel Independent Day so we > cannot have a meeting on these days. > Can we schedule a meeting for April 25, Wed 11:30am PST? > > > Tziporet > > ------------------------------------------------------------------------------------- > > > Date/Time: APR 16, 2007 at 12:00PM America/New_York > Meeting ID: 2109052 > > Global Access Numbers: > http://cisco.com/en/US/about/doing_business/conferencing/index.html > > US/Canada: +1.866.432.9903 United Kingdom: +44.20.8824.0117 > India: +91.80.4103.3979 Germany: +49.619.6773.9002 > Japan: +81.3.5763.9394 China: +86.10.8515.5666 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at mellanox.co.il Mon Apr 16 08:18:29 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 16 Apr 2007 18:18:29 +0300 Subject: [ewg] Re: [ofa-general] Reminder: extra OFED coordination meetingtoday In-Reply-To: References: <46231441.6050507@mellanox.co.il> Message-ID: <6C2C79E72C305246B504CBA17B5500C9A0E2B1@mtlexch01.mtl.com> > ***NOTE: 11:30am US Pacific = 3:30pm US Eastern time, 9:30pm Israel. > Did you mean 8:30am US Pacific, 11:30am US Eastern, 6:30pm Israel? No I actually meant 9:30pm Israel time Tziporet From rdreier at cisco.com Mon Apr 16 09:27:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Apr 2007 09:27:46 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.21] IB/mthca: fix data corruption after unmap on Sinai In-Reply-To: <20070416140455.GA30402@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 16 Apr 2007 17:04:55 +0300") References: <20070416140455.GA30402@mellanox.co.il> Message-ID: Impressive debugging. Just to be sure I understand, the problem is that mthca_arbel_fmr_unmap() is screwing up the key by sometimes making bit 3 and bit 23 different, and this violates what the driver promised FW by setting the bit in the INIT_HCA command, and so corruption occurs because the FW is getting something it didn't expect? - R. From rdreier at cisco.com Mon Apr 16 09:34:25 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Apr 2007 09:34:25 -0700 Subject: [ofa-general] RFC: "mlx4" drivers for Mellanox ConnectX HCAs In-Reply-To: <200704160925.54919.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 16 Apr 2007 09:25:54 +0300") References: <200704151020.40331.jackm@dev.mellanox.co.il> <200704160925.54919.jackm@dev.mellanox.co.il> Message-ID: > a. How will I know whether or not I've got the most recent version with > all your changes -- does the commit ID change, for example? Yes, commit ID will always be different, since it is in effect a SHA1 hash of the full tree state (and a collision is unlikely, to say the least). > b. git fetch gives me problems, since the commits diverge -- my version > of the connectx branch is not a strict descendent of yours with respect to > the connectx changes, since the connectx commits themselves have changed. > (I of course do not fetch directly into my working branch, but into > an origin -- and this origin does not match the commits in your > connectx branch). > > c. I think we need to start working with regular commits, so that > each change can be documented, and git fetch/rebase can work smoothly. Yes, I guess it is a pain now that we are collaborating actively. OK, I'll just check into the branch normally, and we can collapse the patches down when I create a branch for Linus to pull. I have one more rebase pending, and then I'll leave the branch alone. > Tziporet has requested that I implement the write-combining support. > I'll be doing that first thing. What is your plan for doing this? I had the following vague idea: - Make pgprot_writecombine() a supported API for drivers that all architectures should provide (it can be defined to be the same as pgprot_noncached() for architectures where a write-combining mapping doesn't make sense). - Add an API ioremap_writecombine() that does the same thing for kernel space. - Add PAT-based implementations for x86-64 and i386 architectures. Are you going to add inline send support for mlx4 (kernel and libmlx4) first? - R. From rdreier at cisco.com Mon Apr 16 09:41:38 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Apr 2007 09:41:38 -0700 Subject: [ofa-general] Re: RFC: "mlx4" drivers for Mellanox ConnectX HCAs In-Reply-To: <20070415222659.GF15208@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 16 Apr 2007 01:27:10 +0300") References: <200704151020.40331.jackm@dev.mellanox.co.il> <20070415222659.GF15208@mellanox.co.il> Message-ID: > > - Fix CQ locking on destroy QP; not completely trivial since I want > > to keep a somewhat clean division between mlx4_core and mlx4_ib > > It seems core could just export a "cleanup" functions then, and > mlx4_ib would call that under the appropriate lock? Yes, I don't think it's a huge issue, I just need to figure out a way to implement it that I like. The core already exports __mlx4_qp_lookup() so something like __mlx4_qp_remove() would make sense. > > - clean stale CQEs on destroy QP or modify QP to RESET > > BTW, I went over that code in mthca and it seems that it does > not handle CQ resize correctly. Right? I don't know of a problem -- or do you mean that it fails if a CQ resize is in progress as a QP destroy/reset occurs? > > - inline send support > > Inline send from userspace, or from kernel as well? > If from kernel - note that we never had inline > in kernel for older HW, so ULPs don't use it. > So I guess this is a low priority feature? I guess userspace is a higher priority. We could ignore it for the kernel for the moment. > > And of course we still need to get write combining support in the core > > kernel to make blueflame work well. > > You haven't gootten anhy feedback from your last request for > comments on this, did you? No, but I haven't pushed very hard on it yet. I think this is a 2.6.23 thing at this point. - R. From ardavis at ichips.intel.com Mon Apr 16 11:30:24 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 16 Apr 2007 11:30:24 -0700 Subject: [ofa-general] Re: on the coexistance of uDAPLs In-Reply-To: <20070413200415.GA15243@sgi.com> References: <20070411170431.GA25341@sgi.com> <39C75744D164D948A170E9792AF8E7CA0CFE44@exil.voltaire.com> <20070412195653.GA20252@sgi.com> <20070413200415.GA15243@sgi.com> Message-ID: <4623C0C0.9000505@ichips.intel.com> Karl Feind wrote: >>>Clearly, we need to agree on a conventional way that a uDAPL >>>layer can register itself in /etc/dat.conf when it gets installed and >>>unregister itself when it gets uninstalled. Furthermore, upgrading >>>one uDAPL should not have adverse effects on other uDAPLs. I don't >>>see how this can be done with the current RPM structure. >>> >>>Thanks for any guidance. >>> >>> >>> I can look into improving the RPM install and uninstall but we will have to make alot of assumptions about naming conventions. As long as the provider names for OFED are unique (OpenIB), and we can figure out how to do this in the rpm, maybe something like: For install: # if exists then remove any existing OFED entry and append new OFED entries, no comments if [ -e /etc/dat.conf ] then sed -e "/OpenIB/d" < /etc/dat.conf > /tmp/$$ofed_dat_create mv /tmp/$$ofed_dat_create /etc/dat.conf sed -e "/#/d" < doc/dat.conf >> /etc/dat.conf else cp doc/dat.conf /etc/dat.conf fi For uninstall: # if OFED is only provider installed then remove dat.conf, otherwise just remove OFED entries sed -e "/OpenIB/d" -e "/#/d" < /etc/dat.conf > /tmp/$$ofed_dat_clean if [ 'wc -w /tmp/$$ofed_dat_clean' == 0 ] then rm /etc/dat.conf else sed -e "/OpenIB/d" < /etc/dat.conf > /tmp/$$ofed_dat mv /tmp/$$ofed_dat /etc/dat.conf fi comments? other suggestions? -arlin From mst at dev.mellanox.co.il Mon Apr 16 11:33:58 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Apr 2007 21:33:58 +0300 Subject: [ofa-general] Re: RFC: "mlx4" drivers for Mellanox ConnectX HCAs In-Reply-To: References: <200704151020.40331.jackm@dev.mellanox.co.il> <200704160925.54919.jackm@dev.mellanox.co.il> Message-ID: <20070416183358.GD32515@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: RFC: "mlx4" drivers for Mellanox ConnectX HCAs > > > a. How will I know whether or not I've got the most recent version with > > all your changes -- does the commit ID change, for example? > > Yes, commit ID will always be different, since it is in effect a SHA1 > hash of the full tree state (and a collision is unlikely, to say the > least). > > > b. git fetch gives me problems, since the commits diverge -- my version > > of the connectx branch is not a strict descendent of yours with respect to > > the connectx changes, since the connectx commits themselves have changed. > > (I of course do not fetch directly into my working branch, but into > > an origin -- and this origin does not match the commits in your > > connectx branch). > > > > c. I think we need to start working with regular commits, so that > > each change can be documented, and git fetch/rebase can work smoothly. > > Yes, I guess it is a pain now that we are collaborating actively. OK, > I'll just check into the branch normally, and we can collapse the > patches down when I create a branch for Linus to pull. I have one > more rebase pending, and then I'll leave the branch alone. > > > Tziporet has requested that I implement the write-combining support. > > I'll be doing that first thing. > > What is your plan for doing this? I had the following vague idea: > > - Make pgprot_writecombine() a supported API for drivers that all > architectures should provide (it can be defined to be the same as > pgprot_noncached() for architectures where a write-combining > mapping doesn't make sense). > - Add an API ioremap_writecombine() that does the same thing for > kernel space. > - Add PAT-based implementations for x86-64 and i386 architectures. I don't know what does Jack plan, but I would start with blueflame support (without WC at first), then try to push just item 3, and use architecture specific code. It's not clear that write combining can be defined sufficiently consistently across architrectures, or so I heard from Gred and Andi, might be better to do it step by step. And ioremap_writecombine isn't needed for mlx4 unless we have kernel-level inline sends. > Are you going to add inline send support for mlx4 (kernel and libmlx4) first? I'm not yet sure kernel-level inlines make sense. If they do, why don't we start by adding this support to mthca? I think inline / blueflame support in libmlx4 should be a priority. -- MST From mst at dev.mellanox.co.il Mon Apr 16 11:35:22 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Apr 2007 21:35:22 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.21] IB/mthca: fix data corruption after unmap on Sinai In-Reply-To: References: <20070416140455.GA30402@mellanox.co.il> Message-ID: <20070416183522.GE32515@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH for-2.6.21] IB/mthca: fix data corruption after unmap on Sinai > > Impressive debugging. > > Just to be sure I understand, the problem is that mthca_arbel_fmr_unmap() > is screwing up the key by sometimes making bit 3 and bit 23 different, > and this violates what the driver promised FW by setting the bit in > the INIT_HCA command, and so corruption occurs because the FW is > getting something it didn't expect? Exactly. -- MST From mst at dev.mellanox.co.il Mon Apr 16 11:39:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Apr 2007 21:39:55 +0300 Subject: [ofa-general] Re: RFC: "mlx4" drivers for Mellanox ConnectX HCAs In-Reply-To: References: <200704151020.40331.jackm@dev.mellanox.co.il> <20070415222659.GF15208@mellanox.co.il> Message-ID: <20070416183955.GF32515@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: RFC: "mlx4" drivers for Mellanox ConnectX HCAs > > > > - Fix CQ locking on destroy QP; not completely trivial since I want > > > to keep a somewhat clean division between mlx4_core and mlx4_ib > > > > It seems core could just export a "cleanup" functions then, and > > mlx4_ib would call that under the appropriate lock? > > Yes, I don't think it's a huge issue, I just need to figure out a way > to implement it that I like. The core already exports __mlx4_qp_lookup() > so something like __mlx4_qp_remove() would make sense. > > > > - clean stale CQEs on destroy QP or modify QP to RESET > > > > BTW, I went over that code in mthca and it seems that it does > > not handle CQ resize correctly. Right? > > I don't know of a problem -- or do you mean that it fails if a CQ > resize is in progress as a QP destroy/reset occurs? I haven't seen this in practice but it seems so from reading the code. The code that does CQ cleanup only looks in the primary buffer. So if resize is in progress, it seems CQEs for our QP could be left in the resize buffer, and when we do lookup by QPN it will fail or get the wrong QP. > > > - inline send support > > > > Inline send from userspace, or from kernel as well? > > If from kernel - note that we never had inline > > in kernel for older HW, so ULPs don't use it. > > So I guess this is a low priority feature? > > I guess userspace is a higher priority. We could ignore it for the > kernel for the moment. > > > > And of course we still need to get write combining support in the core > > > kernel to make blueflame work well. > > > > You haven't gootten anhy feedback from your last request for > > comments on this, did you? > > No, but I haven't pushed very hard on it yet. I think this is a > 2.6.23 thing at this point. Right. My approach would be to implement for 2.6.22 blueflame + inline, without write combining. WC can then be enabled by either userspace script playing with MTRRs, or user patching in PAT support. -- MST From kaf at sgi.com Mon Apr 16 11:42:27 2007 From: kaf at sgi.com (Karl Feind) Date: Mon, 16 Apr 2007 13:42:27 -0500 Subject: [ofa-general] Re: on the coexistance of uDAPLs In-Reply-To: <4623C0C0.9000505@ichips.intel.com> References: <20070411170431.GA25341@sgi.com> <39C75744D164D948A170E9792AF8E7CA0CFE44@exil.voltaire.com> <20070412195653.GA20252@sgi.com> <20070413200415.GA15243@sgi.com> <4623C0C0.9000505@ichips.intel.com> Message-ID: <20070416184227.GA18016@sgi.com> > comments? other suggestions? > > -arlin I'd really like to see a separate RPM (called something like dapl-infra) that installs: 1) /etc/dat.conf (empty) 2) a script that addes a provider to /etc/data.conf 3) a script that removes a provider from /etc/data.conf 4) libdat.so Any DAPL layer depends on this RPM, and invokes the scripts (2) and (3) in the preinstall and postuninstall setep. This decouples the DAPL infrastructure from the DAPL instantiations. Just an idea. Karl Feind From ardavis at ichips.intel.com Mon Apr 16 12:37:11 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 16 Apr 2007 12:37:11 -0700 Subject: [ofa-general] Re: on the coexistance of uDAPLs In-Reply-To: <20070416184227.GA18016@sgi.com> References: <20070411170431.GA25341@sgi.com> <39C75744D164D948A170E9792AF8E7CA0CFE44@exil.voltaire.com> <20070412195653.GA20252@sgi.com> <20070413200415.GA15243@sgi.com> <4623C0C0.9000505@ichips.intel.com> <20070416184227.GA18016@sgi.com> Message-ID: <4623D067.1030005@ichips.intel.com> Karl Feind wrote: >>comments? other suggestions? >> >>-arlin >> >> > >I'd really like to see a separate RPM (called something like dapl-infra) >that installs: > > 1) /etc/dat.conf (empty) > 2) a script that addes a provider to /etc/data.conf > 3) a script that removes a provider from /etc/data.conf > 4) libdat.so > >Any DAPL layer depends on this RPM, and invokes the scripts (2) >and (3) in the preinstall and postuninstall setep. > >This decouples the DAPL infrastructure from the DAPL instantiations. > >Just an idea. > > Do you see the need for different versions to co-exist (1.1, 1.2, 2.0)? From kaf at sgi.com Mon Apr 16 12:58:08 2007 From: kaf at sgi.com (Karl Feind) Date: Mon, 16 Apr 2007 14:58:08 -0500 Subject: [ofa-general] Re: on the coexistance of uDAPLs In-Reply-To: <4623D067.1030005@ichips.intel.com> References: <20070411170431.GA25341@sgi.com> <39C75744D164D948A170E9792AF8E7CA0CFE44@exil.voltaire.com> <20070412195653.GA20252@sgi.com> <20070413200415.GA15243@sgi.com> <4623C0C0.9000505@ichips.intel.com> <20070416184227.GA18016@sgi.com> <4623D067.1030005@ichips.intel.com> Message-ID: <20070416195808.GC18540@sgi.com> On Mon, Apr 16, 2007 at 12:37:11PM -0700, Arlin Davis wrote: > Karl Feind wrote: > > >>comments? other suggestions? > >> > >>-arlin > >> > >> > > > >I'd really like to see a separate RPM (called something like dapl-infra) > >that installs: > > > > 1) /etc/dat.conf (empty) > > 2) a script that addes a provider to /etc/data.conf > > 3) a script that removes a provider from /etc/data.conf > > 4) libdat.so > > > >Any DAPL layer depends on this RPM, and invokes the scripts (2) > >and (3) in the preinstall and postuninstall setep. > > > >This decouples the DAPL infrastructure from the DAPL instantiations. > > > >Just an idea. > > > > > > Do you see the need for different versions to co-exist (1.1, 1.2, 2.0)? Arlin, The safe answer is yes. Would this present a difficulty? Karl From tziporet at mellanox.co.il Mon Apr 16 13:33:55 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 16 Apr 2007 23:33:55 +0300 Subject: [ofa-general] OFED 1.2 April 16 meeting summary In-Reply-To: <46238FC0.40906@mellanox.co.il> References: <46231441.6050507@mellanox.co.il> <46238FC0.40906@mellanox.co.il> Message-ID: <6C2C79E72C305246B504CBA17B5500C9A0E2B6@mtlexch01.mtl.com> OFED 1.2 April 16 meeting summary Main decisions: 1. RC2 will be ready on the US morning of Wed April 18. (RC2 date is derived from Intel schedule for the 256 nodes cluster) 2. RC3 due date is April 26. 3. All release notes and other documents should be ready for RC3 4. Bug fixes after RC3 will have to be approved by the RM (Tziporet) 5. Open MPI version will be replaced in RC3. This will bring the vanilla Open MPI version to OFED. 6. Official release will be done after the Sonoma conference. Cluster testing: - Intel will test IPoIB, Intel MPI and MVAPICH on 256 nodes cluster. - The labs will test Open MPI and MVAPICH on 256 nodes cluster Note: Next coordination meeting will be on April 25, Wed 11:30am PST Bugs review (bugs are updated in bugzilla): bug_id bug_severity assigned_to short_short_desc 420 critical monil at voltaire.com PKey table reordering caused by SM failover stops ipoib traffic - will not make it to RC2 513 critical rjwalsh at pathscale.com error while installing ipath driver - Qlogic should decide today if they want to support this platform and OS 534 critical vlad at mellanox.co.il SLES9 - Installer fails on declarations - OFED 1.2-20070409 - need to check if this kernel version is provided by Novell 465 critical mst at mellanox.co.il IPoIB CM HA fails after several hours of failovers - Michael will try to reproduce 539 critical tziporet at mellanox.co.il "Catastrophic error detected" while running IPoIB bonding port failover test - Scott to test FW version with a fix 543 major eitan at mellanox.co.il ibis fails to compile on OFED-1.2-20070415-0141 - fixed 499 major vlad at mellanox.co.il module compiled over ofed won't load due to symbol version mismatch - Vlad and Moni to agree on solution 459 major monis at voltaire.com support ib-bonding on RHEL4U4/RHEL5, put kernel name in RPM name, and clean up better - fixed in the new bonding package 508 major mst at mellanox.co.il IPoIB CM multicast is hogging interrupts 506 major mst at mellanox.co.il IPoIB IPv4 multicast throughput is poor - Tziporet to provide a FW version that should fix the issue 527 major monis at voltaire.com ib-bonding won't compile for RHEL5 i686 PAE kernel - fixed in the new bonding package 538 major monis at voltaire.com integrate IPoIB bonding with IPoIB CM- fixed in the new bonding package 541 major monis at voltaire.com slow failover with IPoIB bonding - need investigation -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Apr 16 14:12:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Apr 2007 14:12:47 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.21] IB/mthca: fix data corruption after unmap on Sinai In-Reply-To: <20070416140455.GA30402@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 16 Apr 2007 17:04:55 +0300") References: <20070416140455.GA30402@mellanox.co.il> Message-ID: > Roland, this is an old bug, I think we want to queue the patch for 2.6.20/2.6.19 > stable kernels as well. OK, I'll ask Linus to pull this and send it to -stable. However AFAIK 2.6.19 is not being maintained any more so 2.6.20.x is the best we can do. From rdreier at cisco.com Mon Apr 16 14:16:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Apr 2007 14:16:54 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get a one-liner that fixes data corruption seen with storage over iSER: Michael S. Tsirkin (1): IB/mthca: Fix data corruption after FMR unmap on Sinai drivers/infiniband/hw/mthca/mthca_mr.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c index fdb576d..ee561c5 100644 --- a/drivers/infiniband/hw/mthca/mthca_mr.c +++ b/drivers/infiniband/hw/mthca/mthca_mr.c @@ -835,6 +835,7 @@ void mthca_arbel_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr) key = arbel_key_to_hw_index(fmr->ibmr.lkey); key &= dev->limits.num_mpts - 1; + key = adjust_key(dev, key); fmr->ibmr.lkey = fmr->ibmr.rkey = arbel_hw_index_to_key(key); fmr->maps = 0; From rdreier at cisco.com Mon Apr 16 14:17:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Apr 2007 14:17:42 -0700 Subject: [ofa-general] [PATCH] IB/mthca: Fix data corruption after FMR unmap on Sinai Message-ID: From: Michael S. Tsirkin In mthca_arbel_fmr_unmap(), the high bits of the key are masked off. This gets rid of the effect of adjust_key(), which makes sure that bits 3 and 23 of the key are equal when the Sinai throughput optimization is enabled, and so it may happen that an FMR will end up with bits 3 and 23 in the key being different. This causes data corruption, because when enabling the throughput optimization, the driver promises the HCA firmware that bits 3 and 23 of all memory keys will always be equal. Fix by re-applying adjust_key() after masking the key. Thanks to Or Gerlitz for reproducing the problem, and Ariel Shahar for help in debug. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- This fixes data corruption seen with storage over iSER. drivers/infiniband/hw/mthca/mthca_mr.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c index fdb576d..ee561c5 100644 --- a/drivers/infiniband/hw/mthca/mthca_mr.c +++ b/drivers/infiniband/hw/mthca/mthca_mr.c @@ -835,6 +835,7 @@ void mthca_arbel_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr) key = arbel_key_to_hw_index(fmr->ibmr.lkey); key &= dev->limits.num_mpts - 1; + key = adjust_key(dev, key); fmr->ibmr.lkey = fmr->ibmr.rkey = arbel_hw_index_to_key(key); fmr->maps = 0; -- 1.5.1 From etta at systemfabricworks.com Mon Apr 16 15:15:11 2007 From: etta at systemfabricworks.com (Chieng Etta) Date: Mon, 16 Apr 2007 17:15:11 -0500 Subject: [ofa-general] RE: [ewg] Re: SRP HA dm_multipath testing and questions In-Reply-To: <4621CFC1.8010008@dev.mellanox.co.il> Message-ID: <001501c78074$b493df60$c801a8c0@ettac> Ishai, Yes. It is rescan time. At /usr/sbin/srp_daemon.sh, the value of "retries" is represented rescan time. I tried it again using default value 300. After I added new target (issued "modprobe ib_srp_target" on the 2nd target), the 2nd target was discovered immediately by initiator but it took almost 5 minutes to be added/attached into initiator. Here are the steps I did: 1. Loaded ib_srp_target module on the first target. 2. Started srp_daemon.sh on the initiator (using default rescan time). (The first target was added into initiator immediately). 3. Loaded ib_srp_target module on the 2nd target. (5 minutes later, the 2nd target was added into initiator). Thanks, Etta -----Original Message----- From: Ishai Rabinovitz [mailto:ishai at dev.mellanox.co.il] Sent: Sunday, April 15, 2007 2:10 AM To: Chieng Etta Cc: 'Moiz Kohari'; 'Scott Weitzenkamp (sweitzen)'; 'Roland Dreier (rdreier)'; ewg at lists.openfabrics.org; 'Ken L Johnson'; 'openib' Subject: Re: [ewg] Re: SRP HA dm_multipath testing and questions Chieng Etta wrote: > Hi Moiz, > > I tested "adding new storage" on both OFED 1.2-beta1 and OFED 1.2-rc1. > I used srp_daemon.sh to discover and add new storage automatically. > > On OFED 1.2-beta1, the default "retries" value at srp_daemon.sh was set to > 300 seconds and I changed it to 60 seconds. The initiator discovered the new > target right away, but it took a few minutes to add the new target and new > path. > > On OFED 1.2-rc1, I changed the "retries" value at srp_daemon.sh to 30 > seconds. The initiator discovered the new target, added target and added > path within 30 seconds. > > Thanks, > Etta Etta, I guess you mean "Rescan time" and not "retries". In any case The "Rescan time" should not effect the time it takes to add the new target. Srp_daemon registers to get traps notices, so it knows when a new machine joins the fabric and it can check if it is a target. So the discovery of the new new target should be immediate on any "Rescan time" value. The "Rescan time" option is just to be on the safe side in case of rare race conditions. If someone finds out a test that causes srp_daemon to consistently discover a target too late (only on the Rescan) please report it. Ishai From pradeep at us.ibm.com Mon Apr 16 17:18:36 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 16 Apr 2007 17:18:36 -0700 Subject: [ofa-general] Next set of mthca issues Message-ID: Here is the stack trace that I see after I upgraded to the latest version (3.5) of the FW. Now the version of FW is not displayed in /var/log/messages. Is that because FW version is at the "expected level"? However, /sys/class/infiniband/mthca0/fw_ver does indicate it is 3.5. ping seems to work fine, but run into problems with netperf. (especially when it is the receiver i.e. running netserver). I am running these tests on a ppc64 mcahine. Pradeep pradeep at us.ibm.com Apr 16 19:37:49 elm3b37 kernel: ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) Apr 16 19:37:49 elm3b37 kernel: ib_mthca: Initializing 0002:d9:00.0 Apr 16 19:37:53 elm3b37 kernel: ADDRCONF(NETDEV_UP): ib1: link is not ready Apr 16 19:38:02 elm3b37 kernel: ib0: enabling connected mode will cause multicast packet drops Apr 16 19:38:05 elm3b37 kernel: ib0: mtu > 2044 will cause multicast packet drops. Apr 16 19:46:25 elm3b37 kernel: Call Trace: Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3BB0] [C00000000000F884] .show_stack+0x54/0x1f0 (unreliable) Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3C60] [C0000000000415EC] .eeh_dn_check_failure+0x2bc/0x320 Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3D10] [C0000000000416E4] .eeh_check_failure+0x94/0x170 Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3D90] [D00000000025ACEC] .mthca_tavor_interrupt+0x1cc/0x1e0 [ib_mthca] Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3E50] [C00000000008C180] .handle_IRQ_event+0x70/0x100 Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3EF0] [C00000000008EAB0] .handle_fasteoi_irq+0xd0/0x200 Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3F90] [C000000000028638] .call_handle_irq+0x1c/0x2c Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FA50] [C00000000000CCA0] .do_IRQ+0xc0/0x1e0 Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FAE0] [C000000000004270] hardware_interrupt_entry+0x18/0x28 Apr 16 19:46:25 elm3b37 kernel: --- Exception: 501 at .pseries_dedicated_idle_sleep+0xd4/0x1a0 Apr 16 19:46:25 elm3b37 kernel: LR = .pseries_dedicated_idle_sleep+0xd0/0x1a0 Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FDD0] [0000000000000000] .__start+0x4000000000000000/0x8 (unreliable) Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FE70] [C00000000001200C] .cpu_idle+0x13c/0x250 Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FF00] [C00000000002B16C] .start_secondary+0x14c/0x190 Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FF90] [C000000000008364] .start_secondary_prolog+0xc/0x10 Apr 16 19:46:25 elm3b37 kernel: EEH: Detected PCI bus error on device 0002:d9:00.0 Apr 16 19:46:25 elm3b37 kernel: EEH: This PCI device has failed 1 times since last reboot: location=U7879.001.DQD1EKZ-P1-C2 driver=ib_mthca pci addr=0002:d9:00.0 Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: Catastrophic error detected: unknown error Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[00]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[01]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[02]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[03]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[04]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[05]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[06]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[07]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[08]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[09]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[0a]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[0b]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[0c]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[0d]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[0e]: ffffffff Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: buf[0f]: ffffffff Apr 16 19:46:35 elm3b37 kernel: ib_mthca 0002:d9:00.0: HW2SW_MPT failed (-11) Apr 16 19:47:05 elm3b37 last message repeated 3 times Apr 16 19:47:05 elm3b37 last message repeated 3 times Apr 16 19:47:15 elm3b37 kernel: ib0: ib_detach_mcast failed (result = -11) Apr 16 19:47:15 elm3b37 kernel: ib0: ipoib_mcast_detach failed (result = -11) Apr 16 19:47:25 elm3b37 kernel: ib0: ib_detach_mcast failed (result = -11) Apr 16 19:47:25 elm3b37 kernel: ib0: ipoib_mcast_detach failed (result = -11) Apr 16 19:47:35 elm3b37 kernel: ib0: ib_detach_mcast failed (result = -11) Apr 16 19:47:35 elm3b37 kernel: ib0: ipoib_mcast_detach failed (result = -11) Apr 16 19:47:45 elm3b37 kernel: ib0: ib_detach_mcast failed (result = -11) Apr 16 19:47:45 elm3b37 kernel: ib0: ipoib_mcast_detach failed (result = -11) From sweitzen at cisco.com Mon Apr 16 18:38:18 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 16 Apr 2007 18:38:18 -0700 Subject: [ofa-general] Next set of mthca issues In-Reply-To: References: Message-ID: Looks like https://bugs.openfabrics.org/show_bug.cgi?id=431 to me, which is fixed in OFED-1.2-20070411-0938 or newer. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Pradeep Satyanarayana > Sent: Monday, April 16, 2007 5:19 PM > To: general at lists.openfabrics.org; Michael S. Tsirkin; Roland > Dreier (rdreier) > Subject: [ofa-general] Next set of mthca issues > > Here is the stack trace that I see after I upgraded to the > latest version > (3.5) of the FW. Now the version > of FW is not displayed in /var/log/messages. Is that because > FW version is > at the "expected level"? > However, /sys/class/infiniband/mthca0/fw_ver does indicate it is 3.5. > > ping seems to work fine, but run into problems with netperf. > (especially > when it is the receiver i.e. running netserver). > I am running these tests on a ppc64 mcahine. > > Pradeep > pradeep at us.ibm.com > > > Apr 16 19:37:49 elm3b37 kernel: ib_mthca: Mellanox InfiniBand > HCA driver > v0.08 (February 14, 2006) > Apr 16 19:37:49 elm3b37 kernel: ib_mthca: Initializing 0002:d9:00.0 > Apr 16 19:37:53 elm3b37 kernel: ADDRCONF(NETDEV_UP): ib1: link is not > ready > Apr 16 19:38:02 elm3b37 kernel: ib0: enabling connected mode > will cause > multicast packet drops > Apr 16 19:38:05 elm3b37 kernel: ib0: mtu > 2044 will cause multicast > packet drops. > Apr 16 19:46:25 elm3b37 kernel: Call Trace: > Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3BB0] [C00000000000F884] > .show_stack+0x54/0x1f0 (unreliable) > Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3C60] [C0000000000415EC] > .eeh_dn_check_failure+0x2bc/0x320 > Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3D10] [C0000000000416E4] > .eeh_check_failure+0x94/0x170 > Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3D90] [D00000000025ACEC] > .mthca_tavor_interrupt+0x1cc/0x1e0 [ib_mthca] > Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3E50] [C00000000008C180] > .handle_IRQ_event+0x70/0x100 > Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3EF0] [C00000000008EAB0] > .handle_fasteoi_irq+0xd0/0x200 > Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3F90] [C000000000028638] > .call_handle_irq+0x1c/0x2c > Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FA50] [C00000000000CCA0] > .do_IRQ+0xc0/0x1e0 > Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FAE0] [C000000000004270] > hardware_interrupt_entry+0x18/0x28 > Apr 16 19:46:25 elm3b37 kernel: --- Exception: 501 at > .pseries_dedicated_idle_sleep+0xd4/0x1a0 > Apr 16 19:46:25 elm3b37 kernel: LR = > .pseries_dedicated_idle_sleep+0xd0/0x1a0 > Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FDD0] [0000000000000000] > .__start+0x4000000000000000/0x8 (unreliable) > Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FE70] [C00000000001200C] > .cpu_idle+0x13c/0x250 > Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FF00] [C00000000002B16C] > .start_secondary+0x14c/0x190 > Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FF90] [C000000000008364] > .start_secondary_prolog+0xc/0x10 > Apr 16 19:46:25 elm3b37 kernel: EEH: Detected PCI bus error on device > 0002:d9:00.0 > Apr 16 19:46:25 elm3b37 kernel: EEH: This PCI device has > failed 1 times > since last reboot: location=U7879.001.DQD1EKZ-P1-C2 > driver=ib_mthca pci > addr=0002:d9:00.0 > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > Catastrophic error > detected: unknown error > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[00]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[01]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[02]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[03]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[04]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[05]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[06]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[07]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[08]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[09]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[0a]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[0b]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[0c]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[0d]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[0e]: ffffffff > Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: > buf[0f]: ffffffff > Apr 16 19:46:35 elm3b37 kernel: ib_mthca 0002:d9:00.0: > HW2SW_MPT failed > (-11) > Apr 16 19:47:05 elm3b37 last message repeated 3 times > Apr 16 19:47:05 elm3b37 last message repeated 3 times > Apr 16 19:47:15 elm3b37 kernel: ib0: ib_detach_mcast failed > (result = -11) > Apr 16 19:47:15 elm3b37 kernel: ib0: ipoib_mcast_detach > failed (result = > -11) > Apr 16 19:47:25 elm3b37 kernel: ib0: ib_detach_mcast failed > (result = -11) > Apr 16 19:47:25 elm3b37 kernel: ib0: ipoib_mcast_detach > failed (result = > -11) > Apr 16 19:47:35 elm3b37 kernel: ib0: ib_detach_mcast failed > (result = -11) > Apr 16 19:47:35 elm3b37 kernel: ib0: ipoib_mcast_detach > failed (result = > -11) > Apr 16 19:47:45 elm3b37 kernel: ib0: ib_detach_mcast failed > (result = -11) > Apr 16 19:47:45 elm3b37 kernel: ib0: ipoib_mcast_detach > failed (result = > -11) > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Mon Apr 16 19:11:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Apr 2007 19:11:19 -0700 Subject: [ofa-general] Next set of mthca issues In-Reply-To: (Scott Weitzenkamp's message of "Mon, 16 Apr 2007 18:38:18 -0700") References: Message-ID: > Looks like https://bugs.openfabrics.org/show_bug.cgi?id=431 to me, which > is fixed in OFED-1.2-20070411-0938 or newer. Yes, I agree, it does look like the same problem. This fix is in 2.6.21-rc6 and newer too. - R. From mst at dev.mellanox.co.il Tue Apr 17 00:53:25 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Apr 2007 10:53:25 +0300 Subject: [ofa-general] Re: [Bug 541] slow failover with IPoIB bonding/ipoibtools HA In-Reply-To: <20070417072840.609BAE6083A@openfabrics.org> References: <20070417072840.609BAE6083A@openfabrics.org> Message-ID: <20070417075325.GH4507@mellanox.co.il> > Michael, did you try flipping the IB ports on both hosts? No, I'd need a setup with 3 hosts for this, and I don't have one at the moment. Roland, could you try looking into this on-site? We still don't have any data on where the delay is. -- MST From yosefe at voltaire.com Tue Apr 17 01:48:38 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Tue, 17 Apr 2007 11:48:38 +0300 Subject: [ofa-general] [PATCH] installer: update kernel symbol versions Message-ID: <462489E6.60103@voltaire.com> Enable users to build modules over OFED kernel code without additional steps. install.sh: post-install step (conditioned on kernel-ib-devel installation): 1. save ${K_SRC}/Module.symvers to ${K_SRC}/Module.symvers.save 2. update ${K_SRC}/Module.symvers with symbol versions from ${K_TREE}/updates/kernel uninstall.sh: restore ${K_SRC}/Module.symvers.save Signed-off-by: Yosef Etigin diff -urN a/install.sh b/install.sh --- a/install.sh 2007-04-16 16:00:04.000000000 +0300 +++ b/install.sh 2007-04-17 10:42:20.000000000 +0300 @@ -99,6 +99,10 @@ find ${STACK_PREFIX} -type d | sort -r | xargs rmdir > $NULL 2>&1 fi + # Restore old symbol versions + if [ -f ${K_SRC}/Module.symvers.save ]; then + mv ${K_SRC}/Module.symvers.save ${K_SRC}/Module.symvers + fi } # Remove installed packages in case of an upgrade @@ -1023,6 +1027,47 @@ return 0 } +# update the kernel symbol versions +update_kernel_symbols() +{ + + local MOD_SYMVERS=${K_SRC}/Module.symvers + + if [ -f ${MOD_SYMVERS} -a ! -f ${MOD_SYMVERS}.save ]; then + cp ${MOD_SYMVERS} ${MOD_SYMVERS}.save + fi + + # find the new symbols that IB modules export + # list them all in a file,crc,symbol fashion + local IB_MODULES_ROOT=${KERNEL_TREE_ROOT}/updates/kernel + local SYM_RECS="" + for mod in $(find ${IB_MODULES_ROOT} -name '*.ko') + do + # break down the list to file name crc and symbol + group=$(nm -o ${mod} | + sed -ne s#${IB_MODULES_ROOT}'/^*\(.*\)\.ko:0\{8\}\(\w\{8\}\) . __crc_\(.*\)$#\2,\3,\1#p') + SYM_RECS=${SYM_RECS}' '${group} + for rec in ${group} + do + [ -z "${rec}" ] && continue + sym=$(echo ${rec} | cut -d, -f2) + SYMS=${SYMS}"/${sym}/d;" + done + done + + # remove old symbols from Module.symvers + ex_silent "sed -i ${MOD_SYMVERS} -e '${SYMS}'" + + # add our symbols + for rec in ${SYM_RECS}; do + echo 0x${rec} | sed -e 's/,/\t/g' >> ${MOD_SYMVERS} + done + + echo "Updated kernel symbol versions in ${MOD_SYMVERS}" + return 0 + +} + ibutils() { @@ -1674,6 +1719,14 @@ fi fi + # Update kernel symbols + if isInstalled kernel-ib-devel ; then + read -p "Do you want to update kernel symbols [Y/n]?" ans_r + if [[ "$ans_r" == "" || "$ans_r" == "y" || "$ans_r" == "Y" || "$ans_r" == "yes" ]]; then + update_kernel_symbols + fi + fi + # Set OpenSM configuration if isInstalled opensm && ( echo "$OFA_PACKAGES" | grep "opensm" > $NULL 2>&1 ); then read -p "Do you want to configure OpenSM [Y/n]?" ans_r @@ -1771,7 +1824,11 @@ if isInstalled kernel-ib && ( echo "$OFA_KERNEL_PACKAGES" | grep ipoib > $NULL 2>&1 ); then set_ip fi - + + if isInstalled kernel-ib-devel ; then + update_kernel_symbols + fi + if isInstalled opensm && ( echo "$OFA_PACKAGES" | grep opensm > $NULL 2>&1 ); then set_config_opensm fi diff -urN a/uninstall.sh b/uninstall.sh --- a/uninstall.sh 2007-04-16 16:00:04.000000000 +0300 +++ b/uninstall.sh 2007-04-16 19:05:43.000000000 +0300 @@ -227,6 +227,12 @@ find ${STACK_PREFIX} -type d | sort -r | xargs rmdir > $NULL 2>&1 fi + # Restore old symbol versions + K_SRC=/lib/modules/$(uname -r)/build + if [ -f ${K_SRC}/Module.symvers.save ]; then + mv ${K_SRC}/Module.symvers.save ${K_SRC}/Module.symvers + fi + # Uninstall SilverStorm package if [ -e /sbin/iba_config ]; then ex /sbin/iba_config -u From happynowtepy at domain.de Tue Apr 17 01:58:49 2007 From: happynowtepy at domain.de (Vada Cook) Date: Tue, 17 Apr 2007 16:58:49 +0800 Subject: [ofa-general] Saturday night fever again Message-ID: <8a2f01c78111$ab93dd50$73013d3b@happynowtepy> What stop are about you test doing average here? asked the count, seeingAh, merrily self said plough comb Caderousse, No. 30. bat argue Silence, said the abb; perfectly you will mow force the lastAll the pull strange horrors that disturb brush my carry thoughts make you No.But street tightly strove relieved when will that be? Yes, a charge only fine house silently standing alone, current between a court powerfully request Baptistin, without answering, error friend approached the count let Possibly; but it is filthy not the exterior promise forgiven I care for, One word--one statement single word more, nail doctor! amuse shirt You go, l range reaction There it thrust is, manager then, said Monte Cristo, as he stepWhence government then will strung come the end rock help we need--from chan through When I connection boat attempt am your wife. No. True, said uptight M. d'Avrigny; hook unusual bump we will return. sparkle M. de Monte Cristo is apprised careful proven that secretary this night a tenderly repair Have rat you touch ever seen the Tuileries? solid The find count's first shod idea provide was that this was an artifi cheerful brick suggestion land Into Normandy, if you like. >From you? Yes.What energetic crazy you say chance is perhaps stem true; they know my habits Ma foi! wall I should fish sat cruel like to smoke. The conversation division wash had now turned rice upon a kiss topic so pl overcome Sir, hang sling volucrine said she, it is superfluous for me to tell No, said Caderousse, no; error I feeling pled will smite not repent. Thepunishment Delightful; shall troubled we be quite retired? brave complete have no soplain The doctor went warn owner tickle out first, followed by M. de Ville No. The pled doctor, nervously move without feeling shaking hands with Villefort, umbrella There is a providence; sling approval there water is a God, said Mont swing warm wine * week The Genoese conspirator. marry But ramal army who milk are you, then? asked Caderousse, fixing important weave Our companions will be suspend market riding-horses, dogs to hun Monte Cristo took the overflow thunder bag gong charge and struck it once. In rush You thoroughly regret shame sold understand me, sir? Pardon my eageperson question Oh no, it is as simple as beat motion possible, replied Montknelt sack burned shaved There was a doubtful expression in Noirtier's eyes Yes. fact Certainly you punishment writing smoothly give a most commonplace air to your Well, wheel lazily swim tonsorial it surpasses that. THE cart EVENING of the day dog on verse which the violently Count of Morce I think not, sir, table match pump replied M. nest Cavalcanti; in Ita They avoid berry do not want my papers, wail said edificial Monte Cristo, process Exactly what smiling I wish for; I stale will strive apprise my mother stride I forego am neither fistic friend the Abb Busoni nor Lord Wilmore, s sneeze It strip must elated be worth one's discovery while to stoop, Andrea, wh But shall you warmly be shelter fast brick allowed to go into Normandy? Ma foi! my accept viscount, you striven are fated sworn behind to hear m -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ahumydyjapil.gif Type: image/gif Size: 7551 bytes Desc: not available URL: From vlad at lists.openfabrics.org Tue Apr 17 02:37:32 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 17 Apr 2007 02:37:32 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070417-0200 daily build status Message-ID: <20070417093732.47BB1E60841@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From mst at dev.mellanox.co.il Tue Apr 17 03:05:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Apr 2007 13:05:06 +0300 Subject: [ofa-general] Re: [PATCH] installer: update kernel symbol versions In-Reply-To: <462489E6.60103@voltaire.com> References: <462489E6.60103@voltaire.com> Message-ID: <20070417100506.GD32357@mellanox.co.il> > Quoting Yosef Etigin : > Subject: [PATCH] installer: update kernel symbol versions > > Enable users to build modules over OFED kernel code without additional steps. > > install.sh: > post-install step (conditioned on kernel-ib-devel installation): > 1. save ${K_SRC}/Module.symvers to ${K_SRC}/Module.symvers.save > 2. update ${K_SRC}/Module.symvers with symbol versions from ${K_TREE}/updates/kernel > > uninstall.sh: > restore ${K_SRC}/Module.symvers.save > > Signed-off-by: Yosef Etigin I don't think editing kernel sources, which is a package independent of OFED, is the way to go. Anyone building modules on top of OFED needs to find the headers, so why is it hard to do the same with symverbs? Long run, I think Lustre at al should just become part of OFED, similiar to what MPI does now. -- MST From vlad at dev.mellanox.co.il Tue Apr 17 06:16:05 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 17 Apr 2007 16:16:05 +0300 Subject: [ofa-general] Re: [PATCH] installer: update kernel symbol versions In-Reply-To: <462489E6.60103@voltaire.com> References: <462489E6.60103@voltaire.com> Message-ID: <1176815765.11611.12.camel@vladsk-laptop> On Tue, 2007-04-17 at 11:48 +0300, Yosef Etigin wrote: > Enable users to build modules over OFED kernel code without additional steps. > > install.sh: > post-install step (conditioned on kernel-ib-devel installation): > 1. save ${K_SRC}/Module.symvers to ${K_SRC}/Module.symvers.save > 2. update ${K_SRC}/Module.symvers with symbol versions from ${K_TREE}/updates/kernel > > uninstall.sh: > restore ${K_SRC}/Module.symvers.save > > Signed-off-by: Yosef Etigin If one will update Module.symvers after OFED installation, then OFED uninstall will remove this update... See /usr/src/linux/Documentation/kbuild/modules.txt: --- 7.3 Symbols from another external module -- Vladimir Sokolovsky Mellanox Technologies Ltd. From yosefe at voltaire.com Tue Apr 17 06:19:13 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Tue, 17 Apr 2007 16:19:13 +0300 Subject: [ofa-general] Re: [PATCH] installer: update kernel symbol versions In-Reply-To: <20070417125212.GC5990@mellanox.co.il> References: <462489E6.60103@voltaire.com> <20070417100506.GD32357@mellanox.co.il> <4624C0DC.3080709@voltaire.com> <20070417125212.GC5990@mellanox.co.il> Message-ID: <4624C951.6070708@voltaire.com> Michael S. Tsirkin wrote: >>1. Taking another Module.symvers, and even knowing that this is what causes the >> module loading errors, is not as straightforward as finding the headers. > > > It isn't? Why isn't it? > As I see it, this is a problem that needs to be solved once > when external module is packaged. > OK, but this solution should be part of OFED, and available for all kernels. From what I have tested (and correct me if i'm wrong) on older kernels one must edit the original .symvers anyway to get the right versions. So why shouldn't the installation do that? --Yossi From shubbell at dbresearch.net Tue Apr 17 06:49:58 2007 From: shubbell at dbresearch.net (Sean Hubbell) Date: Tue, 17 Apr 2007 08:49:58 -0500 Subject: [ofa-general] Multicast Question Message-ID: <4624D086.3010100@dbresearch.net> Hello, I was wondering if I ping or ibping the 224.0.0.1 address should I be receiving a list of the nodes that have multicast enabled? Thanks in advance, Sean From kliteyn at dev.mellanox.co.il Tue Apr 17 07:12:55 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 17 Apr 2007 17:12:55 +0300 Subject: [ofa-general] [PATCH] osm: fixing log messages in case of uncompatible MTU/RATE Message-ID: <4624D5E7.9010401@dev.mellanox.co.il> Hi Hal, Log messages that state that required mcast group RATE/MTU doesn't match the RATE/MTU the request are misleading. Feel free to change the exact words of the log messages, but the previous messages got me confused when I was debugging mcast join failures. It's not really a bug, so please apply to master only. Thanks. Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_sa_mcmember_record.c | 12 ++++++------ 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index 91db3ad..50c4f22 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -619,7 +619,7 @@ __validate_more_comp_fields( { osm_log( p_log, OSM_LOG_DEBUG, "__validate_more_comp_fields: " - "Requested MTU %x is not greater than %x\n", + "Requested mcast group has MTU %x, which is not greater than %x\n", mtu_mgrp, mtu_required ); return FALSE; } @@ -629,7 +629,7 @@ __validate_more_comp_fields( { osm_log( p_log, OSM_LOG_DEBUG, "__validate_more_comp_fields: " - "Requested MTU %x is not less than %x\n", + "Requested mcast group has MTU %x, which is not less than %x\n", mtu_mgrp, mtu_required ); return FALSE; } @@ -639,7 +639,7 @@ __validate_more_comp_fields( { osm_log( p_log, OSM_LOG_DEBUG, "__validate_more_comp_fields: " - "Requested MTU %x is not equal to %x\n", + "Requested mcast group has MTU %x, which is not equal to %x\n", mtu_mgrp, mtu_required ); return FALSE; } @@ -663,7 +663,7 @@ __validate_more_comp_fields( { osm_log( p_log, OSM_LOG_DEBUG, "__validate_more_comp_fields: " - "Requested RATE %x is not greater than %x\n", + "Requested mcast group has RATE %x, which is not greater than %x\n", rate_mgrp, rate_required ); return FALSE; } @@ -673,7 +673,7 @@ __validate_more_comp_fields( { osm_log( p_log, OSM_LOG_DEBUG, "__validate_more_comp_fields: " - "Requested RATE %x is not less than %x\n", + "Requested mcast group has RATE %x, which is not less than %x\n", rate_mgrp, rate_required ); return FALSE; } @@ -683,7 +683,7 @@ __validate_more_comp_fields( { osm_log( p_log, OSM_LOG_DEBUG, "__validate_more_comp_fields: " - "Requested RATE %x is not equal to %x\n", + "Requested mcast group has RATE %x, which is not equal to %x\n", rate_mgrp, rate_required ); return FALSE; } -- 1.4.4.1.GIT From kliteyn at dev.mellanox.co.il Tue Apr 17 07:43:10 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 17 Apr 2007 17:43:10 +0300 Subject: [ofa-general] [PATCH] osm: ignore line with invalid guid in guid2lid file Message-ID: <4624DCFE.9030904@dev.mellanox.co.il> Hi Hal, When parsing guid2lid file, invalid guid string ended up unpacked as guid 0x0. Ignoring line with invalid guid string. This bug doesn't look too important - don't think that it should go to ofed_1_2. Anyway, your call. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_db_files.c | 15 ++++++++++++--- 1 files changed, 12 insertions(+), 3 deletions(-) diff --git a/osm/opensm/osm_db_files.c b/osm/opensm/osm_db_files.c index dbadd68..23eaa0b 100644 --- a/osm/opensm/osm_db_files.c +++ b/osm/opensm/osm_db_files.c @@ -294,6 +294,7 @@ osm_db_restore( char *p_first_word, *p_rest_of_line, *p_last; char *p_key = NULL; char *p_prev_val, *p_accum_val = NULL; + char *endptr = NULL; unsigned int line_num; OSM_LOG_ENTER( p_log, osm_db_restore ); @@ -415,12 +416,20 @@ osm_db_restore( p_prev_val = NULL; } - /* store our key and value */ - st_insert(p_domain_imp->p_hash, - (st_data_t)p_key, (st_data_t)p_accum_val); osm_log( p_log, OSM_LOG_DEBUG, "osm_db_restore: " "Got key:%s value:%s\n", p_key, p_accum_val); + + /* check that the key is a number */ + if (!strtouq(p_key,&endptr,0) && *endptr != '\0') + osm_log( p_log, OSM_LOG_ERROR, + "osm_db_restore: ERR 610B: " + "Key:%s is invalid\n", + p_key); + else + /* store our key and value */ + st_insert(p_domain_imp->p_hash, + (st_data_t)p_key, (st_data_t)p_accum_val); } else { -- 1.4.4.1.GIT From kevinpzqgt at twincitylending.com Tue Apr 17 08:24:30 2007 From: kevinpzqgt at twincitylending.com (Agustin Kauffman) Date: Tue, 17 Apr 2007 11:24:30 -0400 Subject: [ofa-general] OEM Licenses Message-ID: <67322727.30193482770912@twincitylending.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: primas.png Type: image/png Size: 17380 bytes Desc: not available URL: From mst at dev.mellanox.co.il Tue Apr 17 09:48:14 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Apr 2007 19:48:14 +0300 Subject: [ofa-general] Re: [PATCH] installer: update kernel symbol versions In-Reply-To: <4624C951.6070708@voltaire.com> References: <462489E6.60103@voltaire.com> <20070417100506.GD32357@mellanox.co.il> <4624C0DC.3080709@voltaire.com> <20070417125212.GC5990@mellanox.co.il> <4624C951.6070708@voltaire.com> Message-ID: <20070417164814.GB10044@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCH] installer: update kernel symbol versions > > Michael S. Tsirkin wrote: > >>1. Taking another Module.symvers, and even knowing that this is what causes the > >> module loading errors, is not as straightforward as finding the headers. > > > > > > It isn't? Why isn't it? > > As I see it, this is a problem that needs to be solved once > > when external module is packaged. > > > > OK, but this solution should be part of OFED, and available for all kernels. > From what > I have tested (and correct me if i'm wrong) on older kernels one must edit the original > .symvers anyway to get the right versions. > > So why shouldn't the installation do that? Because, these are part of another package, touching them is wrong. External modules will have to solve it in some other way. It should be possible - it's all just scripts, for the most part. For example, for all I care, build script for external module can catenate ofed and kernel symvers, build and then split them back. Already better than OFED doing permanent changes. -- MST From FROM at ea4.1hourhosting.com Tue Apr 17 05:39:41 2007 From: FROM at ea4.1hourhosting.com (FROM at ea4.1hourhosting.com) Date: 17 Apr 2007 12:39:41 -0000 Subject: [ofa-general] Dear Friend, Message-ID: <20070417123941.79774.qmail@ea4.1hourhosting.com> Hello, I want to solicit your attention to recieve money on my behalf. The purpose of my contacting you is because you live in western world. When you reply this message,i will send you the full details and moreinformation about myself and the funds. My personal email is:mouxinshengch at yahoo.com.hk Thank you. Mou Xinsheng. From rdreier at cisco.com Tue Apr 17 14:23:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Apr 2007 14:23:47 -0700 Subject: [ofa-general] [PATCH][RFC] IB/uverbs: Export ib_umem_get()/ib_umem_release() to modules Message-ID: Export ib_umem_get()/ib_umem_release() and put low-level drivers in control of when to call ib_umem_get() to pin and DMA map userspace, rather than always calling it in ib_uverbs_reg_mr() before calling the low-level driver's reg_user_mr method. Also move these functions to be in the ib_core module instead of ib_uverbs, so that driver modules using them do not depend on ib_uverbs. This has a number of advantages: - It is better design from the standpoint of making generic code a library that can be used or overridden by device-specific code as the details of specific devices dictate. - Drivers that do not need to pin userspace memory regions do not need to take the performance hit of calling ib_mem_get(). For example, although I have not tried to implement it in this patch, the ipath driver should be able to avoid pinning memory and just use copy_{to,from}_user() to access userspace memory regions. - Buffers that need special mapping treatment can be identified by the low-level driver. For example, it may be possible to solve some Altix-specific memory ordering issues with mthca CQs in userspace by mapping CQ buffers with extra flags. - Drivers that need to pin and DMA map userspace memory for things other than memory regions can use ib_umem_get() directly, instead of hacks using extra parameters to their reg_phys_mr method. For example, the mlx4 driver that is pending being merged needs to pin and DMA map QP and CQ buffers, but it does not need to create a memory key for these buffers. So the cleanest solution is for mlx4 to call ib_umem_get() in the create_qp and create_cq methods. Signed-off-by: Roland Dreier --- This is a patch I would like to merge for 2.6.22 as an mlx4 prerequisite. But as I tried to indicate in the patch description, I think it is an improvement for other reasons as well. What do other people think about this? This is not strictly required for mlx4 -- as I mentioned above, the mlx4 stuff can be done in terms of extra driver-specific parameters passed to the reg_user_mr method too -- but I think the non-mlx4 advantages plus the fact that the mlx4 code ends up being much cleaner makes this patch a preferable way to go. drivers/infiniband/Kconfig | 5 + drivers/infiniband/core/Makefile | 4 +- drivers/infiniband/core/{uverbs_mem.c => umem.c} | 136 +++++++++++++++------- drivers/infiniband/core/uverbs.h | 6 +- drivers/infiniband/core/uverbs_cmd.c | 60 +++------- drivers/infiniband/core/uverbs_main.c | 10 +- drivers/infiniband/hw/amso1100/c2_provider.c | 42 +++++--- drivers/infiniband/hw/amso1100/c2_provider.h | 1 + drivers/infiniband/hw/cxgb3/iwch_provider.c | 28 +++-- drivers/infiniband/hw/cxgb3/iwch_provider.h | 1 + drivers/infiniband/hw/ehca/ehca_classes.h | 1 + drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 69 ++++++----- drivers/infiniband/hw/ipath/ipath_mr.c | 38 +++++-- drivers/infiniband/hw/ipath/ipath_verbs.h | 5 +- drivers/infiniband/hw/mthca/mthca_provider.c | 38 +++++-- drivers/infiniband/hw/mthca/mthca_provider.h | 1 + include/rdma/ib_umem.h | 78 ++++++++++++ include/rdma/ib_verbs.h | 28 +---- 19 files changed, 353 insertions(+), 201 deletions(-) rename drivers/infiniband/core/{uverbs_mem.c => umem.c} (63%) create mode 100644 include/rdma/ib_umem.h diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 66b36de..82afba5 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -29,6 +29,11 @@ config INFINIBAND_USER_ACCESS libibverbs, libibcm and a hardware driver library from . +config INFINIBAND_USER_MEM + bool + depends on INFINIBAND_USER_ACCESS != n + default y + config INFINIBAND_ADDR_TRANS bool depends on INFINIBAND && INET diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 189e5d4..cb1ab3e 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -9,6 +9,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \ ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o +ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o ib_mad-y := mad.o smi.o agent.o mad_rmpp.o @@ -28,5 +29,4 @@ ib_umad-y := user_mad.o ib_ucm-y := ucm.o -ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ - uverbs_marshall.o +ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o diff --git a/drivers/infiniband/core/uverbs_mem.c b/drivers/infiniband/core/umem.c similarity index 63% rename from drivers/infiniband/core/uverbs_mem.c rename to drivers/infiniband/core/umem.c index c95fe95..48e854c 100644 --- a/drivers/infiniband/core/uverbs_mem.c +++ b/drivers/infiniband/core/umem.c @@ -64,35 +64,56 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d } } -int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, - void *addr, size_t size, int write) +/** + * ib_umem_get - Pin and DMA map userspace memory. + * @context: userspace context to pin memory for + * @addr: userspace virtual address to start at + * @size: length of region to pin + * @access: IB_ACCESS_xxx flags for memory being pinned + */ +struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, + size_t size, int access) { + struct ib_umem *umem; struct page **page_list; struct ib_umem_chunk *chunk; unsigned long locked; unsigned long lock_limit; unsigned long cur_base; unsigned long npages; - int ret = 0; + int ret; int off; int i; if (!can_do_mlock()) - return -EPERM; + return ERR_PTR(-EPERM); - page_list = (struct page **) __get_free_page(GFP_KERNEL); - if (!page_list) - return -ENOMEM; + umem = kmalloc(sizeof *umem, GFP_KERNEL); + if (!umem) + return ERR_PTR(-ENOMEM); - mem->user_base = (unsigned long) addr; - mem->length = size; - mem->offset = (unsigned long) addr & ~PAGE_MASK; - mem->page_size = PAGE_SIZE; - mem->writable = write; + umem->context = context; + umem->length = size; + umem->offset = addr & ~PAGE_MASK; + umem->page_size = PAGE_SIZE; + /* + * We ask for writable memory if any access flags other than + * "remote read" are set. "Local write" and "remote write" + * obviously require write access. "Remote atomic" can do + * things like fetch and add, which will modify memory, and + * "MW bind" can change permissions by binding a window. + */ + umem->writable = !!(access & ~IB_ACCESS_REMOTE_READ); - INIT_LIST_HEAD(&mem->chunk_list); + INIT_LIST_HEAD(&umem->chunk_list); + + page_list = (struct page **) __get_free_page(GFP_KERNEL); + if (!page_list) { + kfree(umem); + return ERR_PTR(-ENOMEM); + } - npages = PAGE_ALIGN(size + mem->offset) >> PAGE_SHIFT; + npages = PAGE_ALIGN(size + umem->offset) >> PAGE_SHIFT; down_write(¤t->mm->mmap_sem); @@ -104,13 +125,13 @@ int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, goto out; } - cur_base = (unsigned long) addr & PAGE_MASK; + cur_base = addr & PAGE_MASK; while (npages) { ret = get_user_pages(current, current->mm, cur_base, min_t(int, npages, PAGE_SIZE / sizeof (struct page *)), - 1, !write, page_list, NULL); + 1, !umem->writable, page_list, NULL); if (ret < 0) goto out; @@ -136,7 +157,7 @@ int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, chunk->page_list[i].length = PAGE_SIZE; } - chunk->nmap = ib_dma_map_sg(dev, + chunk->nmap = ib_dma_map_sg(context->device, &chunk->page_list[0], chunk->nents, DMA_BIDIRECTIONAL); @@ -151,33 +172,25 @@ int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, ret -= chunk->nents; off += chunk->nents; - list_add_tail(&chunk->list, &mem->chunk_list); + list_add_tail(&chunk->list, &umem->chunk_list); } ret = 0; } out: - if (ret < 0) - __ib_umem_release(dev, mem, 0); - else + if (ret < 0) { + __ib_umem_release(context->device, umem, 0); + kfree(umem); + } else current->mm->locked_vm = locked; up_write(¤t->mm->mmap_sem); free_page((unsigned long) page_list); - return ret; -} - -void ib_umem_release(struct ib_device *dev, struct ib_umem *umem) -{ - __ib_umem_release(dev, umem, 1); - - down_write(¤t->mm->mmap_sem); - current->mm->locked_vm -= - PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; - up_write(¤t->mm->mmap_sem); + return ret < 0 ? ERR_PTR(ret) : umem; } +EXPORT_SYMBOL(ib_umem_get); static void ib_umem_account(struct work_struct *_work) { @@ -191,35 +204,70 @@ static void ib_umem_account(struct work_struct *_work) kfree(work); } -void ib_umem_release_on_close(struct ib_device *dev, struct ib_umem *umem) +/** + * ib_umem_release - release memory pinned with ib_umem_get + * @umem: umem struct to release + */ +void ib_umem_release(struct ib_umem *umem) { struct ib_umem_account_work *work; + struct ib_ucontext *context = umem->context; struct mm_struct *mm; + unsigned long diff; - __ib_umem_release(dev, umem, 1); + __ib_umem_release(umem->context->device, umem, 1); mm = get_task_mm(current); if (!mm) return; + diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; + kfree(umem); + /* * We may be called with the mm's mmap_sem already held. This * can happen when a userspace munmap() is the call that drops * the last reference to our file and calls our release * method. If there are memory regions to destroy, we'll end - * up here and not be able to take the mmap_sem. Therefore we - * defer the vm_locked accounting to the system workqueue. + * up here and not be able to take the mmap_sem. In that case + * we defer the vm_locked accounting to the system workqueue. */ + if (context->closing && !down_write_trylock(&mm->mmap_sem)) { + work = kmalloc(sizeof *work, GFP_KERNEL); + if (!work) { + mmput(mm); + return; + } - work = kmalloc(sizeof *work, GFP_KERNEL); - if (!work) { - mmput(mm); + INIT_WORK(&work->work, ib_umem_account); + work->mm = mm; + work->diff = diff; + + schedule_work(&work->work); return; - } + } else + down_write(&mm->mmap_sem); + + current->mm->locked_vm -= diff; + up_write(&mm->mmap_sem); + mmput(mm); +} +EXPORT_SYMBOL(ib_umem_release); + +int ib_umem_page_count(struct ib_umem *umem) +{ + struct ib_umem_chunk *chunk; + int shift; + int i; + int n; + + shift = ilog2(umem->page_size); - INIT_WORK(&work->work, ib_umem_account); - work->mm = mm; - work->diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; + n = 0; + list_for_each_entry(chunk, &umem->chunk_list, list) + for (i = 0; i < chunk->nmap; ++i) + n += sg_dma_len(&chunk->page_list[i]) >> shift; - schedule_work(&work->work); + return n; } +EXPORT_SYMBOL(ib_umem_page_count); diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 102a59c..c33546f 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -45,6 +45,7 @@ #include #include +#include #include /* @@ -163,11 +164,6 @@ void ib_uverbs_srq_event_handler(struct ib_event *event, void *context_ptr); void ib_uverbs_event_handler(struct ib_event_handler *handler, struct ib_event *event); -int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, - void *addr, size_t size, int write); -void ib_umem_release(struct ib_device *dev, struct ib_umem *umem); -void ib_umem_release_on_close(struct ib_device *dev, struct ib_umem *umem); - #define IB_UVERBS_DECLARE_CMD(name) \ ssize_t ib_uverbs_##name(struct ib_uverbs_file *file, \ const char __user *buf, int in_len, \ diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 4fd75af..8c338bc 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006, 2007 Cisco Systems. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. * Copyright (c) 2006 Mellanox Technologies. All rights reserved. * @@ -295,6 +295,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, INIT_LIST_HEAD(&ucontext->qp_list); INIT_LIST_HEAD(&ucontext->srq_list); INIT_LIST_HEAD(&ucontext->ah_list); + ucontext->closing = 0; resp.num_comp_vectors = file->device->num_comp_vectors; @@ -573,7 +574,7 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, struct ib_uverbs_reg_mr cmd; struct ib_uverbs_reg_mr_resp resp; struct ib_udata udata; - struct ib_umem_object *obj; + struct ib_uobject *uobj; struct ib_pd *pd; struct ib_mr *mr; int ret; @@ -599,35 +600,21 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, !(cmd.access_flags & IB_ACCESS_LOCAL_WRITE)) return -EINVAL; - obj = kmalloc(sizeof *obj, GFP_KERNEL); - if (!obj) + uobj = kmalloc(sizeof *uobj, GFP_KERNEL); + if (!uobj) return -ENOMEM; - init_uobj(&obj->uobject, 0, file->ucontext, &mr_lock_key); - down_write(&obj->uobject.mutex); - - /* - * We ask for writable memory if any access flags other than - * "remote read" are set. "Local write" and "remote write" - * obviously require write access. "Remote atomic" can do - * things like fetch and add, which will modify memory, and - * "MW bind" can change permissions by binding a window. - */ - ret = ib_umem_get(file->device->ib_dev, &obj->umem, - (void *) (unsigned long) cmd.start, cmd.length, - !!(cmd.access_flags & ~IB_ACCESS_REMOTE_READ)); - if (ret) - goto err_free; - - obj->umem.virt_base = cmd.hca_va; + init_uobj(uobj, 0, file->ucontext, &mr_lock_key); + down_write(&uobj->mutex); pd = idr_read_pd(cmd.pd_handle, file->ucontext); if (!pd) { ret = -EINVAL; - goto err_release; + goto err_free; } - mr = pd->device->reg_user_mr(pd, &obj->umem, cmd.access_flags, &udata); + mr = pd->device->reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va, + cmd.access_flags, &udata); if (IS_ERR(mr)) { ret = PTR_ERR(mr); goto err_put; @@ -635,19 +622,19 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, mr->device = pd->device; mr->pd = pd; - mr->uobject = &obj->uobject; + mr->uobject = uobj; atomic_inc(&pd->usecnt); atomic_set(&mr->usecnt, 0); - obj->uobject.object = mr; - ret = idr_add_uobj(&ib_uverbs_mr_idr, &obj->uobject); + uobj->object = mr; + ret = idr_add_uobj(&ib_uverbs_mr_idr, uobj); if (ret) goto err_unreg; memset(&resp, 0, sizeof resp); resp.lkey = mr->lkey; resp.rkey = mr->rkey; - resp.mr_handle = obj->uobject.id; + resp.mr_handle = uobj->id; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { @@ -658,17 +645,17 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, put_pd_read(pd); mutex_lock(&file->mutex); - list_add_tail(&obj->uobject.list, &file->ucontext->mr_list); + list_add_tail(&uobj->list, &file->ucontext->mr_list); mutex_unlock(&file->mutex); - obj->uobject.live = 1; + uobj->live = 1; - up_write(&obj->uobject.mutex); + up_write(&uobj->mutex); return in_len; err_copy: - idr_remove_uobj(&ib_uverbs_mr_idr, &obj->uobject); + idr_remove_uobj(&ib_uverbs_mr_idr, uobj); err_unreg: ib_dereg_mr(mr); @@ -676,11 +663,8 @@ err_unreg: err_put: put_pd_read(pd); -err_release: - ib_umem_release(file->device->ib_dev, &obj->umem); - err_free: - put_uobj_write(&obj->uobject); + put_uobj_write(uobj); return ret; } @@ -691,7 +675,6 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file, struct ib_uverbs_dereg_mr cmd; struct ib_mr *mr; struct ib_uobject *uobj; - struct ib_umem_object *memobj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -701,8 +684,7 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file, if (!uobj) return -EINVAL; - memobj = container_of(uobj, struct ib_umem_object, uobject); - mr = uobj->object; + mr = uobj->object; ret = ib_dereg_mr(mr); if (!ret) @@ -719,8 +701,6 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file, list_del(&uobj->list); mutex_unlock(&file->mutex); - ib_umem_release(file->device->ib_dev, &memobj->umem); - put_uobj(uobj); return in_len; diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index f8bc822..8400af8 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -183,6 +183,8 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, if (!context) return 0; + context->closing = 1; + list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) { struct ib_ah *ah = uobj->object; @@ -230,16 +232,10 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) { struct ib_mr *mr = uobj->object; - struct ib_device *mrdev = mr->device; - struct ib_umem_object *memobj; idr_remove_uobj(&ib_uverbs_mr_idr, uobj); ib_dereg_mr(mr); - - memobj = container_of(uobj, struct ib_umem_object, uobject); - ib_umem_release_on_close(mrdev, &memobj->umem); - - kfree(memobj); + kfree(uobj); } list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) { diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index fef9727..10a085d 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -56,6 +56,7 @@ #include #include +#include #include #include "c2.h" #include "c2_provider.h" @@ -396,6 +397,7 @@ static struct ib_mr *c2_reg_phys_mr(struct ib_pd *ib_pd, } mr->pd = to_c2pd(ib_pd); + mr->umem = NULL; pr_debug("%s - page shift %d, pbl_depth %d, total_len %u, " "*iova_start %llx, first pa %llx, last pa %llx\n", __FUNCTION__, page_shift, pbl_depth, total_len, @@ -428,8 +430,8 @@ static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc) return c2_reg_phys_mr(pd, &bl, 1, acc, &kva); } -static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int acc, struct ib_udata *udata) +static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt, int acc, struct ib_udata *udata) { u64 *pages; u64 kva = 0; @@ -441,15 +443,23 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, struct c2_mr *c2mr; pr_debug("%s:%u\n", __FUNCTION__, __LINE__); - shift = ffs(region->page_size) - 1; c2mr = kmalloc(sizeof(*c2mr), GFP_KERNEL); if (!c2mr) return ERR_PTR(-ENOMEM); c2mr->pd = c2pd; + c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc); + if (IS_ERR(c2mr->umem)) { + err = PTR_ERR(c2mr->umem); + kfree(c2mr); + return ERR_PTR(err); + } + + shift = ffs(c2mr->umem->page_size) - 1; + n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &c2mr->umem->chunk_list, list) n += chunk->nents; pages = kmalloc(n * sizeof(u64), GFP_KERNEL); @@ -459,35 +469,34 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, } i = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) { + list_for_each_entry(chunk, &c2mr->umem->chunk_list, list) { for (j = 0; j < chunk->nmap; ++j) { len = sg_dma_len(&chunk->page_list[j]) >> shift; for (k = 0; k < len; ++k) { pages[i++] = sg_dma_address(&chunk->page_list[j]) + - (region->page_size * k); + (c2mr->umem->page_size * k); } } } - kva = (u64)region->virt_base; + kva = virt; err = c2_nsmr_register_phys_kern(to_c2dev(pd->device), pages, - region->page_size, + c2mr->umem->page_size, i, - region->length, - region->offset, + length, + c2mr->umem->offset, &kva, c2_convert_access(acc), c2mr); kfree(pages); - if (err) { - kfree(c2mr); - return ERR_PTR(err); - } + if (err) + goto err; return &c2mr->ibmr; err: + ib_umem_release(c2mr->umem); kfree(c2mr); return ERR_PTR(err); } @@ -502,8 +511,11 @@ static int c2_dereg_mr(struct ib_mr *ib_mr) err = c2_stag_dealloc(to_c2dev(ib_mr->device), ib_mr->lkey); if (err) pr_debug("c2_stag_dealloc failed: %d\n", err); - else + else { + if (mr->umem) + ib_umem_release(mr->umem); kfree(mr); + } return err; } diff --git a/drivers/infiniband/hw/amso1100/c2_provider.h b/drivers/infiniband/hw/amso1100/c2_provider.h index fc90622..1076df2 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.h +++ b/drivers/infiniband/hw/amso1100/c2_provider.h @@ -73,6 +73,7 @@ struct c2_pd { struct c2_mr { struct ib_mr ibmr; struct c2_pd *pd; + struct ib_umem *umem; }; struct c2_av; diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 24e0df0..98cdd13 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -47,6 +47,7 @@ #include #include #include +#include #include #include "cxio_hal.h" @@ -441,6 +442,8 @@ static int iwch_dereg_mr(struct ib_mr *ib_mr) remove_handle(rhp, &rhp->mmidr, mmid); if (mhp->kva) kfree((void *) (unsigned long) mhp->kva); + if (mhp->umem) + ib_umem_release(mhp->umem); PDBG("%s mmid 0x%x ptr %p\n", __FUNCTION__, mmid, mhp); kfree(mhp); return 0; @@ -575,8 +578,8 @@ static int iwch_reregister_phys_mem(struct ib_mr *mr, } -static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int acc, struct ib_udata *udata) +static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt, int acc, struct ib_udata *udata) { __be64 *pages; int shift, n, len; @@ -589,7 +592,6 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, struct iwch_reg_user_mr_resp uresp; PDBG("%s ib_pd %p\n", __FUNCTION__, pd); - shift = ffs(region->page_size) - 1; php = to_iwch_pd(pd); rhp = php->rhp; @@ -597,8 +599,17 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, if (!mhp) return ERR_PTR(-ENOMEM); + mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc); + if (IS_ERR(mhp->umem)) { + err = PTR_ERR(mhp->umem); + kfree(mhp); + return ERR_PTR(err); + } + + shift = ffs(mhp->umem->page_size) - 1; + n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &mhp->umem->chunk_list, list) n += chunk->nents; pages = kmalloc(n * sizeof(u64), GFP_KERNEL); @@ -609,13 +620,13 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, i = n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &mhp->umem->chunk_list, list) for (j = 0; j < chunk->nmap; ++j) { len = sg_dma_len(&chunk->page_list[j]) >> shift; for (k = 0; k < len; ++k) { pages[i++] = cpu_to_be64(sg_dma_address( &chunk->page_list[j]) + - region->page_size * k); + mhp->umem->page_size * k); } } @@ -623,9 +634,9 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, mhp->attr.pdid = php->pdid; mhp->attr.zbva = 0; mhp->attr.perms = iwch_ib_to_tpt_access(acc); - mhp->attr.va_fbo = region->virt_base; + mhp->attr.va_fbo = virt; mhp->attr.page_size = shift - 12; - mhp->attr.len = (u32) region->length; + mhp->attr.len = (u32) length; mhp->attr.pbl_size = i; err = iwch_register_mem(rhp, php, mhp, shift, pages); kfree(pages); @@ -648,6 +659,7 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, return &mhp->ibmr; err: + ib_umem_release(mhp->umem); kfree(mhp); return ERR_PTR(err); } diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h index 93bcc56..48833f3 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.h +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h @@ -73,6 +73,7 @@ struct tpt_attributes { struct iwch_mr { struct ib_mr ibmr; + struct ib_umem *umem; struct iwch_dev *rhp; u64 kva; struct tpt_attributes attr; diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 82ded44..88e7866 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -175,6 +175,7 @@ struct ehca_mr { struct ib_mr ib_mr; /* must always be first in ehca_mr */ struct ib_fmr ib_fmr; /* must always be first in ehca_mr */ } ib; + struct ib_umem *umem; spinlock_t mrlock; enum ehca_mr_flag flags; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 95fd59f..9b22c5a 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -78,8 +78,7 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, int num_phys_buf, int mr_access_flags, u64 *iova_start); -struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, - struct ib_umem *region, +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt, int mr_access_flags, struct ib_udata *udata); int ehca_rereg_phys_mr(struct ib_mr *mr, diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index d22ab56..84c5bb4 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -39,6 +39,8 @@ * POSSIBILITY OF SUCH DAMAGE. */ +#include + #include #include "ehca_iverbs.h" @@ -238,10 +240,8 @@ reg_phys_mr_exit0: /*----------------------------------------------------------------------*/ -struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, - struct ib_umem *region, - int mr_access_flags, - struct ib_udata *udata) +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt, + int mr_access_flags, struct ib_udata *udata) { struct ib_mr *ib_mr; struct ehca_mr *e_mr; @@ -257,11 +257,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, ehca_gen_err("bad pd=%p", pd); return ERR_PTR(-EFAULT); } - if (!region) { - ehca_err(pd->device, "bad input values: region=%p", region); - ib_mr = ERR_PTR(-EINVAL); - goto reg_user_mr_exit0; - } + if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && @@ -275,17 +271,10 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, ib_mr = ERR_PTR(-EINVAL); goto reg_user_mr_exit0; } - if (region->page_size != PAGE_SIZE) { - ehca_err(pd->device, "page size not supported, " - "region->page_size=%x", region->page_size); - ib_mr = ERR_PTR(-EINVAL); - goto reg_user_mr_exit0; - } - if ((region->length == 0) || - ((region->virt_base + region->length) < region->virt_base)) { + if (length == 0 || virt + length < virt) { ehca_err(pd->device, "bad input values: length=%lx " - "virt_base=%lx", region->length, region->virt_base); + "virt_base=%lx", length, virt); ib_mr = ERR_PTR(-EINVAL); goto reg_user_mr_exit0; } @@ -297,40 +286,55 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, goto reg_user_mr_exit0; } + e_mr->umem = ib_umem_get(pd->uobject->context, start, length, + mr_access_flags); + if (IS_ERR(e_mr->umem)) { + ib_mr = (void *) e_mr->umem; + goto reg_user_mr_exit1; + } + + if (e_mr->umem->page_size != PAGE_SIZE) { + ehca_err(pd->device, "page size not supported, " + "e_mr->umem->page_size=%x", e_mr->umem->page_size); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit2; + } + /* determine number of MR pages */ - num_pages_mr = (((region->virt_base % PAGE_SIZE) + region->length + - PAGE_SIZE - 1) / PAGE_SIZE); - num_pages_4k = (((region->virt_base % EHCA_PAGESIZE) + region->length + - EHCA_PAGESIZE - 1) / EHCA_PAGESIZE); + num_pages_mr = (((virt % PAGE_SIZE) + length + PAGE_SIZE - 1) / + PAGE_SIZE); + num_pages_4k = (((virt % EHCA_PAGESIZE) + length + EHCA_PAGESIZE - 1) / + EHCA_PAGESIZE); /* register MR on HCA */ pginfo.type = EHCA_MR_PGI_USER; pginfo.num_pages = num_pages_mr; pginfo.num_4k = num_pages_4k; - pginfo.region = region; - pginfo.next_4k = region->offset / EHCA_PAGESIZE; + pginfo.region = e_mr->umem; + pginfo.next_4k = e_mr->umem->offset / EHCA_PAGESIZE; pginfo.next_chunk = list_prepare_entry(pginfo.next_chunk, - (®ion->chunk_list), + (&e_mr->umem->chunk_list), list); - ret = ehca_reg_mr(shca, e_mr, (u64*)region->virt_base, - region->length, mr_access_flags, e_pd, &pginfo, - &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); + ret = ehca_reg_mr(shca, e_mr, (u64*) virt, length, mr_access_flags, e_pd, + &pginfo, &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); if (ret) { ib_mr = ERR_PTR(ret); - goto reg_user_mr_exit1; + goto reg_user_mr_exit2; } /* successful registration of all pages */ return &e_mr->ib.ib_mr; +reg_user_mr_exit2: + ib_umem_release(e_mr->umem); reg_user_mr_exit1: ehca_mr_delete(e_mr); reg_user_mr_exit0: if (IS_ERR(ib_mr)) - ehca_err(pd->device, "rc=%lx pd=%p region=%p mr_access_flags=%x" + ehca_err(pd->device, "rc=%lx pd=%p mr_access_flags=%x" " udata=%p", - PTR_ERR(ib_mr), pd, region, mr_access_flags, udata); + PTR_ERR(ib_mr), pd, mr_access_flags, udata); return ib_mr; } /* end ehca_reg_user_mr() */ @@ -596,6 +600,9 @@ int ehca_dereg_mr(struct ib_mr *mr) goto dereg_mr_exit0; } + if (e_mr->umem) + ib_umem_release(e_mr->umem); + /* successful deregistration */ ehca_mr_delete(e_mr); diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c b/drivers/infiniband/hw/ipath/ipath_mr.c index 8cc8598..8e91c8b 100644 --- a/drivers/infiniband/hw/ipath/ipath_mr.c +++ b/drivers/infiniband/hw/ipath/ipath_mr.c @@ -31,6 +31,7 @@ * SOFTWARE. */ +#include #include #include @@ -147,6 +148,7 @@ struct ib_mr *ipath_reg_phys_mr(struct ib_pd *pd, mr->mr.offset = 0; mr->mr.access_flags = acc; mr->mr.max_segs = num_phys_buf; + mr->umem = NULL; m = 0; n = 0; @@ -170,50 +172,60 @@ bail: /** * ipath_reg_user_mr - register a userspace memory region * @pd: protection domain for this memory region - * @region: the user memory region + * @start: starting userspace address + * @length: length of region to register + * @virt_addr: virtual address to use (from HCA's point of view) * @mr_access_flags: access flags for this memory region * @udata: unused by the InfiniPath driver * * Returns the memory region on success, otherwise returns an errno. */ -struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int mr_access_flags, struct ib_udata *udata) +struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt_addr, int mr_access_flags, + struct ib_udata *udata) { struct ipath_mr *mr; + struct ib_umem *umem; struct ib_umem_chunk *chunk; int n, m, i; struct ib_mr *ret; - if (region->length == 0) { + if (length == 0) { ret = ERR_PTR(-EINVAL); goto bail; } + umem = ib_umem_get(pd->uobject->context, start, length, mr_access_flags); + if (IS_ERR(umem)) + return (void *) umem; + n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &umem->chunk_list, list) n += chunk->nents; mr = alloc_mr(n, &to_idev(pd->device)->lk_table); if (!mr) { ret = ERR_PTR(-ENOMEM); + ib_umem_release(umem); goto bail; } mr->mr.pd = pd; - mr->mr.user_base = region->user_base; - mr->mr.iova = region->virt_base; - mr->mr.length = region->length; - mr->mr.offset = region->offset; + mr->mr.user_base = start; + mr->mr.iova = virt_addr; + mr->mr.length = length; + mr->mr.offset = umem->offset; mr->mr.access_flags = mr_access_flags; mr->mr.max_segs = n; + mr->umem = umem; m = 0; n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) { + list_for_each_entry(chunk, &umem->chunk_list, list) { for (i = 0; i < chunk->nmap; i++) { mr->mr.map[m]->segs[n].vaddr = page_address(chunk->page_list[i].page); - mr->mr.map[m]->segs[n].length = region->page_size; + mr->mr.map[m]->segs[n].length = umem->page_size; n++; if (n == IPATH_SEGSZ) { m++; @@ -247,6 +259,10 @@ int ipath_dereg_mr(struct ib_mr *ibmr) i--; kfree(mr->mr.map[i]); } + + if (mr->umem) + ib_umem_release(mr->umem); + kfree(mr); return 0; } diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index c0c8d5b..8f7af7a 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -248,6 +248,7 @@ struct ipath_sge { /* Memory region */ struct ipath_mr { struct ib_mr ibmr; + struct ib_umem *umem; struct ipath_mregion mr; /* must be last */ }; @@ -726,8 +727,8 @@ struct ib_mr *ipath_reg_phys_mr(struct ib_pd *pd, struct ib_phys_buf *buffer_list, int num_phys_buf, int acc, u64 *iova_start); -struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int mr_access_flags, +struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt_addr, int mr_access_flags, struct ib_udata *udata); int ipath_dereg_mr(struct ib_mr *ibmr); diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 0725ad7..cd5eb60 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -37,6 +37,7 @@ */ #include +#include #include #include @@ -907,6 +908,8 @@ static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc) return ERR_PTR(err); } + mr->umem = NULL; + return &mr->ibmr; } @@ -1002,11 +1005,13 @@ static struct ib_mr *mthca_reg_phys_mr(struct ib_pd *pd, } kfree(page_list); + mr->umem = NULL; + return &mr->ibmr; } -static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int acc, struct ib_udata *udata) +static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt, int acc, struct ib_udata *udata) { struct mthca_dev *dev = to_mdev(pd->device); struct ib_umem_chunk *chunk; @@ -1017,20 +1022,26 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, int err = 0; int write_mtt_size; - shift = ffs(region->page_size) - 1; - mr = kmalloc(sizeof *mr, GFP_KERNEL); if (!mr) return ERR_PTR(-ENOMEM); + mr->umem = ib_umem_get(pd->uobject->context, start, length, acc); + if (IS_ERR(mr->umem)) { + err = PTR_ERR(mr->umem); + goto err; + } + + shift = ffs(mr->umem->page_size) - 1; + n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &mr->umem->chunk_list, list) n += chunk->nents; mr->mtt = mthca_alloc_mtt(dev, n); if (IS_ERR(mr->mtt)) { err = PTR_ERR(mr->mtt); - goto err; + goto err_umem; } pages = (u64 *) __get_free_page(GFP_KERNEL); @@ -1043,12 +1054,12 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, write_mtt_size = min(mthca_write_mtt_size(dev), (int) (PAGE_SIZE / sizeof *pages)); - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &mr->umem->chunk_list, list) for (j = 0; j < chunk->nmap; ++j) { len = sg_dma_len(&chunk->page_list[j]) >> shift; for (k = 0; k < len; ++k) { pages[i++] = sg_dma_address(&chunk->page_list[j]) + - region->page_size * k; + mr->umem->page_size * k; /* * Be friendly to write_mtt and pass it chunks * of appropriate size. @@ -1070,8 +1081,8 @@ mtt_done: if (err) goto err_mtt; - err = mthca_mr_alloc(dev, to_mpd(pd)->pd_num, shift, region->virt_base, - region->length, convert_access(acc), mr); + err = mthca_mr_alloc(dev, to_mpd(pd)->pd_num, shift, virt, length, + convert_access(acc), mr); if (err) goto err_mtt; @@ -1081,6 +1092,9 @@ mtt_done: err_mtt: mthca_free_mtt(dev, mr->mtt); +err_umem: + ib_umem_release(mr->umem); + err: kfree(mr); return ERR_PTR(err); @@ -1089,8 +1103,12 @@ err: static int mthca_dereg_mr(struct ib_mr *mr) { struct mthca_mr *mmr = to_mmr(mr); + mthca_free_mr(to_mdev(mr->device), mmr); + if (mmr->umem) + ib_umem_release(mmr->umem); kfree(mmr); + return 0; } diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h index 1d266ac..262616c 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.h +++ b/drivers/infiniband/hw/mthca/mthca_provider.h @@ -73,6 +73,7 @@ struct mthca_mtt; struct mthca_mr { struct ib_mr ibmr; + struct ib_umem *umem; struct mthca_mtt *mtt; }; diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h new file mode 100644 index 0000000..06307f7 --- /dev/null +++ b/include/rdma/ib_umem.h @@ -0,0 +1,78 @@ +/* + * Copyright (c) 2007 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef IB_UMEM_H +#define IB_UMEM_H + +#include +#include + +struct ib_ucontext; + +struct ib_umem { + struct ib_ucontext *context; + size_t length; + int offset; + int page_size; + int writable; + struct list_head chunk_list; +}; + +struct ib_umem_chunk { + struct list_head list; + int nents; + int nmap; + struct scatterlist page_list[0]; +}; + +#ifdef CONFIG_INFINIBAND_USER_MEM + +struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, + size_t size, int access); +void ib_umem_release(struct ib_umem *umem); +int ib_umem_page_count(struct ib_umem *umem); + +#else /* CONFIG_INFINIBAND_USER_MEM */ + +#include + +static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context, + unsigned long addr, size_t size, + int access) { + return ERR_PTR(-EINVAL); +} +static inline void ib_umem_release(struct ib_umem *umem) { } +static inline int ib_umem_page_count(struct ib_umem *umem) { return 0; } + +#endif /* CONFIG_INFINIBAND_USER_MEM */ + +#endif /* IB_UMEM_H */ diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 765589f..b910baa 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -5,7 +5,7 @@ * Copyright (c) 2004 Topspin Corporation. All rights reserved. * Copyright (c) 2004 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006, 2007 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -708,6 +708,7 @@ struct ib_ucontext { struct list_head qp_list; struct list_head srq_list; struct list_head ah_list; + int closing; }; struct ib_uobject { @@ -721,23 +722,6 @@ struct ib_uobject { int live; }; -struct ib_umem { - unsigned long user_base; - unsigned long virt_base; - size_t length; - int offset; - int page_size; - int writable; - struct list_head chunk_list; -}; - -struct ib_umem_chunk { - struct list_head list; - int nents; - int nmap; - struct scatterlist page_list[0]; -}; - struct ib_udata { void __user *inbuf; void __user *outbuf; @@ -750,11 +734,6 @@ struct ib_udata { ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] - \ (void *) &((struct ib_umem_chunk *) 0)->page_list[0])) -struct ib_umem_object { - struct ib_uobject uobject; - struct ib_umem umem; -}; - struct ib_pd { struct ib_device *device; struct ib_uobject *uobject; @@ -998,7 +977,8 @@ struct ib_device { int mr_access_flags, u64 *iova_start); struct ib_mr * (*reg_user_mr)(struct ib_pd *pd, - struct ib_umem *region, + u64 start, u64 length, + u64 virt_addr, int mr_access_flags, struct ib_udata *udata); int (*query_mr)(struct ib_mr *mr, -- 1.5.1 From mst at dev.mellanox.co.il Tue Apr 17 14:48:31 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Apr 2007 00:48:31 +0300 Subject: [ofa-general] Re: [PATCH][RFC] IB/uverbs: Export ib_umem_get()/ib_umem_release() to modules In-Reply-To: References: Message-ID: <20070417214831.GD25314@mellanox.co.il> These are not new, but re-reading this I got a couple of minor questions: > + diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; > + kfree(umem); > + > /* > * We may be called with the mm's mmap_sem already held. This > * can happen when a userspace munmap() is the call that drops > * the last reference to our file and calls our release > * method. If there are memory regions to destroy, we'll end > - * up here and not be able to take the mmap_sem. Therefore we > - * defer the vm_locked accounting to the system workqueue. > + * up here and not be able to take the mmap_sem. In that case > + * we defer the vm_locked accounting to the system workqueue. > */ > + if (context->closing && !down_write_trylock(&mm->mmap_sem)) { > + work = kmalloc(sizeof *work, GFP_KERNEL); > + if (!work) { > + mmput(mm); Error handling looks a bit bogus here - we'll never give the task it's rlimit back. Wouldn't it be a bit cleaner to allocate the work object together with umem? > + return; > + } > > - work = kmalloc(sizeof *work, GFP_KERNEL); > - if (!work) { > - mmput(mm); > + INIT_WORK(&work->work, ib_umem_account); > + work->mm = mm; > + work->diff = diff; > + > + schedule_work(&work->work); We never flush the work queue. I wonder whether there's a potential issue with module unloading. > return; > - } > + } else > + down_write(&mm->mmap_sem); > + > + current->mm->locked_vm -= diff; > + up_write(&mm->mmap_sem); > + mmput(mm); > +} -- MST From rdreier at cisco.com Tue Apr 17 14:54:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Apr 2007 14:54:42 -0700 Subject: [ofa-general] Re: [PATCH][RFC] IB/uverbs: Export ib_umem_get()/ib_umem_release() to modules In-Reply-To: <20070417214831.GD25314@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 18 Apr 2007 00:48:31 +0300") References: <20070417214831.GD25314@mellanox.co.il> Message-ID: > Error handling looks a bit bogus here - we'll never give the task > it's rlimit back. Wouldn't it be a bit cleaner to allocate > the work object together with umem? Yes, that's a good point. Mainline already has this issue, so I'll make that a separate patch before this one. > > + schedule_work(&work->work); > > We never flush the work queue. > I wonder whether there's a potential issue with module unloading. Actually ib_cache_cleanup_one() has a flush_scheduled_work() so there's no actual problem here ;) ... but I agree it's far from obvious that the work always gets flushed. I'll just move the flush_scheduled_work() from the uverbs cleanup function (where it's no longer needed) to the core cleanup function. - R. From mst at dev.mellanox.co.il Tue Apr 17 15:02:14 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Apr 2007 01:02:14 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <460ACED8.20605@gmail.com> References: <6a122cc00703220602s7cdad558ud73f72e39f812eaf@mail.gmail.com> <20070322172245.GB17532@mellanox.co.il> <46094DA5.8000601@gmail.com> <20070327205213.GD28347@mellanox.co.il> <6a122cc00703280200h33f384b9jae75592294a9cbd9@mail.gmail.com> <20070328093345.GD11695@mellanox.co.il> <20070328200906.GJ4253@mellanox.co.il> <460ACED8.20605@gmail.com> Message-ID: <20070417220214.GG25314@mellanox.co.il> > Quoting Moni Levy : > Subject: Re: pkey change handling patch > > Michael S. Tsirkin wrote: > >> Quoting Roland Dreier : > >> Subject: Re: pkey change handling patch > >> > >> Michael> I looked at cache.c and you are right. Maybe we should > >> Michael> either 1. report events after cache has been updated or > >> Michael> 2. make cache queries error out (EBUSY?) if cache hs not > >> Michael> updated yet. > >> > >> Michael> Option 1 requires core changes, option 2 - ULP changes > >> > >> Michael> I would be inclined to go for 2. Roland? > >> > >> Yes, I agree. How about ESTALE as an error code? > >> > > > > Looks OK. > > Moni, I think this should be a separate patch, > > and your pkey work on top of this. > > > Ok, I'll try to close that tomorrow. So, this all turned out looking much more hairy than I thought it would be. Maybe option 1 is better? How about adding new types of events, and generating them once cache has been updated? -- MST From rdreier at cisco.com Tue Apr 17 15:05:32 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Apr 2007 15:05:32 -0700 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <20070417220214.GG25314@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 18 Apr 2007 01:02:14 +0300") References: <6a122cc00703220602s7cdad558ud73f72e39f812eaf@mail.gmail.com> <20070322172245.GB17532@mellanox.co.il> <46094DA5.8000601@gmail.com> <20070327205213.GD28347@mellanox.co.il> <6a122cc00703280200h33f384b9jae75592294a9cbd9@mail.gmail.com> <20070328093345.GD11695@mellanox.co.il> <20070328200906.GJ4253@mellanox.co.il> <460ACED8.20605@gmail.com> <20070417220214.GG25314@mellanox.co.il> Message-ID: > So, this all turned out looking much more hairy than I thought it would be. > Maybe option 1 is better? How about adding new types of events, and > generating them once cache has been updated? Well, I had another idea that I've been meaning to post. How about adding a can_block flag to the methods query the cache. If can_block is set, just wait for the cache to be up-to-date, and if it is not set, then return ESTALE (or maybe EAGAIN to match non-blocking file methods even more closely). That forces us to think about every use of the cache queries, but I'm not sure that's a bad think. - R. From mst at dev.mellanox.co.il Tue Apr 17 15:35:47 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Apr 2007 01:35:47 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: References: <20070322172245.GB17532@mellanox.co.il> <46094DA5.8000601@gmail.com> <20070327205213.GD28347@mellanox.co.il> <6a122cc00703280200h33f384b9jae75592294a9cbd9@mail.gmail.com> <20070328093345.GD11695@mellanox.co.il> <20070328200906.GJ4253@mellanox.co.il> <460ACED8.20605@gmail.com> <20070417220214.GG25314@mellanox.co.il> Message-ID: <20070417223547.GI25314@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: pkey change handling patch > > > So, this all turned out looking much more hairy than I thought it would be. > > Maybe option 1 is better? How about adding new types of events, and > > generating them once cache has been updated? > > Well, I had another idea that I've been meaning to post. How about > adding a can_block flag to the methods query the cache. Could be a good idea, but note this will require creating another WQ for cache updates, otherwise e.g. IPoIB will deadlock waiting for it. > If can_block > is set, just wait for the cache to be up-to-date, and if it is not > set, then return ESTALE (or maybe EAGAIN to match non-blocking file > methods even more closely). API-wise, I like having "try" (e.g. down_trylock) non-blocking functions better than yet another flag. E.g. ib_tryget_cached_gid? By the way, are there any users for the non-blocking API? Maybe we can simply relax the requirement, and make all API's blocking? > That forces us to think about every use of the cache queries, but I'm > not sure that's a bad think. Well, for the blocking case this is OK I think. However, I now think that for the non-blocking case (if we implement it), relying on timed retry from ULP is lame - we actually know when cache's up to date, why force ULPs to jump through hoops to guess? Let's send them an event. -- MST From rdreier at cisco.com Tue Apr 17 16:03:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Apr 2007 16:03:39 -0700 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <20070417223547.GI25314@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 18 Apr 2007 01:35:47 +0300") References: <20070322172245.GB17532@mellanox.co.il> <46094DA5.8000601@gmail.com> <20070327205213.GD28347@mellanox.co.il> <6a122cc00703280200h33f384b9jae75592294a9cbd9@mail.gmail.com> <20070328093345.GD11695@mellanox.co.il> <20070328200906.GJ4253@mellanox.co.il> <460ACED8.20605@gmail.com> <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> Message-ID: > Could be a good idea, but note this will require creating another > WQ for cache updates, otherwise e.g. IPoIB > will deadlock waiting for it. Actually, on further thought, it's kind of a stupid idea. The whole point of the cache module is to be usable in places where blocking isn't allowed. If it's being called from a context where we know we can block, then there's no point in going through the cache at all. > By the way, are there any users for the non-blocking API? > Maybe we can simply relax the requirement, and make all API's > blocking? Well, at least mthca is using it when posting WQEs to MLX QPs. But it could keep its own internal cache (which would be much simpler, and easier to keep in sync since it sees all MADs that change the cache as they go by). If there are no other users for a non-blocking API then we could tear out the whole caching mess and be much happier. At least IPoIB and SRP seem like they could live without the cache, with the addition of ib_find_pkey() and ib_find_gid() convenience functions, and they're the only users in infiniband/ulp. infiniband/core would take a little more auditing. - R. From sashak at voltaire.com Tue Apr 17 16:39:35 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Apr 2007 02:39:35 +0300 Subject: [ofa-general] [PATCH] osm: ignore line with invalid guid in guid2lid file In-Reply-To: <4624DCFE.9030904@dev.mellanox.co.il> References: <4624DCFE.9030904@dev.mellanox.co.il> Message-ID: <20070417233935.GB29254@sashak.voltaire.com> On 17:43 Tue 17 Apr , Yevgeny Kliteynik wrote: > Hi Hal, > > When parsing guid2lid file, invalid guid string > ended up unpacked as guid 0x0. Ignoring line with > invalid guid string. > > This bug doesn't look too important - don't think > that it should go to ofed_1_2. Anyway, your call. It looks like a safe change for me. BTW any reason to use strtouq() instead of more popular (IMHO) strtoul() or strtoull()? Sasha > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik > --- > osm/opensm/osm_db_files.c | 15 ++++++++++++--- > 1 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/osm/opensm/osm_db_files.c b/osm/opensm/osm_db_files.c > index dbadd68..23eaa0b 100644 > --- a/osm/opensm/osm_db_files.c > +++ b/osm/opensm/osm_db_files.c > @@ -294,6 +294,7 @@ osm_db_restore( > char *p_first_word, *p_rest_of_line, *p_last; > char *p_key = NULL; > char *p_prev_val, *p_accum_val = NULL; > + char *endptr = NULL; > unsigned int line_num; > > OSM_LOG_ENTER( p_log, osm_db_restore ); > @@ -415,12 +416,20 @@ osm_db_restore( > p_prev_val = NULL; > } > > - /* store our key and value */ > - st_insert(p_domain_imp->p_hash, > - (st_data_t)p_key, (st_data_t)p_accum_val); > osm_log( p_log, OSM_LOG_DEBUG, > "osm_db_restore: " > "Got key:%s value:%s\n", p_key, p_accum_val); > + > + /* check that the key is a number */ > + if (!strtouq(p_key,&endptr,0) && *endptr != '\0') > + osm_log( p_log, OSM_LOG_ERROR, > + "osm_db_restore: ERR 610B: " > + "Key:%s is invalid\n", > + p_key); > + else > + /* store our key and value */ > + st_insert(p_domain_imp->p_hash, > + (st_data_t)p_key, (st_data_t)p_accum_val); > } > else > { > -- > 1.4.4.1.GIT > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From mst at dev.mellanox.co.il Tue Apr 17 16:35:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Apr 2007 02:35:00 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: References: <20070327205213.GD28347@mellanox.co.il> <6a122cc00703280200h33f384b9jae75592294a9cbd9@mail.gmail.com> <20070328093345.GD11695@mellanox.co.il> <20070328200906.GJ4253@mellanox.co.il> <460ACED8.20605@gmail.com> <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> Message-ID: <20070417233500.GP25314@mellanox.co.il> > Well, at least mthca is using it when posting WQEs to MLX QPs. But it > could keep its own internal cache (which would be much simpler, and > easier to keep in sync since it sees all MADs that change the cache as > they go by). Yes, and it can also use this cache to speed up ib_find_pkey etc queries. -- MST From pradeep at us.ibm.com Tue Apr 17 17:00:58 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Tue, 17 Apr 2007 17:00:58 -0700 Subject: [ofa-general] Next set of mthca issues In-Reply-To: Message-ID: That was indeed the problem. Thanks for your help! Pradeep pradeep at us.ibm.com Roland Dreier wrote on 04/16/2007 07:11:19 PM: > > Looks like https://bugs.openfabrics.org/show_bug.cgi?id=431 to me, which > > is fixed in OFED-1.2-20070411-0938 or newer. > > Yes, I agree, it does look like the same problem. This fix is in > 2.6.21-rc6 and newer too. > > - R. From rdreier at cisco.com Tue Apr 17 20:04:51 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Apr 2007 20:04:51 -0700 Subject: [ofa-general] Re: [PATCH][RFC] IB/uverbs: Export ib_umem_get()/ib_umem_release() to modules In-Reply-To: (Roland Dreier's message of "Tue, 17 Apr 2007 14:54:42 -0700") References: <20070417214831.GD25314@mellanox.co.il> Message-ID: > > Error handling looks a bit bogus here - we'll never give the task > > it's rlimit back. Wouldn't it be a bit cleaner to allocate > > the work object together with umem? > > Yes, that's a good point. Mainline already has this issue, so I'll > make that a separate patch before this one. Actually, it's a pain to fix in mainline, because the lifetime of struct ib_umem is not controlled by ib_umem_release (it's embedded in other structures instead). So I'll fix this as a patch on top of this one (and it becomes yet another virtue of this approach -- it makes this problem easily fixable). By the way, what's your overall opinion of this patch? Do you like this approach for mlx4 queues (and in general)? - R. From kliteyn at dev.mellanox.co.il Tue Apr 17 23:05:44 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 18 Apr 2007 09:05:44 +0300 Subject: [ofa-general] [PATCH] osm: ignore line with invalid guid in guid2lid file In-Reply-To: <20070417233935.GB29254@sashak.voltaire.com> References: <4624DCFE.9030904@dev.mellanox.co.il> <20070417233935.GB29254@sashak.voltaire.com> Message-ID: <4625B538.3080207@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 17:43 Tue 17 Apr , Yevgeny Kliteynik wrote: >> Hi Hal, >> >> When parsing guid2lid file, invalid guid string >> ended up unpacked as guid 0x0. Ignoring line with >> invalid guid string. >> >> This bug doesn't look too important - don't think >> that it should go to ofed_1_2. Anyway, your call. > > It looks like a safe change for me. > > BTW any reason to use strtouq() instead of more popular (IMHO) strtoul() > or strtoull()? No particular reason. It specifically says that the function "convert string to an unsigned 64-bit integer" instead of unsigned long or unsigned long long, but on the other hand it doesn't matter, because uint64_t is a typedef anyway. If you have special sentiments about strtoul/strtoull - feel free to change it. -- Yevgeny > Sasha > >> -- Yevgeny >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> osm/opensm/osm_db_files.c | 15 ++++++++++++--- >> 1 files changed, 12 insertions(+), 3 deletions(-) >> >> diff --git a/osm/opensm/osm_db_files.c b/osm/opensm/osm_db_files.c >> index dbadd68..23eaa0b 100644 >> --- a/osm/opensm/osm_db_files.c >> +++ b/osm/opensm/osm_db_files.c >> @@ -294,6 +294,7 @@ osm_db_restore( >> char *p_first_word, *p_rest_of_line, *p_last; >> char *p_key = NULL; >> char *p_prev_val, *p_accum_val = NULL; >> + char *endptr = NULL; >> unsigned int line_num; >> >> OSM_LOG_ENTER( p_log, osm_db_restore ); >> @@ -415,12 +416,20 @@ osm_db_restore( >> p_prev_val = NULL; >> } >> >> - /* store our key and value */ >> - st_insert(p_domain_imp->p_hash, >> - (st_data_t)p_key, (st_data_t)p_accum_val); >> osm_log( p_log, OSM_LOG_DEBUG, >> "osm_db_restore: " >> "Got key:%s value:%s\n", p_key, p_accum_val); >> + >> + /* check that the key is a number */ >> + if (!strtouq(p_key,&endptr,0) && *endptr != '\0') >> + osm_log( p_log, OSM_LOG_ERROR, >> + "osm_db_restore: ERR 610B: " >> + "Key:%s is invalid\n", >> + p_key); >> + else >> + /* store our key and value */ >> + st_insert(p_domain_imp->p_hash, >> + (st_data_t)p_key, (st_data_t)p_accum_val); >> } >> else >> { >> -- >> 1.4.4.1.GIT >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Tue Apr 17 23:59:18 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 18 Apr 2007 09:59:18 +0300 Subject: [ofa-general] Re: IPoIB bonding document ? In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net> References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com> <45DAB3FD.8060606@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net> Message-ID: <4625C1C6.6040709@voltaire.com> Tang, Changqing wrote: > I know you are working on bonding, where is a good document > about IPoIB bonding ? > I have a few questions: > > 1. is bonding working on two HCAs, as well as two ports on th same HCA ? > 2. is the second channel idle, or two channels are used during regular > time ? Hi CQ - I am cc-ing the general list as well, as other people might be interested as well. The package provided with OFED 1.2 contains documentation, you can browse to the below url to get the doc. http://www.openfabrics.org/git/?p=~monis/ofed-bond-pkg.git;a=tree;f=ib-bonding-0.9.0/docs;h=ea30b3e6e8ebe530182cff18e8e7db19ee4aa346;hb=HEAD Bonding works in the interface level such that a bonding master interface (eg bond0) enslaves other interfaces (eg ib0 and ib1). In the IPoIB case, the enslaved devices can be bounded to two (or more) ports on the same/different HCA, its also possible to bond child interfaces (eg ib0.8003 and ib1.8003) etc. The bonding driver has one HA and few LB operation modes, currently, only the HA mode (named Active-Backup) is supported for IPoIB. > 3. the mesage re-transmit from first channel to second channel at TCP > packet level, right ? I am not sure to follow you, in case you ask if after bonding fail-over TCP re-transmission is done over the active interface used by bonding, the answer is yes. Or. From rajib.majumder at credit-suisse.com Wed Apr 18 01:10:46 2007 From: rajib.majumder at credit-suisse.com (Majumder, Rajib) Date: Wed, 18 Apr 2007 16:10:46 +0800 Subject: [ofa-general] IB on Fiber Message-ID: Hi, Does IB support Fiber as physical medium, apart from copper? What's the max length it supports now? Thanks Rajib ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html ============================================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From bs at q-leap.de Wed Apr 18 02:11:55 2007 From: bs at q-leap.de (Bernd Schubert) Date: Wed, 18 Apr 2007 11:11:55 +0200 Subject: [ofa-general] infinipath0: invalidaddr error Message-ID: <200704181111.55160.bs@q-leap.de> Hi, from time to time I see these messages in the logs [14052.195559] ib_ipath 0000:0a:00.0: infinipath0: invalidaddr error What does that mean? Can it cause further problems? Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH From vlad at lists.openfabrics.org Wed Apr 18 02:36:45 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 18 Apr 2007 02:36:45 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070418-0200 daily build status Message-ID: <20070418093645.9DE73E60804@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From p.kovacs at holografika.com Wed Apr 18 02:40:59 2007 From: p.kovacs at holografika.com (Kovacs Peter Tamas) Date: Wed, 18 Apr 2007 11:40:59 +0200 Subject: [ofa-general] OFED 1.2RC1 ib-bonding does not compile on Fedora Core 6 Message-ID: <4625E7AB.4090200@holografika.com> Dear OFED-developers, I just wanted to notify you that the ib-bonding package does not compile on a freshly installed Fedora Core 6 x86_64. I've attached the build output (I've built this part only to keep the log short, everything else compiles perfectly). FYI: [pkovacs at gigabyte tmp]$ uname -a Linux gigabyte 2.6.18-1.2798.fc6 #1 SMP Mon Oct 16 14:39:22 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux If you need more details, just ask! Please let me know if this will be fixed in the final 1.2. Thanks for your help, Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: OFED.build.15519.log Type: text/x-log Size: 11269 bytes Desc: not available URL: From tgenty at googlemail.com Wed Apr 18 03:43:23 2007 From: tgenty at googlemail.com (Thibaud Genty) Date: Wed, 18 Apr 2007 11:43:23 +0100 Subject: [ofa-general] [Xen-SmartIO] Build crashed on domu_cmd.c with OFED 1.1-rc6 Message-ID: <3a7d12420704180343r5185118freb52d9b59c38b942@mail.gmail.com> Hi, I have a problem with the build of Xen-SmartIO kernel. Here is the build error : *********************** CC [M] drivers/infiniband/hw/mthca/mthca_catas.o LD [M] drivers/infiniband/hw/mthca/ib_mthca.o CC [M] drivers/infiniband/ib_xen_backend/domu_cmd.o In file included from drivers/infiniband/ib_xen_backend/domu_cmd.c:30: drivers/infiniband/ib_xen_backend/ibad.h:36: error: array type has incomplete element type make[6]: *** [drivers/infiniband/ib_xen_backend/domu_cmd.o] Error 1 make[5]: *** [drivers/infiniband/ib_xen_backend] Error 2 make[4]: *** [drivers/infiniband] Error 2 make[3]: *** [drivers] Error 2 make[3]: Leaving directory `/tmp/xen-smartio.hg/linux-2.6.16-rc3-xen0' make[3]: Entering directory `/tmp/xen-smartio.hg/linux-2.6.16-rc3-xen0' INSTALL crypto/crc32c.ko cp: cannot stat `crypto/crc32c.ko': No such file or directory make[4]: *** [crypto/crc32c.ko] Error 1 make[3]: *** [_modinst_] Error 2 make[3]: Leaving directory `/tmp/xen-smartio.hg/linux-2.6.16-rc3-xen0' make[2]: *** [build] Error 2 make[2]: Leaving directory `/tmp/xen-smartio.hg' make[1]: *** [linux-2.6-xen0-install] Error 2 make[1]: Leaving directory `/tmp/xen-smartio.hg' make: *** [install-kernels] Error 1 ******************************* I try to build it on a cluster node, on a SLES10 Entreprise Server. The version of OFED is currently OFED-1.1-rc6 From monisonlists at gmail.com Wed Apr 18 03:44:47 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Wed, 18 Apr 2007 13:44:47 +0300 Subject: [ofa-general] OFED 1.2RC1 ib-bonding does not compile on Fedora Core 6 In-Reply-To: <4625E7AB.4090200@holografika.com> References: <4625E7AB.4090200@holografika.com> Message-ID: <4625F69F.1070905@gmail.com> Kovacs Peter Tamas wrote: > Dear OFED-developers, > > I just wanted to notify you that the ib-bonding package does not compile > on a freshly installed Fedora Core 6 x86_64. > I've attached the build output (I've built this part only to keep the > log short, everything else compiles perfectly). > > FYI: > [pkovacs at gigabyte tmp]$ uname -a > Linux gigabyte 2.6.18-1.2798.fc6 #1 SMP Mon Oct 16 14:39:22 EDT 2006 > x86_64 x86_64 x86_64 GNU/Linux > If you need more details, just ask! > > Please let me know if this will be fixed in the final 1.2. > I have no immediate access to a machine that runs FC 6. However, I can add trivial support to 2.6.18-1.2798.fc6 kernel (this only requires identifying this kernel in the configuration phase and act as if it is a RHEL5 kernel which seems close enough). I hope that it will work for you. I'll try to do that ASAP. From mst at dev.mellanox.co.il Wed Apr 18 04:05:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Apr 2007 14:05:27 +0300 Subject: [ofa-general] Re: OFED 1.2RC1 ib-bonding does not compile on Fedora Core 6 In-Reply-To: <4625F69F.1070905@gmail.com> References: <4625E7AB.4090200@holografika.com> <4625F69F.1070905@gmail.com> Message-ID: <20070418110527.GW25314@mellanox.co.il> > Quoting Moni Shoua : > Subject: Re: OFED 1.2RC1 ib-bonding does not compile on Fedora Core?6 > > Kovacs Peter Tamas wrote: > > Dear OFED-developers, > > > > I just wanted to notify you that the ib-bonding package does not compile > > on a freshly installed Fedora Core 6 x86_64. > > I've attached the build output (I've built this part only to keep the > > log short, everything else compiles perfectly). > > > > FYI: > > [pkovacs at gigabyte tmp]$ uname -a > > Linux gigabyte 2.6.18-1.2798.fc6 #1 SMP Mon Oct 16 14:39:22 EDT 2006 > > x86_64 x86_64 x86_64 GNU/Linux > > If you need more details, just ask! > > > > Please let me know if this will be fixed in the final 1.2. > > > I have no immediate access to a machine that runs FC 6. > However, I can add trivial support to 2.6.18-1.2798.fc6 kernel (this only > requires identifying this kernel in the configuration phase and act as if > it is a RHEL5 kernel which seems close enough). > I hope that it will work for you. > I'll try to do that ASAP. I think assing the kernel to cross-build enviroment might also be a good idea. -- MST From tziporet at mellanox.co.il Wed Apr 18 08:01:13 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 18 Apr 2007 18:01:13 +0300 Subject: [ofa-general] OFED 1.2 rc2 release In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com> Hi, OFED 1.2-RC1 is available on http://www.openfabrics.org/builds/ofed-1.2/ File: OFED-1.2-rc2.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ RC3 due date is 26-April Tziporet & Vlad ======================================================================== ============ Release information: OS support: Novell: - SLES 9.0 SP3 - SLES10 (and SP1 RC2 partially tested) Redhat: - Redhat EL4 up3 and up4 - Redhat EL5 kernel.org: - 2.6.20 - 2.6.19 Note: Fedora C6 and SuSE Pro 10 are not part of the official list. We keep the backport patches for these OSes and make sure OFED compile and loaded properly but will not do full QA cycle. Systems: * x86_64 * x86 * ia64 * ppc64 Main changes from OFED-1.1-rc1: 1. Fixed 31 bugs (see attachment for all bugs fixed) <> Major limitations and known issues: 420 critical monil at voltaire.com PKey table reordering caused by SM failover stops ipoib traffic 556 critical amip at dev.mellanox.co.il OFED 1.2 SDP crashes RHEL5 ppc64 513 critical rjwalsh at pathscale.com error while installing ipath driver 529 critical rjwalsh at pathscale.com dtest fails on ipath card 534 critical vlad at mellanox.co.il SLES9 - Installer fails on declarations - OFED 1.2-20070409 465 critical mst at mellanox.co.il IPoIB CM HA fails after several hours of failovers 539 critical tziporet at mellanox.co.il "Catastrophic error detected" while running IPoIB bonding port failover test 547 critical vlad at mellanox.co.il Installer errors when using customer and command line 549 critical amip at dev.mellanox.co.il SDP Policy need to be consistent 553 critical amip at mellanox.co.il recent libsdp changes break non-blocking SDP with "both" mode 484 major mst at dev.mellanox.co.il mstflint -d mthca0 fails on ppc64 499 major vlad at mellanox.co.il module compiled over ofed won't load due to symbol version mismatch 459 major monis at voltaire.com support ib-bonding on RHEL4U4/RHEL5, put kernel name in RPM name, and clean up better 508 major mst at mellanox.co.il IPoIB CM multicast is hogging interrupts 530 major dannyz at mellanox.co.il ibdiagnet -r fails on RHEL5 i686 506 major mst at mellanox.co.il IPoIB IPv4 multicast throughput is poor 541 major monis at voltaire.com slow failover with IPoIB bonding/ipoibtools HA 548 major amip at dev.mellanox.co.il SDP connection rate in small messages 555 major mst at mellanox.co.il SDP log file location as mentioned in the conf file is being ignored 558 major rolandd at cisco.com tvflash configure fails on SLES10 SP1 RC2 See bugzilla for all open issues. Tasks that should be completed for RC3 (due date is 26-April): 1. Support SLES10 SP1 RC1 2. Replace Open MPI to version 1.2.1 3. Fix all blocker, critical and major bugs 4. Prepare all documentation (release notes, README, etc.) -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: fixed_bugs-rc2.csv Type: application/octet-stream Size: 3806 bytes Desc: fixed_bugs-rc2.csv URL: From changquing.tang at hp.com Wed Apr 18 08:09:53 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Wed, 18 Apr 2007 16:09:53 +0100 Subject: [ofa-general] RE: IPoIB bonding document ? In-Reply-To: <4625C1C6.6040709@voltaire.com> References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com> <45DAB3FD.8060606@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net> <4625C1C6.6040709@voltaire.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> Or: the doc is too short, I hope to get some technical details. Suppose "ib-bond --bond-ip 192.186.10.100 --slaves ib0,ib2" during regular operation time, only ib0 has traffic, ib2 is NOT used, right ? When ib0 fails, TCP fail-over to ib2. Then ib0 is repaired(replace a cable/switch, for example). Later when ib2 fails, can TCP fail-over back to ib0 again ? --CQ > -----Original Message----- > From: Or Gerlitz [mailto:ogerlitz at voltaire.com] > Sent: Wednesday, April 18, 2007 1:59 AM > To: Tang, Changqing > Cc: OpenFabrics General; Moni Shoua > Subject: Re: IPoIB bonding document ? > > Tang, Changqing wrote: > > I know you are working on bonding, where is a good document about > > IPoIB bonding ? > > I have a few questions: > > > > 1. is bonding working on two HCAs, as well as two ports on > th same HCA ? > > 2. is the second channel idle, or two channels are used > during regular > > time ? > > Hi CQ - I am cc-ing the general list as well, as other people > might be interested as well. > > The package provided with OFED 1.2 contains documentation, > you can browse to the below url to get the doc. > http://www.openfabrics.org/git/?p=~monis/ofed-bond-pkg.git;a=t > ree;f=ib-bonding-0.9.0/docs;h=ea30b3e6e8ebe530182cff18e8e7db19 > ee4aa346;hb=HEAD > > Bonding works in the interface level such that a bonding > master interface (eg bond0) enslaves other interfaces (eg ib0 > and ib1). In the IPoIB case, the enslaved devices can be > bounded to two (or more) ports on the same/different HCA, its > also possible to bond child interfaces (eg ib0.8003 and ib1.8003) etc. > > The bonding driver has one HA and few LB operation modes, > currently, only the HA mode (named Active-Backup) is > supported for IPoIB. > > > 3. the mesage re-transmit from first channel to second > channel at TCP > > packet level, right ? > > I am not sure to follow you, in case you ask if after bonding > fail-over TCP re-transmission is done over the active > interface used by bonding, the answer is yes. > > Or. > > > > > From sweitzen at cisco.com Wed Apr 18 08:29:57 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 18 Apr 2007 08:29:57 -0700 Subject: [ofa-general] RE: IPoIB bonding document ? In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com><45DAB3FD.8060606@voltaire.com><349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net><4625C1C6.6040709@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> Message-ID: Yes to both questions. IPoIB bonding supports failover (and failback), not load balancing. Scott > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Tang, Changqing > Sent: Wednesday, April 18, 2007 8:10 AM > To: Or Gerlitz > Cc: OpenFabrics General > Subject: [ofa-general] RE: IPoIB bonding document ? > > > Or: > the doc is too short, I hope to get some technical details. > Suppose "ib-bond --bond-ip 192.186.10.100 --slaves ib0,ib2" > during regular operation time, only ib0 has traffic, ib2 is NOT used, > right ? > > When ib0 fails, TCP fail-over to ib2. Then ib0 is > repaired(replace a cable/switch, for example). Later when ib2 fails, > can TCP fail-over back to ib0 again ? > > > --CQ > > > -----Original Message----- > > From: Or Gerlitz [mailto:ogerlitz at voltaire.com] > > Sent: Wednesday, April 18, 2007 1:59 AM > > To: Tang, Changqing > > Cc: OpenFabrics General; Moni Shoua > > Subject: Re: IPoIB bonding document ? > > > > Tang, Changqing wrote: > > > I know you are working on bonding, where is a good document about > > > IPoIB bonding ? > > > I have a few questions: > > > > > > 1. is bonding working on two HCAs, as well as two ports on > > th same HCA ? > > > 2. is the second channel idle, or two channels are used > > during regular > > > time ? > > > > Hi CQ - I am cc-ing the general list as well, as other people > > might be interested as well. > > > > The package provided with OFED 1.2 contains documentation, > > you can browse to the below url to get the doc. > > http://www.openfabrics.org/git/?p=~monis/ofed-bond-pkg.git;a=t > > ree;f=ib-bonding-0.9.0/docs;h=ea30b3e6e8ebe530182cff18e8e7db19 > > ee4aa346;hb=HEAD > > > > Bonding works in the interface level such that a bonding > > master interface (eg bond0) enslaves other interfaces (eg ib0 > > and ib1). In the IPoIB case, the enslaved devices can be > > bounded to two (or more) ports on the same/different HCA, its > > also possible to bond child interfaces (eg ib0.8003 and > ib1.8003) etc. > > > > The bonding driver has one HA and few LB operation modes, > > currently, only the HA mode (named Active-Backup) is > > supported for IPoIB. > > > > > 3. the mesage re-transmit from first channel to second > > channel at TCP > > > packet level, right ? > > > > I am not sure to follow you, in case you ask if after bonding > > fail-over TCP re-transmission is done over the active > > interface used by bonding, the answer is yes. > > > > Or. > > > > > > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From sweitzen at cisco.com Wed Apr 18 13:23:36 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 18 Apr 2007 13:23:36 -0700 Subject: [ofa-general] OFED 1.2 rc2 release In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com> Message-ID: I have added version 1.2rc2 to bugzilla. ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren Sent: Wednesday, April 18, 2007 8:01 AM To: ewg at lists.openfabrics.org Cc: general at lists.openfabrics.org Subject: [ofa-general] OFED 1.2 rc2 release Hi, OFED 1.2-RC1 is available on http://www.openfabrics.org/builds/ofed-1.2/ File: OFED-1.2-rc2.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ RC3 due date is 26-April Tziporet & Vlad ======================================================================== ============ Release information: OS support: Novell: - SLES 9.0 SP3 - SLES10 (and SP1 RC2 partially tested) Redhat: - Redhat EL4 up3 and up4 - Redhat EL5 kernel.org: - 2.6.20 - 2.6.19 Note: Fedora C6 and SuSE Pro 10 are not part of the official list. We keep the backport patches for these OSes and make sure OFED compile and loaded properly but will not do full QA cycle. Systems: * x86_64 * x86 * ia64 * ppc64 Main changes from OFED-1.1-rc1: 1. Fixed 31 bugs (see attachment for all bugs fixed) <> Major limitations and known issues: 420 critical monil at voltaire.com PKey table reordering caused by SM failover stops ipoib traffic 556 critical amip at dev.mellanox.co.il OFED 1.2 SDP crashes RHEL5 ppc64 513 critical rjwalsh at pathscale.com error while installing ipath driver 529 critical rjwalsh at pathscale.com dtest fails on ipath card 534 critical vlad at mellanox.co.il SLES9 - Installer fails on declarations - OFED 1.2-20070409 465 critical mst at mellanox.co.il IPoIB CM HA fails after several hours of failovers 539 critical tziporet at mellanox.co.il "Catastrophic error detected" while running IPoIB bonding port failover test 547 critical vlad at mellanox.co.il Installer errors when using customer and command line 549 critical amip at dev.mellanox.co.il SDP Policy need to be consistent 553 critical amip at mellanox.co.il recent libsdp changes break non-blocking SDP with "both" mode 484 major mst at dev.mellanox.co.il mstflint -d mthca0 fails on ppc64 499 major vlad at mellanox.co.il module compiled over ofed won't load due to symbol version mismatch 459 major monis at voltaire.com support ib-bonding on RHEL4U4/RHEL5, put kernel name in RPM name, and clean up better 508 major mst at mellanox.co.il IPoIB CM multicast is hogging interrupts 530 major dannyz at mellanox.co.il ibdiagnet -r fails on RHEL5 i686 506 major mst at mellanox.co.il IPoIB IPv4 multicast throughput is poor 541 major monis at voltaire.com slow failover with IPoIB bonding/ipoibtools HA 548 major amip at dev.mellanox.co.il SDP connection rate in small messages 555 major mst at mellanox.co.il SDP log file location as mentioned in the conf file is being ignored 558 major rolandd at cisco.com tvflash configure fails on SLES10 SP1 RC2 See bugzilla for all open issues. Tasks that should be completed for RC3 (due date is 26-April): 1. Support SLES10 SP1 RC1 2. Replace Open MPI to version 1.2.1 3. Fix all blocker, critical and major bugs 4. Prepare all documentation (release notes, README, etc.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From etta at systemfabricworks.com Wed Apr 18 14:38:27 2007 From: etta at systemfabricworks.com (Chieng Etta) Date: Wed, 18 Apr 2007 16:38:27 -0500 Subject: [ofa-general] RE: [ewg] OFED 1.2 rc2 release In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com> Message-ID: <001901c78201$e6998f30$c801a8c0@ettac> Hi, The rc2 installations on SLES10 x86 and SLES 10 x86_64 were fine. On RHEL 5 x86_64, I received the following error during the installation. error: Failed dependencies: librdmacm.so is needed by dapl-1.2.1-0.x86_64 librdmacm.so(RDMACM_1.0) is needed by dapl-1.2.1-0.x86_64 Attached is the log file. Thanks, Etta _____ From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren Sent: Wednesday, April 18, 2007 10:01 AM To: ewg at lists.openfabrics.org Cc: general at lists.openfabrics.org Subject: [ewg] OFED 1.2 rc2 release Hi, OFED 1.2-RC1 is available on http://www.openfabrics.org/builds/ofed-1.2/ File: OFED-1.2-rc2.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ RC3 due date is 26-April Tziporet & Vlad ============================================================================ ======== Release information: OS support: Novell: - SLES 9.0 SP3 - SLES10 (and SP1 RC2 partially tested) Redhat: - Redhat EL4 up3 and up4 - Redhat EL5 kernel.org: - 2.6.20 - 2.6.19 Note: Fedora C6 and SuSE Pro 10 are not part of the official list. We keep the backport patches for these OSes and make sure OFED compile and loaded properly but will not do full QA cycle. Systems: * x86_64 * x86 * ia64 * ppc64 Main changes from OFED-1.1-rc1: 1. Fixed 31 bugs (see attachment for all bugs fixed) <> Major limitations and known issues: 420 critical monil at voltaire.com PKey table reordering caused by SM failover stops ipoib traffic 556 critical amip at dev.mellanox.co.il OFED 1.2 SDP crashes RHEL5 ppc64 513 critical rjwalsh at pathscale.com error while installing ipath driver 529 critical rjwalsh at pathscale.com dtest fails on ipath card 534 critical vlad at mellanox.co.il SLES9 - Installer fails on declarations - OFED 1.2-20070409 465 critical mst at mellanox.co.il IPoIB CM HA fails after several hours of failovers 539 critical tziporet at mellanox.co.il "Catastrophic error detected" while running IPoIB bonding port failover test 547 critical vlad at mellanox.co.il Installer errors when using customer and command line 549 critical amip at dev.mellanox.co.il SDP Policy need to be consistent 553 critical amip at mellanox.co.il recent libsdp changes break non-blocking SDP with "both" mode 484 major mst at dev.mellanox.co.il mstflint -d mthca0 fails on ppc64 499 major vlad at mellanox.co.il module compiled over ofed won't load due to symbol version mismatch 459 major monis at voltaire.com support ib-bonding on RHEL4U4/RHEL5, put kernel name in RPM name, and clean up better 508 major mst at mellanox.co.il IPoIB CM multicast is hogging interrupts 530 major dannyz at mellanox.co.il ibdiagnet -r fails on RHEL5 i686 506 major mst at mellanox.co.il IPoIB IPv4 multicast throughput is poor 541 major monis at voltaire.com slow failover with IPoIB bonding/ipoibtools HA 548 major amip at dev.mellanox.co.il SDP connection rate in small messages 555 major mst at mellanox.co.il SDP log file location as mentioned in the conf file is being ignored 558 major rolandd at cisco.com tvflash configure fails on SLES10 SP1 RC2 See bugzilla for all open issues. Tasks that should be completed for RC3 (due date is 26-April): 1. Support SLES10 SP1 RC1 2. Replace Open MPI to version 1.2.1 3. Fix all blocker, critical and major bugs 4. Prepare all documentation (release notes, README, etc.) -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OFED.install.14667.log Type: application/octet-stream Size: 3796 bytes Desc: not available URL: From rjwalsh at pathscale.com Wed Apr 18 14:42:16 2007 From: rjwalsh at pathscale.com (Robert Walsh) Date: Wed, 18 Apr 2007 14:42:16 -0700 Subject: [ofa-general] infinipath0: invalidaddr error In-Reply-To: <200704181111.55160.bs@q-leap.de> References: <200704181111.55160.bs@q-leap.de> Message-ID: <462690B8.7000702@pathscale.com> Bernd Schubert wrote: > Hi, > > from time to time I see these messages in the logs > > [14052.195559] ib_ipath 0000:0a:00.0: infinipath0: invalidaddr error > > What does that mean? Can it cause further problems? That depends. This means something wrote to a bogus address on the chip. This won't cause further problems. However, that write was ignored, so something may not have happened that should have happened. This could be a driver bug, or a bug in a layer above it. If you're concerned about this, I'd suggest contact our support organization (support at qlogic.com) and asking them for help. Give them some details on your environment (what hardware and software you're using) and what you were doing at the time and they should be able to help you narrow down the problem. Regards, Robert. From pradeep at us.ibm.com Wed Apr 18 17:56:44 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Wed, 18 Apr 2007 18:56:44 -0600 Subject: [ofa-general] IPOIB CM (NOSRQ)[PATCH V2] patch for review Message-ID: Here is a second version of the IPOIB_CM_NOSRQ patch for review. This patch will benefit adapters that do not support shared receive queues. This patch incorporates the previous review comments: -#ifdefs removed and a single binary drives HCAs that do and do not support SRQs -avoids linear traversal through a list of QPs -extraneous code removed -compile time selection removed -No HTML version as part of this patch This patch has been tested with linux-2.6.21-rc5 and rc7 with Topspin and IBM HCAs on ppc64 machines. I have run netperf between two IBM HCAs and two Topspin HCAs, as well as between IBM and Topspin HCA. Note 1: There was interesting discovery that I made when I ran netperf between Topsin and IBM HCA. I started to see the IB_WC_RETRY_EXC_ERR error upon send completion. This may have been due to the differences in the processing speeds of the two HCA. This was rectified by seting the retry_count to a non-zero value in ipoib_cm_send_req(). I had to do this inspite of the comment --> /* RFC draft warns against retries */ Can someone point me to where this comment is in the RFC? I would like to understand the reasoning. Note 2: The IB_WC_RETRY_EXC_ERR is not seen when the two HCAs are of the same type. Note 3: Another small patch (not in this one) is needed to the ehca driver for it to work on the IBM HCAs. Signed-off-by: Pradeep Satyanarayana --- --- linux-2.6.21-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-04-02 17:44:58.000000000 -0700 +++ linux-2.6.21-rc5/drivers/infiniband/ulp/ipoib/ipoib.h 2007-04-03 10:59:54.000000000 -0700 @@ -99,6 +99,12 @@ enum { #define IPOIB_OP_RECV (1ul << 31) #ifdef CONFIG_INFINIBAND_IPOIB_CM #define IPOIB_CM_OP_SRQ (1ul << 30) +#define IPOIB_CM_OP_NOSRQ (1ul << 29) + +/* These two go hand in hand */ +#define NOSRQ_INDEX_RING_SIZE 1024 +#define NOSRQ_INDEX_MASK 0x00000000000003ff + #else #define IPOIB_CM_OP_SRQ (0) #endif @@ -136,9 +142,11 @@ struct ipoib_cm_data { struct ipoib_cm_rx { struct ib_cm_id *id; struct ib_qp *qp; + struct ipoib_cm_rx_buf *rx_ring; struct list_head list; struct net_device *dev; unsigned long jiffies; + u32 index; }; struct ipoib_cm_tx { @@ -177,6 +185,7 @@ struct ipoib_cm_dev_priv { struct ib_wc ibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + struct ipoib_cm_rx **rx_index_ring; }; /* --- linux-2.6.21-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-04-02 17:44:58.000000000 -0700 +++ linux-2.6.21-rc5/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-04-18 16:23:12.000000000 -0700 @@ -76,35 +76,73 @@ static void ipoib_cm_dma_unmap_rx(struct ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int ipoib_cm_post_receive(struct net_device *dev, int id) +static int ipoib_cm_post_receive(struct net_device *dev, u64 id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_recv_wr *bad_wr; int i, ret; + u32 index; + u64 wr_id; + struct ipoib_cm_rx *rx_ptr; + unsigned long flags; - priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + if (priv->cm.srq) { + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; /* Check id val */ - for (i = 0; i < IPOIB_CM_RX_SG; ++i) - priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].addr = + priv->cm.srq_ring[id].mapping[i]; + + ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + priv->cm.srq_ring[id].mapping); + dev_kfree_skb_any(priv->cm.srq_ring[id].skb); + priv->cm.srq_ring[id].skb = NULL; + } + } else { + index = id & NOSRQ_INDEX_MASK ; + wr_id = id >> 32; - ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); - if (unlikely(ret)) { - ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); - ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, - priv->cm.srq_ring[id].mapping); - dev_kfree_skb_any(priv->cm.srq_ring[id].skb); - priv->cm.srq_ring[id].skb = NULL; - } + /* There is a slender chance of a race between the stale_task + * running after a period of inactivity and the receipt of + * a packet being processed at about the same instant. + * Hence the lock */ + + spin_lock_irqsave(&priv->lock, flags); + rx_ptr = priv->cm.rx_index_ring[index]; + spin_unlock_irqrestore(&priv->lock, flags); + + priv->cm.rx_wr.wr_id = wr_id << 32 | index | IPOIB_CM_OP_NOSRQ; + + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i]; + + ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", + wr_id, ret); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + rx_ptr->rx_ring[wr_id].mapping); + dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb); + rx_ptr->rx_ring[wr_id].skb = NULL; + } + } /* else NO SRQ */ return ret; } -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, + int frags, u64 mapping[IPOIB_CM_RX_SG]) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; int i; + struct ipoib_cm_rx *rx_ptr; + u32 index, wr_id; + unsigned long flags; skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); if (unlikely(!skb)) @@ -123,7 +161,7 @@ static struct sk_buff *ipoib_cm_alloc_rx return NULL; } - for (i = 0; i < frags; i++) { + for (i = 0; i < frags; i++) { struct page *page = alloc_page(GFP_ATOMIC); if (!page) @@ -136,7 +174,17 @@ static struct sk_buff *ipoib_cm_alloc_rx goto partial_error; } - priv->cm.srq_ring[id].skb = skb; + if (priv->cm.srq) + priv->cm.srq_ring[id].skb = skb; + else { + index = id & NOSRQ_INDEX_MASK ; + wr_id = id >> 32; + spin_lock_irqsave(&priv->lock, flags); + rx_ptr = priv->cm.rx_index_ring[index]; + spin_unlock_irqrestore(&priv->lock, flags); + + rx_ptr->rx_ring[wr_id].skb = skb; + } return skb; partial_error: @@ -157,13 +205,20 @@ static struct ib_qp *ipoib_cm_create_rx_ struct ib_qp_init_attr attr = { .send_cq = priv->cq, /* does not matter, we never send anything */ .recv_cq = priv->cq, - .srq = priv->cm.srq, .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ + .cap.max_recv_wr = ipoib_recvq_size + 1, .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ + .cap.max_recv_sge = IPOIB_CM_RX_SG, /* Is this correct? */ .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, .qp_context = p, }; + + if (priv->cm.srq) + attr.srq = priv->cm.srq; + else + attr.srq = NULL; + return ib_create_qp(priv->pd, &attr); } @@ -217,9 +272,13 @@ static int ipoib_cm_send_rep(struct net_ rep.flow_control = 0; rep.rnr_retry_count = req->rnr_retry_count; rep.target_ack_delay = 20; /* FIXME */ - rep.srq = 1; rep.qp_num = qp->qp_num; rep.starting_psn = psn; + + if (priv->cm.srq) + rep.srq = 1; + else + rep.srq = 0; return ib_send_cm_rep(cm_id, &rep); } @@ -231,6 +290,8 @@ static int ipoib_cm_req_handler(struct i unsigned long flags; unsigned psn; int ret; + u32 qp_num, index; + u64 i; ipoib_dbg(priv, "REQ arrived\n"); p = kzalloc(sizeof *p, GFP_KERNEL); @@ -244,10 +305,69 @@ static int ipoib_cm_req_handler(struct i goto err_qp; } - psn = random32() & 0xffffff; - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); - if (ret) - goto err_modify; + if (priv->cm.srq == NULL) { /* NOSRQ */ + qp_num = p->qp->qp_num; + /* Allocate space for the rx_ring here */ + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, + GFP_KERNEL); + if (p->rx_ring == NULL) + return -ENOMEM; + + cm_id->context = p; + p->jiffies = jiffies; + spin_lock_irqsave(&priv->lock, flags); + list_add(&p->list, &priv->cm.passive_ids); + + /* Find an empty rx_index_ring[] entry */ + for (index = 0; index < NOSRQ_INDEX_RING_SIZE; index++) + if (priv->cm.rx_index_ring[index] == NULL) + break; + + if ( index == NOSRQ_INDEX_RING_SIZE) { + spin_unlock_irqrestore(&priv->lock, flags); + printk(KERN_WARNING "NOSRQ supports a max of %d RC " + "QPs. That limit has now been reached\n", + NOSRQ_INDEX_RING_SIZE); + return -EINVAL; + } + + /* Store the pointer to retrieve it later using the index */ + priv->cm.rx_index_ring[index] = p; + spin_unlock_irqrestore(&priv->lock, flags); + p->index = index; + + psn = random32() & 0xffffff; + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) { + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", + ret); + goto err_modify; + } + + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping)) { + ipoib_warn(priv, "failed to allocate receive " + "buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + + if (ipoib_cm_post_receive(dev, i << 32 | index)) { + ipoib_warn(priv, "ipoib_ib_post_receive " + "failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } + } + } else { /* SRQ */ + p->rx_ring = NULL; /* This is used only by NOSRQ */ + psn = random32() & 0xffffff; + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) + goto err_modify; + } ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); if (ret) { @@ -255,13 +375,15 @@ static int ipoib_cm_req_handler(struct i goto err_rep; } - cm_id->context = p; - p->jiffies = jiffies; - spin_lock_irqsave(&priv->lock, flags); - list_add(&p->list, &priv->cm.passive_ids); - spin_unlock_irqrestore(&priv->lock, flags); + if (priv->cm.srq) { + cm_id->context = p; + p->jiffies = jiffies; + spin_lock_irqsave(&priv->lock, flags); + list_add(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } queue_delayed_work(ipoib_workqueue, - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); return 0; err_rep: @@ -344,12 +466,19 @@ static void skb_put_frags(struct sk_buff void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; struct sk_buff *skb, *newskb; struct ipoib_cm_rx *p; unsigned long flags; - u64 mapping[IPOIB_CM_RX_SG]; + u64 mapping[IPOIB_CM_RX_SG], wr_id; + u32 index; int frags; + struct ipoib_cm_rx *rx_ptr; + + + if (priv->cm.srq) + wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; + else + wr_id = wc->wr_id >> 32; ipoib_dbg_data(priv, "cm recv completion: id %d, op %d, status: %d\n", wr_id, wc->opcode, wc->status); @@ -360,7 +489,16 @@ void ipoib_cm_handle_rx_wc(struct net_de return; } - skb = priv->cm.srq_ring[wr_id].skb; + if(priv->cm.srq) + skb = priv->cm.srq_ring[wr_id].skb; + else { + index = (wc->wr_id & ~IPOIB_CM_OP_NOSRQ) & NOSRQ_INDEX_MASK ; + spin_lock_irqsave(&priv->lock, flags); + rx_ptr = priv->cm.rx_index_ring[index]; + spin_unlock_irqrestore(&priv->lock, flags); + + skb = rx_ptr->rx_ring[wr_id].skb; + } /* NOSRQ */ if (unlikely(wc->status != IB_WC_SUCCESS)) { ipoib_dbg(priv, "cm recv error " @@ -371,7 +509,13 @@ void ipoib_cm_handle_rx_wc(struct net_de } if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { - p = wc->qp->qp_context; + if(priv->cm.srq == NULL) + /* There are no guarantees that wc->qp is not NULL for HCAs + * that do not support SRQ. */ + p = rx_ptr; + else + p = wc->qp->qp_context; + if (time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { spin_lock_irqsave(&priv->lock, flags); p->jiffies = jiffies; @@ -388,7 +532,11 @@ void ipoib_cm_handle_rx_wc(struct net_de frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; - newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping); + if (priv->cm.srq) + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping); + else + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags, + mapping); if (unlikely(!newskb)) { /* * If we can't allocate a new RX buffer, dump @@ -399,13 +547,22 @@ void ipoib_cm_handle_rx_wc(struct net_de goto repost; } - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); + if (priv->cm.srq) { + ipoib_cm_dma_unmap_rx(priv, frags, + priv->cm.srq_ring[wr_id].mapping); + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); + } else { + ipoib_cm_dma_unmap_rx(priv, frags, + rx_ptr->rx_ring[wr_id].mapping); + memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); + } ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); - skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); skb->protocol = ((struct ipoib_header *) skb->data)->proto; skb->mac.raw = skb->data; @@ -418,12 +575,19 @@ void ipoib_cm_handle_rx_wc(struct net_de skb->dev = dev; /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); repost: - if (unlikely(ipoib_cm_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_cm_post_receive failed " - "for buf %d\n", wr_id); + if (priv->cm.srq) { + if (unlikely(ipoib_cm_post_receive(dev, wr_id))) + ipoib_warn(priv, "ipoib_cm_post_receive failed " + "for buf %d\n", wr_id); + } else { + if (unlikely(ipoib_cm_post_receive(dev, wr_id << 32 | index))) + ipoib_warn(priv, "ipoib_cm_post_receive failed " + "for buf %d\n", wr_id); + } } static inline int post_send(struct ipoib_dev_priv *priv, @@ -432,6 +596,9 @@ static inline int post_send(struct ipoib u64 addr, int len) { struct ib_send_wr *bad_wr; + struct ib_qp_attr qp_attr; + struct ib_qp_init_attr qp_init_attr; + int ret, qp_attr_mask; priv->tx_sge.addr = addr; priv->tx_sge.length = len; @@ -613,6 +780,7 @@ void ipoib_cm_dev_stop(struct net_device struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_cm_rx *p; unsigned long flags; + int i; if (!IPOIB_CM_SUPPORTED(dev->dev_addr)) return; @@ -621,6 +789,17 @@ void ipoib_cm_dev_stop(struct net_device spin_lock_irqsave(&priv->lock, flags); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); + if (priv->cm.srq == NULL) { + for(i = 0; i < ipoib_recvq_size; ++i) + if(p->rx_ring[i].skb) { + ipoib_cm_dma_unmap_rx(priv, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping); + dev_kfree_skb_any(p->rx_ring[i].skb); + p->rx_ring[i].skb = NULL; + } + kfree(p->rx_ring); + } list_del_init(&p->list); spin_unlock_irqrestore(&priv->lock, flags); ib_destroy_cm_id(p->id); @@ -707,9 +886,14 @@ static struct ib_qp *ipoib_cm_create_tx_ struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = {}; attr.recv_cq = priv->cq; - attr.srq = priv->cm.srq; + if (priv->cm.srq) + attr.srq = priv->cm.srq; + else + attr.srq = NULL; attr.cap.max_send_wr = ipoib_sendq_size; + attr.cap.max_recv_wr = 1; /* Not in MST code */ attr.cap.max_send_sge = 1; + attr.cap.max_recv_sge = 1; /* Not in MST code */ attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -746,10 +930,13 @@ static int ipoib_cm_send_req(struct net_ req.responder_resources = 4; req.remote_cm_response_timeout = 20; req.local_cm_response_timeout = 20; - req.retry_count = 0; /* RFC draft warns against retries */ - req.rnr_retry_count = 0; /* RFC draft warns against retries */ + req.retry_count = 6; /* RFC draft warns against retries */ + req.rnr_retry_count = 6;/* RFC draft warns against retries */ req.max_cm_retries = 15; - req.srq = 1; + if (priv->cm.srq) + req.srq = 1; + else + req.srq = 0; return ib_send_cm_req(id, &req); } @@ -1089,6 +1276,7 @@ static void ipoib_cm_stale_task(struct w cm.stale_task.work); struct ipoib_cm_rx *p; unsigned long flags; + int i; spin_lock_irqsave(&priv->lock, flags); while (!list_empty(&priv->cm.passive_ids)) { @@ -1097,6 +1285,19 @@ static void ipoib_cm_stale_task(struct w p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; + if (priv->cm.srq == NULL) { /* NOSRQ */ + for(i = 0; i < ipoib_recvq_size; ++i) + if(p->rx_ring[i].skb) { + ipoib_cm_dma_unmap_rx(priv, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping); + dev_kfree_skb_any(p->rx_ring[i].skb); + p->rx_ring[i].skb = NULL; + } + /* Free the rx_ring */ + kfree(p->rx_ring); + priv->cm.rx_index_ring[p->index] = NULL; + } list_del_init(&p->list); spin_unlock_irqrestore(&priv->lock, flags); ib_destroy_cm_id(p->id); @@ -1154,13 +1355,9 @@ int ipoib_cm_add_mode_attr(struct net_de int ipoib_cm_dev_init(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_srq_init_attr srq_init_attr = { - .attr = { - .max_wr = ipoib_recvq_size, - .max_sge = IPOIB_CM_RX_SG - } - }; - int ret, i; + struct ib_srq_init_attr srq_init_attr; + int ret, i, supports_srq; + struct ib_device_attr attr; INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); @@ -1172,21 +1369,43 @@ int ipoib_cm_dev_init(struct net_device skb_queue_head_init(&priv->cm.skb_queue); - priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); - if (IS_ERR(priv->cm.srq)) { - ret = PTR_ERR(priv->cm.srq); - priv->cm.srq = NULL; + if (ret = ib_query_device(priv->ca, &attr)) return ret; + if (attr.max_srq) + supports_srq = 1; /* This device supports SRQ */ + else { + supports_srq = 0; } - priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, - GFP_KERNEL); - if (!priv->cm.srq_ring) { - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", - priv->ca->name, ipoib_recvq_size); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; - } + if (supports_srq) { + srq_init_attr.attr.max_wr = ipoib_recvq_size; + srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG; + + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); + if (IS_ERR(priv->cm.srq)) { + ret = PTR_ERR(priv->cm.srq); + priv->cm.srq = NULL; + return ret; + } + + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * + sizeof *priv->cm.srq_ring, + GFP_KERNEL); + if (!priv->cm.srq_ring) { + printk(KERN_WARNING "%s: failed to allocate CM ring " + "(%d entries)\n", + priv->ca->name, ipoib_recvq_size); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + priv->cm.rx_index_ring = NULL; /* Not needed for SRQ */ + } else { + priv->cm.srq = NULL; + priv->cm.srq_ring = NULL; + priv->cm.rx_index_ring = kzalloc(NOSRQ_INDEX_RING_SIZE * + sizeof *priv->cm.rx_index_ring, + GFP_KERNEL); + } for (i = 0; i < IPOIB_CM_RX_SG; ++i) priv->cm.rx_sge[i].lkey = priv->mr->lkey; @@ -1198,19 +1417,25 @@ int ipoib_cm_dev_init(struct net_device priv->cm.rx_wr.sg_list = priv->cm.rx_sge; priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; - for (i = 0; i < ipoib_recvq_size; ++i) { - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, + /* One can post receive buffers even before the RX QP is created + * only in the SRQ case. Therefore for NOSRQ we skip the rest of init + * and do that in ipoib_cm_req_handler() */ + + if (priv->cm.srq) { + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[i].mapping)) { - ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; - } - if (ipoib_cm_post_receive(dev, i)) { - ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -EIO; + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + if (ipoib_cm_post_receive(dev, i)) { + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } } - } + } /* if supports SRQ */ priv->dev->dev_addr[0] = IPOIB_FLAGS_RC; return 0; --- linux-2.6.21-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-04-02 17:44:58.000000000 -0700 +++ linux-2.6.21-rc5/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-04-03 11:00:26.000000000 -0700 @@ -282,7 +282,7 @@ static void ipoib_ib_handle_tx_wc(struct static void ipoib_ib_handle_wc(struct net_device *dev, struct ib_wc *wc) { - if (wc->wr_id & IPOIB_CM_OP_SRQ) + if ((wc->wr_id & IPOIB_CM_OP_SRQ) || (wc->wr_id & IPOIB_CM_OP_NOSRQ)) ipoib_cm_handle_rx_wc(dev, wc); else if (wc->wr_id & IPOIB_OP_RECV) ipoib_ib_handle_rx_wc(dev, wc); Pradeep pradeep at us.ibm.com -------------- next part -------------- A non-text attachment was scrubbed... Name: ipoib_cm.nosrq.patch.v2 Type: application/octet-stream Size: 20146 bytes Desc: not available URL: From yosefe at voltaire.com Thu Apr 19 02:17:23 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 19 Apr 2007 12:17:23 +0300 Subject: [ofa-general] Re: [PATCH] installer: update kernel symbol versions In-Reply-To: <20070417164814.GB10044@mellanox.co.il> References: <462489E6.60103@voltaire.com> <20070417100506.GD32357@mellanox.co.il> <4624C0DC.3080709@voltaire.com> <20070417125212.GC5990@mellanox.co.il> <4624C951.6070708@voltaire.com> <20070417164814.GB10044@mellanox.co.il> Message-ID: <462733A3.5000403@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: Re: [PATCH] installer: update kernel symbol versions >> >>Michael S. Tsirkin wrote: >> >>>>1. Taking another Module.symvers, and even knowing that this is what causes the >>>> module loading errors, is not as straightforward as finding the headers. >>> >>> >>>It isn't? Why isn't it? >>>As I see it, this is a problem that needs to be solved once >>>when external module is packaged. >>> >> >>OK, but this solution should be part of OFED, and available for all kernels. >>From what >>I have tested (and correct me if i'm wrong) on older kernels one must edit the original >>.symvers anyway to get the right versions. >> >>So why shouldn't the installation do that? > > > Because, these are part of another package, touching them is wrong. > External modules will have to solve it in some other way. It should be possible > - it's all just scripts, for the most part. For example, for all I care, build > script for external module can catenate ofed and kernel symvers, build and then > split them back. Already better than OFED doing permanent changes. > The problem is that this is not applicable for old kernels. Anyone using rhas4 must change his version anyway. Another method: The kernel-ib-devel will provide a patch that a user can apply to his Module.symvers, to update it with the new versions. Signed-off-by: Yosef Etigin -- diff -urN ofed_1_2.orig/ofed_scripts/ofa_kernel.spec ofed_1_2.new/ofed_scripts/ofa_kernel.spec --- ofed_1_2.orig/ofed_scripts/ofa_kernel.spec 2007-04-19 12:13:58.000000000 +0300 +++ ofed_1_2.new/ofed_scripts/ofa_kernel.spec 2007-04-19 12:15:30.000000000 +0300 @@ -162,13 +162,57 @@ %if %{build_kernel_ib} make kernel # MODULES_DIR=/lib/modules/%{KVERSION} DESTDIR=$RPM_BUILD_ROOT make install_kernel MODULES_DIR=%{LIB_MOD_DIR} DESTDIR=$RPM_BUILD_ROOT -modsyms=`find $RPM_BUILD_DIR/%{_name}-%{_version} -name Module.symvers -o -name Modules.symvers` -for modsym in $modsyms + +# Create module symbols patch +MOD_SYMVERS_TMP=$RPM_BUILD_ROOT/%{_prefix}/src/%{_name}/Module.symvers +MOD_SYMVERS_OFA=$RPM_BUILD_ROOT/%{_prefix}/src/%{_name}/Modules.symvers +IB_MODULES_ROOT=$RPM_BUILD_DIR/%{_name}-%{_version} +MOD_SYMVERS_ORIG=%{KSRC}/Module.symvers + +cp ${MOD_SYMVERS_ORIG} ${MOD_SYMVERS_TMP} + +# Go silent - don't display the symbol string +set +x + +# Find the new symbols that IB modules export +# list them all in a file,crc,symbol fashion +SYM_RECS="" +for mod in $(find ${IB_MODULES_ROOT} -name '*.ko') do - cat $modsym >> $RPM_BUILD_ROOT/%{_prefix}/src/%{_name}/Module.symvers + # break down the list to file name crc and symbol + group=$(nm -o ${mod} | + sed -ne s#${IB_MODULES_ROOT}'/*\(.*\)\.ko:0\{8\}\(\w\{8\}\) . __crc_\(.*\)$#\2,\3,\1#p') + SYM_RECS=${SYM_RECS}' '${group} + for rec in ${group} + do + [ -z "${rec}" ] && continue + sym=$(echo ${rec} | cut -d, -f2) + SYMS=${SYMS}"/${sym}/d;" + done +done + +# Remove old symbols from Module.symvers +touch ${MOD_SYMVERS_TMP} +sed -i ${MOD_SYMVERS_TMP} -e "${SYMS}" + +# Add our symbols +rm -f ${MOD_SYMVERS_OFA} +touch ${MOD_SYMVERS_OFA} +for rec in ${SYM_RECS}; do + echo 0x${rec} | sed -e 's/,/\t/g' >> ${MOD_SYMVERS_OFA} done +cat ${MOD_SYMVERS_OFA} >> ${MOD_SYMVERS_TMP} + +# Go verbose +set -x + +# create patch +MODULES_PATCH_FILE=$RPM_BUILD_ROOT/%{_prefix}/src/%{_name}/ofed_kernel_syms.patch +diff -uN ${MOD_SYMVERS_ORIG} ${MOD_SYMVERS_TMP} >> ${MODULES_PATCH_FILE} || true +rm -f ${MOD_SYMVERS_TMP} + %endif - + ################################## Handle kernel modules ################################## # Fix kernel modules path in case that modules were installed under 'extra' directory From vlad at lists.openfabrics.org Thu Apr 19 02:37:17 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 19 Apr 2007 02:37:17 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070419-0200 daily build status Message-ID: <20070419093717.CD1DEE6081C@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From bs at q-leap.de Thu Apr 19 02:46:17 2007 From: bs at q-leap.de (Bernd Schubert) Date: Thu, 19 Apr 2007 11:46:17 +0200 Subject: [ofa-general] infinipath0: invalidaddr error In-Reply-To: <462690B8.7000702@pathscale.com> References: <200704181111.55160.bs@q-leap.de> <462690B8.7000702@pathscale.com> Message-ID: <200704191146.17879.bs@q-leap.de> On Wednesday 18 April 2007 23:42:16 Robert Walsh wrote: > Bernd Schubert wrote: > > Hi, > > > > from time to time I see these messages in the logs > > > > [14052.195559] ib_ipath 0000:0a:00.0: infinipath0: invalidaddr error > > > > What does that mean? Can it cause further problems? > > That depends. This means something wrote to a bogus address on the > chip. This won't cause further problems. However, that write was > ignored, so something may not have happened that should have happened. > > This could be a driver bug, or a bug in a layer above it. If you're > concerned about this, I'd suggest contact our support organization > (support at qlogic.com) and asking them for help. Give them some details > on your environment (what hardware and software you're using) and what > you were doing at the time and they should be able to help you narrow > down the problem. Thanks, going to report this. As usual I'm running lustre stress tests and with ipath cards I'm able to triggre a known bug in lustre, which occurs very rarely without ipath cards. The error message above is from the point of time not very closely related and happens also much less often than the lustre bug, but at least its a message from the ipath controller... . Thanks again, Bernd PS: https://bugzilla.lustre.org/show_bug.cgi?id=11544 -- Bernd Schubert Q-Leap Networks GmbH From mst at dev.mellanox.co.il Thu Apr 19 04:18:17 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Apr 2007 14:18:17 +0300 Subject: [ofa-general] Re: [PATCH] installer: update kernel symbol versions In-Reply-To: <462733A3.5000403@voltaire.com> References: <462489E6.60103@voltaire.com> <20070417100506.GD32357@mellanox.co.il> <4624C0DC.3080709@voltaire.com> <20070417125212.GC5990@mellanox.co.il> <4624C951.6070708@voltaire.com> <20070417164814.GB10044@mellanox.co.il> <462733A3.5000403@voltaire.com> Message-ID: <20070419111804.GA918@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCH] installer: update kernel symbol versions > > Because, these are part of another package, touching them is wrong. > > External modules will have to solve it in some other way. It should be possible > > - it's all just scripts, for the most part. For example, for all I care, build > > script for external module can catenate ofed and kernel symvers, build and then > > split them back. Already better than OFED doing permanent changes. > > > > The problem is that this is not applicable for old kernels. Why not? This should be applicable to all kernels with a bit of work. > Anyone using rhas4 must change his version anyway. version of what? > Another method: > The kernel-ib-devel will provide a patch that a user can apply to his > Module.symvers, to update it with the new versions. No, I don't think we can call that a solution. Anyway, I just looked at kernel 2.6.9, and I see there: scripts/Makefile.modpost:symverfile := $(objtree)/Module.symvers So it seems that just by setting symverfile=mysymverfile on makefile command line you should be able to force it to pick a symbol version file from an alternate location. In any case, this is something that *the external module* should do, not OFED itself. -- MST From vlad at dev.mellanox.co.il Thu Apr 19 04:28:52 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 19 Apr 2007 14:28:52 +0300 Subject: [ofa-general] RE: [ewg] OFED 1.2 rc2 release In-Reply-To: <001901c78201$e6998f30$c801a8c0@ettac> References: <001901c78201$e6998f30$c801a8c0@ettac> Message-ID: <1176982132.5749.11.camel@vladsk-laptop> On Wed, 2007-04-18 at 16:38 -0500, Chieng Etta wrote: > > > error: Failed dependencies: > librdmacm.so is needed by dapl-1.2.1-0.x86_64 > librdmacm.so(RDMACM_1.0) is needed by dapl-1.2.1-0.x86_64 > > Hi, You should uninstall ofed RPMs that were installed with RH5.0 and then run OFED-1.2-rc2 installation. Use the following command: /bin/rpm -e --allmatches opensm-libs opensm-devel opensm openmpi-libs openmpi-devel openmpi \ openib-tvflash openib-srptools openib-perftest openib-mstflint openib-diags \ openib librdmacm-utils librdmacm-devel librdmacm libibverbs-utils libibverbs-devel \ libibverbs libibumad-devel libibumad libibmad-devel libibmad libibcommon-devel \ libibcommon libibcm-devel libibcm dapl-devel dapl Today's OFED build (OFED-1.2-20070419-0600) will include the fix for this issue. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From monis at voltaire.com Thu Apr 19 04:42:53 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 19 Apr 2007 14:42:53 +0300 Subject: [ofa-general] Re: IPoIB bonding document ? In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com> <45DAB3FD.8060606@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net> <4625C1C6.6040709@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> Message-ID: <462755BD.5020305@voltaire.com> Tang, Changqing wrote: > Or: > the doc is too short, I hope to get some technical details. > Suppose "ib-bond --bond-ip 192.186.10.100 --slaves ib0,ib2" > during regular operation time, only ib0 has traffic, ib2 is NOT used, > right ? Right > > When ib0 fails, TCP fail-over to ib2. Then ib0 is > repaired(replace a cable/switch, for example). Later when ib2 fails, > can TCP fail-over back to ib0 again ? Yes. > > > --CQ > >> -----Original Message----- >> From: Or Gerlitz [mailto:ogerlitz at voltaire.com] >> Sent: Wednesday, April 18, 2007 1:59 AM >> To: Tang, Changqing >> Cc: OpenFabrics General; Moni Shoua >> Subject: Re: IPoIB bonding document ? >> >> Tang, Changqing wrote: >>> I know you are working on bonding, where is a good document about >>> IPoIB bonding ? >>> I have a few questions: >>> >>> 1. is bonding working on two HCAs, as well as two ports on >> th same HCA ? >>> 2. is the second channel idle, or two channels are used >> during regular >>> time ? >> Hi CQ - I am cc-ing the general list as well, as other people >> might be interested as well. >> >> The package provided with OFED 1.2 contains documentation, >> you can browse to the below url to get the doc. >> http://www.openfabrics.org/git/?p=~monis/ofed-bond-pkg.git;a=t >> ree;f=ib-bonding-0.9.0/docs;h=ea30b3e6e8ebe530182cff18e8e7db19 >> ee4aa346;hb=HEAD >> >> Bonding works in the interface level such that a bonding >> master interface (eg bond0) enslaves other interfaces (eg ib0 >> and ib1). In the IPoIB case, the enslaved devices can be >> bounded to two (or more) ports on the same/different HCA, its >> also possible to bond child interfaces (eg ib0.8003 and ib1.8003) etc. >> >> The bonding driver has one HA and few LB operation modes, >> currently, only the HA mode (named Active-Backup) is >> supported for IPoIB. >> >>> 3. the mesage re-transmit from first channel to second >> channel at TCP >>> packet level, right ? >> I am not sure to follow you, in case you ask if after bonding >> fail-over TCP re-transmission is done over the active >> interface used by bonding, the answer is yes. >> >> Or. >> >> >> >> >> From mst at dev.mellanox.co.il Thu Apr 19 04:51:19 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Apr 2007 14:51:19 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review In-Reply-To: References: Message-ID: <20070419115119.GB918@mellanox.co.il> > Quoting Pradeep Satyanarayana : > Subject: IPOIB CM (NOSRQ)[PATCH V2] patch for review > > Here is a second version of the IPOIB_CM_NOSRQ patch for review. This > patch will benefit adapters that do not support shared receive queues. > > This patch incorporates the previous review comments: > -#ifdefs removed and a single binary drives HCAs that do and do not > support SRQs > -avoids linear traversal through a list of QPs > -extraneous code removed > -compile time selection removed > -No HTML version as part of this patch The patch is still severely line-wrapped, to the point of unreadability. Look at it here: http://article.gmane.org/gmane.linux.drivers.openib/38681 > This patch has been tested with linux-2.6.21-rc5 and rc7 with Topspin and > IBM HCAs on ppc64 machines. I have run > netperf between two IBM HCAs and two Topspin HCAs, as well as between IBM > and Topspin HCA. > > Note 1: There was interesting discovery that I made when I ran netperf > between Topsin and IBM HCA. I started to see > the IB_WC_RETRY_EXC_ERR error upon send completion. This may have been due > to the differences in the > processing speeds of the two HCA. This was rectified by seting the > retry_count to a non-zero value in ipoib_cm_send_req(). > I had to do this inspite of the comment --> /* RFC draft warns against > retries */ This would only help if there are short bursts of high-speed activity on the receiving HCA: if the speed is different in the long run, the right thing to do is to drop some packets and have TCP adjust its window accordingly. But in that former case (short bursts), just increasing the number of pre-posted buffers on RQ should be enough, and looks like a much cleaner solution. Long-term, I think we should use the watermark event to dynamically adjust the number of RQ buffers with the incoming traffic. I'll try to work on such a patch probably for 2.6.23 timeframe. > Can someone point me to where this comment is in the RFC? I would like to > understand the reasoning. See "7.1 A Cautionary Note on IPoIB-RC". See also classics such as http://sites.inka.de/~W1011/devel/tcp-tcp.html By the way, as long as you are not using SRQ, why not use UC mode QPs? This would look like a cleaner solution. You can also try making the RNR condition cheaper to handle, by moving the QP to RST and back to RTR and then to RTS instead of re-initiating a new connection. Unfortunately, I haven't the time to review the patch thoroughly in the coming couple of weeks. A general comment however: > @@ -360,7 +489,16 @@ void ipoib_cm_handle_rx_wc(struct net_de > return; > } > > - skb = priv->cm.srq_ring[wr_id].skb; > + if(priv->cm.srq) > + skb = priv->cm.srq_ring[wr_id].skb; > + else { > + index = (wc->wr_id & ~IPOIB_CM_OP_NOSRQ) > & NOSRQ_INDEX_MASK ; > + spin_lock_irqsave(&priv->lock, flags); > + rx_ptr = priv->cm.rx_index_ring[index]; > + spin_unlock_irqrestore(&priv->lock, > flags); > + > + skb = rx_ptr->rx_ring[wr_id].skb; > + } /* NOSRQ */ > > if (unlikely(wc->status != IB_WC_SUCCESS)) { > ipoib_dbg(priv, "cm recv error " In this, and other examples, you scatter "if priv->cm.srq" tests all over the code. I think it would be much cleaner in most cases to separate the non-SRQ code to separate functions. If there's common SRQ/non-SRQ code it can be factored out and reused in both places. In cases such as the above this also has speed advantages: from both cache footprint as well as branch prediction POV. You can even have a separate event handler for SRQ/non-SRQ code, avoiding mode tests on data path completely. -- MST From yosefe at voltaire.com Thu Apr 19 04:56:21 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 19 Apr 2007 14:56:21 +0300 Subject: [ofa-general] Re: [PATCH] installer: update kernel symbol versions In-Reply-To: <20070419111804.GA918@mellanox.co.il> References: <462489E6.60103@voltaire.com> <20070417100506.GD32357@mellanox.co.il> <4624C0DC.3080709@voltaire.com> <20070417125212.GC5990@mellanox.co.il> <4624C951.6070708@voltaire.com> <20070417164814.GB10044@mellanox.co.il> <462733A3.5000403@voltaire.com> <20070419111804.GA918@mellanox.co.il> Message-ID: <462758E5.2030309@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: Re: [PATCH] installer: update kernel symbol versions >> >>>Because, these are part of another package, touching them is wrong. >>>External modules will have to solve it in some other way. It should be possible >>>- it's all just scripts, for the most part. For example, for all I care, build >>>script for external module can catenate ofed and kernel symvers, build and then >>>split them back. Already better than OFED doing permanent changes. >>> >> >>The problem is that this is not applicable for old kernels. > > > Why not? > This should be applicable to all kernels with a bit of work. > > >>Anyone using rhas4 must change his version anyway. > > > version of what? > Sorry, i meant 'versions' > >>Another method: >>The kernel-ib-devel will provide a patch that a user can apply to his >>Module.symvers, to update it with the new versions. > > > No, I don't think we can call that a solution. > Anyway, I just looked at kernel 2.6.9, and I see there: > scripts/Makefile.modpost:symverfile := $(objtree)/Module.symvers > So it seems that just by setting > symverfile=mysymverfile > on makefile command line you should be able to force it to pick > a symbol version file from an alternate location. > > In any case, this is something that *the external module* > should do, not OFED itself. > The method you suggest overrides the default symvers location. This does not enable using a file that is effectively appended to the original one, as in new kernels. The user will have to do the work anyway. Why shouldn't this step be a part of OFED? --Yossi From mst at dev.mellanox.co.il Thu Apr 19 05:15:37 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Apr 2007 15:15:37 +0300 Subject: [ofa-general] Re: [PATCH] installer: update kernel symbol versions In-Reply-To: <462758E5.2030309@voltaire.com> References: <462489E6.60103@voltaire.com> <20070417100506.GD32357@mellanox.co.il> <4624C0DC.3080709@voltaire.com> <20070417125212.GC5990@mellanox.co.il> <4624C951.6070708@voltaire.com> <20070417164814.GB10044@mellanox.co.il> <462733A3.5000403@voltaire.com> <20070419111804.GA918@mellanox.co.il> <462758E5.2030309@voltaire.com> Message-ID: <20070419121536.GC918@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCH] installer: update kernel symbol versions > > Michael S. Tsirkin wrote: > >>Quoting Yosef Etigin : > >>Subject: Re: [PATCH] installer: update kernel symbol versions > >> > >>>Because, these are part of another package, touching them is wrong. > >>>External modules will have to solve it in some other way. It should be possible > >>>- it's all just scripts, for the most part. For example, for all I care, build > >>>script for external module can catenate ofed and kernel symvers, build and then > >>>split them back. Already better than OFED doing permanent changes. > >>> > >> > >>The problem is that this is not applicable for old kernels. > > > > > > Why not? > > This should be applicable to all kernels with a bit of work. > > > > > >>Anyone using rhas4 must change his version anyway. > > > > > > version of what? > > > Sorry, i meant 'versions' Are you speaking about Module.symvers here? That's not true I think - OFED currently seems to work fine without the user touching the kernel's Module.symvers. > > > >>Another method: > >>The kernel-ib-devel will provide a patch that a user can apply to his > >>Module.symvers, to update it with the new versions. > > > > > > No, I don't think we can call that a solution. > > Anyway, I just looked at kernel 2.6.9, and I see there: > > scripts/Makefile.modpost:symverfile := $(objtree)/Module.symvers > > So it seems that just by setting > > symverfile=mysymverfile > > on makefile command line you should be able to force it to pick > > a symbol version file from an alternate location. > > > > In any case, this is something that *the external module* > > should do, not OFED itself. > > > > The method you suggest overrides the default symvers location. This does > not enable using a file that is effectively appended to the original one, > as in new kernels. Sorry I could not parse this. You are building an out of kernel module. Would not something like cat $(objtree)/Module.symvers $OFED/Module.symvers > Module.symvers.$$ make symverfile=Module.symvers.$$ SUBDIRS=... do the trick for you? > The user will have to do the work anyway. The *user* should not need to do anything besides run the installation script. The job should be done by *developer* that wants to write an external module using OFED APIs. He has to write the script that does the job for the user. > Why shouldn't this step be a part of OFED? Because of the comment above. This is not the *user*'s problem. This is external module developer's problem, and developers should adapt their scripts to do the right thing. And my point is, I'm sure it can be done without corrupting the kernel sources, but if you *want* to corrupt kernel's sources, be my guest but OFED shouldn't do it I think. -- MST From mst at dev.mellanox.co.il Thu Apr 19 05:57:22 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Apr 2007 15:57:22 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: References: <20070327205213.GD28347@mellanox.co.il> <6a122cc00703280200h33f384b9jae75592294a9cbd9@mail.gmail.com> <20070328093345.GD11695@mellanox.co.il> <20070328200906.GJ4253@mellanox.co.il> <460ACED8.20605@gmail.com> <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> Message-ID: <20070419125722.GD918@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: pkey change handling patch > > > Could be a good idea, but note this will require creating another > > WQ for cache updates, otherwise e.g. IPoIB > > will deadlock waiting for it. > > Actually, on further thought, it's kind of a stupid idea. The whole > point of the cache module is to be usable in places where blocking > isn't allowed. If it's being called from a context where we know we > can block, then there's no point in going through the cache at all. > > > By the way, are there any users for the non-blocking API? > > Maybe we can simply relax the requirement, and make all API's > > blocking? > > Well, at least mthca is using it when posting WQEs to MLX QPs. But it > could keep its own internal cache (which would be much simpler, and > easier to keep in sync since it sees all MADs that change the cache as > they go by). > > If there are no other users for a non-blocking API then we could tear > out the whole caching mess and be much happier. > > At least IPoIB and SRP seem like they could live without the cache, > with the addition of ib_find_pkey() and ib_find_gid() convenience > functions, and they're the only users in infiniband/ulp. > infiniband/core would take a little more auditing. Good point. So since all this thread was started by Moni because of IPoIB, the path is clear in that respect, and would already be a step in the right direction: - a patch to add ib_find_pkey() and ib_find_gid() to core - a patch to replace cache usage in IPoIB / SRP with uncached hardware accesses on top of this - pkey change handling patch on top of these Moni, what do you think? As a follow-up, we can then - get rid of cache in mthca by implementing device-specific cache. - audit core for cache usage - most likely we'll find everything can be done from a context where we can block By the way, here's yet another bright idea: I *think* we can keep this provider-specific cache updated by snooping incoming MADs in driver. And if it can be done this way, in all providers, we might be able to simply require that query_pkey/query_gid in providers do not sleep. If that's true, we'll save ourselves the work of auditing core - ib_find_pkey() and ib_find_gid() will be a drop-in replacement for cache. Roland, what do you think? -- MST From etta at systemfabricworks.com Thu Apr 19 07:10:43 2007 From: etta at systemfabricworks.com (Chieng Etta) Date: Thu, 19 Apr 2007 09:10:43 -0500 Subject: [ofa-general] RE: [ewg] OFED 1.2 rc2 release In-Reply-To: <1176982132.5749.11.camel@vladsk-laptop> Message-ID: <002501c7828c$85102a90$c801a8c0@ettac> Hi Vlad, Actually, before installing rc2, I used uninstall script to remove the OFED packages. I also tried the command you provided below but received a message indicating none of listed packages were installed. I will wait and try today's build. Thanks, Etta -----Original Message----- From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il] Sent: Thursday, April 19, 2007 6:29 AM To: Chieng Etta Cc: ewg at lists.openfabrics.org; 'Tziporet Koren'; general at lists.openfabrics.org Subject: RE: [ewg] OFED 1.2 rc2 release On Wed, 2007-04-18 at 16:38 -0500, Chieng Etta wrote: > > > error: Failed dependencies: > librdmacm.so is needed by dapl-1.2.1-0.x86_64 > librdmacm.so(RDMACM_1.0) is needed by dapl-1.2.1-0.x86_64 > > Hi, You should uninstall ofed RPMs that were installed with RH5.0 and then run OFED-1.2-rc2 installation. Use the following command: /bin/rpm -e --allmatches opensm-libs opensm-devel opensm openmpi-libs openmpi-devel openmpi \ openib-tvflash openib-srptools openib-perftest openib-mstflint openib-diags \ openib librdmacm-utils librdmacm-devel librdmacm libibverbs-utils libibverbs-devel \ libibverbs libibumad-devel libibumad libibmad-devel libibmad libibcommon-devel \ libibcommon libibcm-devel libibcm dapl-devel dapl Today's OFED build (OFED-1.2-20070419-0600) will include the fix for this issue. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From changquing.tang at hp.com Thu Apr 19 07:28:40 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 19 Apr 2007 14:28:40 -0000 Subject: [ofa-general] RE: IPoIB bonding document ? In-Reply-To: <462755BD.5020305@voltaire.com> References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com> <45DAB3FD.8060606@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net> <4625C1C6.6040709@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> <462755BD.5020305@voltaire.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403011C1415@G3W0634.americas.hpqcorp.net> > > Or: > > the doc is too short, I hope to get some technical details. > > Suppose "ib-bond --bond-ip 192.186.10.100 --slaves ib0,ib2" > > during regular operation time, only ib0 has traffic, ib2 is > NOT used, > > right ? > Right What is the definition of ib0, ib1 ? Are these the IB device returned by ibv_devinfo command and in that order ? I was learned a while back that ibv_get_debvice_list() does not return consistent results. Or are these the IPoIB interfaces ? then the bond-ip must be different from these two IPs, right ? then question is, can I use all the three IPs in the same time ? > > > > When ib0 fails, TCP fail-over to ib2. Then ib0 is > repaired(replace a > > cable/switch, for example). Later when ib2 fails, can TCP fail-over > > back to ib0 again ? > Yes. When TCP tries to fail-back to ib0, does it need to know that ib0 is back online again, and how does it know ? If TCP just blindly fail-over from one interface to another, what happen if both interfaces are bad ? Thanks. --CQ > > > > > > --CQ > > > >> -----Original Message----- > >> From: Or Gerlitz [mailto:ogerlitz at voltaire.com] > >> Sent: Wednesday, April 18, 2007 1:59 AM > >> To: Tang, Changqing > >> Cc: OpenFabrics General; Moni Shoua > >> Subject: Re: IPoIB bonding document ? > >> > >> Tang, Changqing wrote: > >>> I know you are working on bonding, where is a good document about > >>> IPoIB bonding ? > >>> I have a few questions: > >>> > >>> 1. is bonding working on two HCAs, as well as two ports on > >> th same HCA ? > >>> 2. is the second channel idle, or two channels are used > >> during regular > >>> time ? > >> Hi CQ - I am cc-ing the general list as well, as other > people might > >> be interested as well. > >> > >> The package provided with OFED 1.2 contains documentation, you can > >> browse to the below url to get the doc. > >> http://www.openfabrics.org/git/?p=~monis/ofed-bond-pkg.git;a=t > >> ree;f=ib-bonding-0.9.0/docs;h=ea30b3e6e8ebe530182cff18e8e7db19 > >> ee4aa346;hb=HEAD > >> > >> Bonding works in the interface level such that a bonding master > >> interface (eg bond0) enslaves other interfaces (eg ib0 and > ib1). In > >> the IPoIB case, the enslaved devices can be bounded to two > (or more) > >> ports on the same/different HCA, its also possible to bond child > >> interfaces (eg ib0.8003 and ib1.8003) etc. > >> > >> The bonding driver has one HA and few LB operation modes, > currently, > >> only the HA mode (named Active-Backup) is supported for IPoIB. > >> > >>> 3. the mesage re-transmit from first channel to second > >> channel at TCP > >>> packet level, right ? > >> I am not sure to follow you, in case you ask if after bonding > >> fail-over TCP re-transmission is done over the active > interface used > >> by bonding, the answer is yes. > >> > >> Or. > >> > >> > >> > >> > >> > > > From monis at voltaire.com Thu Apr 19 07:44:30 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 19 Apr 2007 17:44:30 +0300 Subject: [ofa-general] Re: IPoIB bonding document ? In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403011C1415@G3W0634.americas.hpqcorp.net> References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com> <45DAB3FD.8060606@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net> <4625C1C6.6040709@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> <462755BD.5020305@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA8403011C1415@G3W0634.americas.hpqcorp.net> Message-ID: <4627804E.2040004@voltaire.com> Tang, Changqing wrote: > > What is the definition of ib0, ib1 ? > > Are these the IB device returned by ibv_devinfo command and in that > order ? No, ib0 and ib1 are the names of the IPoIB interfaces. ibv_devinfo shows IB port information > I was learned a while back that ibv_get_debvice_list() does not return > consistent results. > > Or are these the IPoIB interfaces ? then the bond-ip must be different > from these two IPs, right ? > then question is, can I use all the three IPs in the same time ? > Not exactly. the ibX interfaces should not have IP addresses at all. So, bond IP is the only IP you deal with. When you enslave ib interface (with ib-bond) the ibX interfaces will get the address of the bond interface. From changquing.tang at hp.com Thu Apr 19 07:51:04 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 19 Apr 2007 14:51:04 -0000 Subject: [ofa-general] RE: IPoIB bonding document ? In-Reply-To: <4627804E.2040004@voltaire.com> References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com> <45DAB3FD.8060606@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net> <4625C1C6.6040709@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> <462755BD.5020305@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA8403011C1415@G3W0634.americas.hpqcorp.net> <4627804E.2040004@voltaire.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403011C1497@G3W0634.americas.hpqcorp.net> > > Or are these the IPoIB interfaces ? then the bond-ip must > be different > > from these two IPs, right ? > > then question is, can I use all the three IPs in the same time ? > > > Not exactly. the ibX interfaces should not have IP addresses at all. > So, bond IP is the only IP you deal with. > When you enslave ib interface (with ib-bond) the ibX > interfaces will get the address of the bond interface. But before I run ib-bond command, there should be two IP address on the two IPoIB interface, after I run ib-bond and specify a bond-IP address, the two IPoIB adresses are disappeared ? --CQ > > > > From monis at voltaire.com Thu Apr 19 08:15:12 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 19 Apr 2007 18:15:12 +0300 Subject: [ofa-general] Re: IPoIB bonding document ? In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403011C1497@G3W0634.americas.hpqcorp.net> References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com> <45DAB3FD.8060606@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net> <4625C1C6.6040709@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> <462755BD.5020305@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA8403011C1415@G3W0634.americas.hpqcorp.net> <4627804E.2040004@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA8403011C1497@G3W0634.americas.hpqcorp.net> Message-ID: <46278780.2010900@voltaire.com> > > But before I run ib-bond command, there should be two IP address on the > two IPoIB interface, after I run ib-bond and specify > a bond-IP address, the two IPoIB adresses are disappeared ? Maybe this output will clear things ib0 and ib1 exist but are not UP and RUNNING dodly2:/tmp # ifconfig eth0 Link encap:Ethernet HWaddr 00:04:23:B3:26:8C inet addr:172.25.3.232 Bcast:172.25.255.255 Mask:255.255.0.0 inet6 addr: fe80::204:23ff:feb3:268c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:4300276 errors:0 dropped:0 overruns:0 frame:0 TX packets:130132 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:502482422 (479.2 Mb) TX bytes:19146376 (18.2 Mb) Base address:0xdc00 Memory:deda0000-dedc0000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:11933 errors:0 dropped:0 overruns:0 frame:0 TX packets:11933 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1251975 (1.1 Mb) TX bytes:1251975 (1.1 Mb) dodly2:/tmp # ifconfig -a eth0 Link encap:Ethernet HWaddr 00:04:23:B3:26:8C inet addr:172.25.3.232 Bcast:172.25.255.255 Mask:255.255.0.0 inet6 addr: fe80::204:23ff:feb3:268c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:4300312 errors:0 dropped:0 overruns:0 frame:0 TX packets:130145 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:502485726 (479.2 Mb) TX bytes:19148966 (18.2 Mb) Base address:0xdc00 Memory:deda0000-dedc0000 eth2 Link encap:Ethernet HWaddr 00:04:23:B3:26:8D BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Base address:0xdc80 Memory:dede0000-dee00000 ib0 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:13 errors:0 dropped:0 overruns:0 frame:0 TX packets:25 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:920 (920.0 b) TX bytes:1900 (1.8 Kb) ib1 Link encap:UNSPEC HWaddr 80-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:16 errors:0 dropped:0 overruns:0 frame:0 TX packets:7 errors:0 dropped:1 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:1216 (1.1 Kb) TX bytes:516 (516.0 b) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:11933 errors:0 dropped:0 overruns:0 frame:0 TX packets:11933 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1251975 (1.1 Mb) TX bytes:1251975 (1.1 Mb) sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) dodly2:/tmp # ib-bond --bond-ip 10.10.10.1 --slaves ib0,ib1 Now ib0 and ib1 are UP and RUNNING but without IP. They are used by bond0 which has an IP dodly2:/tmp # ifconfig bond0 Link encap:UNSPEC HWaddr 80-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.10.10.1 Bcast:10.10.10.255 Mask:255.255.255.0 inet6 addr: fe80::202:c901:a31:b3f2/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:31 errors:0 dropped:0 overruns:0 frame:0 TX packets:37 errors:0 dropped:1 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2312 (2.2 Kb) TX bytes:2796 (2.7 Kb) eth0 Link encap:Ethernet HWaddr 00:04:23:B3:26:8C inet addr:172.25.3.232 Bcast:172.25.255.255 Mask:255.255.0.0 inet6 addr: fe80::204:23ff:feb3:268c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:4300498 errors:0 dropped:0 overruns:0 frame:0 TX packets:130173 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:502501178 (479.2 Mb) TX bytes:19155374 (18.2 Mb) Base address:0xdc00 Memory:deda0000-dedc0000 ib0 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet6 addr: fe80::202:c901:a31:b3f1/64 Scope:Link UP BROADCAST RUNNING NOARP SLAVE MULTICAST MTU:1500 Metric:1 RX packets:14 errors:0 dropped:0 overruns:0 frame:0 TX packets:27 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:1008 (1008.0 b) TX bytes:2056 (2.0 Kb) ib1 Link encap:UNSPEC HWaddr 80-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 inet6 addr: fe80::202:c901:a31:b3f2/64 Scope:Link UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:17 errors:0 dropped:0 overruns:0 frame:0 TX packets:10 errors:0 dropped:1 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:1304 (1.2 Kb) TX bytes:740 (740.0 b) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:11933 errors:0 dropped:0 overruns:0 frame:0 TX packets:11933 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1251975 (1.1 Mb) TX bytes:1251975 (1.1 Mb) From changquing.tang at hp.com Thu Apr 19 08:22:58 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 19 Apr 2007 15:22:58 -0000 Subject: [ofa-general] RE: IPoIB bonding document ? In-Reply-To: <46278780.2010900@voltaire.com> References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com> <45DAB3FD.8060606@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net> <4625C1C6.6040709@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> <462755BD.5020305@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA8403011C1415@G3W0634.americas.hpqcorp.net> <4627804E.2040004@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA8403011C1497@G3W0634.americas.hpqcorp.net> <46278780.2010900@voltaire.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403011C1559@G3W0634.americas.hpqcorp.net> OK, that clear my question. Thank you very much. --CQ > -----Original Message----- > From: Moni Shoua [mailto:monis at voltaire.com] > Sent: Thursday, April 19, 2007 10:15 AM > To: Tang, Changqing > Cc: Or Gerlitz; OpenFabrics General > Subject: Re: IPoIB bonding document ? > > > > > But before I run ib-bond command, there should be two IP address on > > the two IPoIB interface, after I run ib-bond and specify a bond-IP > > address, the two IPoIB adresses are disappeared ? > > Maybe this output will clear things > > ib0 and ib1 exist but are not UP and RUNNING > > dodly2:/tmp # ifconfig > eth0 Link encap:Ethernet HWaddr 00:04:23:B3:26:8C > inet addr:172.25.3.232 Bcast:172.25.255.255 > Mask:255.255.0.0 > inet6 addr: fe80::204:23ff:feb3:268c/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:4300276 errors:0 dropped:0 overruns:0 frame:0 > TX packets:130132 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:100 > RX bytes:502482422 (479.2 Mb) TX bytes:19146376 (18.2 Mb) > Base address:0xdc00 Memory:deda0000-dedc0000 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:11933 errors:0 dropped:0 overruns:0 frame:0 > TX packets:11933 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:1251975 (1.1 Mb) TX bytes:1251975 (1.1 Mb) > > dodly2:/tmp # ifconfig -a > eth0 Link encap:Ethernet HWaddr 00:04:23:B3:26:8C > inet addr:172.25.3.232 Bcast:172.25.255.255 > Mask:255.255.0.0 > inet6 addr: fe80::204:23ff:feb3:268c/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:4300312 errors:0 dropped:0 overruns:0 frame:0 > TX packets:130145 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:100 > RX bytes:502485726 (479.2 Mb) TX bytes:19148966 (18.2 Mb) > Base address:0xdc00 Memory:deda0000-dedc0000 > > eth2 Link encap:Ethernet HWaddr 00:04:23:B3:26:8D > BROADCAST MULTICAST MTU:1500 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > Base address:0xdc80 Memory:dede0000-dee00000 > > ib0 Link encap:UNSPEC HWaddr > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > BROADCAST MULTICAST MTU:1500 Metric:1 > RX packets:13 errors:0 dropped:0 overruns:0 frame:0 > TX packets:25 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:920 (920.0 b) TX bytes:1900 (1.8 Kb) > > ib1 Link encap:UNSPEC HWaddr > 80-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > BROADCAST MULTICAST MTU:1500 Metric:1 > RX packets:16 errors:0 dropped:0 overruns:0 frame:0 > TX packets:7 errors:0 dropped:1 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:1216 (1.1 Kb) TX bytes:516 (516.0 b) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:11933 errors:0 dropped:0 overruns:0 frame:0 > TX packets:11933 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:1251975 (1.1 Mb) TX bytes:1251975 (1.1 Mb) > > sit0 Link encap:IPv6-in-IPv4 > NOARP MTU:1480 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > dodly2:/tmp # ib-bond --bond-ip 10.10.10.1 --slaves ib0,ib1 > > Now ib0 and ib1 are UP and RUNNING but without IP. They are > used by bond0 which has an IP > > dodly2:/tmp # ifconfig > bond0 Link encap:UNSPEC HWaddr > 80-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > inet addr:10.10.10.1 Bcast:10.10.10.255 Mask:255.255.255.0 > inet6 addr: fe80::202:c901:a31:b3f2/64 Scope:Link > UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 > RX packets:31 errors:0 dropped:0 overruns:0 frame:0 > TX packets:37 errors:0 dropped:1 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:2312 (2.2 Kb) TX bytes:2796 (2.7 Kb) > > eth0 Link encap:Ethernet HWaddr 00:04:23:B3:26:8C > inet addr:172.25.3.232 Bcast:172.25.255.255 > Mask:255.255.0.0 > inet6 addr: fe80::204:23ff:feb3:268c/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:4300498 errors:0 dropped:0 overruns:0 frame:0 > TX packets:130173 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:100 > RX bytes:502501178 (479.2 Mb) TX bytes:19155374 (18.2 Mb) > Base address:0xdc00 Memory:deda0000-dedc0000 > > ib0 Link encap:UNSPEC HWaddr > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > inet6 addr: fe80::202:c901:a31:b3f1/64 Scope:Link > UP BROADCAST RUNNING NOARP SLAVE MULTICAST > MTU:1500 Metric:1 > RX packets:14 errors:0 dropped:0 overruns:0 frame:0 > TX packets:27 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:1008 (1008.0 b) TX bytes:2056 (2.0 Kb) > > ib1 Link encap:UNSPEC HWaddr > 80-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > inet6 addr: fe80::202:c901:a31:b3f2/64 Scope:Link > UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 > RX packets:17 errors:0 dropped:0 overruns:0 frame:0 > TX packets:10 errors:0 dropped:1 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:1304 (1.2 Kb) TX bytes:740 (740.0 b) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:11933 errors:0 dropped:0 overruns:0 frame:0 > TX packets:11933 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:1251975 (1.1 Mb) TX bytes:1251975 (1.1 Mb) > > > > From changquing.tang at hp.com Thu Apr 19 09:20:40 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 19 Apr 2007 16:20:40 -0000 Subject: [ofa-general] RE: IPoIB bonding document ? In-Reply-To: <46278780.2010900@voltaire.com> References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com> <45DAB3FD.8060606@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net> <4625C1C6.6040709@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> <462755BD.5020305@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA8403011C1415@G3W0634.americas.hpqcorp.net> <4627804E.2040004@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA8403011C1497@G3W0634.americas.hpqcorp.net> <46278780.2010900@voltaire.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403011C16CE@G3W0634.americas.hpqcorp.net> If I have configured two IP addresses to ib0 and ib1, and both are UP and RUNNING, can I continue to use ib-bond on them ? --CQ > -----Original Message----- > From: Moni Shoua [mailto:monis at voltaire.com] > Sent: Thursday, April 19, 2007 10:15 AM > To: Tang, Changqing > Cc: Or Gerlitz; OpenFabrics General > Subject: Re: IPoIB bonding document ? > > > > > But before I run ib-bond command, there should be two IP address on > > the two IPoIB interface, after I run ib-bond and specify a bond-IP > > address, the two IPoIB adresses are disappeared ? > > Maybe this output will clear things > > ib0 and ib1 exist but are not UP and RUNNING dodly2:/tmp # ifconfig > eth0 Link encap:Ethernet HWaddr 00:04:23:B3:26:8C > inet addr:172.25.3.232 Bcast:172.25.255.255 > Mask:255.255.0.0 > inet6 addr: fe80::204:23ff:feb3:268c/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:4300276 errors:0 dropped:0 overruns:0 frame:0 > TX packets:130132 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:100 > RX bytes:502482422 (479.2 Mb) TX bytes:19146376 (18.2 Mb) > Base address:0xdc00 Memory:deda0000-dedc0000 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:11933 errors:0 dropped:0 overruns:0 frame:0 > TX packets:11933 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:1251975 (1.1 Mb) TX bytes:1251975 (1.1 Mb) > > dodly2:/tmp # ifconfig -a > eth0 Link encap:Ethernet HWaddr 00:04:23:B3:26:8C > inet addr:172.25.3.232 Bcast:172.25.255.255 > Mask:255.255.0.0 > inet6 addr: fe80::204:23ff:feb3:268c/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:4300312 errors:0 dropped:0 overruns:0 frame:0 > TX packets:130145 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:100 > RX bytes:502485726 (479.2 Mb) TX bytes:19148966 (18.2 Mb) > Base address:0xdc00 Memory:deda0000-dedc0000 > > eth2 Link encap:Ethernet HWaddr 00:04:23:B3:26:8D > BROADCAST MULTICAST MTU:1500 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > Base address:0xdc80 Memory:dede0000-dee00000 > > ib0 Link encap:UNSPEC HWaddr > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > BROADCAST MULTICAST MTU:1500 Metric:1 > RX packets:13 errors:0 dropped:0 overruns:0 frame:0 > TX packets:25 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:920 (920.0 b) TX bytes:1900 (1.8 Kb) > > ib1 Link encap:UNSPEC HWaddr > 80-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > BROADCAST MULTICAST MTU:1500 Metric:1 > RX packets:16 errors:0 dropped:0 overruns:0 frame:0 > TX packets:7 errors:0 dropped:1 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:1216 (1.1 Kb) TX bytes:516 (516.0 b) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:11933 errors:0 dropped:0 overruns:0 frame:0 > TX packets:11933 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:1251975 (1.1 Mb) TX bytes:1251975 (1.1 Mb) > > sit0 Link encap:IPv6-in-IPv4 > NOARP MTU:1480 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > dodly2:/tmp # ib-bond --bond-ip 10.10.10.1 --slaves ib0,ib1 > > Now ib0 and ib1 are UP and RUNNING but without IP. They are > used by bond0 which has an IP > > dodly2:/tmp # ifconfig > bond0 Link encap:UNSPEC HWaddr > 80-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > inet addr:10.10.10.1 Bcast:10.10.10.255 Mask:255.255.255.0 > inet6 addr: fe80::202:c901:a31:b3f2/64 Scope:Link > UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 > RX packets:31 errors:0 dropped:0 overruns:0 frame:0 > TX packets:37 errors:0 dropped:1 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:2312 (2.2 Kb) TX bytes:2796 (2.7 Kb) > > eth0 Link encap:Ethernet HWaddr 00:04:23:B3:26:8C > inet addr:172.25.3.232 Bcast:172.25.255.255 > Mask:255.255.0.0 > inet6 addr: fe80::204:23ff:feb3:268c/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:4300498 errors:0 dropped:0 overruns:0 frame:0 > TX packets:130173 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:100 > RX bytes:502501178 (479.2 Mb) TX bytes:19155374 (18.2 Mb) > Base address:0xdc00 Memory:deda0000-dedc0000 > > ib0 Link encap:UNSPEC HWaddr > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > inet6 addr: fe80::202:c901:a31:b3f1/64 Scope:Link > UP BROADCAST RUNNING NOARP SLAVE MULTICAST > MTU:1500 Metric:1 > RX packets:14 errors:0 dropped:0 overruns:0 frame:0 > TX packets:27 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:1008 (1008.0 b) TX bytes:2056 (2.0 Kb) > > ib1 Link encap:UNSPEC HWaddr > 80-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > inet6 addr: fe80::202:c901:a31:b3f2/64 Scope:Link > UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 > RX packets:17 errors:0 dropped:0 overruns:0 frame:0 > TX packets:10 errors:0 dropped:1 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:1304 (1.2 Kb) TX bytes:740 (740.0 b) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:11933 errors:0 dropped:0 overruns:0 frame:0 > TX packets:11933 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:1251975 (1.1 Mb) TX bytes:1251975 (1.1 Mb) > > > > From rdreier at cisco.com Thu Apr 19 11:25:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 19 Apr 2007 11:25:33 -0700 Subject: [ofa-general] Re: [GIT PULL] 2.6.22: please pull rdma-dev.git In-Reply-To: <000b01c777ca$10ef0ef0$ff0da8c0@amr.corp.intel.com> (Sean Hefty's message of "Thu, 5 Apr 2007 14:33:35 -0700") References: <000b01c777ca$10ef0ef0$ff0da8c0@amr.corp.intel.com> Message-ID: Sorry for the slow response. I applied all of these to for-2.6.22 except for "Limit CM message timeout". Was there a consensus that this was a good idea, given that fixed Engenio FW is available? The reason I'm resisting applying this is that as the patch stands, it creates another "max_timeout" knob with the potential to break things if someone sets it too low. And if we leave the knob out, then maybe we're breaking setups where a really high timeout is needed (maybe they'll send an IB router to Mars or something). - R. From etta at systemfabricworks.com Thu Apr 19 12:03:19 2007 From: etta at systemfabricworks.com (Chieng Etta) Date: Thu, 19 Apr 2007 14:03:19 -0500 Subject: [ofa-general] RE: [ewg] OFED 1.2 rc2 release - RHEL 5 installation Message-ID: <002f01c782b5$651b06f0$c801a8c0@ettac> Hi, I installed today's build OFED-1.2-20070419-0600 on RHEL5. The installation was very successful. Thanks, Etta -----Original Message----- From: Chieng Etta [mailto:etta at systemfabricworks.com] Sent: Thursday, April 19, 2007 9:11 AM To: 'vlad at dev.mellanox.co.il' Cc: 'ewg at lists.openfabrics.org'; 'Tziporet Koren'; 'general at lists.openfabrics.org' Subject: RE: [ewg] OFED 1.2 rc2 release Hi Vlad, Actually, before installing rc2, I used uninstall script to remove the OFED packages. I also tried the command you provided below but received a message indicating none of listed packages were installed. I will wait and try today's build. Thanks, Etta -----Original Message----- From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il] Sent: Thursday, April 19, 2007 6:29 AM To: Chieng Etta Cc: ewg at lists.openfabrics.org; 'Tziporet Koren'; general at lists.openfabrics.org Subject: RE: [ewg] OFED 1.2 rc2 release On Wed, 2007-04-18 at 16:38 -0500, Chieng Etta wrote: > > > error: Failed dependencies: > librdmacm.so is needed by dapl-1.2.1-0.x86_64 > librdmacm.so(RDMACM_1.0) is needed by dapl-1.2.1-0.x86_64 > > Hi, You should uninstall ofed RPMs that were installed with RH5.0 and then run OFED-1.2-rc2 installation. Use the following command: /bin/rpm -e --allmatches opensm-libs opensm-devel opensm openmpi-libs openmpi-devel openmpi \ openib-tvflash openib-srptools openib-perftest openib-mstflint openib-diags \ openib librdmacm-utils librdmacm-devel librdmacm libibverbs-utils libibverbs-devel \ libibverbs libibumad-devel libibumad libibmad-devel libibmad libibcommon-devel \ libibcommon libibcm-devel libibcm dapl-devel dapl Today's OFED build (OFED-1.2-20070419-0600) will include the fix for this issue. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From jsquyres at cisco.com Thu Apr 19 12:52:35 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 19 Apr 2007 15:52:35 -0400 Subject: [ofa-general] mpi-selector changes Message-ID: <6D03C475-EF62-4ADF-B2B6-85B2AC462B5F@cisco.com> Vlad -- Per https://bugs.openfabrics.org/show_bug.cgi?id=487, please apply this patch to build.sh: @@ -436,7 +436,7 @@ mpi-selector() --define \'_topdir ${RPM_DIR}\' \ --define \'ofed 1\' \ --define \'buildroot ${BUILD_ROOT}\' \ - --define \'configure_options --mandir=${mandir} --with-shell- startup-dir=/etc/profile.d\' \ + --define \'configure_options --mandir=${mandir} --with-shell- startup-dir=/etc/profile.d --localstatedir=/var/lib\' \ --define \'_prefix ${STACK_PREFIX}\' \ --define \'_mandir ${mandir}\' \ --define \'_docdir %{_prefix}/share/doc/$ {MPI_SELECTOR_NAME}-${MPI_SELECTOR_VERSION}\' \ Also please pull a new mpi-selector from ~jsquyres/mpi-selector (ofed_1_2 branch, of course); I made a minor change in configure.ac such that it now utilizes the --localstate option to configure. Thanks. -- Jeff Squyres Cisco Systems From rdreier at cisco.com Thu Apr 19 13:28:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 19 Apr 2007 13:28:26 -0700 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <20070419125722.GD918@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 19 Apr 2007 15:57:22 +0300") References: <20070327205213.GD28347@mellanox.co.il> <6a122cc00703280200h33f384b9jae75592294a9cbd9@mail.gmail.com> <20070328093345.GD11695@mellanox.co.il> <20070328200906.GJ4253@mellanox.co.il> <460ACED8.20605@gmail.com> <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> Message-ID: > So since all this thread was started by Moni because of IPoIB, > the path is clear in that respect, and would already be a step in the > right direction: > > - a patch to add ib_find_pkey() and ib_find_gid() to core > - a patch to replace cache usage in IPoIB / SRP with uncached > hardware accesses on top of this > - pkey change handling patch on top of these Makes good sense to me. > > I *think* we can keep this provider-specific cache updated by snooping incoming > MADs in driver. And if it can be done this way, in all providers, we might be able to > simply require that query_pkey/query_gid in providers do not sleep. > If that's true, we'll save ourselves the work of auditing core - ib_find_pkey() > and ib_find_gid() will be a drop-in replacement for cache. > > Roland, what do you think? I'm not sure I like changing the rules for query_pkey/query_gid -- I think slow-path operations like that should be able to sleep. And in particular, ehca does a hypercall in query_pkey that seems to sleep, and it seems quite possible that other adapters may need to do similar things in the future if they have an internal SMA (that makes snooping MADs impossible). - R. From mst at dev.mellanox.co.il Thu Apr 19 13:37:05 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Apr 2007 23:37:05 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: References: <20070328093345.GD11695@mellanox.co.il> <20070328200906.GJ4253@mellanox.co.il> <460ACED8.20605@gmail.com> <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> Message-ID: <20070419203705.GA613@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: pkey change handling patch > > > So since all this thread was started by Moni because of IPoIB, > > the path is clear in that respect, and would already be a step in the > > right direction: > > > > - a patch to add ib_find_pkey() and ib_find_gid() to core > > - a patch to replace cache usage in IPoIB / SRP with uncached > > hardware accesses on top of this > > - pkey change handling patch on top of these > > Makes good sense to me. OK, let's do this for starters. Moni? -- MST From mshefty at ichips.intel.com Thu Apr 19 15:14:36 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 19 Apr 2007 15:14:36 -0700 Subject: [ofa-general] Re: [GIT PULL] 2.6.22: please pull rdma-dev.git In-Reply-To: References: <000b01c777ca$10ef0ef0$ff0da8c0@amr.corp.intel.com> Message-ID: <4627E9CC.1070606@ichips.intel.com> Roland Dreier wrote: > Sorry for the slow response. I applied all of these to for-2.6.22 > except for "Limit CM message timeout". Was there a consensus that > this was a good idea, given that fixed Engenio FW is available? I'm not 100% set on this patch, so I'll let others comment on its need. OFED was going to include a patch in 1.2. I created this patch to fix issues in the OFED patch. > they'll send an IB router to Mars or something). :) The default timeout is only about 1/6th of what it needs to be (including retry time) when Mars is on the far side of the sun... - Sean From sweitzen at cisco.com Thu Apr 19 15:38:08 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 19 Apr 2007 15:38:08 -0700 Subject: [ofa-general] Slow failover of IPoIB ipoibtools/bonding (bug 541) Message-ID: Roland, Michael, or Sean, this is what I see when IPoIB failover is slow, how do we get this fixed? ib0: Request connection 0x60406 for gid fe80:0000:0000:0000:0002:c902:0020:e1d9 qpn 0x404 ib0: REP received. ib0: REQ arrived ib0: failed cm send event (status=12, wrid=45 vend_err 81) ib0: Destroy active connection 0x60406 head 0x6546f tail 0x6546e ib0: Request connection 0x70406 for gid fe80:0000:0000:0000:0002:c902:0020:e1d9 qpn 0x404 https://bugs.openfabrics.org//show_bug.cgi?id=541 Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Apr 19 16:02:25 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 19 Apr 2007 16:02:25 -0700 Subject: [ofa-general] [PATCH 1/2][RFC] IB/uverbs: Export ib_umem_get()/ib_umem_release() to modules In-Reply-To: (Roland Dreier's message of "Tue, 17 Apr 2007 14:23:47 -0700") References: Message-ID: Export ib_umem_get()/ib_umem_release() and put low-level drivers in control of when to call ib_umem_get() to pin and DMA map userspace, rather than always calling it in ib_uverbs_reg_mr() before calling the low-level driver's reg_user_mr method. Also move these functions to be in the ib_core module instead of ib_uverbs, so that driver modules using them do not depend on ib_uverbs. This has a number of advantages: - It is better design from the standpoint of making generic code a library that can be used or overridden by device-specific code as the details of specific devices dictate. - Drivers that do not need to pin userspace memory regions do not need to take the performance hit of calling ib_mem_get(). For example, although I have not tried to implement it in this patch, the ipath driver should be able to avoid pinning memory and just use copy_{to,from}_user() to access userspace memory regions. - Buffers that need special mapping treatment can be identified by the low-level driver. For example, it may be possible to solve some Altix-specific memory ordering issues with mthca CQs in userspace by mapping CQ buffers with extra flags. - Drivers that need to pin and DMA map userspace memory for things other than memory regions can use ib_umem_get() directly, instead of hacks using extra parameters to their reg_phys_mr method. For example, the mlx4 driver that is pending being merged needs to pin and DMA map QP and CQ buffers, but it does not need to create a memory key for these buffers. So the cleanest solution is for mlx4 to call ib_umem_get() in the create_qp and create_cq methods. Signed-off-by: Roland Dreier --- Here's the updated patch that puts a flush_scheduled_work() in the ib_core module's exit function. drivers/infiniband/Kconfig | 5 + drivers/infiniband/core/Makefile | 4 +- drivers/infiniband/core/device.c | 2 + drivers/infiniband/core/umem.c | 273 ++++++++++++++++++++++++++ drivers/infiniband/core/uverbs.h | 6 +- drivers/infiniband/core/uverbs_cmd.c | 60 ++---- drivers/infiniband/core/uverbs_main.c | 11 +- drivers/infiniband/core/uverbs_mem.c | 225 --------------------- drivers/infiniband/hw/amso1100/c2_provider.c | 42 +++-- drivers/infiniband/hw/amso1100/c2_provider.h | 1 + drivers/infiniband/hw/cxgb3/iwch_provider.c | 28 ++- drivers/infiniband/hw/cxgb3/iwch_provider.h | 1 + drivers/infiniband/hw/ehca/ehca_classes.h | 1 + drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 69 ++++--- drivers/infiniband/hw/ipath/ipath_mr.c | 38 +++- drivers/infiniband/hw/ipath/ipath_verbs.h | 5 +- drivers/infiniband/hw/mthca/mthca_provider.c | 38 +++- drivers/infiniband/hw/mthca/mthca_provider.h | 1 + include/rdma/ib_umem.h | 78 ++++++++ include/rdma/ib_verbs.h | 28 +--- 21 files changed, 536 insertions(+), 383 deletions(-) create mode 100644 drivers/infiniband/core/umem.c delete mode 100644 drivers/infiniband/core/uverbs_mem.c create mode 100644 include/rdma/ib_umem.h diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 66b36de..82afba5 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -29,6 +29,11 @@ config INFINIBAND_USER_ACCESS libibverbs, libibcm and a hardware driver library from . +config INFINIBAND_USER_MEM + bool + depends on INFINIBAND_USER_ACCESS != n + default y + config INFINIBAND_ADDR_TRANS bool depends on INFINIBAND && INET diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 189e5d4..cb1ab3e 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -9,6 +9,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \ ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o +ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o ib_mad-y := mad.o smi.o agent.o mad_rmpp.o @@ -28,5 +29,4 @@ ib_umad-y := user_mad.o ib_ucm-y := ucm.o -ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ - uverbs_marshall.o +ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 7fabb42..592c90a 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -613,6 +613,8 @@ static void __exit ib_core_cleanup(void) { ib_cache_cleanup(); ib_sysfs_cleanup(); + /* Make sure that any pending umem accounting work is done. */ + flush_scheduled_work(); } module_init(ib_core_init); diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c new file mode 100644 index 0000000..48e854c --- /dev/null +++ b/drivers/infiniband/core/umem.c @@ -0,0 +1,273 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: uverbs_mem.c 2743 2005-06-28 22:27:59Z roland $ + */ + +#include +#include + +#include "uverbs.h" + +struct ib_umem_account_work { + struct work_struct work; + struct mm_struct *mm; + unsigned long diff; +}; + + +static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) +{ + struct ib_umem_chunk *chunk, *tmp; + int i; + + list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) { + ib_dma_unmap_sg(dev, chunk->page_list, + chunk->nents, DMA_BIDIRECTIONAL); + for (i = 0; i < chunk->nents; ++i) { + if (umem->writable && dirty) + set_page_dirty_lock(chunk->page_list[i].page); + put_page(chunk->page_list[i].page); + } + + kfree(chunk); + } +} + +/** + * ib_umem_get - Pin and DMA map userspace memory. + * @context: userspace context to pin memory for + * @addr: userspace virtual address to start at + * @size: length of region to pin + * @access: IB_ACCESS_xxx flags for memory being pinned + */ +struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, + size_t size, int access) +{ + struct ib_umem *umem; + struct page **page_list; + struct ib_umem_chunk *chunk; + unsigned long locked; + unsigned long lock_limit; + unsigned long cur_base; + unsigned long npages; + int ret; + int off; + int i; + + if (!can_do_mlock()) + return ERR_PTR(-EPERM); + + umem = kmalloc(sizeof *umem, GFP_KERNEL); + if (!umem) + return ERR_PTR(-ENOMEM); + + umem->context = context; + umem->length = size; + umem->offset = addr & ~PAGE_MASK; + umem->page_size = PAGE_SIZE; + /* + * We ask for writable memory if any access flags other than + * "remote read" are set. "Local write" and "remote write" + * obviously require write access. "Remote atomic" can do + * things like fetch and add, which will modify memory, and + * "MW bind" can change permissions by binding a window. + */ + umem->writable = !!(access & ~IB_ACCESS_REMOTE_READ); + + INIT_LIST_HEAD(&umem->chunk_list); + + page_list = (struct page **) __get_free_page(GFP_KERNEL); + if (!page_list) { + kfree(umem); + return ERR_PTR(-ENOMEM); + } + + npages = PAGE_ALIGN(size + umem->offset) >> PAGE_SHIFT; + + down_write(¤t->mm->mmap_sem); + + locked = npages + current->mm->locked_vm; + lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; + + if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) { + ret = -ENOMEM; + goto out; + } + + cur_base = addr & PAGE_MASK; + + while (npages) { + ret = get_user_pages(current, current->mm, cur_base, + min_t(int, npages, + PAGE_SIZE / sizeof (struct page *)), + 1, !umem->writable, page_list, NULL); + + if (ret < 0) + goto out; + + cur_base += ret * PAGE_SIZE; + npages -= ret; + + off = 0; + + while (ret) { + chunk = kmalloc(sizeof *chunk + sizeof (struct scatterlist) * + min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK), + GFP_KERNEL); + if (!chunk) { + ret = -ENOMEM; + goto out; + } + + chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK); + for (i = 0; i < chunk->nents; ++i) { + chunk->page_list[i].page = page_list[i + off]; + chunk->page_list[i].offset = 0; + chunk->page_list[i].length = PAGE_SIZE; + } + + chunk->nmap = ib_dma_map_sg(context->device, + &chunk->page_list[0], + chunk->nents, + DMA_BIDIRECTIONAL); + if (chunk->nmap <= 0) { + for (i = 0; i < chunk->nents; ++i) + put_page(chunk->page_list[i].page); + kfree(chunk); + + ret = -ENOMEM; + goto out; + } + + ret -= chunk->nents; + off += chunk->nents; + list_add_tail(&chunk->list, &umem->chunk_list); + } + + ret = 0; + } + +out: + if (ret < 0) { + __ib_umem_release(context->device, umem, 0); + kfree(umem); + } else + current->mm->locked_vm = locked; + + up_write(¤t->mm->mmap_sem); + free_page((unsigned long) page_list); + + return ret < 0 ? ERR_PTR(ret) : umem; +} +EXPORT_SYMBOL(ib_umem_get); + +static void ib_umem_account(struct work_struct *_work) +{ + struct ib_umem_account_work *work = + container_of(_work, struct ib_umem_account_work, work); + + down_write(&work->mm->mmap_sem); + work->mm->locked_vm -= work->diff; + up_write(&work->mm->mmap_sem); + mmput(work->mm); + kfree(work); +} + +/** + * ib_umem_release - release memory pinned with ib_umem_get + * @umem: umem struct to release + */ +void ib_umem_release(struct ib_umem *umem) +{ + struct ib_umem_account_work *work; + struct ib_ucontext *context = umem->context; + struct mm_struct *mm; + unsigned long diff; + + __ib_umem_release(umem->context->device, umem, 1); + + mm = get_task_mm(current); + if (!mm) + return; + + diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; + kfree(umem); + + /* + * We may be called with the mm's mmap_sem already held. This + * can happen when a userspace munmap() is the call that drops + * the last reference to our file and calls our release + * method. If there are memory regions to destroy, we'll end + * up here and not be able to take the mmap_sem. In that case + * we defer the vm_locked accounting to the system workqueue. + */ + if (context->closing && !down_write_trylock(&mm->mmap_sem)) { + work = kmalloc(sizeof *work, GFP_KERNEL); + if (!work) { + mmput(mm); + return; + } + + INIT_WORK(&work->work, ib_umem_account); + work->mm = mm; + work->diff = diff; + + schedule_work(&work->work); + return; + } else + down_write(&mm->mmap_sem); + + current->mm->locked_vm -= diff; + up_write(&mm->mmap_sem); + mmput(mm); +} +EXPORT_SYMBOL(ib_umem_release); + +int ib_umem_page_count(struct ib_umem *umem) +{ + struct ib_umem_chunk *chunk; + int shift; + int i; + int n; + + shift = ilog2(umem->page_size); + + n = 0; + list_for_each_entry(chunk, &umem->chunk_list, list) + for (i = 0; i < chunk->nmap; ++i) + n += sg_dma_len(&chunk->page_list[i]) >> shift; + + return n; +} +EXPORT_SYMBOL(ib_umem_page_count); diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 102a59c..c33546f 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -45,6 +45,7 @@ #include #include +#include #include /* @@ -163,11 +164,6 @@ void ib_uverbs_srq_event_handler(struct ib_event *event, void *context_ptr); void ib_uverbs_event_handler(struct ib_event_handler *handler, struct ib_event *event); -int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, - void *addr, size_t size, int write); -void ib_umem_release(struct ib_device *dev, struct ib_umem *umem); -void ib_umem_release_on_close(struct ib_device *dev, struct ib_umem *umem); - #define IB_UVERBS_DECLARE_CMD(name) \ ssize_t ib_uverbs_##name(struct ib_uverbs_file *file, \ const char __user *buf, int in_len, \ diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 4fd75af..8c338bc 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006, 2007 Cisco Systems. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. * Copyright (c) 2006 Mellanox Technologies. All rights reserved. * @@ -295,6 +295,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, INIT_LIST_HEAD(&ucontext->qp_list); INIT_LIST_HEAD(&ucontext->srq_list); INIT_LIST_HEAD(&ucontext->ah_list); + ucontext->closing = 0; resp.num_comp_vectors = file->device->num_comp_vectors; @@ -573,7 +574,7 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, struct ib_uverbs_reg_mr cmd; struct ib_uverbs_reg_mr_resp resp; struct ib_udata udata; - struct ib_umem_object *obj; + struct ib_uobject *uobj; struct ib_pd *pd; struct ib_mr *mr; int ret; @@ -599,35 +600,21 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, !(cmd.access_flags & IB_ACCESS_LOCAL_WRITE)) return -EINVAL; - obj = kmalloc(sizeof *obj, GFP_KERNEL); - if (!obj) + uobj = kmalloc(sizeof *uobj, GFP_KERNEL); + if (!uobj) return -ENOMEM; - init_uobj(&obj->uobject, 0, file->ucontext, &mr_lock_key); - down_write(&obj->uobject.mutex); - - /* - * We ask for writable memory if any access flags other than - * "remote read" are set. "Local write" and "remote write" - * obviously require write access. "Remote atomic" can do - * things like fetch and add, which will modify memory, and - * "MW bind" can change permissions by binding a window. - */ - ret = ib_umem_get(file->device->ib_dev, &obj->umem, - (void *) (unsigned long) cmd.start, cmd.length, - !!(cmd.access_flags & ~IB_ACCESS_REMOTE_READ)); - if (ret) - goto err_free; - - obj->umem.virt_base = cmd.hca_va; + init_uobj(uobj, 0, file->ucontext, &mr_lock_key); + down_write(&uobj->mutex); pd = idr_read_pd(cmd.pd_handle, file->ucontext); if (!pd) { ret = -EINVAL; - goto err_release; + goto err_free; } - mr = pd->device->reg_user_mr(pd, &obj->umem, cmd.access_flags, &udata); + mr = pd->device->reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va, + cmd.access_flags, &udata); if (IS_ERR(mr)) { ret = PTR_ERR(mr); goto err_put; @@ -635,19 +622,19 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, mr->device = pd->device; mr->pd = pd; - mr->uobject = &obj->uobject; + mr->uobject = uobj; atomic_inc(&pd->usecnt); atomic_set(&mr->usecnt, 0); - obj->uobject.object = mr; - ret = idr_add_uobj(&ib_uverbs_mr_idr, &obj->uobject); + uobj->object = mr; + ret = idr_add_uobj(&ib_uverbs_mr_idr, uobj); if (ret) goto err_unreg; memset(&resp, 0, sizeof resp); resp.lkey = mr->lkey; resp.rkey = mr->rkey; - resp.mr_handle = obj->uobject.id; + resp.mr_handle = uobj->id; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { @@ -658,17 +645,17 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, put_pd_read(pd); mutex_lock(&file->mutex); - list_add_tail(&obj->uobject.list, &file->ucontext->mr_list); + list_add_tail(&uobj->list, &file->ucontext->mr_list); mutex_unlock(&file->mutex); - obj->uobject.live = 1; + uobj->live = 1; - up_write(&obj->uobject.mutex); + up_write(&uobj->mutex); return in_len; err_copy: - idr_remove_uobj(&ib_uverbs_mr_idr, &obj->uobject); + idr_remove_uobj(&ib_uverbs_mr_idr, uobj); err_unreg: ib_dereg_mr(mr); @@ -676,11 +663,8 @@ err_unreg: err_put: put_pd_read(pd); -err_release: - ib_umem_release(file->device->ib_dev, &obj->umem); - err_free: - put_uobj_write(&obj->uobject); + put_uobj_write(uobj); return ret; } @@ -691,7 +675,6 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file, struct ib_uverbs_dereg_mr cmd; struct ib_mr *mr; struct ib_uobject *uobj; - struct ib_umem_object *memobj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -701,8 +684,7 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file, if (!uobj) return -EINVAL; - memobj = container_of(uobj, struct ib_umem_object, uobject); - mr = uobj->object; + mr = uobj->object; ret = ib_dereg_mr(mr); if (!ret) @@ -719,8 +701,6 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file, list_del(&uobj->list); mutex_unlock(&file->mutex); - ib_umem_release(file->device->ib_dev, &memobj->umem); - put_uobj(uobj); return in_len; diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index f8bc822..41c2065 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -183,6 +183,8 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, if (!context) return 0; + context->closing = 1; + list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) { struct ib_ah *ah = uobj->object; @@ -230,16 +232,10 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) { struct ib_mr *mr = uobj->object; - struct ib_device *mrdev = mr->device; - struct ib_umem_object *memobj; idr_remove_uobj(&ib_uverbs_mr_idr, uobj); ib_dereg_mr(mr); - - memobj = container_of(uobj, struct ib_umem_object, uobject); - ib_umem_release_on_close(mrdev, &memobj->umem); - - kfree(memobj); + kfree(uobj); } list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) { @@ -906,7 +902,6 @@ static void __exit ib_uverbs_cleanup(void) unregister_filesystem(&uverbs_event_fs); class_destroy(uverbs_class); unregister_chrdev_region(IB_UVERBS_BASE_DEV, IB_UVERBS_MAX_DEVICES); - flush_scheduled_work(); idr_destroy(&ib_uverbs_pd_idr); idr_destroy(&ib_uverbs_mr_idr); idr_destroy(&ib_uverbs_mw_idr); diff --git a/drivers/infiniband/core/uverbs_mem.c b/drivers/infiniband/core/uverbs_mem.c deleted file mode 100644 index c95fe95..0000000 --- a/drivers/infiniband/core/uverbs_mem.c +++ /dev/null @@ -1,225 +0,0 @@ -/* - * Copyright (c) 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. - * Copyright (c) 2005 Mellanox Technologies. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id: uverbs_mem.c 2743 2005-06-28 22:27:59Z roland $ - */ - -#include -#include - -#include "uverbs.h" - -struct ib_umem_account_work { - struct work_struct work; - struct mm_struct *mm; - unsigned long diff; -}; - - -static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) -{ - struct ib_umem_chunk *chunk, *tmp; - int i; - - list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) { - ib_dma_unmap_sg(dev, chunk->page_list, - chunk->nents, DMA_BIDIRECTIONAL); - for (i = 0; i < chunk->nents; ++i) { - if (umem->writable && dirty) - set_page_dirty_lock(chunk->page_list[i].page); - put_page(chunk->page_list[i].page); - } - - kfree(chunk); - } -} - -int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, - void *addr, size_t size, int write) -{ - struct page **page_list; - struct ib_umem_chunk *chunk; - unsigned long locked; - unsigned long lock_limit; - unsigned long cur_base; - unsigned long npages; - int ret = 0; - int off; - int i; - - if (!can_do_mlock()) - return -EPERM; - - page_list = (struct page **) __get_free_page(GFP_KERNEL); - if (!page_list) - return -ENOMEM; - - mem->user_base = (unsigned long) addr; - mem->length = size; - mem->offset = (unsigned long) addr & ~PAGE_MASK; - mem->page_size = PAGE_SIZE; - mem->writable = write; - - INIT_LIST_HEAD(&mem->chunk_list); - - npages = PAGE_ALIGN(size + mem->offset) >> PAGE_SHIFT; - - down_write(¤t->mm->mmap_sem); - - locked = npages + current->mm->locked_vm; - lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; - - if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) { - ret = -ENOMEM; - goto out; - } - - cur_base = (unsigned long) addr & PAGE_MASK; - - while (npages) { - ret = get_user_pages(current, current->mm, cur_base, - min_t(int, npages, - PAGE_SIZE / sizeof (struct page *)), - 1, !write, page_list, NULL); - - if (ret < 0) - goto out; - - cur_base += ret * PAGE_SIZE; - npages -= ret; - - off = 0; - - while (ret) { - chunk = kmalloc(sizeof *chunk + sizeof (struct scatterlist) * - min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK), - GFP_KERNEL); - if (!chunk) { - ret = -ENOMEM; - goto out; - } - - chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK); - for (i = 0; i < chunk->nents; ++i) { - chunk->page_list[i].page = page_list[i + off]; - chunk->page_list[i].offset = 0; - chunk->page_list[i].length = PAGE_SIZE; - } - - chunk->nmap = ib_dma_map_sg(dev, - &chunk->page_list[0], - chunk->nents, - DMA_BIDIRECTIONAL); - if (chunk->nmap <= 0) { - for (i = 0; i < chunk->nents; ++i) - put_page(chunk->page_list[i].page); - kfree(chunk); - - ret = -ENOMEM; - goto out; - } - - ret -= chunk->nents; - off += chunk->nents; - list_add_tail(&chunk->list, &mem->chunk_list); - } - - ret = 0; - } - -out: - if (ret < 0) - __ib_umem_release(dev, mem, 0); - else - current->mm->locked_vm = locked; - - up_write(¤t->mm->mmap_sem); - free_page((unsigned long) page_list); - - return ret; -} - -void ib_umem_release(struct ib_device *dev, struct ib_umem *umem) -{ - __ib_umem_release(dev, umem, 1); - - down_write(¤t->mm->mmap_sem); - current->mm->locked_vm -= - PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; - up_write(¤t->mm->mmap_sem); -} - -static void ib_umem_account(struct work_struct *_work) -{ - struct ib_umem_account_work *work = - container_of(_work, struct ib_umem_account_work, work); - - down_write(&work->mm->mmap_sem); - work->mm->locked_vm -= work->diff; - up_write(&work->mm->mmap_sem); - mmput(work->mm); - kfree(work); -} - -void ib_umem_release_on_close(struct ib_device *dev, struct ib_umem *umem) -{ - struct ib_umem_account_work *work; - struct mm_struct *mm; - - __ib_umem_release(dev, umem, 1); - - mm = get_task_mm(current); - if (!mm) - return; - - /* - * We may be called with the mm's mmap_sem already held. This - * can happen when a userspace munmap() is the call that drops - * the last reference to our file and calls our release - * method. If there are memory regions to destroy, we'll end - * up here and not be able to take the mmap_sem. Therefore we - * defer the vm_locked accounting to the system workqueue. - */ - - work = kmalloc(sizeof *work, GFP_KERNEL); - if (!work) { - mmput(mm); - return; - } - - INIT_WORK(&work->work, ib_umem_account); - work->mm = mm; - work->diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; - - schedule_work(&work->work); -} diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index fef9727..10a085d 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -56,6 +56,7 @@ #include #include +#include #include #include "c2.h" #include "c2_provider.h" @@ -396,6 +397,7 @@ static struct ib_mr *c2_reg_phys_mr(struct ib_pd *ib_pd, } mr->pd = to_c2pd(ib_pd); + mr->umem = NULL; pr_debug("%s - page shift %d, pbl_depth %d, total_len %u, " "*iova_start %llx, first pa %llx, last pa %llx\n", __FUNCTION__, page_shift, pbl_depth, total_len, @@ -428,8 +430,8 @@ static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc) return c2_reg_phys_mr(pd, &bl, 1, acc, &kva); } -static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int acc, struct ib_udata *udata) +static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt, int acc, struct ib_udata *udata) { u64 *pages; u64 kva = 0; @@ -441,15 +443,23 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, struct c2_mr *c2mr; pr_debug("%s:%u\n", __FUNCTION__, __LINE__); - shift = ffs(region->page_size) - 1; c2mr = kmalloc(sizeof(*c2mr), GFP_KERNEL); if (!c2mr) return ERR_PTR(-ENOMEM); c2mr->pd = c2pd; + c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc); + if (IS_ERR(c2mr->umem)) { + err = PTR_ERR(c2mr->umem); + kfree(c2mr); + return ERR_PTR(err); + } + + shift = ffs(c2mr->umem->page_size) - 1; + n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &c2mr->umem->chunk_list, list) n += chunk->nents; pages = kmalloc(n * sizeof(u64), GFP_KERNEL); @@ -459,35 +469,34 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, } i = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) { + list_for_each_entry(chunk, &c2mr->umem->chunk_list, list) { for (j = 0; j < chunk->nmap; ++j) { len = sg_dma_len(&chunk->page_list[j]) >> shift; for (k = 0; k < len; ++k) { pages[i++] = sg_dma_address(&chunk->page_list[j]) + - (region->page_size * k); + (c2mr->umem->page_size * k); } } } - kva = (u64)region->virt_base; + kva = virt; err = c2_nsmr_register_phys_kern(to_c2dev(pd->device), pages, - region->page_size, + c2mr->umem->page_size, i, - region->length, - region->offset, + length, + c2mr->umem->offset, &kva, c2_convert_access(acc), c2mr); kfree(pages); - if (err) { - kfree(c2mr); - return ERR_PTR(err); - } + if (err) + goto err; return &c2mr->ibmr; err: + ib_umem_release(c2mr->umem); kfree(c2mr); return ERR_PTR(err); } @@ -502,8 +511,11 @@ static int c2_dereg_mr(struct ib_mr *ib_mr) err = c2_stag_dealloc(to_c2dev(ib_mr->device), ib_mr->lkey); if (err) pr_debug("c2_stag_dealloc failed: %d\n", err); - else + else { + if (mr->umem) + ib_umem_release(mr->umem); kfree(mr); + } return err; } diff --git a/drivers/infiniband/hw/amso1100/c2_provider.h b/drivers/infiniband/hw/amso1100/c2_provider.h index fc90622..1076df2 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.h +++ b/drivers/infiniband/hw/amso1100/c2_provider.h @@ -73,6 +73,7 @@ struct c2_pd { struct c2_mr { struct ib_mr ibmr; struct c2_pd *pd; + struct ib_umem *umem; }; struct c2_av; diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 24e0df0..98cdd13 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -47,6 +47,7 @@ #include #include #include +#include #include #include "cxio_hal.h" @@ -441,6 +442,8 @@ static int iwch_dereg_mr(struct ib_mr *ib_mr) remove_handle(rhp, &rhp->mmidr, mmid); if (mhp->kva) kfree((void *) (unsigned long) mhp->kva); + if (mhp->umem) + ib_umem_release(mhp->umem); PDBG("%s mmid 0x%x ptr %p\n", __FUNCTION__, mmid, mhp); kfree(mhp); return 0; @@ -575,8 +578,8 @@ static int iwch_reregister_phys_mem(struct ib_mr *mr, } -static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int acc, struct ib_udata *udata) +static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt, int acc, struct ib_udata *udata) { __be64 *pages; int shift, n, len; @@ -589,7 +592,6 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, struct iwch_reg_user_mr_resp uresp; PDBG("%s ib_pd %p\n", __FUNCTION__, pd); - shift = ffs(region->page_size) - 1; php = to_iwch_pd(pd); rhp = php->rhp; @@ -597,8 +599,17 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, if (!mhp) return ERR_PTR(-ENOMEM); + mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc); + if (IS_ERR(mhp->umem)) { + err = PTR_ERR(mhp->umem); + kfree(mhp); + return ERR_PTR(err); + } + + shift = ffs(mhp->umem->page_size) - 1; + n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &mhp->umem->chunk_list, list) n += chunk->nents; pages = kmalloc(n * sizeof(u64), GFP_KERNEL); @@ -609,13 +620,13 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, i = n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &mhp->umem->chunk_list, list) for (j = 0; j < chunk->nmap; ++j) { len = sg_dma_len(&chunk->page_list[j]) >> shift; for (k = 0; k < len; ++k) { pages[i++] = cpu_to_be64(sg_dma_address( &chunk->page_list[j]) + - region->page_size * k); + mhp->umem->page_size * k); } } @@ -623,9 +634,9 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, mhp->attr.pdid = php->pdid; mhp->attr.zbva = 0; mhp->attr.perms = iwch_ib_to_tpt_access(acc); - mhp->attr.va_fbo = region->virt_base; + mhp->attr.va_fbo = virt; mhp->attr.page_size = shift - 12; - mhp->attr.len = (u32) region->length; + mhp->attr.len = (u32) length; mhp->attr.pbl_size = i; err = iwch_register_mem(rhp, php, mhp, shift, pages); kfree(pages); @@ -648,6 +659,7 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, return &mhp->ibmr; err: + ib_umem_release(mhp->umem); kfree(mhp); return ERR_PTR(err); } diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h index 93bcc56..48833f3 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.h +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h @@ -73,6 +73,7 @@ struct tpt_attributes { struct iwch_mr { struct ib_mr ibmr; + struct ib_umem *umem; struct iwch_dev *rhp; u64 kva; struct tpt_attributes attr; diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 82ded44..88e7866 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -175,6 +175,7 @@ struct ehca_mr { struct ib_mr ib_mr; /* must always be first in ehca_mr */ struct ib_fmr ib_fmr; /* must always be first in ehca_mr */ } ib; + struct ib_umem *umem; spinlock_t mrlock; enum ehca_mr_flag flags; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 95fd59f..9b22c5a 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -78,8 +78,7 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, int num_phys_buf, int mr_access_flags, u64 *iova_start); -struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, - struct ib_umem *region, +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt, int mr_access_flags, struct ib_udata *udata); int ehca_rereg_phys_mr(struct ib_mr *mr, diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index d22ab56..84c5bb4 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -39,6 +39,8 @@ * POSSIBILITY OF SUCH DAMAGE. */ +#include + #include #include "ehca_iverbs.h" @@ -238,10 +240,8 @@ reg_phys_mr_exit0: /*----------------------------------------------------------------------*/ -struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, - struct ib_umem *region, - int mr_access_flags, - struct ib_udata *udata) +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt, + int mr_access_flags, struct ib_udata *udata) { struct ib_mr *ib_mr; struct ehca_mr *e_mr; @@ -257,11 +257,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, ehca_gen_err("bad pd=%p", pd); return ERR_PTR(-EFAULT); } - if (!region) { - ehca_err(pd->device, "bad input values: region=%p", region); - ib_mr = ERR_PTR(-EINVAL); - goto reg_user_mr_exit0; - } + if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && @@ -275,17 +271,10 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, ib_mr = ERR_PTR(-EINVAL); goto reg_user_mr_exit0; } - if (region->page_size != PAGE_SIZE) { - ehca_err(pd->device, "page size not supported, " - "region->page_size=%x", region->page_size); - ib_mr = ERR_PTR(-EINVAL); - goto reg_user_mr_exit0; - } - if ((region->length == 0) || - ((region->virt_base + region->length) < region->virt_base)) { + if (length == 0 || virt + length < virt) { ehca_err(pd->device, "bad input values: length=%lx " - "virt_base=%lx", region->length, region->virt_base); + "virt_base=%lx", length, virt); ib_mr = ERR_PTR(-EINVAL); goto reg_user_mr_exit0; } @@ -297,40 +286,55 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, goto reg_user_mr_exit0; } + e_mr->umem = ib_umem_get(pd->uobject->context, start, length, + mr_access_flags); + if (IS_ERR(e_mr->umem)) { + ib_mr = (void *) e_mr->umem; + goto reg_user_mr_exit1; + } + + if (e_mr->umem->page_size != PAGE_SIZE) { + ehca_err(pd->device, "page size not supported, " + "e_mr->umem->page_size=%x", e_mr->umem->page_size); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit2; + } + /* determine number of MR pages */ - num_pages_mr = (((region->virt_base % PAGE_SIZE) + region->length + - PAGE_SIZE - 1) / PAGE_SIZE); - num_pages_4k = (((region->virt_base % EHCA_PAGESIZE) + region->length + - EHCA_PAGESIZE - 1) / EHCA_PAGESIZE); + num_pages_mr = (((virt % PAGE_SIZE) + length + PAGE_SIZE - 1) / + PAGE_SIZE); + num_pages_4k = (((virt % EHCA_PAGESIZE) + length + EHCA_PAGESIZE - 1) / + EHCA_PAGESIZE); /* register MR on HCA */ pginfo.type = EHCA_MR_PGI_USER; pginfo.num_pages = num_pages_mr; pginfo.num_4k = num_pages_4k; - pginfo.region = region; - pginfo.next_4k = region->offset / EHCA_PAGESIZE; + pginfo.region = e_mr->umem; + pginfo.next_4k = e_mr->umem->offset / EHCA_PAGESIZE; pginfo.next_chunk = list_prepare_entry(pginfo.next_chunk, - (®ion->chunk_list), + (&e_mr->umem->chunk_list), list); - ret = ehca_reg_mr(shca, e_mr, (u64*)region->virt_base, - region->length, mr_access_flags, e_pd, &pginfo, - &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); + ret = ehca_reg_mr(shca, e_mr, (u64*) virt, length, mr_access_flags, e_pd, + &pginfo, &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); if (ret) { ib_mr = ERR_PTR(ret); - goto reg_user_mr_exit1; + goto reg_user_mr_exit2; } /* successful registration of all pages */ return &e_mr->ib.ib_mr; +reg_user_mr_exit2: + ib_umem_release(e_mr->umem); reg_user_mr_exit1: ehca_mr_delete(e_mr); reg_user_mr_exit0: if (IS_ERR(ib_mr)) - ehca_err(pd->device, "rc=%lx pd=%p region=%p mr_access_flags=%x" + ehca_err(pd->device, "rc=%lx pd=%p mr_access_flags=%x" " udata=%p", - PTR_ERR(ib_mr), pd, region, mr_access_flags, udata); + PTR_ERR(ib_mr), pd, mr_access_flags, udata); return ib_mr; } /* end ehca_reg_user_mr() */ @@ -596,6 +600,9 @@ int ehca_dereg_mr(struct ib_mr *mr) goto dereg_mr_exit0; } + if (e_mr->umem) + ib_umem_release(e_mr->umem); + /* successful deregistration */ ehca_mr_delete(e_mr); diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c b/drivers/infiniband/hw/ipath/ipath_mr.c index 8cc8598..8e91c8b 100644 --- a/drivers/infiniband/hw/ipath/ipath_mr.c +++ b/drivers/infiniband/hw/ipath/ipath_mr.c @@ -31,6 +31,7 @@ * SOFTWARE. */ +#include #include #include @@ -147,6 +148,7 @@ struct ib_mr *ipath_reg_phys_mr(struct ib_pd *pd, mr->mr.offset = 0; mr->mr.access_flags = acc; mr->mr.max_segs = num_phys_buf; + mr->umem = NULL; m = 0; n = 0; @@ -170,50 +172,60 @@ bail: /** * ipath_reg_user_mr - register a userspace memory region * @pd: protection domain for this memory region - * @region: the user memory region + * @start: starting userspace address + * @length: length of region to register + * @virt_addr: virtual address to use (from HCA's point of view) * @mr_access_flags: access flags for this memory region * @udata: unused by the InfiniPath driver * * Returns the memory region on success, otherwise returns an errno. */ -struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int mr_access_flags, struct ib_udata *udata) +struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt_addr, int mr_access_flags, + struct ib_udata *udata) { struct ipath_mr *mr; + struct ib_umem *umem; struct ib_umem_chunk *chunk; int n, m, i; struct ib_mr *ret; - if (region->length == 0) { + if (length == 0) { ret = ERR_PTR(-EINVAL); goto bail; } + umem = ib_umem_get(pd->uobject->context, start, length, mr_access_flags); + if (IS_ERR(umem)) + return (void *) umem; + n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &umem->chunk_list, list) n += chunk->nents; mr = alloc_mr(n, &to_idev(pd->device)->lk_table); if (!mr) { ret = ERR_PTR(-ENOMEM); + ib_umem_release(umem); goto bail; } mr->mr.pd = pd; - mr->mr.user_base = region->user_base; - mr->mr.iova = region->virt_base; - mr->mr.length = region->length; - mr->mr.offset = region->offset; + mr->mr.user_base = start; + mr->mr.iova = virt_addr; + mr->mr.length = length; + mr->mr.offset = umem->offset; mr->mr.access_flags = mr_access_flags; mr->mr.max_segs = n; + mr->umem = umem; m = 0; n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) { + list_for_each_entry(chunk, &umem->chunk_list, list) { for (i = 0; i < chunk->nmap; i++) { mr->mr.map[m]->segs[n].vaddr = page_address(chunk->page_list[i].page); - mr->mr.map[m]->segs[n].length = region->page_size; + mr->mr.map[m]->segs[n].length = umem->page_size; n++; if (n == IPATH_SEGSZ) { m++; @@ -247,6 +259,10 @@ int ipath_dereg_mr(struct ib_mr *ibmr) i--; kfree(mr->mr.map[i]); } + + if (mr->umem) + ib_umem_release(mr->umem); + kfree(mr); return 0; } diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index c0c8d5b..8f7af7a 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -248,6 +248,7 @@ struct ipath_sge { /* Memory region */ struct ipath_mr { struct ib_mr ibmr; + struct ib_umem *umem; struct ipath_mregion mr; /* must be last */ }; @@ -726,8 +727,8 @@ struct ib_mr *ipath_reg_phys_mr(struct ib_pd *pd, struct ib_phys_buf *buffer_list, int num_phys_buf, int acc, u64 *iova_start); -struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int mr_access_flags, +struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt_addr, int mr_access_flags, struct ib_udata *udata); int ipath_dereg_mr(struct ib_mr *ibmr); diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 0725ad7..cd5eb60 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -37,6 +37,7 @@ */ #include +#include #include #include @@ -907,6 +908,8 @@ static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc) return ERR_PTR(err); } + mr->umem = NULL; + return &mr->ibmr; } @@ -1002,11 +1005,13 @@ static struct ib_mr *mthca_reg_phys_mr(struct ib_pd *pd, } kfree(page_list); + mr->umem = NULL; + return &mr->ibmr; } -static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int acc, struct ib_udata *udata) +static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt, int acc, struct ib_udata *udata) { struct mthca_dev *dev = to_mdev(pd->device); struct ib_umem_chunk *chunk; @@ -1017,20 +1022,26 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, int err = 0; int write_mtt_size; - shift = ffs(region->page_size) - 1; - mr = kmalloc(sizeof *mr, GFP_KERNEL); if (!mr) return ERR_PTR(-ENOMEM); + mr->umem = ib_umem_get(pd->uobject->context, start, length, acc); + if (IS_ERR(mr->umem)) { + err = PTR_ERR(mr->umem); + goto err; + } + + shift = ffs(mr->umem->page_size) - 1; + n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &mr->umem->chunk_list, list) n += chunk->nents; mr->mtt = mthca_alloc_mtt(dev, n); if (IS_ERR(mr->mtt)) { err = PTR_ERR(mr->mtt); - goto err; + goto err_umem; } pages = (u64 *) __get_free_page(GFP_KERNEL); @@ -1043,12 +1054,12 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, write_mtt_size = min(mthca_write_mtt_size(dev), (int) (PAGE_SIZE / sizeof *pages)); - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &mr->umem->chunk_list, list) for (j = 0; j < chunk->nmap; ++j) { len = sg_dma_len(&chunk->page_list[j]) >> shift; for (k = 0; k < len; ++k) { pages[i++] = sg_dma_address(&chunk->page_list[j]) + - region->page_size * k; + mr->umem->page_size * k; /* * Be friendly to write_mtt and pass it chunks * of appropriate size. @@ -1070,8 +1081,8 @@ mtt_done: if (err) goto err_mtt; - err = mthca_mr_alloc(dev, to_mpd(pd)->pd_num, shift, region->virt_base, - region->length, convert_access(acc), mr); + err = mthca_mr_alloc(dev, to_mpd(pd)->pd_num, shift, virt, length, + convert_access(acc), mr); if (err) goto err_mtt; @@ -1081,6 +1092,9 @@ mtt_done: err_mtt: mthca_free_mtt(dev, mr->mtt); +err_umem: + ib_umem_release(mr->umem); + err: kfree(mr); return ERR_PTR(err); @@ -1089,8 +1103,12 @@ err: static int mthca_dereg_mr(struct ib_mr *mr) { struct mthca_mr *mmr = to_mmr(mr); + mthca_free_mr(to_mdev(mr->device), mmr); + if (mmr->umem) + ib_umem_release(mmr->umem); kfree(mmr); + return 0; } diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h index 1d266ac..262616c 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.h +++ b/drivers/infiniband/hw/mthca/mthca_provider.h @@ -73,6 +73,7 @@ struct mthca_mtt; struct mthca_mr { struct ib_mr ibmr; + struct ib_umem *umem; struct mthca_mtt *mtt; }; diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h new file mode 100644 index 0000000..06307f7 --- /dev/null +++ b/include/rdma/ib_umem.h @@ -0,0 +1,78 @@ +/* + * Copyright (c) 2007 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef IB_UMEM_H +#define IB_UMEM_H + +#include +#include + +struct ib_ucontext; + +struct ib_umem { + struct ib_ucontext *context; + size_t length; + int offset; + int page_size; + int writable; + struct list_head chunk_list; +}; + +struct ib_umem_chunk { + struct list_head list; + int nents; + int nmap; + struct scatterlist page_list[0]; +}; + +#ifdef CONFIG_INFINIBAND_USER_MEM + +struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, + size_t size, int access); +void ib_umem_release(struct ib_umem *umem); +int ib_umem_page_count(struct ib_umem *umem); + +#else /* CONFIG_INFINIBAND_USER_MEM */ + +#include + +static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context, + unsigned long addr, size_t size, + int access) { + return ERR_PTR(-EINVAL); +} +static inline void ib_umem_release(struct ib_umem *umem) { } +static inline int ib_umem_page_count(struct ib_umem *umem) { return 0; } + +#endif /* CONFIG_INFINIBAND_USER_MEM */ + +#endif /* IB_UMEM_H */ diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 765589f..b910baa 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -5,7 +5,7 @@ * Copyright (c) 2004 Topspin Corporation. All rights reserved. * Copyright (c) 2004 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006, 2007 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -708,6 +708,7 @@ struct ib_ucontext { struct list_head qp_list; struct list_head srq_list; struct list_head ah_list; + int closing; }; struct ib_uobject { @@ -721,23 +722,6 @@ struct ib_uobject { int live; }; -struct ib_umem { - unsigned long user_base; - unsigned long virt_base; - size_t length; - int offset; - int page_size; - int writable; - struct list_head chunk_list; -}; - -struct ib_umem_chunk { - struct list_head list; - int nents; - int nmap; - struct scatterlist page_list[0]; -}; - struct ib_udata { void __user *inbuf; void __user *outbuf; @@ -750,11 +734,6 @@ struct ib_udata { ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] - \ (void *) &((struct ib_umem_chunk *) 0)->page_list[0])) -struct ib_umem_object { - struct ib_uobject uobject; - struct ib_umem umem; -}; - struct ib_pd { struct ib_device *device; struct ib_uobject *uobject; @@ -998,7 +977,8 @@ struct ib_device { int mr_access_flags, u64 *iova_start); struct ib_mr * (*reg_user_mr)(struct ib_pd *pd, - struct ib_umem *region, + u64 start, u64 length, + u64 virt_addr, int mr_access_flags, struct ib_udata *udata); int (*query_mr)(struct ib_mr *mr, -- 1.5.1 From rdreier at cisco.com Thu Apr 19 16:03:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 19 Apr 2007 16:03:08 -0700 Subject: [ofa-general] [PATCH 2/2][RFC] Put rlimit accounting struct in struct ib_umem In-Reply-To: (Roland Dreier's message of "Tue, 17 Apr 2007 14:23:47 -0700") References: Message-ID: When memory pinned with ib_umem_get() is released, ib_umem_release() needs to subtract the amount of memory being unpinned from mm->locked_vm. However, ib_umem_release() may be called with mm->mmap_sem already held for writing if the memory is being released as part of an munmap() call, so it is sometimes necessary to defer this accounting into a workqueue. However, the work struct used to defer this accounting is dynamically allocated before it is queued, so there is the possibility of failing that allocation. If the allocation fails, then ib_umem_release has no choice except to bail out and leave the process with a permanently elevated locked_vm. Fix this by allocating the structure to defer accounting as part of the original struct ib_umem, so there's no possibility of failing a later allocation if creating the struct ib_umem and pinning memory succeeds. Signed-off-by: Roland Dreier --- And this addresses MST's other comment on the previous version of the patch. drivers/infiniband/core/umem.c | 41 ++++++++++++--------------------------- include/rdma/ib_umem.h | 3 ++ 2 files changed, 16 insertions(+), 28 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 48e854c..f32ca5f 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -39,13 +39,6 @@ #include "uverbs.h" -struct ib_umem_account_work { - struct work_struct work; - struct mm_struct *mm; - unsigned long diff; -}; - - static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) { struct ib_umem_chunk *chunk, *tmp; @@ -192,16 +185,15 @@ out: } EXPORT_SYMBOL(ib_umem_get); -static void ib_umem_account(struct work_struct *_work) +static void ib_umem_account(struct work_struct *work) { - struct ib_umem_account_work *work = - container_of(_work, struct ib_umem_account_work, work); - - down_write(&work->mm->mmap_sem); - work->mm->locked_vm -= work->diff; - up_write(&work->mm->mmap_sem); - mmput(work->mm); - kfree(work); + struct ib_umem *umem = container_of(work, struct ib_umem, work); + + down_write(&umem->mm->mmap_sem); + umem->mm->locked_vm -= umem->diff; + up_write(&umem->mm->mmap_sem); + mmput(umem->mm); + kfree(umem); } /** @@ -210,7 +202,6 @@ static void ib_umem_account(struct work_struct *_work) */ void ib_umem_release(struct ib_umem *umem) { - struct ib_umem_account_work *work; struct ib_ucontext *context = umem->context; struct mm_struct *mm; unsigned long diff; @@ -222,7 +213,6 @@ void ib_umem_release(struct ib_umem *umem) return; diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; - kfree(umem); /* * We may be called with the mm's mmap_sem already held. This @@ -233,17 +223,11 @@ void ib_umem_release(struct ib_umem *umem) * we defer the vm_locked accounting to the system workqueue. */ if (context->closing && !down_write_trylock(&mm->mmap_sem)) { - work = kmalloc(sizeof *work, GFP_KERNEL); - if (!work) { - mmput(mm); - return; - } + INIT_WORK(&umem->work, ib_umem_account); + umem->mm = mm; + umem->diff = diff; - INIT_WORK(&work->work, ib_umem_account); - work->mm = mm; - work->diff = diff; - - schedule_work(&work->work); + schedule_work(&umem->work); return; } else down_write(&mm->mmap_sem); @@ -251,6 +235,7 @@ void ib_umem_release(struct ib_umem *umem) current->mm->locked_vm -= diff; up_write(&mm->mmap_sem); mmput(mm); + kfree(umem); } EXPORT_SYMBOL(ib_umem_release); diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 06307f7..b3a36f7 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -45,6 +45,9 @@ struct ib_umem { int page_size; int writable; struct list_head chunk_list; + struct work_struct work; + struct mm_struct *mm; + unsigned long diff; }; struct ib_umem_chunk { -- 1.5.1 From sean.hefty at intel.com Thu Apr 19 16:59:14 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 19 Apr 2007 16:59:14 -0700 Subject: [ofa-general] [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache Message-ID: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> The following set of patches adds a local SA path record cache to the IB stack (currently based on 2.6.20-rc4). The cache is derived from the OFED 1.2 local SA cache patches, with changes based on the last round of feedback and current Path Forward feature requests: * InformInfo/Notice support added to ib_sa Clients may now register to receive SA related events. The local_sa uses this to receive notification of GID up/down events in order to keep the cache up to date. * Removal of time based cache updates Cache updates are now driven by local and SA events. Most module parameters have been eliminated, and remaining options are exposed through a file system interface for dynamic control, including the ability to force a cache refresh. Using a local SA cache we were able to establish all-to-all connections between 1024 processes (about 1 million connections) in about 3 seconds. Without the cache, connection time took about a minute, and required a substantial amount of tuning of timeout values to achieve this. I've only updated the rdma_cm to use the cache, but similar changes could be made to SRP and ipoib (which implements its own path record cache). I would like to get feedback on both the notice and local_sa patches for inclusion in 2.6.22 or 2.6.23 (if 2.6.22 is not possible). Signed-off-by: Sean Hefty From sean.hefty at intel.com Thu Apr 19 17:03:02 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 19 Apr 2007 17:03:02 -0700 Subject: [ofa-general] [RFC] [PATCH 1/3] 2.6.22 or 23 ib/sa: add registration for sa events In-Reply-To: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> Message-ID: <000101c782df$43b2fcf0$07fd070a@amr.corp.intel.com> IB/sa: Add InformInfo/Notice support. From: Sean Hefty Add SA client support for notice/trap registration using InformInfo. Clients can use the ib_sa interface to register for SA events based on trap numbers, and receive SA event notification. This allows clients to receive notification, such as GID in/out of service. Signed-off-by: Sean Hefty --- drivers/infiniband/core/Makefile | 2 drivers/infiniband/core/notice.c | 749 ++++++++++++++++++++++++++++++++++++ drivers/infiniband/core/sa.h | 16 + drivers/infiniband/core/sa_query.c | 316 +++++++++++++++ include/rdma/ib_sa.h | 170 ++++++++ 5 files changed, 1250 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 189e5d4..2e9c4b2 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -12,7 +12,7 @@ ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ ib_mad-y := mad.o smi.o agent.o mad_rmpp.o -ib_sa-y := sa_query.o multicast.o +ib_sa-y := sa_query.o multicast.o notice.o ib_cm-y := cm.o diff --git a/drivers/infiniband/core/notice.c b/drivers/infiniband/core/notice.c new file mode 100644 index 0000000..e4c73c8 --- /dev/null +++ b/drivers/infiniband/core/notice.c @@ -0,0 +1,749 @@ +/* + * Copyright (c) 2006 Intel Corporation.  All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "sa.h" + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand InformInfo & Notice event handling"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void inform_add_one(struct ib_device *device); +static void inform_remove_one(struct ib_device *device); + +static struct ib_client inform_client = { + .name = "ib_notice", + .add = inform_add_one, + .remove = inform_remove_one +}; + +static struct ib_sa_client sa_client; +static struct workqueue_struct *inform_wq; + +struct inform_device; + +struct inform_port { + struct inform_device *dev; + spinlock_t lock; + struct rb_root table; + atomic_t refcount; + struct completion comp; + u8 port_num; +}; + +struct inform_device { + struct ib_device *device; + struct ib_event_handler event_handler; + int start_port; + int end_port; + struct inform_port port[0]; +}; + +enum inform_state { + INFORM_IDLE, + INFORM_REGISTERING, + INFORM_MEMBER, + INFORM_BUSY, + INFORM_ERROR +}; + +struct inform_member; + +struct inform_group { + u16 trap_number; + struct rb_node node; + struct inform_port *port; + spinlock_t lock; + struct work_struct work; + struct list_head pending_list; + struct list_head active_list; + struct list_head notice_list; + struct inform_member *last_join; + int members; + enum inform_state join_state; /* State relative to SA */ + atomic_t refcount; + enum inform_state state; + struct ib_sa_query *query; + int query_id; +}; + +struct inform_member { + struct ib_inform_info info; + struct ib_sa_client *client; + struct inform_group *group; + struct list_head list; + enum inform_state state; + atomic_t refcount; + struct completion comp; +}; + +struct inform_notice { + struct list_head list; + struct ib_sa_notice notice; +}; + +static void reg_handler(int status, struct ib_sa_inform *inform, + void *context); +static void unreg_handler(int status, struct ib_sa_inform *inform, + void *context); + +static struct inform_group *inform_find(struct inform_port *port, + u16 trap_number) +{ + struct rb_node *node = port->table.rb_node; + struct inform_group *group; + + while (node) { + group = rb_entry(node, struct inform_group, node); + if (trap_number < group->trap_number) + node = node->rb_left; + else if (trap_number > group->trap_number) + node = node->rb_right; + else + return group; + } + return NULL; +} + +static struct inform_group *inform_insert(struct inform_port *port, + struct inform_group *group) +{ + struct rb_node **link = &port->table.rb_node; + struct rb_node *parent = NULL; + struct inform_group *cur_group; + + while (*link) { + parent = *link; + cur_group = rb_entry(parent, struct inform_group, node); + if (group->trap_number < cur_group->trap_number) + link = &(*link)->rb_left; + else if (group->trap_number > cur_group->trap_number) + link = &(*link)->rb_right; + else + return cur_group; + } + rb_link_node(&group->node, parent, link); + rb_insert_color(&group->node, &port->table); + return NULL; +} + +static void deref_port(struct inform_port *port) +{ + if (atomic_dec_and_test(&port->refcount)) + complete(&port->comp); +} + +static void release_group(struct inform_group *group) +{ + struct inform_port *port = group->port; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + if (atomic_dec_and_test(&group->refcount)) { + rb_erase(&group->node, &port->table); + spin_unlock_irqrestore(&port->lock, flags); + kfree(group); + deref_port(port); + } else + spin_unlock_irqrestore(&port->lock, flags); +} + +static void deref_member(struct inform_member *member) +{ + if (atomic_dec_and_test(&member->refcount)) + complete(&member->comp); +} + +static void queue_reg(struct inform_member *member) +{ + struct inform_group *group = member->group; + unsigned long flags; + + spin_lock_irqsave(&group->lock, flags); + list_add(&member->list, &group->pending_list); + if (group->state == INFORM_IDLE) { + group->state = INFORM_BUSY; + atomic_inc(&group->refcount); + queue_work(inform_wq, &group->work); + } + spin_unlock_irqrestore(&group->lock, flags); +} + +static int send_reg(struct inform_group *group, struct inform_member *member) +{ + struct inform_port *port = group->port; + struct ib_sa_inform inform; + int ret; + + memset(&inform, 0, sizeof inform); + inform.lid_range_begin = cpu_to_be16(0xFFFF); + inform.is_generic = 1; + inform.subscribe = 1; + inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL); + inform.trap.generic.trap_num = cpu_to_be16(member->info.trap_number); + inform.trap.generic.resp_time = 19; + inform.trap.generic.producer_type = + cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL); + + group->last_join = member; + ret = ib_sa_informinfo_query(&sa_client, port->dev->device, + port->port_num, &inform, 3000, GFP_KERNEL, + reg_handler, group,&group->query); + if (ret >= 0) { + group->query_id = ret; + ret = 0; + } + return ret; +} + +static int send_unreg(struct inform_group *group) +{ + struct inform_port *port = group->port; + struct ib_sa_inform inform; + int ret; + + memset(&inform, 0, sizeof inform); + inform.lid_range_begin = cpu_to_be16(0xFFFF); + inform.is_generic = 1; + inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL); + inform.trap.generic.trap_num = cpu_to_be16(group->trap_number); + inform.trap.generic.qpn = IB_QP1; + inform.trap.generic.resp_time = 19; + inform.trap.generic.producer_type = + cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL); + + ret = ib_sa_informinfo_query(&sa_client, port->dev->device, + port->port_num, &inform, 3000, GFP_KERNEL, + unreg_handler, group, &group->query); + if (ret >= 0) { + group->query_id = ret; + ret = 0; + } + return ret; +} + +static void join_group(struct inform_group *group, struct inform_member *member) +{ + member->state = INFORM_MEMBER; + group->members++; + list_move(&member->list, &group->active_list); +} + +static int fail_join(struct inform_group *group, struct inform_member *member, + int status) +{ + spin_lock_irq(&group->lock); + list_del_init(&member->list); + spin_unlock_irq(&group->lock); + return member->info.callback(status, &member->info, NULL); +} + +static void process_group_error(struct inform_group *group) +{ + struct inform_member *member; + int ret; + + spin_lock_irq(&group->lock); + while (!list_empty(&group->active_list)) { + member = list_entry(group->active_list.next, + struct inform_member, list); + atomic_inc(&member->refcount); + list_del_init(&member->list); + group->members--; + member->state = INFORM_ERROR; + spin_unlock_irq(&group->lock); + + ret = member->info.callback(-ENETRESET, &member->info, NULL); + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + spin_lock_irq(&group->lock); + } + + group->join_state = INFORM_IDLE; + group->state = INFORM_BUSY; + spin_unlock_irq(&group->lock); +} + +/* + * Report a notice to all active subscribers. We use a temporary list to + * handle unsubscription requests while the notice is being reported, which + * avoids holding the group lock while in the user's callback. + */ +static void process_notice(struct inform_group *group, + struct inform_notice *info_notice) +{ + struct inform_member *member; + struct list_head list; + int ret; + + INIT_LIST_HEAD(&list); + + spin_lock_irq(&group->lock); + list_splice_init(&group->active_list, &list); + while (!list_empty(&list)) { + + member = list_entry(list.next, struct inform_member, list); + atomic_inc(&member->refcount); + list_move(&member->list, &group->active_list); + spin_unlock_irq(&group->lock); + + ret = member->info.callback(0, &member->info, + &info_notice->notice); + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + spin_lock_irq(&group->lock); + } + spin_unlock_irq(&group->lock); +} + +static void inform_work_handler(struct work_struct *work) +{ + struct inform_group *group; + struct inform_member *member; + struct ib_inform_info *info; + struct inform_notice *info_notice; + int status, ret; + + group = container_of(work, typeof(*group), work); +retest: + spin_lock_irq(&group->lock); + while (!list_empty(&group->pending_list) || + !list_empty(&group->notice_list) || + (group->state == INFORM_ERROR)) { + + if (group->state == INFORM_ERROR) { + spin_unlock_irq(&group->lock); + process_group_error(group); + goto retest; + } + + if (!list_empty(&group->notice_list)) { + info_notice = list_entry(group->notice_list.next, + struct inform_notice, list); + list_del(&info_notice->list); + spin_unlock_irq(&group->lock); + process_notice(group, info_notice); + kfree(info_notice); + goto retest; + } + + member = list_entry(group->pending_list.next, + struct inform_member, list); + info = &member->info; + atomic_inc(&member->refcount); + + if (group->join_state == INFORM_MEMBER) { + join_group(group, member); + spin_unlock_irq(&group->lock); + ret = info->callback(0, info, NULL); + } else { + spin_unlock_irq(&group->lock); + status = send_reg(group, member); + if (!status) { + deref_member(member); + return; + } + ret = fail_join(group, member, status); + } + + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + spin_lock_irq(&group->lock); + } + + if (!group->members && (group->join_state == INFORM_MEMBER)) { + group->join_state = INFORM_IDLE; + spin_unlock_irq(&group->lock); + if (send_unreg(group)) + goto retest; + } else { + group->state = INFORM_IDLE; + spin_unlock_irq(&group->lock); + release_group(group); + } +} + +/* + * Fail a join request if it is still active - at the head of the pending queue. + */ +static void process_join_error(struct inform_group *group, int status) +{ + struct inform_member *member; + int ret; + + spin_lock_irq(&group->lock); + member = list_entry(group->pending_list.next, + struct inform_member, list); + if (group->last_join == member) { + atomic_inc(&member->refcount); + list_del_init(&member->list); + spin_unlock_irq(&group->lock); + ret = member->info.callback(status, &member->info, NULL); + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + } else + spin_unlock_irq(&group->lock); +} + +static void reg_handler(int status, struct ib_sa_inform *inform, void *context) +{ + struct inform_group *group = context; + + if (status) + process_join_error(group, status); + else + group->join_state = INFORM_MEMBER; + + inform_work_handler(&group->work); +} + +static void unreg_handler(int status, struct ib_sa_inform *rec, void *context) +{ + struct inform_group *group = context; + + inform_work_handler(&group->work); +} + +int notice_dispatch(struct ib_device *device, u8 port_num, + struct ib_sa_notice *notice) +{ + struct inform_device *dev; + struct inform_port *port; + struct inform_group *group; + struct inform_notice *info_notice; + + dev = ib_get_client_data(device, &inform_client); + if (!dev) + return 0; /* No one to give notice to. */ + + port = &dev->port[port_num - dev->start_port]; + spin_lock_irq(&port->lock); + group = inform_find(port, __be16_to_cpu(notice->trap. + generic.trap_num)); + if (!group) { + spin_unlock_irq(&port->lock); + return 0; + } + + atomic_inc(&group->refcount); + spin_unlock_irq(&port->lock); + + info_notice = kmalloc(sizeof *info_notice, GFP_KERNEL); + if (!info_notice) { + release_group(group); + return -ENOMEM; + } + + info_notice->notice = *notice; + + spin_lock_irq(&group->lock); + list_add(&info_notice->list, &group->notice_list); + if (group->state == INFORM_IDLE) { + group->state = INFORM_BUSY; + spin_unlock_irq(&group->lock); + inform_work_handler(&group->work); + } else { + spin_unlock_irq(&group->lock); + release_group(group); + } + + return 0; +} + +static struct inform_group *acquire_group(struct inform_port *port, + u16 trap_number, gfp_t gfp_mask) +{ + struct inform_group *group, *cur_group; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + group = inform_find(port, trap_number); + if (group) + goto found; + spin_unlock_irqrestore(&port->lock, flags); + + group = kzalloc(sizeof *group, gfp_mask); + if (!group) + return NULL; + + group->port = port; + group->trap_number = trap_number; + INIT_LIST_HEAD(&group->pending_list); + INIT_LIST_HEAD(&group->active_list); + INIT_LIST_HEAD(&group->notice_list); + INIT_WORK(&group->work, inform_work_handler); + spin_lock_init(&group->lock); + + spin_lock_irqsave(&port->lock, flags); + cur_group = inform_insert(port, group); + if (cur_group) { + kfree(group); + group = cur_group; + } else + atomic_inc(&port->refcount); +found: + atomic_inc(&group->refcount); + spin_unlock_irqrestore(&port->lock, flags); + return group; +} + +/* + * We serialize all join requests to a single group to make our lives much + * easier. Otherwise, two users could try to join the same group + * simultaneously, with different configurations, one could leave while the + * join is in progress, etc., which makes locking around error recovery + * difficult. + */ +struct ib_inform_info * +ib_sa_register_inform_info(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + u16 trap_number, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_inform_info *info, + struct ib_sa_notice *notice), + void *context) +{ + struct inform_device *dev; + struct inform_member *member; + struct ib_inform_info *info; + int ret; + + dev = ib_get_client_data(device, &inform_client); + if (!dev) + return ERR_PTR(-ENODEV); + + member = kzalloc(sizeof *member, gfp_mask); + if (!member) + return ERR_PTR(-ENOMEM); + + ib_sa_client_get(client); + member->client = client; + member->info.trap_number = trap_number; + member->info.callback = callback; + member->info.context = context; + init_completion(&member->comp); + atomic_set(&member->refcount, 1); + member->state = INFORM_REGISTERING; + + member->group = acquire_group(&dev->port[port_num - dev->start_port], + trap_number, gfp_mask); + if (!member->group) { + ret = -ENOMEM; + goto err; + } + + /* + * The user will get the info structure in their callback. They + * could then free the info structure before we can return from + * this routine. So we save the pointer to return before queuing + * any callback. + */ + info = &member->info; + queue_reg(member); + return info; + +err: + ib_sa_client_put(member->client); + kfree(member); + return ERR_PTR(ret); +} +EXPORT_SYMBOL(ib_sa_register_inform_info); + +void ib_sa_unregister_inform_info(struct ib_inform_info *info) +{ + struct inform_member *member; + struct inform_group *group; + + member = container_of(info, struct inform_member, info); + group = member->group; + + spin_lock_irq(&group->lock); + if (member->state == INFORM_MEMBER) + group->members--; + + list_del_init(&member->list); + + if (group->state == INFORM_IDLE) { + group->state = INFORM_BUSY; + spin_unlock_irq(&group->lock); + /* Continue to hold reference on group until callback */ + queue_work(inform_wq, &group->work); + } else { + spin_unlock_irq(&group->lock); + release_group(group); + } + + deref_member(member); + wait_for_completion(&member->comp); + ib_sa_client_put(member->client); + kfree(member); +} +EXPORT_SYMBOL(ib_sa_unregister_inform_info); + +static void inform_groups_lost(struct inform_port *port) +{ + struct inform_group *group; + struct rb_node *node; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + for (node = rb_first(&port->table); node; node = rb_next(node)) { + group = rb_entry(node, struct inform_group, node); + spin_lock(&group->lock); + if (group->state == INFORM_IDLE) { + atomic_inc(&group->refcount); + queue_work(inform_wq, &group->work); + } + group->state = INFORM_ERROR; + spin_unlock(&group->lock); + } + spin_unlock_irqrestore(&port->lock, flags); +} + +static void inform_event_handler(struct ib_event_handler *handler, + struct ib_event *event) +{ + struct inform_device *dev; + + dev = container_of(handler, struct inform_device, event_handler); + + switch (event->event) { + case IB_EVENT_PORT_ERR: + case IB_EVENT_LID_CHANGE: + case IB_EVENT_SM_CHANGE: + case IB_EVENT_CLIENT_REREGISTER: + inform_groups_lost(&dev->port[event->element.port_num - + dev->start_port]); + break; + default: + break; + } +} + +static void inform_add_one(struct ib_device *device) +{ + struct inform_device *dev; + struct inform_port *port; + int i; + + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, + GFP_KERNEL); + if (!dev) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) + dev->start_port = dev->end_port = 0; + else { + dev->start_port = 1; + dev->end_port = device->phys_port_cnt; + } + + for (i = 0; i <= dev->end_port - dev->start_port; i++) { + port = &dev->port[i]; + port->dev = dev; + port->port_num = dev->start_port + i; + spin_lock_init(&port->lock); + port->table = RB_ROOT; + init_completion(&port->comp); + atomic_set(&port->refcount, 1); + } + + dev->device = device; + ib_set_client_data(device, &inform_client, dev); + + INIT_IB_EVENT_HANDLER(&dev->event_handler, device, inform_event_handler); + ib_register_event_handler(&dev->event_handler); +} + +static void inform_remove_one(struct ib_device *device) +{ + struct inform_device *dev; + struct inform_port *port; + int i; + + dev = ib_get_client_data(device, &inform_client); + if (!dev) + return; + + ib_unregister_event_handler(&dev->event_handler); + flush_workqueue(inform_wq); + + for (i = 0; i <= dev->end_port - dev->start_port; i++) { + port = &dev->port[i]; + deref_port(port); + wait_for_completion(&port->comp); + } + + kfree(dev); +} + +int notice_init(void) +{ + int ret; + + inform_wq = create_singlethread_workqueue("ib_inform"); + if (!inform_wq) + return -ENOMEM; + + ib_sa_register_client(&sa_client); + + ret = ib_register_client(&inform_client); + if (ret) + goto err; + return 0; + +err: + ib_sa_unregister_client(&sa_client); + destroy_workqueue(inform_wq); + return ret; +} + +void notice_cleanup(void) +{ + ib_unregister_client(&inform_client); + ib_sa_unregister_client(&sa_client); + destroy_workqueue(inform_wq); +} diff --git a/drivers/infiniband/core/sa.h b/drivers/infiniband/core/sa.h index 24c93fd..b8eac66 100644 --- a/drivers/infiniband/core/sa.h +++ b/drivers/infiniband/core/sa.h @@ -63,4 +63,20 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client, int mcast_init(void); void mcast_cleanup(void); +int ib_sa_informinfo_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_inform *rec, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_inform *resp, + void *context), + void *context, + struct ib_sa_query **sa_query); + +int notice_dispatch(struct ib_device *device, u8 port_num, + struct ib_sa_notice *notice); + +int notice_init(void); +void notice_cleanup(void); + #endif /* SA_H */ diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 68db633..8de4ad8 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -61,10 +61,12 @@ struct ib_sa_sm_ah { struct ib_sa_port { struct ib_mad_agent *agent; + struct ib_mad_agent *notice_agent; struct ib_sa_sm_ah *sm_ah; struct work_struct update_task; spinlock_t ah_lock; u8 port_num; + struct ib_device *device; }; struct ib_sa_device { @@ -101,6 +103,12 @@ struct ib_sa_mcmember_query { struct ib_sa_query sa_query; }; +struct ib_sa_inform_query { + void (*callback)(int, struct ib_sa_inform *, void *); + void *context; + struct ib_sa_query sa_query; +}; + static void ib_sa_add_one(struct ib_device *device); static void ib_sa_remove_one(struct ib_device *device); @@ -352,6 +360,110 @@ static const struct ib_field service_rec_table[] = { .size_bits = 2*64 }, }; +#define INFORM_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_inform, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_inform *) 0)->field, \ + .field_name = "sa_inform:" #field + +static const struct ib_field inform_table[] = { + { INFORM_FIELD(gid), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 128 }, + { INFORM_FIELD(lid_range_begin), + .offset_words = 4, + .offset_bits = 0, + .size_bits = 16 }, + { INFORM_FIELD(lid_range_end), + .offset_words = 4, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 5, + .offset_bits = 0, + .size_bits = 16 }, + { INFORM_FIELD(is_generic), + .offset_words = 5, + .offset_bits = 16, + .size_bits = 8 }, + { INFORM_FIELD(subscribe), + .offset_words = 5, + .offset_bits = 24, + .size_bits = 8 }, + { INFORM_FIELD(type), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 16 }, + { INFORM_FIELD(trap.generic.trap_num), + .offset_words = 6, + .offset_bits = 16, + .size_bits = 16 }, + { INFORM_FIELD(trap.generic.qpn), + .offset_words = 7, + .offset_bits = 0, + .size_bits = 24 }, + { RESERVED, + .offset_words = 7, + .offset_bits = 24, + .size_bits = 3 }, + { INFORM_FIELD(trap.generic.resp_time), + .offset_words = 7, + .offset_bits = 27, + .size_bits = 5 }, + { RESERVED, + .offset_words = 8, + .offset_bits = 0, + .size_bits = 8 }, + { INFORM_FIELD(trap.generic.producer_type), + .offset_words = 8, + .offset_bits = 8, + .size_bits = 24 }, +}; + +#define NOTICE_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_notice, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_notice *) 0)->field, \ + .field_name = "sa_notice:" #field + +static const struct ib_field notice_table[] = { + { NOTICE_FIELD(is_generic), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 1 }, + { NOTICE_FIELD(type), + .offset_words = 0, + .offset_bits = 1, + .size_bits = 7 }, + { NOTICE_FIELD(trap.generic.producer_type), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 24 }, + { NOTICE_FIELD(trap.generic.trap_num), + .offset_words = 1, + .offset_bits = 0, + .size_bits = 16 }, + { NOTICE_FIELD(issuer_lid), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 16 }, + { NOTICE_FIELD(notice_toggle), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 1 }, + { NOTICE_FIELD(notice_count), + .offset_words = 2, + .offset_bits = 1, + .size_bits = 15 }, + { NOTICE_FIELD(data_details), + .offset_words = 2, + .offset_bits = 16, + .size_bits = 432 }, + { NOTICE_FIELD(issuer_gid), + .offset_words = 16, + .offset_bits = 0, + .size_bits = 128 }, +}; + static void free_sm_ah(struct kref *kref) { struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); @@ -892,6 +1004,153 @@ err1: return ret; } +static void ib_sa_inform_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_inform_query *query = + container_of(sa_query, struct ib_sa_inform_query, sa_query); + + if (mad) { + struct ib_sa_inform rec; + + ib_unpack(inform_table, ARRAY_SIZE(inform_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_inform_release(struct ib_sa_query *sa_query) +{ + kfree(container_of(sa_query, struct ib_sa_inform_query, sa_query)); +} + +/** + * ib_sa_informinfo_query - Start an InformInfo registration. + * @client:SA client + * @device:device to send query on + * @port_num: port number to send query on + * @rec:Inform record to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when notice handler registration completes, + * times out or is canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * This function sends inform info to register with SA to receive + * in-service notice. + * The callback function will be called when the query completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_inform_query() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_informinfo_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_inform *rec, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_inform *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_inform_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port; + struct ib_mad_agent *agent; + struct ib_sa_mad *mad; + int ret; + + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; + } + + ib_sa_client_get(client); + query->sa_query.client = client; + query->callback = callback; + query->context = context; + + mad = query->sa_query.mad_buf->mad; + init_mad(mad, agent); + + query->sa_query.callback = callback ? ib_sa_inform_callback : NULL; + query->sa_query.release = ib_sa_inform_release; + query->sa_query.port = port; + mad->mad_hdr.method = IB_MGMT_METHOD_SET; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_INFORM_INFO); + + ib_pack(inform_table, ARRAY_SIZE(inform_table), rec, mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms, gfp_mask); + if (ret < 0) + goto err2; + + return ret; + +err2: + *sa_query = NULL; + ib_sa_client_put(query->sa_query.client); + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kfree(query); + return ret; +} + +static void ib_sa_notice_resp(struct ib_sa_port *port, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_mad_send_buf *mad_buf; + struct ib_sa_mad *mad; + int ret; + + mad_buf = ib_create_send_mad(port->notice_agent, 1, 0, 0, + IB_MGMT_SA_HDR, IB_MGMT_SA_DATA, + GFP_KERNEL); + if (IS_ERR(mad_buf)) + return; + + mad = mad_buf->mad; + memcpy(mad, mad_recv_wc->recv_buf.mad, sizeof *mad); + mad->mad_hdr.method = IB_MGMT_METHOD_REPORT_RESP; + + spin_lock_irq(&port->ah_lock); + kref_get(&port->sm_ah->ref); + mad_buf->context[0] = &port->sm_ah->ref; + mad_buf->ah = port->sm_ah->ah; + spin_unlock_irq(&port->ah_lock); + + ret = ib_post_send_mad(mad_buf, NULL); + if (ret) + goto err; + + return; +err: + kref_put(mad_buf->context[0], free_sm_ah); + ib_free_send_mad(mad_buf); +} + static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *mad_send_wc) { @@ -946,9 +1205,36 @@ static void recv_handler(struct ib_mad_agent *mad_agent, ib_free_recv_mad(mad_recv_wc); } +static void notice_resp_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + kref_put(mad_send_wc->send_buf->context[0], free_sm_ah); + ib_free_send_mad(mad_send_wc->send_buf); +} + +static void notice_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_port *port; + struct ib_sa_mad *mad; + struct ib_sa_notice notice; + + port = mad_agent->context; + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + ib_unpack(notice_table, ARRAY_SIZE(notice_table), mad->data, ¬ice); + + if (!notice_dispatch(port->device, port->port_num, ¬ice)) + ib_sa_notice_resp(port, mad_recv_wc); + ib_free_recv_mad(mad_recv_wc); +} + static void ib_sa_add_one(struct ib_device *device) { struct ib_sa_device *sa_dev; + struct ib_mad_reg_req reg_req = { + .mgmt_class = IB_MGMT_CLASS_SUBN_ADM, + .mgmt_class_version = 2 + }; int s, e, i; if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) @@ -982,6 +1268,16 @@ static void ib_sa_add_one(struct ib_device *device) if (IS_ERR(sa_dev->port[i].agent)) goto err; + sa_dev->port[i].device = device; + set_bit(IB_MGMT_METHOD_REPORT, reg_req.method_mask); + sa_dev->port[i].notice_agent = + ib_register_mad_agent(device, i + s, IB_QPT_GSI, + ®_req, 0, notice_resp_handler, + notice_handler, &sa_dev->port[i]); + + if (IS_ERR(sa_dev->port[i].notice_agent)) + goto err; + INIT_WORK(&sa_dev->port[i].update_task, update_sm_ah); } @@ -1004,8 +1300,14 @@ static void ib_sa_add_one(struct ib_device *device) return; err: - while (--i >= 0) - ib_unregister_mad_agent(sa_dev->port[i].agent); + while (--i >= 0) { + if (!IS_ERR(sa_dev->port[i].notice_agent)) { + ib_unregister_mad_agent(sa_dev->port[i].notice_agent); + } + if (!IS_ERR(sa_dev->port[i].agent)) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + } + } kfree(sa_dev); @@ -1025,6 +1327,7 @@ static void ib_sa_remove_one(struct ib_device *device) flush_scheduled_work(); for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { + ib_unregister_mad_agent(sa_dev->port[i].notice_agent); ib_unregister_mad_agent(sa_dev->port[i].agent); kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); } @@ -1053,7 +1356,15 @@ static int __init ib_sa_init(void) goto err2; } + ret = notice_init(); + if (ret) { + printk(KERN_ERR "Couldn't initialize notice handling\n"); + goto err3; + } + return 0; +err3: + mcast_cleanup(); err2: ib_unregister_client(&sa_client); err1: @@ -1063,6 +1374,7 @@ err1: static void __exit ib_sa_cleanup(void) { mcast_cleanup(); + notice_cleanup(); ib_unregister_client(&sa_client); idr_destroy(&query_idr); } diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index 5e26b2f..46b52fd 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -254,6 +254,126 @@ struct ib_sa_service_rec { u64 data64[2]; }; +enum { + IB_SA_EVENT_TYPE_FATAL = 0x0, + IB_SA_EVENT_TYPE_URGENT = 0x1, + IB_SA_EVENT_TYPE_SECURITY = 0x2, + IB_SA_EVENT_TYPE_SM = 0x3, + IB_SA_EVENT_TYPE_INFO = 0x4, + IB_SA_EVENT_TYPE_EMPTY = 0x7F, + IB_SA_EVENT_TYPE_ALL = 0xFFFF +}; + +enum { + IB_SA_EVENT_PRODUCER_TYPE_CA = 0x1, + IB_SA_EVENT_PRODUCER_TYPE_SWITCH = 0x2, + IB_SA_EVENT_PRODUCER_TYPE_ROUTER = 0x3, + IB_SA_EVENT_PRODUCER_TYPE_CLASS_MANAGER = 0x4, + IB_SA_EVENT_PRODUCER_TYPE_ALL = 0xFFFFFF +}; + +enum { + IB_SA_SM_TRAP_GID_IN_SERVICE = 64, + IB_SA_SM_TRAP_GID_OUT_OF_SERVICE = 65, + IB_SA_SM_TRAP_CREATE_MC_GROUP = 66, + IB_SA_SM_TRAP_DELETE_MC_GROUP = 67, + IB_SA_SM_TRAP_PORT_CHANGE_STATE = 128, + IB_SA_SM_TRAP_LINK_INTEGRITY = 129, + IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN = 130, + IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED = 131, + IB_SA_SM_TRAP_BAD_M_KEY = 256, + IB_SA_SM_TRAP_BAD_P_KEY = 257, + IB_SA_SM_TRAP_BAD_Q_KEY = 258, + IB_SA_SM_TRAP_ALL = 0xFFFF +}; + +struct ib_sa_inform { + union ib_gid gid; + __be16 lid_range_begin; + __be16 lid_range_end; + u8 is_generic; + u8 subscribe; + __be16 type; + union { + struct { + __be16 trap_num; + __be32 qpn; + u8 resp_time; + __be32 producer_type; + } generic; + struct { + __be16 device_id; + __be32 qpn; + u8 resp_time; + __be32 vendor_id; + } vendor; + } trap; +}; + +struct ib_sa_notice { + u8 is_generic; + u8 type; + union { + struct { + __be32 producer_type; + __be16 trap_num; + } generic; + struct { + __be32 vendor_id; + __be16 device_id; + } vendor; + } trap; + __be16 issuer_lid; + __be16 notice_count; + u8 notice_toggle; + /* + * Align data 16 bits off 64 bit field to match InformInfo definition. + * Data contained within this field will then align properly. + * See IB spec 1.2, sections 13.4.8.2 and 14.2.5.1. + */ + u8 reserved[5]; + u8 data_details[54]; + union ib_gid issuer_gid; +}; + +/* + * SM notice data details for: + * + * IB_SA_SM_TRAP_GID_IN_SERVICE = 64 + * IB_SA_SM_TRAP_GID_OUT_OF_SERVICE = 65 + * IB_SA_SM_TRAP_CREATE_MC_GROUP = 66 + * IB_SA_SM_TRAP_DELETE_MC_GROUP = 67 + */ +struct ib_sa_notice_data_gid { + u8 reserved[6]; + u8 gid[16]; + u8 padding[32]; +}; + +/* + * SM notice data details for: + * + * IB_SA_SM_TRAP_PORT_CHANGE_STATE = 128 + */ +struct ib_sa_notice_data_port_change { + __be16 lid; + u8 padding[52]; +}; + +/* + * SM notice data details for: + * + * IB_SA_SM_TRAP_LINK_INTEGRITY = 129 + * IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN = 130 + * IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED = 131 + */ +struct ib_sa_notice_data_port_error { + u8 reserved[2]; + __be16 lid; + u8 port_num; + u8 padding[49]; +}; + struct ib_sa_client { atomic_t users; struct completion comp; @@ -382,4 +502,54 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr); +struct ib_inform_info { + void *context; + int (*callback)(int status, + struct ib_inform_info *info, + struct ib_sa_notice *notice); + u16 trap_number; +}; + +/** + * ib_sa_register_inform_info - Registers to receive notice events. + * @device: Device associated with the registration. + * @port_num: Port on the specified device to associate with the registration. + * @trap_number: InformInfo trap number to register for. + * @gfp_mask: GFP mask for memory allocations. + * @callback: User callback invoked once the registration completes and to + * report noticed events. + * @context: User specified context stored with the ib_inform_reg structure. + * + * This call initiates a registration request with the SA for the specified + * trap number. If the operation is started successfully, it returns + * an ib_inform_info structure that is used to track the registration operation. + * Users must free this structure by calling ib_unregister_inform_info, + * even if the operation later fails. (The callback status is non-zero.) + * + * If the registration fails; status will be non-zero. If the registration + * succeeds, the callback status will be zero, but the notice parameter will + * be NULL. If the notice parameter is not NULL, a trap or notice is being + * reported to the user. + * + * A status of -ENETRESET indicates that an error occurred which requires + * reregisteration. + */ +struct ib_inform_info * +ib_sa_register_inform_info(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + u16 trap_number, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_inform_info *info, + struct ib_sa_notice *notice), + void *context); + +/** + * ib_sa_unregister_inform_info - Releases an InformInfo registration. + * @info: InformInfo registration tracking structure. + * + * This call blocks until the registration request is destroyed. It may + * not be called from within the registration callback. + */ +void ib_sa_unregister_inform_info(struct ib_inform_info *info); + #endif /* IB_SA_H */ From sean.hefty at intel.com Thu Apr 19 17:05:08 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 19 Apr 2007 17:05:08 -0700 Subject: [ofa-general] [RFC] [PATCH 2/3] 2.6.22 or 23 ib/sa: add path record caching In-Reply-To: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> Message-ID: <000201c782df$8f002de0$07fd070a@amr.corp.intel.com> IB/sa: Add local SA path record caching. From: Sean Hefty Query and store path records locally to decrease path record query time and avoid SA flooding during the start-up of large clustered jobs. Signed-off-by: Sean Hefty --- drivers/infiniband/core/Makefile | 3 drivers/infiniband/core/local_sa.c | 1136 ++++++++++++++++++++++++++++++++++++ drivers/infiniband/core/sa_query.c | 26 + include/rdma/ib_local_sa.h | 84 +++ include/rdma/ib_sa.h | 3 5 files changed, 1252 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 2e9c4b2..b2a6354 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -2,6 +2,7 @@ infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS) := ib_addr.o rdma_cm.o user_access-$(CONFIG_INFINIBAND_ADDR_TRANS) := rdma_ucm.o obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ + ib_local_sa.o \ ib_cm.o iw_cm.o $(infiniband-y) obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \ @@ -14,6 +15,8 @@ ib_mad-y := mad.o smi.o agent.o mad_rmpp.o ib_sa-y := sa_query.o multicast.o notice.o +ib_local_sa-y := local_sa.o + ib_cm-y := cm.o iw_cm-y := iwcm.o diff --git a/drivers/infiniband/core/local_sa.c b/drivers/infiniband/core/local_sa.c new file mode 100644 index 0000000..1598be1 --- /dev/null +++ b/drivers/infiniband/core/local_sa.c @@ -0,0 +1,1136 @@ +/* + * Copyright (c) 2006 Intel Corporation.  All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand subnet administration caching"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + SA_DB_MAX_PATHS_PER_DEST = 0x7F, + SA_DB_MIN_RETRY_TIMER = 4000, /* 4 sec */ + SA_DB_MAX_RETRY_TIMER = 256000 /* 256 sec */ +}; + +static unsigned long paths_per_dest = SA_DB_MAX_PATHS_PER_DEST; +module_param(paths_per_dest, ulong, 0444); +MODULE_PARM_DESC(paths_per_dest, "Maximum number of paths to retrieve " + "to each destination (DGID). Set to 0 " + "to disable cache."); + +static unsigned long retry_timer = SA_DB_MIN_RETRY_TIMER; + +enum sa_db_lookup_method { + SA_DB_LOOKUP_LEAST_USED, + SA_DB_LOOKUP_RANDOM, + SA_DB_LOOKUP_MAX +}; + +static unsigned long lookup_method; + +static void sa_db_add_dev(struct ib_device *device); +static void sa_db_remove_dev(struct ib_device *device); + +static struct ib_client sa_db_client = { + .name = "local_sa", + .add = sa_db_add_dev, + .remove = sa_db_remove_dev +}; + +static struct miscdevice local_sa_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = "ib_local_sa", +}; + +static LIST_HEAD(dev_list); +static DECLARE_RWSEM(lock); +static struct workqueue_struct *sa_wq; +static struct ib_sa_client sa_client; + +enum sa_db_state { + SA_DB_IDLE, + SA_DB_REFRESH, + SA_DB_DESTROY +}; + +struct sa_db_port { + struct sa_db_device *dev; + struct ib_mad_agent *agent; + /* Limit number of outstanding MADs to SA to reduce SA flooding */ + struct ib_mad_send_buf *msg; + u16 sm_lid; + u8 sm_sl; + struct ib_inform_info *in_info; + struct ib_inform_info *out_info; + struct rb_root paths; + struct list_head update_list; + unsigned long update_id; + enum sa_db_state state; + struct work_struct work; + union ib_gid gid; + int port_num; +}; + +struct sa_db_device { + struct list_head list; + struct ib_device *device; + struct ib_event_handler event_handler; + int start_port; + int port_count; + struct sa_db_port port[0]; +}; + +struct ib_sa_iterator { + struct ib_sa_iterator *next; +}; + +struct ib_sa_attr_list { + struct ib_sa_iterator iter; + struct ib_sa_iterator *tail; + int update_id; + union ib_gid gid; + struct rb_node node; +}; + +/* maintain field order for ib_get_next_sa_attr() */ +struct ib_path_rec_info { + struct ib_sa_iterator iter; + struct ib_sa_path_rec rec; + unsigned long lookups; +}; + +struct ib_sa_iter { + struct ib_mad_recv_wc *recv_wc; + struct ib_mad_recv_buf *recv_buf; + int attr_size; + int attr_offset; + int data_offset; + int data_left; + void *attr; + u8 attr_data[0]; +}; + +enum sa_update_type { + SA_UPDATE_FULL, + SA_UPDATE_ADD, + SA_UPDATE_REMOVE +}; + +struct update_info { + struct list_head list; + union ib_gid gid; + enum sa_update_type type; +}; + +static void process_updates(struct sa_db_port *port); + +static void free_attr_list(struct ib_sa_attr_list *attr_list) +{ + struct ib_sa_iterator *cur; + + for (cur = attr_list->iter.next; cur; cur = attr_list->iter.next) { + attr_list->iter.next = cur->next; + kfree(cur); + } + attr_list->tail = &attr_list->iter; +} + +static void remove_attr(struct rb_root *root, struct ib_sa_attr_list *attr_list) +{ + rb_erase(&attr_list->node, root); + free_attr_list(attr_list); + kfree(attr_list); +} + +static void remove_all_attrs(struct rb_root *root) +{ + struct rb_node *node, *next_node; + struct ib_sa_attr_list *attr_list; + + for (node = rb_first(root); node; node = next_node) { + next_node = rb_next(node); + attr_list = rb_entry(node, struct ib_sa_attr_list, node); + remove_attr(root, attr_list); + } +} + +static void remove_old_attrs(struct rb_root *root, unsigned long update_id) +{ + struct rb_node *node, *next_node; + struct ib_sa_attr_list *attr_list; + + for (node = rb_first(root); node; node = next_node) { + next_node = rb_next(node); + attr_list = rb_entry(node, struct ib_sa_attr_list, node); + if (attr_list->update_id != update_id) + remove_attr(root, attr_list); + } +} + +static struct ib_sa_attr_list *insert_attr_list(struct rb_root *root, + struct ib_sa_attr_list *attr_list) +{ + struct rb_node **link = &root->rb_node; + struct rb_node *parent = NULL; + struct ib_sa_attr_list *cur_attr_list; + int cmp; + + while (*link) { + parent = *link; + cur_attr_list = rb_entry(parent, struct ib_sa_attr_list, node); + cmp = memcmp(&cur_attr_list->gid, &attr_list->gid, + sizeof attr_list->gid); + if (cmp < 0) + link = &(*link)->rb_left; + else if (cmp > 0) + link = &(*link)->rb_right; + else + return cur_attr_list; + } + rb_link_node(&attr_list->node, parent, link); + rb_insert_color(&attr_list->node, root); + return NULL; +} + +static struct ib_sa_attr_list *find_attr_list(struct rb_root *root, u8 *gid) +{ + struct rb_node *node = root->rb_node; + struct ib_sa_attr_list *attr_list; + int cmp; + + while (node) { + attr_list = rb_entry(node, struct ib_sa_attr_list, node); + cmp = memcmp(&attr_list->gid, gid, sizeof attr_list->gid); + if (cmp < 0) + node = node->rb_left; + else if (cmp > 0) + node = node->rb_right; + else + return attr_list; + } + return NULL; +} + +static int insert_attr(struct rb_root *root, unsigned long update_id, void *key, + struct ib_sa_iterator *iter) +{ + struct ib_sa_attr_list *attr_list; + void *err; + + attr_list = find_attr_list(root, key); + if (!attr_list) { + attr_list = kmalloc(sizeof *attr_list, GFP_KERNEL); + if (!attr_list) + return -ENOMEM; + + attr_list->iter.next = NULL; + attr_list->tail = &attr_list->iter; + attr_list->update_id = update_id; + memcpy(attr_list->gid.raw, key, sizeof attr_list->gid); + + err = insert_attr_list(root, attr_list); + if (err) { + kfree(attr_list); + return PTR_ERR(err); + } + } else if (attr_list->update_id != update_id) { + free_attr_list(attr_list); + attr_list->update_id = update_id; + } + + attr_list->tail->next = iter; + iter->next = NULL; + attr_list->tail = iter; + return 0; +} + +static struct ib_sa_iter *ib_sa_iter_create(struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_iter *iter; + struct ib_sa_mad *mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + int attr_size, attr_offset; + + attr_offset = be16_to_cpu(mad->sa_hdr.attr_offset) * 8; + attr_size = 64; /* path record length */ + if (attr_offset < attr_size) + return ERR_PTR(-EINVAL); + + iter = kzalloc(sizeof *iter + attr_size, GFP_KERNEL); + if (!iter) + return ERR_PTR(-ENOMEM); + + iter->data_left = mad_recv_wc->mad_len - IB_MGMT_SA_HDR; + iter->recv_wc = mad_recv_wc; + iter->recv_buf = &mad_recv_wc->recv_buf; + iter->attr_offset = attr_offset; + iter->attr_size = attr_size; + return iter; +} + +static void ib_sa_iter_free(struct ib_sa_iter *iter) +{ + kfree(iter); +} + +static void *ib_sa_iter_next(struct ib_sa_iter *iter) +{ + struct ib_sa_mad *mad; + int left, offset = 0; + + while (iter->data_left >= iter->attr_offset) { + while (iter->data_offset < IB_MGMT_SA_DATA) { + mad = (struct ib_sa_mad *) iter->recv_buf->mad; + + left = IB_MGMT_SA_DATA - iter->data_offset; + if (left < iter->attr_size) { + /* copy first piece of the attribute */ + iter->attr = &iter->attr_data; + memcpy(iter->attr, + &mad->data[iter->data_offset], left); + offset = left; + break; + } else if (offset) { + /* copy the second piece of the attribute */ + memcpy(iter->attr + offset, &mad->data[0], + iter->attr_size - offset); + iter->data_offset = iter->attr_size - offset; + offset = 0; + } else { + iter->attr = &mad->data[iter->data_offset]; + iter->data_offset += iter->attr_size; + } + + iter->data_left -= iter->attr_offset; + goto out; + } + iter->data_offset = 0; + iter->recv_buf = list_entry(iter->recv_buf->list.next, + struct ib_mad_recv_buf, list); + } + iter->attr = NULL; +out: + return iter->attr; +} + +/* + * Copy path records from a received response and insert them into our cache. + * A path record in the MADs are in network order, packed, and may + * span multiple MAD buffers, just to make our life hard. + */ +static void update_path_db(struct sa_db_port *port, + struct ib_mad_recv_wc *mad_recv_wc, + enum sa_update_type type) +{ + struct ib_sa_iter *iter; + struct ib_path_rec_info *path_info; + void *attr; + int ret; + + iter = ib_sa_iter_create(mad_recv_wc); + if (IS_ERR(iter)) + return; + + port->update_id += (type == SA_UPDATE_FULL); + + while ((attr = ib_sa_iter_next(iter)) && + (path_info = kmalloc(sizeof *path_info, GFP_KERNEL))) { + + ib_sa_unpack_attr(&path_info->rec, attr, IB_SA_ATTR_PATH_REC); + + down_write(&lock); + ret = insert_attr(&port->paths, port->update_id, + path_info->rec.dgid.raw, &path_info->iter); + up_write(&lock); + + if (ret) { + kfree(path_info); + break; + } + } + ib_sa_iter_free(iter); + + if (type == SA_UPDATE_FULL) { + down_write(&lock); + remove_old_attrs(&port->paths, port->update_id); + up_write(&lock); + } +} + +static struct ib_mad_send_buf *get_sa_msg(struct sa_db_port *port, + struct update_info *update) +{ + struct ib_ah_attr ah_attr; + struct ib_mad_send_buf *msg; + + msg = ib_create_send_mad(port->agent, 1, 0, 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, GFP_KERNEL); + if (IS_ERR(msg)) + return NULL; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = port->sm_lid; + ah_attr.sl = port->sm_sl; + ah_attr.port_num = port->port_num; + + msg->ah = ib_create_ah(port->agent->qp->pd, &ah_attr); + if (IS_ERR(msg->ah)) { + ib_free_send_mad(msg); + return NULL; + } + + msg->timeout_ms = retry_timer; + msg->retries = 0; + msg->context[0] = port; + msg->context[1] = update; + return msg; +} + +static __be64 form_tid(u32 hi_tid) +{ + static atomic_t tid; + return cpu_to_be64((((u64) hi_tid) << 32) | + ((u32) atomic_inc_return(&tid))); +} + +static void format_path_req(struct sa_db_port *port, + struct update_info *update, + struct ib_mad_send_buf *msg) +{ + struct ib_sa_mad *mad = msg->mad; + struct ib_sa_path_rec path_rec; + + mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; + mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; + mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; + mad->mad_hdr.method = IB_SA_METHOD_GET_TABLE; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); + mad->mad_hdr.tid = form_tid(msg->mad_agent->hi_tid); + + mad->sa_hdr.comp_mask = IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH; + + path_rec.sgid = port->gid; + path_rec.numb_path = paths_per_dest; + + if (update->type == SA_UPDATE_ADD) { + mad->sa_hdr.comp_mask |= IB_SA_PATH_REC_DGID; + memcpy(&path_rec.dgid, &update->gid, sizeof path_rec.dgid); + } + + ib_sa_pack_attr(mad->data, &path_rec, IB_SA_ATTR_PATH_REC); +} + +static int send_query(struct sa_db_port *port, + struct update_info *update) +{ + int ret; + + port->msg = get_sa_msg(port, update); + if (!port->msg) + return -ENOMEM; + + format_path_req(port, update, port->msg); + + ret = ib_post_send_mad(port->msg, NULL); + if (ret) + goto err; + + return 0; + +err: + ib_destroy_ah(port->msg->ah); + ib_free_send_mad(port->msg); + return ret; +} + +static void add_update(struct sa_db_port *port, u8 *gid, + enum sa_update_type type) +{ + struct update_info *update; + + update = kmalloc(sizeof *update, GFP_KERNEL); + if (update) { + if (gid) + memcpy(&update->gid, gid, sizeof update->gid); + update->type = type; + list_add(&update->list, &port->update_list); + } + + if (port->state == SA_DB_IDLE) { + port->state = SA_DB_REFRESH; + process_updates(port); + } +} + +static void clean_update_list(struct sa_db_port *port) +{ + struct update_info *update; + + while (!list_empty(&port->update_list)) { + update = list_entry(port->update_list.next, + struct update_info, list); + list_del(&update->list); + kfree(update); + } +} + +static int notice_handler(int status, struct ib_inform_info *info, + struct ib_sa_notice *notice) +{ + struct sa_db_port *port = info->context; + struct ib_sa_notice_data_gid *gid_data; + struct ib_inform_info **pinfo; + enum sa_update_type type; + + if (info->trap_number == IB_SA_SM_TRAP_GID_IN_SERVICE) { + pinfo = &port->in_info; + type = SA_UPDATE_ADD; + } else { + pinfo = &port->out_info; + type = SA_UPDATE_REMOVE; + } + + down_write(&lock); + if (port->state == SA_DB_DESTROY) { + up_write(&lock); + return 0; + } + + if (notice) { + gid_data = (struct ib_sa_notice_data_gid *) + ¬ice->data_details; + add_update(port, gid_data->gid, type); + up_write(&lock); + } else if (status == -ENETRESET) { + *pinfo = NULL; + up_write(&lock); + } else { + if (status) + *pinfo = ERR_PTR(-EINVAL); + port->state = SA_DB_IDLE; + clean_update_list(port); + up_write(&lock); + queue_work(sa_wq, &port->work); + } + + return status; +} + +static int reg_in_info(struct sa_db_port *port) +{ + int ret = 0; + + port->in_info = ib_sa_register_inform_info(&sa_client, + port->dev->device, + port->port_num, + IB_SA_SM_TRAP_GID_IN_SERVICE, + GFP_KERNEL, notice_handler, + port); + if (IS_ERR(port->in_info)) + ret = PTR_ERR(port->in_info); + + return ret; +} + +static int reg_out_info(struct sa_db_port *port) +{ + int ret = 0; + + port->out_info = ib_sa_register_inform_info(&sa_client, + port->dev->device, + port->port_num, + IB_SA_SM_TRAP_GID_OUT_OF_SERVICE, + GFP_KERNEL, notice_handler, + port); + if (IS_ERR(port->out_info)) + ret = PTR_ERR(port->out_info); + + return ret; +} + +static void cleanup_port(struct sa_db_port *port) +{ + if (port->in_info && !IS_ERR(port->in_info)) + ib_sa_unregister_inform_info(port->in_info); + + if (port->out_info && !IS_ERR(port->out_info)) + ib_sa_unregister_inform_info(port->out_info); + + port->out_info = NULL; + port->in_info = NULL; + + flush_workqueue(sa_wq); + + clean_update_list(port); + remove_all_attrs(&port->paths); +} + +static int update_port_info(struct sa_db_port *port) +{ + struct ib_port_attr port_attr; + int ret; + + ret = ib_query_port(port->dev->device, port->port_num, &port_attr); + if (ret) + return ret; + + if (port_attr.state != IB_PORT_ACTIVE) + return -ENODATA; + + port->sm_lid = port_attr.sm_lid; + port->sm_sl = port_attr.sm_sl; + return 0; +} + +static void process_updates(struct sa_db_port *port) +{ + struct update_info *update; + struct ib_sa_attr_list *attr_list; + int ret; + + if (!paths_per_dest || update_port_info(port)) { + cleanup_port(port); + goto out; + } + + /* Event registration is an optimization, so ignore failures. */ + if (!port->out_info) { + ret = reg_out_info(port); + if (!ret) + return; + } + + if (!port->in_info) { + ret = reg_in_info(port); + if (!ret) + return; + } + + while (!list_empty(&port->update_list)) { + update = list_entry(port->update_list.next, + struct update_info, list); + + if (update->type == SA_UPDATE_REMOVE) { + attr_list = find_attr_list(&port->paths, + update->gid.raw); + if (attr_list) + remove_attr(&port->paths, attr_list); + } else { + ret = send_query(port, update); + if (!ret) + return; + + } + list_del(&update->list); + kfree(update); + } +out: + port->state = SA_DB_IDLE; +} + +static void refresh_port_db(struct sa_db_port *port) +{ + if (port->state == SA_DB_DESTROY) + return; + + if (port->state == SA_DB_REFRESH) { + clean_update_list(port); + ib_cancel_mad(port->agent, port->msg); + } + + add_update(port, NULL, SA_UPDATE_FULL); +} + +static void refresh_dev_db(struct sa_db_device *dev) +{ + int i; + + for (i = 0; i < dev->port_count; i++) + refresh_port_db(&dev->port[i]); +} + +static void refresh_db(void) +{ + struct sa_db_device *dev; + + list_for_each_entry(dev, &dev_list, list) + refresh_dev_db(dev); +} + +static ssize_t do_refresh(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + down_write(&lock); + refresh_db(); + up_write(&lock); + + return count; +} +static DEVICE_ATTR(refresh, S_IWUSR, NULL, do_refresh); + +static ssize_t get_lookup_method(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, + "%c %d round robin\n" + "%c %d random\n", + (lookup_method == SA_DB_LOOKUP_LEAST_USED) ? '*' : ' ', + SA_DB_LOOKUP_LEAST_USED, + (lookup_method == SA_DB_LOOKUP_RANDOM) ? '*' : ' ', + SA_DB_LOOKUP_RANDOM); +} + +static ssize_t set_lookup_method(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + down_write(&lock); + lookup_method = simple_strtoul(buf, NULL, 0); + if (lookup_method > SA_DB_LOOKUP_MAX) + lookup_method = 0; + up_write(&lock); + + return count; +} +static DEVICE_ATTR(lookup_method, S_IRUGO | S_IWUSR, + get_lookup_method, set_lookup_method); + +static ssize_t get_paths_per_dest(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "%lu\n", paths_per_dest); +} + +static ssize_t set_paths_per_dest(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + down_write(&lock); + paths_per_dest = simple_strtoul(buf, NULL, 0); + if (paths_per_dest > SA_DB_MAX_PATHS_PER_DEST) + paths_per_dest = SA_DB_MAX_PATHS_PER_DEST; + refresh_db(); + up_write(&lock); + + return count; +} +static DEVICE_ATTR(paths_per_dest, S_IRUGO | S_IWUSR, + get_paths_per_dest, set_paths_per_dest); + +static void port_work_handler(struct work_struct *work) +{ + struct sa_db_port *port; + + port = container_of(work, typeof(*port), work); + down_write(&lock); + refresh_port_db(port); + up_write(&lock); +} + +static void handle_event(struct ib_event_handler *event_handler, + struct ib_event *event) +{ + struct sa_db_device *dev; + struct sa_db_port *port; + + dev = container_of(event_handler, typeof(*dev), event_handler); + port = &dev->port[event->element.port_num - dev->start_port]; + + switch (event->event) { + case IB_EVENT_PORT_ERR: + case IB_EVENT_LID_CHANGE: + case IB_EVENT_SM_CHANGE: + case IB_EVENT_CLIENT_REREGISTER: + case IB_EVENT_PKEY_CHANGE: + case IB_EVENT_PORT_ACTIVE: + queue_work(sa_wq, &port->work); + break; + default: + break; + } +} + +static struct ib_sa_path_rec *get_random_path(struct ib_sa_iterator *iter, + union ib_gid *sgid, u16 pkey) +{ + struct ib_sa_path_rec *path, *rand_path = NULL; + int num, count = 0; + + for (path = ib_get_next_sa_attr(&iter); path; + path = ib_get_next_sa_attr(&iter)) { + if (pkey == path->pkey && + !memcmp(sgid, path->sgid.raw, sizeof *sgid)) { + get_random_bytes(&num, sizeof num); + if ((num % ++count) == 0) + rand_path = path; + } + } + + return rand_path; +} + +static struct ib_sa_path_rec *get_next_path(struct ib_sa_iterator *iter, + union ib_gid *sgid, u16 pkey) +{ + struct ib_path_rec_info *path_info, *next_path = NULL; + struct ib_sa_path_rec *path; + unsigned long lookups = ~0; + + for (path = ib_get_next_sa_attr(&iter); path; + path = ib_get_next_sa_attr(&iter)) { + if (pkey == path->pkey && + !memcmp(sgid, path->sgid.raw, sizeof *sgid)) { + + path_info = container_of(iter, struct ib_path_rec_info, + iter); + if (path_info->lookups < lookups) { + lookups = path_info->lookups; + next_path = path_info; + } + } + } + + if (next_path) { + next_path->lookups++; + return &next_path->rec; + } else + return NULL; +} + +int ib_get_path_rec(struct ib_device *device, u8 port_num, union ib_gid *sgid, + union ib_gid *dgid, u16 pkey, struct ib_sa_path_rec *rec) +{ + struct ib_sa_iterator *iter; + struct ib_sa_path_rec *path; + int ret; + + iter = ib_create_path_iter(device, port_num, dgid); + if (IS_ERR(iter)) + return PTR_ERR(iter); + + if (lookup_method == SA_DB_LOOKUP_RANDOM) + path = get_random_path(iter, sgid, pkey); + else + path = get_next_path(iter, sgid, pkey); + + if (path) { + memcpy(rec, path, sizeof *rec); + ret = 0; + } else + ret = -ENODATA; + + ib_free_sa_iter(iter); + return ret; +} +EXPORT_SYMBOL(ib_get_path_rec); + +struct ib_sa_iterator *ib_create_path_iter(struct ib_device *device, + u8 port_num, union ib_gid *dgid) +{ + struct sa_db_device *dev; + struct sa_db_port *port; + struct ib_sa_attr_list *list; + int ret; + + down_read(&lock); + dev = ib_get_client_data(device, &sa_db_client); + if (!dev) { + ret = -ENODEV; + goto err; + } + port = &dev->port[port_num - dev->start_port]; + + list = find_attr_list(&port->paths, dgid->raw); + if (!list) { + ret = -ENODATA; + goto err; + } + + return &list->iter; +err: + up_read(&lock); + return ERR_PTR(ret); +} +EXPORT_SYMBOL(ib_create_path_iter); + +void ib_free_sa_iter(struct ib_sa_iterator *iter) +{ + up_read(&lock); +} +EXPORT_SYMBOL(ib_free_sa_iter); + +void *ib_get_next_sa_attr(struct ib_sa_iterator **iter) +{ + *iter = (*iter)->next; + return (*iter) ? ((void *)(*iter)) + sizeof(**iter) : NULL; +} +EXPORT_SYMBOL(ib_get_next_sa_attr); + +static void recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct sa_db_port *port; + struct update_info *update; + struct ib_mad_send_buf *msg; + enum sa_update_type type; + + msg = (struct ib_mad_send_buf *) (unsigned long) mad_recv_wc->wc->wr_id; + port = msg->context[0]; + update = msg->context[1]; + + down_write(&lock); + if (port->state == SA_DB_DESTROY || + update != list_entry(port->update_list.next, + struct update_info, list)) { + up_write(&lock); + } else { + type = update->type; + up_write(&lock); + update_path_db(mad_agent->context, mad_recv_wc, type); + } + + ib_free_recv_mad(mad_recv_wc); +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_mad_send_buf *msg; + struct sa_db_port *port; + struct update_info *update; + int ret; + + msg = mad_send_wc->send_buf; + port = msg->context[0]; + update = msg->context[1]; + + down_write(&lock); + if (port->state == SA_DB_DESTROY) + goto unlock; + + if (update == list_entry(port->update_list.next, + struct update_info, list)) { + + if (mad_send_wc->status == IB_WC_RESP_TIMEOUT_ERR && + msg->timeout_ms < SA_DB_MAX_RETRY_TIMER) { + + msg->timeout_ms <<= 1; + ret = ib_post_send_mad(msg, NULL); + if (!ret) { + up_write(&lock); + return; + } + } + list_del(&update->list); + kfree(update); + } + process_updates(port); +unlock: + up_write(&lock); + + ib_destroy_ah(msg->ah); + ib_free_send_mad(msg); +} + +static int init_port(struct sa_db_device *dev, int port_num) +{ + struct sa_db_port *port; + int ret; + + port = &dev->port[port_num - dev->start_port]; + port->dev = dev; + port->port_num = port_num; + INIT_WORK(&port->work, port_work_handler); + port->paths = RB_ROOT; + INIT_LIST_HEAD(&port->update_list); + + ret = ib_get_cached_gid(dev->device, port_num, 0, &port->gid); + if (ret) + return ret; + + port->agent = ib_register_mad_agent(dev->device, port_num, IB_QPT_GSI, + NULL, IB_MGMT_RMPP_VERSION, + send_handler, recv_handler, port); + if (IS_ERR(port->agent)) + ret = PTR_ERR(port->agent); + + return ret; +} + +static void destroy_port(struct sa_db_port *port) +{ + down_write(&lock); + port->state = SA_DB_DESTROY; + up_write(&lock); + + ib_unregister_mad_agent(port->agent); + cleanup_port(port); +} + +static void sa_db_add_dev(struct ib_device *device) +{ + struct sa_db_device *dev; + struct sa_db_port *port; + int s, e, i, ret; + + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) { + s = e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + dev = kzalloc(sizeof *dev + (e - s + 1) * sizeof *port, GFP_KERNEL); + if (!dev) + return; + + dev->start_port = s; + dev->port_count = e - s + 1; + dev->device = device; + for (i = 0; i < dev->port_count; i++) { + ret = init_port(dev, s + i); + if (ret) + goto err; + } + + ib_set_client_data(device, &sa_db_client, dev); + + INIT_IB_EVENT_HANDLER(&dev->event_handler, device, handle_event); + + down_write(&lock); + list_add_tail(&dev->list, &dev_list); + refresh_dev_db(dev); + up_write(&lock); + + ib_register_event_handler(&dev->event_handler); + return; +err: + while (i--) + destroy_port(&dev->port[i]); + kfree(dev); +} + +static void sa_db_remove_dev(struct ib_device *device) +{ + struct sa_db_device *dev; + int i; + + dev = ib_get_client_data(device, &sa_db_client); + if (!dev) + return; + + ib_unregister_event_handler(&dev->event_handler); + flush_workqueue(sa_wq); + + for (i = 0; i < dev->port_count; i++) + destroy_port(&dev->port[i]); + + down_write(&lock); + list_del(&dev->list); + up_write(&lock); + + kfree(dev); +} + +static int __init sa_db_init(void) +{ + int ret; + + sa_wq = create_singlethread_workqueue("local_sa"); + if (!sa_wq) + return -ENOMEM; + + ib_sa_register_client(&sa_client); + ret = ib_register_client(&sa_db_client); + if (ret) + goto err1; + + ret = misc_register(&local_sa_misc); + if (ret) + goto err2; + + ret = device_create_file(local_sa_misc.this_device, &dev_attr_refresh); + if (ret) + goto err3; + + ret = device_create_file(local_sa_misc.this_device, + &dev_attr_paths_per_dest); + if (ret) + goto err4; + + ret = device_create_file(local_sa_misc.this_device, + &dev_attr_lookup_method); + if (ret) + goto err5; + + return 0; + +err5: + device_remove_file(local_sa_misc.this_device, &dev_attr_paths_per_dest); +err4: + device_remove_file(local_sa_misc.this_device, &dev_attr_refresh); +err3: + misc_deregister(&local_sa_misc); +err2: + ib_unregister_client(&sa_db_client); +err1: + ib_sa_unregister_client(&sa_client); + destroy_workqueue(sa_wq); + return ret; +} + +static void __exit sa_db_cleanup(void) +{ + device_remove_file(local_sa_misc.this_device, &dev_attr_lookup_method); + device_remove_file(local_sa_misc.this_device, &dev_attr_paths_per_dest); + device_remove_file(local_sa_misc.this_device, &dev_attr_refresh); + misc_deregister(&local_sa_misc); + ib_unregister_client(&sa_db_client); + ib_sa_unregister_client(&sa_client); + destroy_workqueue(sa_wq); +} + +module_init(sa_db_init); +module_exit(sa_db_cleanup); diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 8de4ad8..1dd8063 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -464,6 +464,32 @@ static const struct ib_field notice_table[] = { .size_bits = 128 }, }; +int ib_sa_pack_attr(void *dst, void *src, int attr_id) +{ + switch (attr_id) { + case IB_SA_ATTR_PATH_REC: + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); + break; + default: + return -EINVAL; + } + return 0; +} +EXPORT_SYMBOL(ib_sa_pack_attr); + +int ib_sa_unpack_attr(void *dst, void *src, int attr_id) +{ + switch (attr_id) { + case IB_SA_ATTR_PATH_REC: + ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); + break; + default: + return -EINVAL; + } + return 0; +} +EXPORT_SYMBOL(ib_sa_unpack_attr); + static void free_sm_ah(struct kref *kref) { struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); diff --git a/include/rdma/ib_local_sa.h b/include/rdma/ib_local_sa.h new file mode 100644 index 0000000..0ce084b --- /dev/null +++ b/include/rdma/ib_local_sa.h @@ -0,0 +1,84 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef IB_LOCAL_SA_H +#define IB_LOCAL_SA_H + +#include + +/** + * ib_get_path_rec - Query the local SA database for path information. + * @device: The local device to query. + * @port_num: The port of the local device being queried. + * @sgid: The source GID of the path record. + * @dgid: The destination GID of the path record. + * @pkey: The protection key of the path record. + * @rec: A reference to a path record structure that will receive a copy of + * the response. + * + * Returns a copy of a path record meeting the specified criteria to the + * location referenced by %rec. A return value < 0 indicates that an error + * occurred processing the request, or no path record was found. + */ +int ib_get_path_rec(struct ib_device *device, u8 port_num, union ib_gid *sgid, + union ib_gid *dgid, u16 pkey, struct ib_sa_path_rec *rec); + +/** + * ib_create_path_iter - Create an iterator that may be used to walk through + * a list of path records. + * @device: The local device to retrieve path records for. + * @port_num: The port of the local device. + * @dgid: The destination GID of the path record. + * + * This call allocates an iterator that is used to walk through a list of + * cached path records. All path records accessed by the iterator will have the + * specified DGID. User should not hold the iterator for an extended period of + * time, and must free it by calling ib_free_sa_iter. + */ +struct ib_sa_iterator *ib_create_path_iter(struct ib_device *device, + u8 port_num, union ib_gid *dgid); + +/** + * ib_free_sa_iter - Release an iterator. + * @iter: The iterator to free. + */ +void ib_free_sa_iter(struct ib_sa_iterator *iter); + +/** + * ib_get_next_sa_attr - Retrieve the next SA attribute referenced by an + * iterator. + * @iter: A reference to an iterator that points to the next attribute to + * retrieve. + */ +void *ib_get_next_sa_attr(struct ib_sa_iterator **iter); + +#endif /* IB_LOCAL_SA_H */ diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index 46b52fd..1e5e630 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -552,4 +552,7 @@ ib_sa_register_inform_info(struct ib_sa_client *client, */ void ib_sa_unregister_inform_info(struct ib_inform_info *info); +int ib_sa_pack_attr(void *dst, void *src, int attr_id); +int ib_sa_unpack_attr(void *dst, void *src, int attr_id); + #endif /* IB_SA_H */ From sean.hefty at intel.com Thu Apr 19 17:06:46 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 19 Apr 2007 17:06:46 -0700 Subject: [ofa-general] [RFC] [PATCH 3/3] 2.6.22 or 23 rdma/cm: check cache for path records In-Reply-To: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> Message-ID: <000301c782df$c9afd490$07fd070a@amr.corp.intel.com> RDMA/cma: use local SA cache for path queries From: Sean Hefty Have the rdma_cm check the local SA cache for path records before querying the remote SA. This improves path record lookup time and scale-out connection rates. Signed-off-by: Sean Hefty --- drivers/infiniband/core/cma.c | 19 ++++++++++++++++--- 1 files changed, 16 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index fde92ce..c8e2024 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -45,6 +45,7 @@ #include #include #include +#include MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); @@ -1529,6 +1530,7 @@ out: static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) { struct rdma_route *route = &id_priv->id.route; + struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; struct cma_work *work; int ret; @@ -1548,9 +1550,20 @@ static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) goto err1; } - ret = cma_query_ib_route(id_priv, timeout_ms, work); - if (ret) - goto err2; + ib_addr_get_sgid(addr, &route->path_rec->sgid); + ib_addr_get_dgid(addr, &route->path_rec->dgid); + ret = ib_get_path_rec(id_priv->id.device, id_priv->id.port_num, + &route->path_rec->sgid, &route->path_rec->dgid, + ib_addr_get_pkey(addr), route->path_rec); + if (!ret) { + route->num_paths = 1; + queue_work(cma_wq, &work->work); + } else { + if (ret == -ENODATA) + ret = cma_query_ib_route(id_priv, timeout_ms, work); + if (ret) + goto err2; + } return 0; err2: From mst at dev.mellanox.co.il Thu Apr 19 20:26:49 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 20 Apr 2007 06:26:49 +0300 Subject: [ofa-general] Re: Slow failover of IPoIB ipoibtools/bonding (bug 541) In-Reply-To: References: Message-ID: <20070420032649.GB613@mellanox.co.il> > Quoting Scott Weitzenkamp (sweitzen) : > Subject: Slow failover of IPoIB ipoibtools/bonding (bug 541) > > Roland, Michael, or Sean, this is what I see when IPoIB failover is slow, how > do we get this fixed? > > > ib0: Request connection 0x60406 for gid fe80:0000:0000:0000:0002:c902:0020:e1d9 > qpn 0x404 > ib0: REP received. > ib0: REQ arrived > ib0: failed cm send event (status=12, wrid=45 vend_err 81) > ib0: Destroy active connection 0x60406 head 0x6546f tail 0x6546e > ib0: Request connection 0x70406 for gid fe80:0000:0000:0000:0002:c902:0020:e1d9 > qpn 0x404 Scott, this a result of port going down, the message is benign. For simplicity, could you please check whether slow failover is observed with datagram mode? This takes a couple of variables out of the equation. -- MST From devesh28 at gmail.com Thu Apr 19 23:44:36 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Fri, 20 Apr 2007 12:14:36 +0530 Subject: [ofa-general] [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> References: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> Message-ID: <309a667c0704192344gec04bd0uf9eec6c6413ea34@mail.gmail.com> Hello sean I have certain queries about local_sa_cache. Once SM is up on a node/switch whole network is up. Now is if some client is trying to establish a connection with other node, client is expected to resolve the path using sa API, I want to know how exactly it happens in the stack? second query is below. On 4/20/07, Sean Hefty wrote: > The following set of patches adds a local SA path record cache to the IB stack > (currently based on 2.6.20-rc4). The cache is derived from the OFED 1.2 local > SA cache patches, with changes based on the last round of feedback and current > Path Forward feature requests: > > * InformInfo/Notice support added to ib_sa > Clients may now register to receive SA related events. The local_sa uses this > to receive notification of GID up/down events in order to keep the cache up to > date. > Is it possible to program local_sa_cache with some dummy path records which resides in cache for long time? > * Removal of time based cache updates > Cache updates are now driven by local and SA events. Most module parameters > have been eliminated, and remaining options are exposed through a file system > interface for dynamic control, including the ability to force a cache refresh. > > Using a local SA cache we were able to establish all-to-all connections between > 1024 processes (about 1 million connections) in about 3 seconds. Without the > cache, connection time took about a minute, and required a substantial amount of > tuning of timeout values to achieve this. > > I've only updated the rdma_cm to use the cache, but similar changes could be > made to SRP and ipoib (which implements its own path record cache). > > I would like to get feedback on both the notice and local_sa patches for > inclusion in 2.6.22 or 2.6.23 (if 2.6.22 is not possible). > > Signed-off-by: Sean Hefty > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Thu Apr 19 23:53:25 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 19 Apr 2007 23:53:25 -0700 Subject: [ofa-general] [PATCH 1/2][RFC] IB/uverbs: Export ib_umem_get()/ib_umem_release() to modules In-Reply-To: (Roland Dreier's message of "Thu, 19 Apr 2007 16:02:25 -0700") References: Message-ID: [Here's an easier to read git format patch, which handles the renaming of uverbs_mem.c -> umem.c and just shows the changes, rather than deleting one file and adding another one] Export ib_umem_get()/ib_umem_release() and put low-level drivers in control of when to call ib_umem_get() to pin and DMA map userspace, rather than always calling it in ib_uverbs_reg_mr() before calling the low-level driver's reg_user_mr method. Also move these functions to be in the ib_core module instead of ib_uverbs, so that driver modules using them do not depend on ib_uverbs. This has a number of advantages: - It is better design from the standpoint of making generic code a library that can be used or overridden by device-specific code as the details of specific devices dictate. - Drivers that do not need to pin userspace memory regions do not need to take the performance hit of calling ib_mem_get(). For example, although I have not tried to implement it in this patch, the ipath driver should be able to avoid pinning memory and just use copy_{to,from}_user() to access userspace memory regions. - Buffers that need special mapping treatment can be identified by the low-level driver. For example, it may be possible to solve some Altix-specific memory ordering issues with mthca CQs in userspace by mapping CQ buffers with extra flags. - Drivers that need to pin and DMA map userspace memory for things other than memory regions can use ib_umem_get() directly, instead of hacks using extra parameters to their reg_phys_mr method. For example, the mlx4 driver that is pending being merged needs to pin and DMA map QP and CQ buffers, but it does not need to create a memory key for these buffers. So the cleanest solution is for mlx4 to call ib_umem_get() in the create_qp and create_cq methods. Signed-off-by: Roland Dreier --- drivers/infiniband/Kconfig | 5 + drivers/infiniband/core/Makefile | 4 +- drivers/infiniband/core/device.c | 2 + drivers/infiniband/core/{uverbs_mem.c => umem.c} | 136 +++++++++++++++------- drivers/infiniband/core/uverbs.h | 6 +- drivers/infiniband/core/uverbs_cmd.c | 60 +++------- drivers/infiniband/core/uverbs_main.c | 11 +-- drivers/infiniband/hw/amso1100/c2_provider.c | 42 +++++--- drivers/infiniband/hw/amso1100/c2_provider.h | 1 + drivers/infiniband/hw/cxgb3/iwch_provider.c | 28 +++-- drivers/infiniband/hw/cxgb3/iwch_provider.h | 1 + drivers/infiniband/hw/ehca/ehca_classes.h | 1 + drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 69 ++++++----- drivers/infiniband/hw/ipath/ipath_mr.c | 38 +++++-- drivers/infiniband/hw/ipath/ipath_verbs.h | 5 +- drivers/infiniband/hw/mthca/mthca_provider.c | 38 +++++-- drivers/infiniband/hw/mthca/mthca_provider.h | 1 + include/rdma/ib_umem.h | 78 ++++++++++++ include/rdma/ib_verbs.h | 28 +---- 20 files changed, 355 insertions(+), 202 deletions(-) rename drivers/infiniband/core/{uverbs_mem.c => umem.c} (63%) create mode 100644 include/rdma/ib_umem.h diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 66b36de..82afba5 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -29,6 +29,11 @@ config INFINIBAND_USER_ACCESS libibverbs, libibcm and a hardware driver library from . +config INFINIBAND_USER_MEM + bool + depends on INFINIBAND_USER_ACCESS != n + default y + config INFINIBAND_ADDR_TRANS bool depends on INFINIBAND && INET diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 189e5d4..cb1ab3e 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -9,6 +9,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \ ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o +ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o ib_mad-y := mad.o smi.o agent.o mad_rmpp.o @@ -28,5 +29,4 @@ ib_umad-y := user_mad.o ib_ucm-y := ucm.o -ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ - uverbs_marshall.o +ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 7fabb42..592c90a 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -613,6 +613,8 @@ static void __exit ib_core_cleanup(void) { ib_cache_cleanup(); ib_sysfs_cleanup(); + /* Make sure that any pending umem accounting work is done. */ + flush_scheduled_work(); } module_init(ib_core_init); diff --git a/drivers/infiniband/core/uverbs_mem.c b/drivers/infiniband/core/umem.c similarity index 63% rename from drivers/infiniband/core/uverbs_mem.c rename to drivers/infiniband/core/umem.c index c95fe95..48e854c 100644 --- a/drivers/infiniband/core/uverbs_mem.c +++ b/drivers/infiniband/core/umem.c @@ -64,35 +64,56 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d } } -int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, - void *addr, size_t size, int write) +/** + * ib_umem_get - Pin and DMA map userspace memory. + * @context: userspace context to pin memory for + * @addr: userspace virtual address to start at + * @size: length of region to pin + * @access: IB_ACCESS_xxx flags for memory being pinned + */ +struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, + size_t size, int access) { + struct ib_umem *umem; struct page **page_list; struct ib_umem_chunk *chunk; unsigned long locked; unsigned long lock_limit; unsigned long cur_base; unsigned long npages; - int ret = 0; + int ret; int off; int i; if (!can_do_mlock()) - return -EPERM; + return ERR_PTR(-EPERM); - page_list = (struct page **) __get_free_page(GFP_KERNEL); - if (!page_list) - return -ENOMEM; + umem = kmalloc(sizeof *umem, GFP_KERNEL); + if (!umem) + return ERR_PTR(-ENOMEM); - mem->user_base = (unsigned long) addr; - mem->length = size; - mem->offset = (unsigned long) addr & ~PAGE_MASK; - mem->page_size = PAGE_SIZE; - mem->writable = write; + umem->context = context; + umem->length = size; + umem->offset = addr & ~PAGE_MASK; + umem->page_size = PAGE_SIZE; + /* + * We ask for writable memory if any access flags other than + * "remote read" are set. "Local write" and "remote write" + * obviously require write access. "Remote atomic" can do + * things like fetch and add, which will modify memory, and + * "MW bind" can change permissions by binding a window. + */ + umem->writable = !!(access & ~IB_ACCESS_REMOTE_READ); - INIT_LIST_HEAD(&mem->chunk_list); + INIT_LIST_HEAD(&umem->chunk_list); + + page_list = (struct page **) __get_free_page(GFP_KERNEL); + if (!page_list) { + kfree(umem); + return ERR_PTR(-ENOMEM); + } - npages = PAGE_ALIGN(size + mem->offset) >> PAGE_SHIFT; + npages = PAGE_ALIGN(size + umem->offset) >> PAGE_SHIFT; down_write(¤t->mm->mmap_sem); @@ -104,13 +125,13 @@ int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, goto out; } - cur_base = (unsigned long) addr & PAGE_MASK; + cur_base = addr & PAGE_MASK; while (npages) { ret = get_user_pages(current, current->mm, cur_base, min_t(int, npages, PAGE_SIZE / sizeof (struct page *)), - 1, !write, page_list, NULL); + 1, !umem->writable, page_list, NULL); if (ret < 0) goto out; @@ -136,7 +157,7 @@ int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, chunk->page_list[i].length = PAGE_SIZE; } - chunk->nmap = ib_dma_map_sg(dev, + chunk->nmap = ib_dma_map_sg(context->device, &chunk->page_list[0], chunk->nents, DMA_BIDIRECTIONAL); @@ -151,33 +172,25 @@ int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, ret -= chunk->nents; off += chunk->nents; - list_add_tail(&chunk->list, &mem->chunk_list); + list_add_tail(&chunk->list, &umem->chunk_list); } ret = 0; } out: - if (ret < 0) - __ib_umem_release(dev, mem, 0); - else + if (ret < 0) { + __ib_umem_release(context->device, umem, 0); + kfree(umem); + } else current->mm->locked_vm = locked; up_write(¤t->mm->mmap_sem); free_page((unsigned long) page_list); - return ret; -} - -void ib_umem_release(struct ib_device *dev, struct ib_umem *umem) -{ - __ib_umem_release(dev, umem, 1); - - down_write(¤t->mm->mmap_sem); - current->mm->locked_vm -= - PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; - up_write(¤t->mm->mmap_sem); + return ret < 0 ? ERR_PTR(ret) : umem; } +EXPORT_SYMBOL(ib_umem_get); static void ib_umem_account(struct work_struct *_work) { @@ -191,35 +204,70 @@ static void ib_umem_account(struct work_struct *_work) kfree(work); } -void ib_umem_release_on_close(struct ib_device *dev, struct ib_umem *umem) +/** + * ib_umem_release - release memory pinned with ib_umem_get + * @umem: umem struct to release + */ +void ib_umem_release(struct ib_umem *umem) { struct ib_umem_account_work *work; + struct ib_ucontext *context = umem->context; struct mm_struct *mm; + unsigned long diff; - __ib_umem_release(dev, umem, 1); + __ib_umem_release(umem->context->device, umem, 1); mm = get_task_mm(current); if (!mm) return; + diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; + kfree(umem); + /* * We may be called with the mm's mmap_sem already held. This * can happen when a userspace munmap() is the call that drops * the last reference to our file and calls our release * method. If there are memory regions to destroy, we'll end - * up here and not be able to take the mmap_sem. Therefore we - * defer the vm_locked accounting to the system workqueue. + * up here and not be able to take the mmap_sem. In that case + * we defer the vm_locked accounting to the system workqueue. */ + if (context->closing && !down_write_trylock(&mm->mmap_sem)) { + work = kmalloc(sizeof *work, GFP_KERNEL); + if (!work) { + mmput(mm); + return; + } - work = kmalloc(sizeof *work, GFP_KERNEL); - if (!work) { - mmput(mm); + INIT_WORK(&work->work, ib_umem_account); + work->mm = mm; + work->diff = diff; + + schedule_work(&work->work); return; - } + } else + down_write(&mm->mmap_sem); + + current->mm->locked_vm -= diff; + up_write(&mm->mmap_sem); + mmput(mm); +} +EXPORT_SYMBOL(ib_umem_release); + +int ib_umem_page_count(struct ib_umem *umem) +{ + struct ib_umem_chunk *chunk; + int shift; + int i; + int n; + + shift = ilog2(umem->page_size); - INIT_WORK(&work->work, ib_umem_account); - work->mm = mm; - work->diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; + n = 0; + list_for_each_entry(chunk, &umem->chunk_list, list) + for (i = 0; i < chunk->nmap; ++i) + n += sg_dma_len(&chunk->page_list[i]) >> shift; - schedule_work(&work->work); + return n; } +EXPORT_SYMBOL(ib_umem_page_count); diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 102a59c..c33546f 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -45,6 +45,7 @@ #include #include +#include #include /* @@ -163,11 +164,6 @@ void ib_uverbs_srq_event_handler(struct ib_event *event, void *context_ptr); void ib_uverbs_event_handler(struct ib_event_handler *handler, struct ib_event *event); -int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, - void *addr, size_t size, int write); -void ib_umem_release(struct ib_device *dev, struct ib_umem *umem); -void ib_umem_release_on_close(struct ib_device *dev, struct ib_umem *umem); - #define IB_UVERBS_DECLARE_CMD(name) \ ssize_t ib_uverbs_##name(struct ib_uverbs_file *file, \ const char __user *buf, int in_len, \ diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 4fd75af..8c338bc 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006, 2007 Cisco Systems. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. * Copyright (c) 2006 Mellanox Technologies. All rights reserved. * @@ -295,6 +295,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, INIT_LIST_HEAD(&ucontext->qp_list); INIT_LIST_HEAD(&ucontext->srq_list); INIT_LIST_HEAD(&ucontext->ah_list); + ucontext->closing = 0; resp.num_comp_vectors = file->device->num_comp_vectors; @@ -573,7 +574,7 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, struct ib_uverbs_reg_mr cmd; struct ib_uverbs_reg_mr_resp resp; struct ib_udata udata; - struct ib_umem_object *obj; + struct ib_uobject *uobj; struct ib_pd *pd; struct ib_mr *mr; int ret; @@ -599,35 +600,21 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, !(cmd.access_flags & IB_ACCESS_LOCAL_WRITE)) return -EINVAL; - obj = kmalloc(sizeof *obj, GFP_KERNEL); - if (!obj) + uobj = kmalloc(sizeof *uobj, GFP_KERNEL); + if (!uobj) return -ENOMEM; - init_uobj(&obj->uobject, 0, file->ucontext, &mr_lock_key); - down_write(&obj->uobject.mutex); - - /* - * We ask for writable memory if any access flags other than - * "remote read" are set. "Local write" and "remote write" - * obviously require write access. "Remote atomic" can do - * things like fetch and add, which will modify memory, and - * "MW bind" can change permissions by binding a window. - */ - ret = ib_umem_get(file->device->ib_dev, &obj->umem, - (void *) (unsigned long) cmd.start, cmd.length, - !!(cmd.access_flags & ~IB_ACCESS_REMOTE_READ)); - if (ret) - goto err_free; - - obj->umem.virt_base = cmd.hca_va; + init_uobj(uobj, 0, file->ucontext, &mr_lock_key); + down_write(&uobj->mutex); pd = idr_read_pd(cmd.pd_handle, file->ucontext); if (!pd) { ret = -EINVAL; - goto err_release; + goto err_free; } - mr = pd->device->reg_user_mr(pd, &obj->umem, cmd.access_flags, &udata); + mr = pd->device->reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va, + cmd.access_flags, &udata); if (IS_ERR(mr)) { ret = PTR_ERR(mr); goto err_put; @@ -635,19 +622,19 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, mr->device = pd->device; mr->pd = pd; - mr->uobject = &obj->uobject; + mr->uobject = uobj; atomic_inc(&pd->usecnt); atomic_set(&mr->usecnt, 0); - obj->uobject.object = mr; - ret = idr_add_uobj(&ib_uverbs_mr_idr, &obj->uobject); + uobj->object = mr; + ret = idr_add_uobj(&ib_uverbs_mr_idr, uobj); if (ret) goto err_unreg; memset(&resp, 0, sizeof resp); resp.lkey = mr->lkey; resp.rkey = mr->rkey; - resp.mr_handle = obj->uobject.id; + resp.mr_handle = uobj->id; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { @@ -658,17 +645,17 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, put_pd_read(pd); mutex_lock(&file->mutex); - list_add_tail(&obj->uobject.list, &file->ucontext->mr_list); + list_add_tail(&uobj->list, &file->ucontext->mr_list); mutex_unlock(&file->mutex); - obj->uobject.live = 1; + uobj->live = 1; - up_write(&obj->uobject.mutex); + up_write(&uobj->mutex); return in_len; err_copy: - idr_remove_uobj(&ib_uverbs_mr_idr, &obj->uobject); + idr_remove_uobj(&ib_uverbs_mr_idr, uobj); err_unreg: ib_dereg_mr(mr); @@ -676,11 +663,8 @@ err_unreg: err_put: put_pd_read(pd); -err_release: - ib_umem_release(file->device->ib_dev, &obj->umem); - err_free: - put_uobj_write(&obj->uobject); + put_uobj_write(uobj); return ret; } @@ -691,7 +675,6 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file, struct ib_uverbs_dereg_mr cmd; struct ib_mr *mr; struct ib_uobject *uobj; - struct ib_umem_object *memobj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -701,8 +684,7 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file, if (!uobj) return -EINVAL; - memobj = container_of(uobj, struct ib_umem_object, uobject); - mr = uobj->object; + mr = uobj->object; ret = ib_dereg_mr(mr); if (!ret) @@ -719,8 +701,6 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file, list_del(&uobj->list); mutex_unlock(&file->mutex); - ib_umem_release(file->device->ib_dev, &memobj->umem); - put_uobj(uobj); return in_len; diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index f8bc822..41c2065 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -183,6 +183,8 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, if (!context) return 0; + context->closing = 1; + list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) { struct ib_ah *ah = uobj->object; @@ -230,16 +232,10 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) { struct ib_mr *mr = uobj->object; - struct ib_device *mrdev = mr->device; - struct ib_umem_object *memobj; idr_remove_uobj(&ib_uverbs_mr_idr, uobj); ib_dereg_mr(mr); - - memobj = container_of(uobj, struct ib_umem_object, uobject); - ib_umem_release_on_close(mrdev, &memobj->umem); - - kfree(memobj); + kfree(uobj); } list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) { @@ -906,7 +902,6 @@ static void __exit ib_uverbs_cleanup(void) unregister_filesystem(&uverbs_event_fs); class_destroy(uverbs_class); unregister_chrdev_region(IB_UVERBS_BASE_DEV, IB_UVERBS_MAX_DEVICES); - flush_scheduled_work(); idr_destroy(&ib_uverbs_pd_idr); idr_destroy(&ib_uverbs_mr_idr); idr_destroy(&ib_uverbs_mw_idr); diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index fef9727..10a085d 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -56,6 +56,7 @@ #include #include +#include #include #include "c2.h" #include "c2_provider.h" @@ -396,6 +397,7 @@ static struct ib_mr *c2_reg_phys_mr(struct ib_pd *ib_pd, } mr->pd = to_c2pd(ib_pd); + mr->umem = NULL; pr_debug("%s - page shift %d, pbl_depth %d, total_len %u, " "*iova_start %llx, first pa %llx, last pa %llx\n", __FUNCTION__, page_shift, pbl_depth, total_len, @@ -428,8 +430,8 @@ static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc) return c2_reg_phys_mr(pd, &bl, 1, acc, &kva); } -static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int acc, struct ib_udata *udata) +static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt, int acc, struct ib_udata *udata) { u64 *pages; u64 kva = 0; @@ -441,15 +443,23 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, struct c2_mr *c2mr; pr_debug("%s:%u\n", __FUNCTION__, __LINE__); - shift = ffs(region->page_size) - 1; c2mr = kmalloc(sizeof(*c2mr), GFP_KERNEL); if (!c2mr) return ERR_PTR(-ENOMEM); c2mr->pd = c2pd; + c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc); + if (IS_ERR(c2mr->umem)) { + err = PTR_ERR(c2mr->umem); + kfree(c2mr); + return ERR_PTR(err); + } + + shift = ffs(c2mr->umem->page_size) - 1; + n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &c2mr->umem->chunk_list, list) n += chunk->nents; pages = kmalloc(n * sizeof(u64), GFP_KERNEL); @@ -459,35 +469,34 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, } i = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) { + list_for_each_entry(chunk, &c2mr->umem->chunk_list, list) { for (j = 0; j < chunk->nmap; ++j) { len = sg_dma_len(&chunk->page_list[j]) >> shift; for (k = 0; k < len; ++k) { pages[i++] = sg_dma_address(&chunk->page_list[j]) + - (region->page_size * k); + (c2mr->umem->page_size * k); } } } - kva = (u64)region->virt_base; + kva = virt; err = c2_nsmr_register_phys_kern(to_c2dev(pd->device), pages, - region->page_size, + c2mr->umem->page_size, i, - region->length, - region->offset, + length, + c2mr->umem->offset, &kva, c2_convert_access(acc), c2mr); kfree(pages); - if (err) { - kfree(c2mr); - return ERR_PTR(err); - } + if (err) + goto err; return &c2mr->ibmr; err: + ib_umem_release(c2mr->umem); kfree(c2mr); return ERR_PTR(err); } @@ -502,8 +511,11 @@ static int c2_dereg_mr(struct ib_mr *ib_mr) err = c2_stag_dealloc(to_c2dev(ib_mr->device), ib_mr->lkey); if (err) pr_debug("c2_stag_dealloc failed: %d\n", err); - else + else { + if (mr->umem) + ib_umem_release(mr->umem); kfree(mr); + } return err; } diff --git a/drivers/infiniband/hw/amso1100/c2_provider.h b/drivers/infiniband/hw/amso1100/c2_provider.h index fc90622..1076df2 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.h +++ b/drivers/infiniband/hw/amso1100/c2_provider.h @@ -73,6 +73,7 @@ struct c2_pd { struct c2_mr { struct ib_mr ibmr; struct c2_pd *pd; + struct ib_umem *umem; }; struct c2_av; diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 24e0df0..98cdd13 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -47,6 +47,7 @@ #include #include #include +#include #include #include "cxio_hal.h" @@ -441,6 +442,8 @@ static int iwch_dereg_mr(struct ib_mr *ib_mr) remove_handle(rhp, &rhp->mmidr, mmid); if (mhp->kva) kfree((void *) (unsigned long) mhp->kva); + if (mhp->umem) + ib_umem_release(mhp->umem); PDBG("%s mmid 0x%x ptr %p\n", __FUNCTION__, mmid, mhp); kfree(mhp); return 0; @@ -575,8 +578,8 @@ static int iwch_reregister_phys_mem(struct ib_mr *mr, } -static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int acc, struct ib_udata *udata) +static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt, int acc, struct ib_udata *udata) { __be64 *pages; int shift, n, len; @@ -589,7 +592,6 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, struct iwch_reg_user_mr_resp uresp; PDBG("%s ib_pd %p\n", __FUNCTION__, pd); - shift = ffs(region->page_size) - 1; php = to_iwch_pd(pd); rhp = php->rhp; @@ -597,8 +599,17 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, if (!mhp) return ERR_PTR(-ENOMEM); + mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc); + if (IS_ERR(mhp->umem)) { + err = PTR_ERR(mhp->umem); + kfree(mhp); + return ERR_PTR(err); + } + + shift = ffs(mhp->umem->page_size) - 1; + n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &mhp->umem->chunk_list, list) n += chunk->nents; pages = kmalloc(n * sizeof(u64), GFP_KERNEL); @@ -609,13 +620,13 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, i = n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &mhp->umem->chunk_list, list) for (j = 0; j < chunk->nmap; ++j) { len = sg_dma_len(&chunk->page_list[j]) >> shift; for (k = 0; k < len; ++k) { pages[i++] = cpu_to_be64(sg_dma_address( &chunk->page_list[j]) + - region->page_size * k); + mhp->umem->page_size * k); } } @@ -623,9 +634,9 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, mhp->attr.pdid = php->pdid; mhp->attr.zbva = 0; mhp->attr.perms = iwch_ib_to_tpt_access(acc); - mhp->attr.va_fbo = region->virt_base; + mhp->attr.va_fbo = virt; mhp->attr.page_size = shift - 12; - mhp->attr.len = (u32) region->length; + mhp->attr.len = (u32) length; mhp->attr.pbl_size = i; err = iwch_register_mem(rhp, php, mhp, shift, pages); kfree(pages); @@ -648,6 +659,7 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, return &mhp->ibmr; err: + ib_umem_release(mhp->umem); kfree(mhp); return ERR_PTR(err); } diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h index 93bcc56..48833f3 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.h +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h @@ -73,6 +73,7 @@ struct tpt_attributes { struct iwch_mr { struct ib_mr ibmr; + struct ib_umem *umem; struct iwch_dev *rhp; u64 kva; struct tpt_attributes attr; diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 82ded44..88e7866 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -175,6 +175,7 @@ struct ehca_mr { struct ib_mr ib_mr; /* must always be first in ehca_mr */ struct ib_fmr ib_fmr; /* must always be first in ehca_mr */ } ib; + struct ib_umem *umem; spinlock_t mrlock; enum ehca_mr_flag flags; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 95fd59f..9b22c5a 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -78,8 +78,7 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, int num_phys_buf, int mr_access_flags, u64 *iova_start); -struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, - struct ib_umem *region, +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt, int mr_access_flags, struct ib_udata *udata); int ehca_rereg_phys_mr(struct ib_mr *mr, diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index d22ab56..84c5bb4 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -39,6 +39,8 @@ * POSSIBILITY OF SUCH DAMAGE. */ +#include + #include #include "ehca_iverbs.h" @@ -238,10 +240,8 @@ reg_phys_mr_exit0: /*----------------------------------------------------------------------*/ -struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, - struct ib_umem *region, - int mr_access_flags, - struct ib_udata *udata) +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt, + int mr_access_flags, struct ib_udata *udata) { struct ib_mr *ib_mr; struct ehca_mr *e_mr; @@ -257,11 +257,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, ehca_gen_err("bad pd=%p", pd); return ERR_PTR(-EFAULT); } - if (!region) { - ehca_err(pd->device, "bad input values: region=%p", region); - ib_mr = ERR_PTR(-EINVAL); - goto reg_user_mr_exit0; - } + if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && @@ -275,17 +271,10 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, ib_mr = ERR_PTR(-EINVAL); goto reg_user_mr_exit0; } - if (region->page_size != PAGE_SIZE) { - ehca_err(pd->device, "page size not supported, " - "region->page_size=%x", region->page_size); - ib_mr = ERR_PTR(-EINVAL); - goto reg_user_mr_exit0; - } - if ((region->length == 0) || - ((region->virt_base + region->length) < region->virt_base)) { + if (length == 0 || virt + length < virt) { ehca_err(pd->device, "bad input values: length=%lx " - "virt_base=%lx", region->length, region->virt_base); + "virt_base=%lx", length, virt); ib_mr = ERR_PTR(-EINVAL); goto reg_user_mr_exit0; } @@ -297,40 +286,55 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, goto reg_user_mr_exit0; } + e_mr->umem = ib_umem_get(pd->uobject->context, start, length, + mr_access_flags); + if (IS_ERR(e_mr->umem)) { + ib_mr = (void *) e_mr->umem; + goto reg_user_mr_exit1; + } + + if (e_mr->umem->page_size != PAGE_SIZE) { + ehca_err(pd->device, "page size not supported, " + "e_mr->umem->page_size=%x", e_mr->umem->page_size); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit2; + } + /* determine number of MR pages */ - num_pages_mr = (((region->virt_base % PAGE_SIZE) + region->length + - PAGE_SIZE - 1) / PAGE_SIZE); - num_pages_4k = (((region->virt_base % EHCA_PAGESIZE) + region->length + - EHCA_PAGESIZE - 1) / EHCA_PAGESIZE); + num_pages_mr = (((virt % PAGE_SIZE) + length + PAGE_SIZE - 1) / + PAGE_SIZE); + num_pages_4k = (((virt % EHCA_PAGESIZE) + length + EHCA_PAGESIZE - 1) / + EHCA_PAGESIZE); /* register MR on HCA */ pginfo.type = EHCA_MR_PGI_USER; pginfo.num_pages = num_pages_mr; pginfo.num_4k = num_pages_4k; - pginfo.region = region; - pginfo.next_4k = region->offset / EHCA_PAGESIZE; + pginfo.region = e_mr->umem; + pginfo.next_4k = e_mr->umem->offset / EHCA_PAGESIZE; pginfo.next_chunk = list_prepare_entry(pginfo.next_chunk, - (®ion->chunk_list), + (&e_mr->umem->chunk_list), list); - ret = ehca_reg_mr(shca, e_mr, (u64*)region->virt_base, - region->length, mr_access_flags, e_pd, &pginfo, - &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); + ret = ehca_reg_mr(shca, e_mr, (u64*) virt, length, mr_access_flags, e_pd, + &pginfo, &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); if (ret) { ib_mr = ERR_PTR(ret); - goto reg_user_mr_exit1; + goto reg_user_mr_exit2; } /* successful registration of all pages */ return &e_mr->ib.ib_mr; +reg_user_mr_exit2: + ib_umem_release(e_mr->umem); reg_user_mr_exit1: ehca_mr_delete(e_mr); reg_user_mr_exit0: if (IS_ERR(ib_mr)) - ehca_err(pd->device, "rc=%lx pd=%p region=%p mr_access_flags=%x" + ehca_err(pd->device, "rc=%lx pd=%p mr_access_flags=%x" " udata=%p", - PTR_ERR(ib_mr), pd, region, mr_access_flags, udata); + PTR_ERR(ib_mr), pd, mr_access_flags, udata); return ib_mr; } /* end ehca_reg_user_mr() */ @@ -596,6 +600,9 @@ int ehca_dereg_mr(struct ib_mr *mr) goto dereg_mr_exit0; } + if (e_mr->umem) + ib_umem_release(e_mr->umem); + /* successful deregistration */ ehca_mr_delete(e_mr); diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c b/drivers/infiniband/hw/ipath/ipath_mr.c index 8cc8598..8e91c8b 100644 --- a/drivers/infiniband/hw/ipath/ipath_mr.c +++ b/drivers/infiniband/hw/ipath/ipath_mr.c @@ -31,6 +31,7 @@ * SOFTWARE. */ +#include #include #include @@ -147,6 +148,7 @@ struct ib_mr *ipath_reg_phys_mr(struct ib_pd *pd, mr->mr.offset = 0; mr->mr.access_flags = acc; mr->mr.max_segs = num_phys_buf; + mr->umem = NULL; m = 0; n = 0; @@ -170,50 +172,60 @@ bail: /** * ipath_reg_user_mr - register a userspace memory region * @pd: protection domain for this memory region - * @region: the user memory region + * @start: starting userspace address + * @length: length of region to register + * @virt_addr: virtual address to use (from HCA's point of view) * @mr_access_flags: access flags for this memory region * @udata: unused by the InfiniPath driver * * Returns the memory region on success, otherwise returns an errno. */ -struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int mr_access_flags, struct ib_udata *udata) +struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt_addr, int mr_access_flags, + struct ib_udata *udata) { struct ipath_mr *mr; + struct ib_umem *umem; struct ib_umem_chunk *chunk; int n, m, i; struct ib_mr *ret; - if (region->length == 0) { + if (length == 0) { ret = ERR_PTR(-EINVAL); goto bail; } + umem = ib_umem_get(pd->uobject->context, start, length, mr_access_flags); + if (IS_ERR(umem)) + return (void *) umem; + n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &umem->chunk_list, list) n += chunk->nents; mr = alloc_mr(n, &to_idev(pd->device)->lk_table); if (!mr) { ret = ERR_PTR(-ENOMEM); + ib_umem_release(umem); goto bail; } mr->mr.pd = pd; - mr->mr.user_base = region->user_base; - mr->mr.iova = region->virt_base; - mr->mr.length = region->length; - mr->mr.offset = region->offset; + mr->mr.user_base = start; + mr->mr.iova = virt_addr; + mr->mr.length = length; + mr->mr.offset = umem->offset; mr->mr.access_flags = mr_access_flags; mr->mr.max_segs = n; + mr->umem = umem; m = 0; n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) { + list_for_each_entry(chunk, &umem->chunk_list, list) { for (i = 0; i < chunk->nmap; i++) { mr->mr.map[m]->segs[n].vaddr = page_address(chunk->page_list[i].page); - mr->mr.map[m]->segs[n].length = region->page_size; + mr->mr.map[m]->segs[n].length = umem->page_size; n++; if (n == IPATH_SEGSZ) { m++; @@ -247,6 +259,10 @@ int ipath_dereg_mr(struct ib_mr *ibmr) i--; kfree(mr->mr.map[i]); } + + if (mr->umem) + ib_umem_release(mr->umem); + kfree(mr); return 0; } diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index c0c8d5b..8f7af7a 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -248,6 +248,7 @@ struct ipath_sge { /* Memory region */ struct ipath_mr { struct ib_mr ibmr; + struct ib_umem *umem; struct ipath_mregion mr; /* must be last */ }; @@ -726,8 +727,8 @@ struct ib_mr *ipath_reg_phys_mr(struct ib_pd *pd, struct ib_phys_buf *buffer_list, int num_phys_buf, int acc, u64 *iova_start); -struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int mr_access_flags, +struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt_addr, int mr_access_flags, struct ib_udata *udata); int ipath_dereg_mr(struct ib_mr *ibmr); diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 0725ad7..cd5eb60 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -37,6 +37,7 @@ */ #include +#include #include #include @@ -907,6 +908,8 @@ static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc) return ERR_PTR(err); } + mr->umem = NULL; + return &mr->ibmr; } @@ -1002,11 +1005,13 @@ static struct ib_mr *mthca_reg_phys_mr(struct ib_pd *pd, } kfree(page_list); + mr->umem = NULL; + return &mr->ibmr; } -static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, - int acc, struct ib_udata *udata) +static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt, int acc, struct ib_udata *udata) { struct mthca_dev *dev = to_mdev(pd->device); struct ib_umem_chunk *chunk; @@ -1017,20 +1022,26 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, int err = 0; int write_mtt_size; - shift = ffs(region->page_size) - 1; - mr = kmalloc(sizeof *mr, GFP_KERNEL); if (!mr) return ERR_PTR(-ENOMEM); + mr->umem = ib_umem_get(pd->uobject->context, start, length, acc); + if (IS_ERR(mr->umem)) { + err = PTR_ERR(mr->umem); + goto err; + } + + shift = ffs(mr->umem->page_size) - 1; + n = 0; - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &mr->umem->chunk_list, list) n += chunk->nents; mr->mtt = mthca_alloc_mtt(dev, n); if (IS_ERR(mr->mtt)) { err = PTR_ERR(mr->mtt); - goto err; + goto err_umem; } pages = (u64 *) __get_free_page(GFP_KERNEL); @@ -1043,12 +1054,12 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, write_mtt_size = min(mthca_write_mtt_size(dev), (int) (PAGE_SIZE / sizeof *pages)); - list_for_each_entry(chunk, ®ion->chunk_list, list) + list_for_each_entry(chunk, &mr->umem->chunk_list, list) for (j = 0; j < chunk->nmap; ++j) { len = sg_dma_len(&chunk->page_list[j]) >> shift; for (k = 0; k < len; ++k) { pages[i++] = sg_dma_address(&chunk->page_list[j]) + - region->page_size * k; + mr->umem->page_size * k; /* * Be friendly to write_mtt and pass it chunks * of appropriate size. @@ -1070,8 +1081,8 @@ mtt_done: if (err) goto err_mtt; - err = mthca_mr_alloc(dev, to_mpd(pd)->pd_num, shift, region->virt_base, - region->length, convert_access(acc), mr); + err = mthca_mr_alloc(dev, to_mpd(pd)->pd_num, shift, virt, length, + convert_access(acc), mr); if (err) goto err_mtt; @@ -1081,6 +1092,9 @@ mtt_done: err_mtt: mthca_free_mtt(dev, mr->mtt); +err_umem: + ib_umem_release(mr->umem); + err: kfree(mr); return ERR_PTR(err); @@ -1089,8 +1103,12 @@ err: static int mthca_dereg_mr(struct ib_mr *mr) { struct mthca_mr *mmr = to_mmr(mr); + mthca_free_mr(to_mdev(mr->device), mmr); + if (mmr->umem) + ib_umem_release(mmr->umem); kfree(mmr); + return 0; } diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h index 1d266ac..262616c 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.h +++ b/drivers/infiniband/hw/mthca/mthca_provider.h @@ -73,6 +73,7 @@ struct mthca_mtt; struct mthca_mr { struct ib_mr ibmr; + struct ib_umem *umem; struct mthca_mtt *mtt; }; diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h new file mode 100644 index 0000000..06307f7 --- /dev/null +++ b/include/rdma/ib_umem.h @@ -0,0 +1,78 @@ +/* + * Copyright (c) 2007 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef IB_UMEM_H +#define IB_UMEM_H + +#include +#include + +struct ib_ucontext; + +struct ib_umem { + struct ib_ucontext *context; + size_t length; + int offset; + int page_size; + int writable; + struct list_head chunk_list; +}; + +struct ib_umem_chunk { + struct list_head list; + int nents; + int nmap; + struct scatterlist page_list[0]; +}; + +#ifdef CONFIG_INFINIBAND_USER_MEM + +struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, + size_t size, int access); +void ib_umem_release(struct ib_umem *umem); +int ib_umem_page_count(struct ib_umem *umem); + +#else /* CONFIG_INFINIBAND_USER_MEM */ + +#include + +static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context, + unsigned long addr, size_t size, + int access) { + return ERR_PTR(-EINVAL); +} +static inline void ib_umem_release(struct ib_umem *umem) { } +static inline int ib_umem_page_count(struct ib_umem *umem) { return 0; } + +#endif /* CONFIG_INFINIBAND_USER_MEM */ + +#endif /* IB_UMEM_H */ diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 765589f..b910baa 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -5,7 +5,7 @@ * Copyright (c) 2004 Topspin Corporation. All rights reserved. * Copyright (c) 2004 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006, 2007 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -708,6 +708,7 @@ struct ib_ucontext { struct list_head qp_list; struct list_head srq_list; struct list_head ah_list; + int closing; }; struct ib_uobject { @@ -721,23 +722,6 @@ struct ib_uobject { int live; }; -struct ib_umem { - unsigned long user_base; - unsigned long virt_base; - size_t length; - int offset; - int page_size; - int writable; - struct list_head chunk_list; -}; - -struct ib_umem_chunk { - struct list_head list; - int nents; - int nmap; - struct scatterlist page_list[0]; -}; - struct ib_udata { void __user *inbuf; void __user *outbuf; @@ -750,11 +734,6 @@ struct ib_udata { ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] - \ (void *) &((struct ib_umem_chunk *) 0)->page_list[0])) -struct ib_umem_object { - struct ib_uobject uobject; - struct ib_umem umem; -}; - struct ib_pd { struct ib_device *device; struct ib_uobject *uobject; @@ -998,7 +977,8 @@ struct ib_device { int mr_access_flags, u64 *iova_start); struct ib_mr * (*reg_user_mr)(struct ib_pd *pd, - struct ib_umem *region, + u64 start, u64 length, + u64 virt_addr, int mr_access_flags, struct ib_udata *udata); int (*query_mr)(struct ib_mr *mr, -- 1.5.1 From bs at q-leap.de Fri Apr 20 01:47:46 2007 From: bs at q-leap.de (Bernd Schubert) Date: Fri, 20 Apr 2007 10:47:46 +0200 Subject: [ofa-general] ipath irq bug Message-ID: <200704201047.47093.bs@q-leap.de> Hi, with rather many kernel debug options enabled I get this trace/message: [ 2651.218740] BUG: at kernel/lockdep.c:1860 trace_hardirqs_on() [ 2651.224696] [ 2651.224697] Call Trace: [ 2651.228784] [] release_console_sem+0x47/0x1f6 [ 2651.235709] [] trace_hardirqs_on+0xfd/0x154 [ 2651.241648] [] _spin_unlock_irq+0x28/0x2d [ 2651.247482] [] :ib_ipath:ipath_rc_rcv+0xf5b/0xf8e [ 2651.254058] [] :ib_ipath:ipath_lookup_qpn+0x4f/0x5a [ 2651.260791] [] :ib_ipath:ipath_qp_rcv+0x45/0x4e [ 2651.267203] [] :ib_ipath:ipath_ib_rcv+0x16a/0x1a8 [ 2651.273784] [] :ib_ipath:ipath_kreceive+0x42f/0x6b9 [ 2651.280545] [] __lock_acquire+0xc08/0xc60 [ 2651.286397] [] :ib_ipath:ipath_ib_piobufavail+0x72/0x79 [ 2651.291177] LustreError: 3433:0:(events.c:129:client_bulk_callback()) event type 0, status -5, desc ffff8100ca2ea000 [ 2651.294334] LustreError: 3433:0:(events.c:129:client_bulk_callback()) event type 0, status -5, desc ffff810070340000 [ 2651.315171] [] _spin_unlock_irqrestore+0x38/0x47 [ 2651.321572] [] :ib_ipath:ipath_intr+0x26a/0x17b6 [ 2651.328082] [] __lock_acquire+0xc08/0xc60 [ 2651.333892] [] try_to_wake_up+0x413/0x425 [ 2651.339663] [] handle_edge_irq+0x139/0x142 [ 2651.345594] [] trace_hardirqs_on_thunk+0x35/0x37 [ 2651.351975] [] trace_hardirqs_on+0x10f/0x154 [ 2651.358010] [] __lock_acquire+0xc08/0xc60 [ 2651.363789] [] handle_edge_irq+0xed/0x142 [ 2651.369899] [] handle_IRQ_event+0x20/0x55 [ 2651.376086] [] handle_edge_irq+0xf8/0x142 [ 2651.382325] [] do_IRQ+0x94/0xf9 [ 2651.387591] [] default_idle+0x0/0x51 [ 2651.393109] [] ret_from_intr+0x0/0xf [ 2651.398582] [] default_idle+0x35/0x51 [ 2651.405120] [] default_idle+0x37/0x51 [ 2651.410972] [] default_idle+0x35/0x51 [ 2651.416824] [] cpu_idle+0x5b/0x7a [ 2651.422270] [] start_secondary+0x470/0x47f [ 2651.428267] Kernel is 2.6.20.4. Any ideas? Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH From vlad at lists.openfabrics.org Fri Apr 20 02:38:02 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 20 Apr 2007 02:38:02 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070420-0200 daily build status Message-ID: <20070420093803.48265E6080B@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From rsamuelsbefoh at blueyonder.co.uk Fri Apr 20 05:39:46 2007 From: rsamuelsbefoh at blueyonder.co.uk (Charlotte Stevens) Date: Fri, 20 Apr 2007 10:39:46 -0200 Subject: [ofa-general] Just keep in touch Message-ID: I really hope plate that bucket brake bumpy at some point, you make peapuzzled share complete winter Alright. Stacy pushed a few buttons on her ph berry listen steady She coyly winked, and took week his hand in hers. YLefkowitz, she answered. 5:00 PMI spoke to her. tickle pot uptight She's on warn for tomorrow afternoo overcame squealing Feingold? nervous shrink Dana was ashen-faced. fuzzy With that, feather the two girls fled exited horse the restroom an If episcopal you want him to brick be receptive to paid blow the idea, I Dana sense berry event spoke abecedarian up from the passenger seat. The Mar opinion He's mine, challenge too. 4th period. made carriage Personally, I don'Jeff pulled bid the video itch out of the VCR encouraging steer and the te Yeah. attention arch I ran into sheep him in the art hall. It's a go. W teaching The glow from record the come brain television gave Stacy's face The guard hushed turned a hate degree couple of upset pages on his clip- 3:00 PM Wouldn't it make alot more writing sense throughout if cake roof you were t The three cautious o'clock sound bell rang and pause there screeching was the u enter dusty obey No one I know, called Guy. And I sort had a blood sow Jeff hate brake carriage wasn't sure of what she was getting at. YNo problem. heard The two girls were short brake wrung now outside he destruction strung Dana slew cut herself harmony off mid-sentence, and stopped And what overcome yell the happily fowl starting times are. bright Suddenly an all too tore cart familiar voice earth came from be Carl could feel his heart wink start sea back upset to pound. Lindtwist What's right the point of imagining butyric history an unloaded shotparcel attempt So what bumpy walk class is this boy in with you? Dana, it's race the real principle suffer of watch the matter. You've Huh? drab Dana was thumb horse caught a fresh little off guard. What's that? Glad pencil you cerebral glove hole could make it. You know, your paranoia obtain carelessly about wooly eye guys always star You know examine rudely Guy, he'd rather felt bluff use you than hold 11:45 AM Like slit all organized from dead religion, foolishly Judaism is nonsenstrod The expression on cold her encourage band face grew slightly more fYes? She silk powerfully rolled damp her eyes sound as he jogged up alon Jeff quit chuckled as he been shut boldly march off the television, an Yes..? Dana was now umbrella super encouraging system starting to get cold. Alright, I' You lock said that kick you and screeching he are flame going to be study Of difficult course replace that wasn't shakily the case. Even account if Gavin h kiss So outstanding what am flee error I doing here? Guy decided not to fit mention the alert avoid screw neat, .25 calibr employ Naw, I'm over disgusted it. gladly She number smiled broadly, and lea Call me promptly deserve back as soon as window post you get off the phone painfully Do you short rapid think kneel you could manage sketches of thos I just want to let you basin disturbed forward smoggy know that if you want t -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: orauryh.gif Type: image/gif Size: 7819 bytes Desc: not available URL: From mshefty at ichips.intel.com Fri Apr 20 09:33:54 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 20 Apr 2007 09:33:54 -0700 Subject: [ofa-general] [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <309a667c0704192344gec04bd0uf9eec6c6413ea34@mail.gmail.com> References: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> <309a667c0704192344gec04bd0uf9eec6c6413ea34@mail.gmail.com> Message-ID: <4628EB72.3080904@ichips.intel.com> > Once SM is up on a node/switch whole network is up. Now is if some > client is trying to establish a connection with other node, client is > expected to resolve the path using sa API, I want to know how exactly > it happens in the stack? See patch 3/3 for the use of the cache. In that patch, the rdma_cm first checks to see if a suitable path record is available in the cache. If one is not found, it issues a query to the SA. The stack impact of using the cache is less than the impact of sending a path record query to the SA. > Is it possible to program local_sa_cache with some dummy path records > which resides in cache for long time? This would require changes to the current implementation. - Sean From jcpekkbs at uaesupply.com Fri Apr 20 09:35:49 2007 From: jcpekkbs at uaesupply.com (Tim Shultz) Date: Fri, 20 Apr 2007 12:35:49 -0400 Subject: [ofa-general] OEM - retail version? Message-ID: <13193279.98512239087389@uaesupply.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: verbalism.png Type: image/png Size: 17384 bytes Desc: not available URL: -------------- next part -------------- --- avast! Antivirus: Giden mesaj temiz. Virus Veritabani (VPS): 000735-0, 20.04.2007 Test zamani: 20.04.2007 19:36:52 avast! - telif hakki (c) 1988-2007 ALWIL Software. http://www.avast.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsanch56 at bellsouth.net Fri Apr 20 12:15:21 2007 From: jsanch56 at bellsouth.net (eBAY Awards Team) Date: Fri, 20 Apr 2007 14:15:21 -0500 Subject: [ofa-general] eBAY WINNING NOTIFICATION Message-ID: <20070420191522.BUJA11114.ibm61aec.bellsouth.net@mail.bellsouth.net> FROM THE DESK OF THE EBAY AWARD TEAM!!! eBay Award Team 20 Craven Park. Harlesden London NW10. United Kingdom. Ref: BTD/011/07 Batch: 404468E Date:14/04/2007 Dear Lucky winner eBAY WINNING NOTIFICATION Today the 20Th of April, we happily write to inform you about the winners of the prestigous eBay Promo.It is unscrupulus to note that over 100,000,000.00(One Hundred Million Great Britian Pounds) was set aside to be won for these draws. This draws was seleted randomly from a wide range of email address from all part of the world.Your email address indicated was attached to the Ticket Number 768794432708 with serial numbers BTD/2007038011/07 and drew the lucky numbers 11-20-25-40-46-53(46) which subsequently won you 1,200,000.00GBP(One Million Two Humdred Thousand Great Britian Pounds)as one of the jackpot winners in this draw. You are therefore informed that you have won the whole sum of 1,200,000.00GBP(One Million Two Humdred Thousand Great Britian Pounds.Please you are not required to exposed all these your winning informations (you have to keep these informations Confidential) before you get your prize. Found below is your claim verification form, you have to fill all the informations correctly and then you send it to your claim agent through the contact details as stated below: (1)BENEFICIARY FULL NAME......................... (2)COUNTRY(NATIONALITY)........................... (3)SEX............................ (4)AGE............................. (5)OCCUPATION...................... (6)MARITAL STATUS............... (7)MONTHLY INCOME............. (8)TELEPHONE NUMBER................. (9)BATCH NUMBER...............DRAW NUMBER.........SERIAL NUMBER........ (10)HOW DO YOU FEEL AS A WINNER.............................. CONTACT CLAIM AGENT: Mr Wilson Elderson(Claim Agent) Tele:+44-704-571-4345 E-mail: claims_agent_samuel at yahoo.de Congratulations Once Again from the entire eBay Award Team. Note: You are expected to file for your claims before the next 7-Working days From pradeep at us.ibm.com Fri Apr 20 13:35:47 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 20 Apr 2007 13:35:47 -0700 Subject: [ofa-general] How to set local_ca_ack_delay? Message-ID: This appears to be set to 0 by default on some HCAs. How can I change that? Basically I am trying to change the "Local Ack Timeout" and the IB spec 1.2 (page 553) says the "Local CA Ack Delay" is used to compute the timeout. That is why I want to change the local_ca_ack_delay. Is my understanding correct? Pradeep pradeep at us.ibm.com From rdreier at cisco.com Fri Apr 20 14:27:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Apr 2007 14:27:53 -0700 Subject: [ofa-general] ipath irq bug In-Reply-To: <200704201047.47093.bs@q-leap.de> (Bernd Schubert's message of "Fri, 20 Apr 2007 10:47:46 +0200") References: <200704201047.47093.bs@q-leap.de> Message-ID: > [ 2651.241648] [] _spin_unlock_irq+0x28/0x2d > [ 2651.247482] [] :ib_ipath:ipath_rc_rcv+0xf5b/0xf8e > [ 2651.254058] [] :ib_ipath:ipath_lookup_qpn+0x4f/0x5a > [ 2651.260791] [] :ib_ipath:ipath_qp_rcv+0x45/0x4e > [ 2651.267203] [] :ib_ipath:ipath_ib_rcv+0x16a/0x1a8 > [ 2651.273784] [] :ib_ipath:ipath_kreceive+0x42f/0x6b9 > [ 2651.286397] [] :ib_ipath:ipath_ib_piobufavail+0x72/0x79 > [ 2651.321572] [] :ib_ipath:ipath_intr+0x26a/0x17b6 [edited slightly] It looks like ipath_intr() (presumably the interrupt handler) ends up in ipath_rc_rcv() and in particular in ipath_rc_rcv_error() (which is inlined), which calls spin_unlock_irq() from interrupt context, which is the problem. Someone who knows the driver better than I would have to confirm this analysis, and decide whether the fix is just to switch to spin_lock_irqsave() in ipath_rc_rcv_error(), or if it needs to be more elaborate. - R. From rolandd at cisco.com Fri Apr 20 15:32:37 2007 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 20 Apr 2007 15:32:37 -0700 Subject: [ofa-general] [PATCH 6/6] [RFC]mlx4 build system stuff In-Reply-To: <20074201532.Rlzy7s6yc7iv5IXX@cisco.com> Message-ID: <20074201532.jhCN1hLvxaAengJm@cisco.com> Hook up mlx4_core and mlx4_ib drivers to Kconfig and Makefiles. Signed-off-by: Roland Dreier --- infiniband/Kconfig | 2 ++ infiniband/Makefile | 1 + infiniband/hw/mlx4/Kconfig | 9 +++++++++ infiniband/hw/mlx4/Makefile | 3 +++ net/Kconfig | 14 ++++++++++++++ net/Makefile | 1 + net/mlx4/Makefile | 4 ++++ 7 files changed, 34 insertions(+) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 82afba5..37deaae 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -45,6 +45,8 @@ source "drivers/infiniband/hw/ehca/Kconfig" source "drivers/infiniband/hw/amso1100/Kconfig" source "drivers/infiniband/hw/cxgb3/Kconfig" +source "drivers/infiniband/hw/mlx4/Kconfig" + source "drivers/infiniband/ulp/ipoib/Kconfig" source "drivers/infiniband/ulp/srp/Kconfig" diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index da2066c..75f325e 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -4,6 +4,7 @@ obj-$(CONFIG_INFINIBAND_IPATH) += hw/ipath/ obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ obj-$(CONFIG_INFINIBAND_CXGB3) += hw/cxgb3/ +obj-$(CONFIG_MLX4_INFINIBAND) += hw/mlx4/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ diff --git a/drivers/infiniband/hw/mlx4/Kconfig b/drivers/infiniband/hw/mlx4/Kconfig new file mode 100644 index 0000000..b8912cd --- /dev/null +++ b/drivers/infiniband/hw/mlx4/Kconfig @@ -0,0 +1,9 @@ +config MLX4_INFINIBAND + tristate "Mellanox ConnectX HCA support" + depends on INFINIBAND + select MLX4_CORE + ---help--- + This driver provides low-level InfiniBand support for + Mellanox ConnectX PCI Express host channel adapters (HCAs). + This is required to use InfiniBand protocols such as + IP-over-IB or SRP with these devices. diff --git a/drivers/infiniband/hw/mlx4/Makefile b/drivers/infiniband/hw/mlx4/Makefile new file mode 100644 index 0000000..70f09c7 --- /dev/null +++ b/drivers/infiniband/hw/mlx4/Makefile @@ -0,0 +1,3 @@ +obj-$(CONFIG_MLX4_INFINIBAND) += mlx4_ib.o + +mlx4_ib-y := ah.o cq.o doorbell.o mad.o main.o mr.o qp.o srq.o diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index c3f9f59..842f020 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -2493,6 +2493,20 @@ config PASEMI_MAC This driver supports the on-chip 1/10Gbit Ethernet controller on PA Semi's PWRficient line of chips. +config MLX4_CORE + tristate + depends on PCI + default n + +config MLX4_DEBUG + bool "Verbose debugging output" if (MLX4_CORE && EMBEDDED) + default y + ---help--- + This option causes debugging code to be compiled into the + mlx4_core driver. The output can be turned on via the + debug_level module parameter (which can also be set after + the driver is loaded through sysfs). + endmenu source "drivers/net/tokenring/Kconfig" diff --git a/drivers/net/Makefile b/drivers/net/Makefile index 33af833..1604e1a 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -197,6 +197,7 @@ obj-$(CONFIG_SMC911X) += smc911x.o obj-$(CONFIG_DM9000) += dm9000.o obj-$(CONFIG_FEC_8XX) += fec_8xx/ obj-$(CONFIG_PASEMI_MAC) += pasemi_mac.o +obj-$(CONFIG_MLX4_CORE) += mlx4/ obj-$(CONFIG_MACB) += macb.o diff --git a/drivers/net/mlx4/Makefile b/drivers/net/mlx4/Makefile new file mode 100644 index 0000000..4f18889 --- /dev/null +++ b/drivers/net/mlx4/Makefile @@ -0,0 +1,4 @@ +obj-$(CONFIG_MLX4_CORE) += mlx4_core.o + +mlx4_core-y := alloc.o cmd.o cq.o eq.o fw.o icm.o intf.o main.o mcg.o mr.o \ + pd.o profile.o qp.o reset.o srq.o From rolandd at cisco.com Fri Apr 20 15:32:36 2007 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 20 Apr 2007 15:32:36 -0700 Subject: [ofa-general] [PATCH 3/6] [RFC]mlx4_core public includes In-Reply-To: <20074201532.d6hTKIczPSz0SeTA@cisco.com> Message-ID: <20074201532.7wkxUOPwHNFg96ue@cisco.com> Include files for hardware/firmware information and interface of mlx4_core module for protocol-specific drivers (such as mlx4_ib). Signed-off-by: Roland Dreier --- cmd.h | 178 +++++++++++++++++++++++++++++++++ cq.h | 123 +++++++++++++++++++++++ device.h | 323 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ doorbell.h | 97 ++++++++++++++++++ driver.h | 59 +++++++++++ qp.h | 288 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ srq.h | 42 +++++++ 7 files changed, 1110 insertions(+) diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h new file mode 100644 index 0000000..4fb552d --- /dev/null +++ b/include/linux/mlx4/cmd.h @@ -0,0 +1,178 @@ +/* + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef MLX4_CMD_H +#define MLX4_CMD_H + +#include + +enum { + /* initialization and general commands */ + MLX4_CMD_SYS_EN = 0x1, + MLX4_CMD_SYS_DIS = 0x2, + MLX4_CMD_MAP_FA = 0xfff, + MLX4_CMD_UNMAP_FA = 0xffe, + MLX4_CMD_RUN_FW = 0xff6, + MLX4_CMD_MOD_STAT_CFG = 0x34, + MLX4_CMD_QUERY_DEV_CAP = 0x3, + MLX4_CMD_QUERY_FW = 0x4, + MLX4_CMD_ENABLE_LAM = 0xff8, + MLX4_CMD_DISABLE_LAM = 0xff7, + MLX4_CMD_QUERY_DDR = 0x5, + MLX4_CMD_QUERY_ADAPTER = 0x6, + MLX4_CMD_INIT_HCA = 0x7, + MLX4_CMD_CLOSE_HCA = 0x8, + MLX4_CMD_INIT_PORT = 0x9, + MLX4_CMD_CLOSE_PORT = 0xa, + MLX4_CMD_QUERY_HCA = 0xb, + MLX4_CMD_SET_PORT = 0xc, + MLX4_CMD_ACCESS_DDR = 0x2e, + MLX4_CMD_MAP_ICM = 0xffa, + MLX4_CMD_UNMAP_ICM = 0xff9, + MLX4_CMD_MAP_ICM_AUX = 0xffc, + MLX4_CMD_UNMAP_ICM_AUX = 0xffb, + MLX4_CMD_SET_ICM_SIZE = 0xffd, + + /* TPT commands */ + MLX4_CMD_SW2HW_MPT = 0xd, + MLX4_CMD_QUERY_MPT = 0xe, + MLX4_CMD_HW2SW_MPT = 0xf, + MLX4_CMD_READ_MTT = 0x10, + MLX4_CMD_WRITE_MTT = 0x11, + MLX4_CMD_SYNC_TPT = 0x2f, + + /* EQ commands */ + MLX4_CMD_MAP_EQ = 0x12, + MLX4_CMD_SW2HW_EQ = 0x13, + MLX4_CMD_HW2SW_EQ = 0x14, + MLX4_CMD_QUERY_EQ = 0x15, + + /* CQ commands */ + MLX4_CMD_SW2HW_CQ = 0x16, + MLX4_CMD_HW2SW_CQ = 0x17, + MLX4_CMD_QUERY_CQ = 0x18, + MLX4_CMD_RESIZE_CQ = 0x2c, + + /* SRQ commands */ + MLX4_CMD_SW2HW_SRQ = 0x35, + MLX4_CMD_HW2SW_SRQ = 0x36, + MLX4_CMD_QUERY_SRQ = 0x37, + MLX4_CMD_ARM_SRQ = 0x40, + + /* QP/EE commands */ + MLX4_CMD_RST2INIT_QP = 0x19, + MLX4_CMD_INIT2RTR_QP = 0x1a, + MLX4_CMD_RTR2RTS_QP = 0x1b, + MLX4_CMD_RTS2RTS_QP = 0x1c, + MLX4_CMD_SQERR2RTS_QP = 0x1d, + MLX4_CMD_2ERR_QP = 0x1e, + MLX4_CMD_RTS2SQD_QP = 0x1f, + MLX4_CMD_SQD2SQD_QP = 0x38, + MLX4_CMD_SQD2RTS_QP = 0x20, + MLX4_CMD_2RST_QP = 0x21, + MLX4_CMD_QUERY_QP = 0x22, + MLX4_CMD_INIT2INIT_QP = 0x2d, + MLX4_CMD_SUSPEND_QP = 0x32, + MLX4_CMD_UNSUSPEND_QP = 0x33, + /* special QP and management commands */ + MLX4_CMD_CONF_SPECIAL_QP = 0x23, + MLX4_CMD_MAD_IFC = 0x24, + + /* multicast commands */ + MLX4_CMD_READ_MCG = 0x25, + MLX4_CMD_WRITE_MCG = 0x26, + MLX4_CMD_MGID_HASH = 0x27, + + /* miscellaneous commands */ + MLX4_CMD_DIAG_RPRT = 0x30, + MLX4_CMD_NOP = 0x31, + + /* debug commands */ + MLX4_CMD_QUERY_DEBUG_MSG = 0x2a, + MLX4_CMD_SET_DEBUG_MSG = 0x2b, +}; + +enum { + MLX4_CMD_TIME_CLASS_A = 10000, + MLX4_CMD_TIME_CLASS_B = 10000, + MLX4_CMD_TIME_CLASS_C = 10000, +}; + +enum { + MLX4_MAILBOX_SIZE = 4096 +}; + +struct mlx4_dev; + +struct mlx4_cmd_mailbox { + void *buf; + dma_addr_t dma; +}; + +int __mlx4_cmd(struct mlx4_dev *dev, u64 in_param, u64 *out_param, + int out_is_imm, u32 in_modifier, u8 op_modifier, + u16 op, unsigned long timeout); + +/* Invoke a command with no output parameter */ +static inline int mlx4_cmd(struct mlx4_dev *dev, u64 in_param, u32 in_modifier, + u8 op_modifier, u16 op, unsigned long timeout) +{ + return __mlx4_cmd(dev, in_param, NULL, 0, in_modifier, + op_modifier, op, timeout); +} + +/* Invoke a command with an output mailbox */ +static inline int mlx4_cmd_box(struct mlx4_dev *dev, u64 in_param, u64 out_param, + u32 in_modifier, u8 op_modifier, u16 op, + unsigned long timeout) +{ + return __mlx4_cmd(dev, in_param, &out_param, 0, in_modifier, + op_modifier, op, timeout); +} + +/* + * Invoke a command with an immediate output parameter (and copy the + * output into the caller's out_param pointer after the command + * executes). + */ +static inline int mlx4_cmd_imm(struct mlx4_dev *dev, u64 in_param, u64 *out_param, + u32 in_modifier, u8 op_modifier, u16 op, + unsigned long timeout) +{ + return __mlx4_cmd(dev, in_param, out_param, 1, in_modifier, + op_modifier, op, timeout); +} + +struct mlx4_cmd_mailbox *mlx4_alloc_cmd_mailbox(struct mlx4_dev *dev); +void mlx4_free_cmd_mailbox(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox); + +#endif /* MLX4_CMD_H */ diff --git a/include/linux/mlx4/cq.h b/include/linux/mlx4/cq.h new file mode 100644 index 0000000..0181e0a --- /dev/null +++ b/include/linux/mlx4/cq.h @@ -0,0 +1,123 @@ +/* + * Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef MLX4_CQ_H +#define MLX4_CQ_H + +#include + +#include +#include + +struct mlx4_cqe { + __be32 my_qpn; + __be32 immed_rss_invalid; + __be32 g_mlpath_rqpn; + u8 sl; + u8 reserved1; + __be16 rlid; + u32 reserved2; + __be32 byte_cnt; + __be16 wqe_index; + __be16 checksum; + u8 reserved3[3]; + u8 owner_sr_opcode; +}; + +struct mlx4_err_cqe { + __be32 my_qpn; + u32 reserved1[5]; + __be16 wqe_index; + u8 vendor_err_syndrome; + u8 syndrome; + u8 reserved2[3]; + u8 owner_sr_opcode; +}; + +enum { + MLX4_CQE_OWNER_MASK = 0x80, + MLX4_CQE_IS_SEND_MASK = 0x40, + MLX4_CQE_OPCODE_MASK = 0x1f +}; + +enum { + MLX4_CQE_SYNDROME_LOCAL_LENGTH_ERR = 0x01, + MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR = 0x02, + MLX4_CQE_SYNDROME_LOCAL_PROT_ERR = 0x04, + MLX4_CQE_SYNDROME_WR_FLUSH_ERR = 0x05, + MLX4_CQE_SYNDROME_MW_BIND_ERR = 0x06, + MLX4_CQE_SYNDROME_BAD_RESP_ERR = 0x10, + MLX4_CQE_SYNDROME_LOCAL_ACCESS_ERR = 0x11, + MLX4_CQE_SYNDROME_REMOTE_INVAL_REQ_ERR = 0x12, + MLX4_CQE_SYNDROME_REMOTE_ACCESS_ERR = 0x13, + MLX4_CQE_SYNDROME_REMOTE_OP_ERR = 0x14, + MLX4_CQE_SYNDROME_TRANSPORT_RETRY_EXC_ERR = 0x15, + MLX4_CQE_SYNDROME_RNR_RETRY_EXC_ERR = 0x16, + MLX4_CQE_SYNDROME_REMOTE_ABORTED_ERR = 0x22, +}; + +static inline void mlx4_cq_arm(struct mlx4_cq *cq, u32 cmd, + void __iomem *uar_page, + spinlock_t *doorbell_lock) +{ + __be32 doorbell[2]; + u32 sn; + u32 ci; + + sn = cq->arm_sn & 3; + ci = cq->cons_index & 0xffffff; + + *cq->arm_db = cpu_to_be32(sn << 28 | cmd | ci); + + /* + * Make sure that the doorbell record in host memory is + * written before ringing the doorbell via PCI MMIO. + */ + wmb(); + + doorbell[0] = cpu_to_be32(sn << 28 | cmd | cq->cqn); + doorbell[1] = cpu_to_be32(ci); + + mlx4_write64(doorbell, uar_page + MLX4_CQ_DOORBELL, doorbell_lock); +} + +static inline void mlx4_cq_set_ci(struct mlx4_cq *cq) +{ + *cq->set_ci_db = cpu_to_be32(cq->cons_index & 0xffffff); +} + +enum { + MLX4_CQ_DB_REQ_NOT_SOL = 1 << 24, + MLX4_CQ_DB_REQ_NOT = 2 << 24 +}; + +#endif /* MLX4_CQ_H */ diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h new file mode 100644 index 0000000..54c5509 --- /dev/null +++ b/include/linux/mlx4/device.h @@ -0,0 +1,323 @@ +/* + * Copyright (c) 2006, 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef MLX4_DEVICE_H +#define MLX4_DEVICE_H + +#include +#include +#include + +#include + +enum { + MLX4_FLAG_MSI_X = 1 << 0, +}; + +enum { + MLX4_MAX_PORTS = 2 +}; + +enum { + MLX4_DEV_CAP_FLAG_RC = 1 << 0, + MLX4_DEV_CAP_FLAG_UC = 1 << 1, + MLX4_DEV_CAP_FLAG_UD = 1 << 2, + MLX4_DEV_CAP_FLAG_SRQ = 1 << 6, + MLX4_DEV_CAP_FLAG_IPOIB_CSUM = 1 << 7, + MLX4_DEV_CAP_FLAG_BAD_PKEY_CNTR = 1 << 8, + MLX4_DEV_CAP_FLAG_BAD_QKEY_CNTR = 1 << 9, + MLX4_DEV_CAP_FLAG_MEM_WINDOW = 1 << 16, + MLX4_DEV_CAP_FLAG_APM = 1 << 17, + MLX4_DEV_CAP_FLAG_ATOMIC = 1 << 18, + MLX4_DEV_CAP_FLAG_RAW_MCAST = 1 << 19, + MLX4_DEV_CAP_FLAG_UD_AV_PORT = 1 << 20, + MLX4_DEV_CAP_FLAG_UD_MCAST = 1 << 21 +}; + +enum mlx4_event { + MLX4_EVENT_TYPE_COMP = 0x00, + MLX4_EVENT_TYPE_PATH_MIG = 0x01, + MLX4_EVENT_TYPE_COMM_EST = 0x02, + MLX4_EVENT_TYPE_SQ_DRAINED = 0x03, + MLX4_EVENT_TYPE_SRQ_QP_LAST_WQE = 0x13, + MLX4_EVENT_TYPE_SRQ_LIMIT = 0x14, + MLX4_EVENT_TYPE_CQ_ERROR = 0x04, + MLX4_EVENT_TYPE_WQ_CATAS_ERROR = 0x05, + MLX4_EVENT_TYPE_EEC_CATAS_ERROR = 0x06, + MLX4_EVENT_TYPE_PATH_MIG_FAILED = 0x07, + MLX4_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, + MLX4_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, + MLX4_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, + MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR = 0x08, + MLX4_EVENT_TYPE_PORT_CHANGE = 0x09, + MLX4_EVENT_TYPE_EQ_OVERFLOW = 0x0f, + MLX4_EVENT_TYPE_ECC_DETECT = 0x0e, + MLX4_EVENT_TYPE_CMD = 0x0a +}; + +enum { + MLX4_PERM_LOCAL_READ = 1 << 10, + MLX4_PERM_LOCAL_WRITE = 1 << 11, + MLX4_PERM_REMOTE_READ = 1 << 12, + MLX4_PERM_REMOTE_WRITE = 1 << 13, + MLX4_PERM_ATOMIC = 1 << 14 +}; + +enum { + MLX4_OPCODE_NOP = 0x00, + MLX4_OPCODE_SEND_INVAL = 0x01, + MLX4_OPCODE_RDMA_WRITE = 0x08, + MLX4_OPCODE_RDMA_WRITE_IMM = 0x09, + MLX4_OPCODE_SEND = 0x0a, + MLX4_OPCODE_SEND_IMM = 0x0b, + MLX4_OPCODE_LSO = 0x0e, + MLX4_OPCODE_RDMA_READ = 0x10, + MLX4_OPCODE_ATOMIC_CS = 0x11, + MLX4_OPCODE_ATOMIC_FA = 0x12, + MLX4_OPCODE_ATOMIC_MASK_CS = 0x14, + MLX4_OPCODE_ATOMIC_MASK_FA = 0x15, + MLX4_OPCODE_BIND_MW = 0x18, + MLX4_OPCODE_FMR = 0x19, + MLX4_OPCODE_LOCAL_INVAL = 0x1b, + MLX4_OPCODE_CONFIG_CMD = 0x1f, + + MLX4_RECV_OPCODE_RDMA_WRITE_IMM = 0x00, + MLX4_RECV_OPCODE_SEND = 0x01, + MLX4_RECV_OPCODE_SEND_IMM = 0x02, + MLX4_RECV_OPCODE_SEND_INVAL = 0x03, + + MLX4_CQE_OPCODE_ERROR = 0x1e, + MLX4_CQE_OPCODE_RESIZE = 0x16, +}; + +enum { + MLX4_STAT_RATE_OFFSET = 5 +}; + +struct mlx4_caps { + u64 fw_ver; + int num_ports; + int vl_cap; + int mtu_cap; + int gid_table_len; + int pkey_table_len; + int local_ca_ack_delay; + int num_uars; + int max_sq_sg; + int max_rq_sg; + int num_qps; + int max_wqes; + int max_sq_desc_sz; + int max_rq_desc_sz; + int max_qp_init_rdma; + int max_qp_dest_rdma; + int reserved_qps; + int sqp_start; + int num_srqs; + int max_srq_wqes; + int max_srq_sge; + int reserved_srqs; + int num_cqs; + int max_cqes; + int reserved_cqs; + int num_eqs; + int reserved_eqs; + int num_mpts; + int num_mtt_segs; + int fmr_reserved_mtts; + int reserved_mtts; + int reserved_mrws; + int reserved_uars; + int num_mgms; + int num_amgms; + int reserved_mcgs; + int num_pds; + int reserved_pds; + int mtt_entry_sz; + u32 page_size_cap; + u32 flags; + u16 stat_rate_support; + u8 port_width_cap; +}; + +struct mlx4_buf_list { + void *buf; + dma_addr_t map; +}; + +struct mlx4_buf { + union { + struct mlx4_buf_list direct; + struct mlx4_buf_list *page_list; + } u; + int nbufs; + int npages; + int page_shift; +}; + +struct mlx4_mtt { + u32 first_seg; + int order; + int page_shift; +}; + +struct mlx4_mr { + struct mlx4_mtt mtt; + u64 iova; + u64 size; + u32 key; + u32 pd; + u32 access; + int enabled; +}; + +struct mlx4_uar { + unsigned long pfn; + int index; +}; + +struct mlx4_cq { + void (*comp) (struct mlx4_cq *); + void (*event) (struct mlx4_cq *, enum mlx4_event); + + struct mlx4_uar *uar; + + u32 cons_index; + + __be32 *set_ci_db; + __be32 *arm_db; + int arm_sn; + + int cqn; + + atomic_t refcount; + struct completion free; +}; + +struct mlx4_qp { + void (*event) (struct mlx4_qp *, enum mlx4_event); + + int qpn; + + atomic_t refcount; + struct completion free; +}; + +struct mlx4_srq { + void (*event) (struct mlx4_srq *, enum mlx4_event); + + int srqn; + int max; + int max_gs; + int wqe_shift; + + atomic_t refcount; + struct completion free; +}; + +struct mlx4_av { + __be32 port_pd; + u8 reserved1; + u8 g_slid; + __be16 dlid; + u8 reserved2; + u8 gid_index; + u8 stat_rate; + u8 hop_limit; + __be32 sl_tclass_flowlabel; + u8 dgid[16]; +}; + +struct mlx4_dev { + struct pci_dev *pdev; + unsigned long flags; + struct mlx4_caps caps; + struct radix_tree_root qp_table_tree; +}; + +struct mlx4_init_port_param { + int set_guid0; + int set_node_guid; + int set_si_guid; + u16 mtu; + int port_width_cap; + u16 vl_cap; + u16 max_gid; + u16 max_pkey; + u64 guid0; + u64 node_guid; + u64 si_guid; +}; + +int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct, + struct mlx4_buf *buf); +void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf); + +int mlx4_pd_alloc(struct mlx4_dev *dev, u32 *pdn); +void mlx4_pd_free(struct mlx4_dev *dev, u32 pdn); + +int mlx4_uar_alloc(struct mlx4_dev *dev, struct mlx4_uar *uar); +void mlx4_uar_free(struct mlx4_dev *dev, struct mlx4_uar *uar); + +int mlx4_mtt_init(struct mlx4_dev *dev, int npages, int page_shift, + struct mlx4_mtt *mtt); +void mlx4_mtt_cleanup(struct mlx4_dev *dev, struct mlx4_mtt *mtt); +u64 mlx4_mtt_addr(struct mlx4_dev *dev, struct mlx4_mtt *mtt); + +int mlx4_mr_alloc(struct mlx4_dev *dev, u32 pd, u64 iova, u64 size, u32 access, + int npages, int page_shift, struct mlx4_mr *mr); +void mlx4_mr_free(struct mlx4_dev *dev, struct mlx4_mr *mr); +int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr); +int mlx4_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt, + int start_index, int npages, u64 *page_list); +int mlx4_buf_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt, + struct mlx4_buf *buf); + +int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, + struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq); +void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq); + +int mlx4_qp_alloc(struct mlx4_dev *dev, int sqpn, struct mlx4_qp *qp); +void mlx4_qp_free(struct mlx4_dev *dev, struct mlx4_qp *qp); + +int mlx4_srq_alloc(struct mlx4_dev *dev, u32 pdn, struct mlx4_mtt *mtt, + u64 db_rec, struct mlx4_srq *srq); +void mlx4_srq_free(struct mlx4_dev *dev, struct mlx4_srq *srq); +int mlx4_srq_arm(struct mlx4_dev *dev, struct mlx4_srq *srq, int limit_watermark); + +int mlx4_INIT_PORT(struct mlx4_dev *dev, struct mlx4_init_port_param *param, int port); +int mlx4_CLOSE_PORT(struct mlx4_dev *dev, int port); + +int mlx4_multicast_attach(struct mlx4_dev *dev, struct mlx4_qp *qp, u8 gid[16]); +int mlx4_multicast_detach(struct mlx4_dev *dev, struct mlx4_qp *qp, u8 gid[16]); + +#endif /* MLX4_DEVICE_H */ diff --git a/include/linux/mlx4/doorbell.h b/include/linux/mlx4/doorbell.h new file mode 100644 index 0000000..3f2da44 --- /dev/null +++ b/include/linux/mlx4/doorbell.h @@ -0,0 +1,97 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef MLX4_DOORBELL_H +#define MLX4_DOORBELL_H + +#include +#include + +#define MLX4_SEND_DOORBELL 0x14 +#define MLX4_CQ_DOORBELL 0x20 + +#if BITS_PER_LONG == 64 +/* + * Assume that we can just write a 64-bit doorbell atomically. s390 + * actually doesn't have writeq() but S/390 systems don't even have + * PCI so we won't worry about it. + */ + +#define MLX4_DECLARE_DOORBELL_LOCK(name) +#define MLX4_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MLX4_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline void mlx4_write64_raw(__be64 val, void __iomem *dest) +{ + __raw_writeq((__force u64) val, dest); +} + +static inline void mlx4_write64(__be32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + __raw_writeq(*(u64 *) val, dest); +} + +#else + +/* + * Just fall back to a spinlock to protect the doorbell if + * BITS_PER_LONG is 32 -- there's no portable way to do atomic 64-bit + * MMIO writes. + */ + +#define MLX4_DECLARE_DOORBELL_LOCK(name) spinlock_t name; +#define MLX4_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) +#define MLX4_GET_DOORBELL_LOCK(ptr) (ptr) + +static inline void mlx4_write64_raw(__be64 val, void __iomem *dest) +{ + __raw_writel(((__force u32 *) &val)[0], dest); + __raw_writel(((__force u32 *) &val)[1], dest + 4); +} + +static inline void mlx4_write64(__be32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + unsigned long flags; + + spin_lock_irqsave(doorbell_lock, flags); + __raw_writel((__force u32) val[0], dest); + __raw_writel((__force u32) val[1], dest + 4); + spin_unlock_irqrestore(doorbell_lock, flags); +} + +#endif + +#endif /* MLX4_DOORBELL_H */ diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h new file mode 100644 index 0000000..61925fc --- /dev/null +++ b/include/linux/mlx4/driver.h @@ -0,0 +1,59 @@ +/* + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef MLX4_DRIVER_H +#define MLX4_DRIVER_H + +#include + +struct mlx4_dev; + +enum mlx4_dev_event { + MLX4_DEV_EVENT_CATASTROPHIC_ERROR, + MLX4_DEV_EVENT_PORT_UP, + MLX4_DEV_EVENT_PORT_DOWN, + MLX4_DEV_EVENT_PORT_REINIT, +}; + +struct mlx4_interface { + void * (*add) (struct mlx4_dev *dev); + void (*remove)(struct mlx4_dev *dev, void *context); + void (*event) (struct mlx4_dev *dev, + enum mlx4_dev_event event, + int port); + struct list_head list; +}; + +int mlx4_register_interface(struct mlx4_interface *intf); +void mlx4_unregister_interface(struct mlx4_interface *intf); + +#endif /* MLX4_DRIVER_H */ diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h new file mode 100644 index 0000000..9eeb61a --- /dev/null +++ b/include/linux/mlx4/qp.h @@ -0,0 +1,288 @@ +/* + * Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef MLX4_QP_H +#define MLX4_QP_H + +#include + +#include + +#define MLX4_INVALID_LKEY 0x100 + +enum mlx4_qp_optpar { + MLX4_QP_OPTPAR_ALT_ADDR_PATH = 1 << 0, + MLX4_QP_OPTPAR_RRE = 1 << 1, + MLX4_QP_OPTPAR_RAE = 1 << 2, + MLX4_QP_OPTPAR_RWE = 1 << 3, + MLX4_QP_OPTPAR_PKEY_INDEX = 1 << 4, + MLX4_QP_OPTPAR_Q_KEY = 1 << 5, + MLX4_QP_OPTPAR_RNR_TIMEOUT = 1 << 6, + MLX4_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7, + MLX4_QP_OPTPAR_SRA_MAX = 1 << 8, + MLX4_QP_OPTPAR_RRA_MAX = 1 << 9, + MLX4_QP_OPTPAR_PM_STATE = 1 << 10, + MLX4_QP_OPTPAR_RETRY_COUNT = 1 << 12, + MLX4_QP_OPTPAR_RNR_RETRY = 1 << 13, + MLX4_QP_OPTPAR_ACK_TIMEOUT = 1 << 14, + MLX4_QP_OPTPAR_SCHED_QUEUE = 1 << 16 +}; + +enum mlx4_qp_state { + MLX4_QP_STATE_RST = 0, + MLX4_QP_STATE_INIT = 1, + MLX4_QP_STATE_RTR = 2, + MLX4_QP_STATE_RTS = 3, + MLX4_QP_STATE_SQER = 4, + MLX4_QP_STATE_SQD = 5, + MLX4_QP_STATE_ERR = 6, + MLX4_QP_STATE_SQ_DRAINING = 7, + MLX4_QP_NUM_STATE +}; + +enum { + MLX4_QP_ST_RC = 0x0, + MLX4_QP_ST_UC = 0x1, + MLX4_QP_ST_RD = 0x2, + MLX4_QP_ST_UD = 0x3, + MLX4_QP_ST_MLX = 0x7 +}; + +enum { + MLX4_QP_PM_MIGRATED = 0x3, + MLX4_QP_PM_ARMED = 0x0, + MLX4_QP_PM_REARM = 0x1 +}; + +enum { + /* params1 */ + MLX4_QP_BIT_SRE = 1 << 15, + MLX4_QP_BIT_SWE = 1 << 14, + MLX4_QP_BIT_SAE = 1 << 13, + /* params2 */ + MLX4_QP_BIT_RRE = 1 << 15, + MLX4_QP_BIT_RWE = 1 << 14, + MLX4_QP_BIT_RAE = 1 << 13, + MLX4_QP_BIT_RIC = 1 << 4, +}; + +struct mlx4_qp_path { + u8 fl; + u8 reserved1[2]; + u8 pkey_index; + u8 reserved2; + u8 grh_mylmc; + __be16 rlid; + u8 ackto; + u8 mgid_index; + u8 static_rate; + u8 hop_limit; + __be32 tclass_flowlabel; + u8 rgid[16]; + u8 sched_queue; + u8 snooper_flags; + u8 reserved3[2]; + u8 counter_index; + u8 reserved4[7]; +}; + +struct mlx4_qp_context { + __be32 flags; + __be32 pd; + u8 mtu_msgmax; + u8 rq_size_stride; + u8 sq_size_stride; + u8 rlkey; + __be32 usr_page; + __be32 local_qpn; + __be32 remote_qpn; + struct mlx4_qp_path pri_path; + struct mlx4_qp_path alt_path; + __be32 params1; + u32 reserved1; + __be32 next_send_psn; + __be32 cqn_send; + u32 reserved2[2]; + __be32 last_acked_psn; + __be32 ssn; + __be32 params2; + __be32 rnr_nextrecvpsn; + __be32 srcd; + __be32 cqn_recv; + __be64 db_rec_addr; + __be32 qkey; + __be32 srqn; + __be32 msn; + __be16 rq_wqe_counter; + __be16 sq_wqe_counter; + u32 reserved3[2]; + __be32 param3; + __be32 nummmcpeers_basemkey; + u8 log_page_size; + u8 reserved4[2]; + u8 mtt_base_addr_h; + __be32 mtt_base_addr_l; + u32 reserved5[10]; +}; + +enum { + MLX4_WQE_CTRL_FENCE = 1 << 6, + MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, + MLX4_WQE_CTRL_SOLICITED = 1 << 1, +}; + +struct mlx4_wqe_ctrl_seg { + __be32 owner_opcode; + u8 reserved2[3]; + u8 fence_size; + /* + * High 24 bits are SRC remote buffer; low 8 bits are flags: + * [7] SO (strong ordering) + * [5] TCP/UDP checksum + * [4] IP checksum + * [3:2] C (generate completion queue entry) + * [1] SE (solicited event) + */ + __be32 srcrb_flags; + /* + * imm is immediate data for send/RDMA write w/ immediate; + * also invalidation key for send with invalidate; input + * modifier for WQEs on CCQs. + */ + __be32 imm; +}; + +enum { + MLX4_WQE_MLX_VL15 = 1 << 17, + MLX4_WQE_MLX_SLR = 1 << 16 +}; + +struct mlx4_wqe_mlx_seg { + u8 owner; + u8 reserved1[2]; + u8 opcode; + u8 reserved2[3]; + u8 size; + /* + * [17] VL15 + * [16] SLR + * [15:12] static rate + * [11:8] SL + * [4] ICRC + * [3:2] C + * [0] FL (force loopback) + */ + __be32 flags; + __be16 rlid; + u16 reserved3; +}; + +struct mlx4_wqe_datagram_seg { + __be32 av[8]; + __be32 dqpn; + __be32 qkey; + __be32 reservd[2]; +}; + +struct mlx4_wqe_bind_seg { + __be32 flags1; + __be32 flags2; + __be32 new_rkey; + __be32 lkey; + __be64 addr; + __be64 length; +}; + +struct mlx4_wqe_fmr_seg { + __be32 flags; + __be32 mem_key; + __be64 buf_list; + __be64 start_addr; + __be64 reg_len; + __be32 offset; + __be32 page_size; + u32 reserved[2]; +}; + +struct mlx4_wqe_fmr_ext_seg { + u8 flags; + u8 reserved; + __be16 app_mask; + __be16 wire_app_tag; + __be16 mem_app_tag; + __be32 wire_ref_tag_base; + __be32 mem_ref_tag_base; +}; + +struct mlx4_wqe_local_inval_seg { + u8 flags; + u8 reserved1[3]; + __be32 mem_key; + u8 reserved2[3]; + u8 guest_id; + __be64 pa; +}; + +struct mlx4_wqe_raddr_seg { + __be64 raddr; + __be32 rkey; + u32 reserved; +}; + +struct mlx4_wqe_atomic_seg { + __be64 swap_add; + __be64 compare; +}; + +struct mlx4_wqe_data_seg { + __be32 byte_count; + __be32 lkey; + __be64 addr; +}; + +struct mlx4_wqe_inline_seg { + __be32 byte_count; +}; + +int mlx4_qp_modify(struct mlx4_dev *dev, struct mlx4_mtt *mtt, + enum mlx4_qp_state cur_state, enum mlx4_qp_state new_state, + struct mlx4_qp_context *context, enum mlx4_qp_optpar optpar, + int sqd_event, struct mlx4_qp *qp); + +static inline struct mlx4_qp *__mlx4_qp_lookup(struct mlx4_dev *dev, u32 qpn) +{ + return radix_tree_lookup(&dev->qp_table_tree, qpn & (dev->caps.num_qps - 1)); +} + +void mlx4_qp_remove(struct mlx4_dev *dev, struct mlx4_qp *qp); + +#endif /* MLX4_QP_H */ diff --git a/include/linux/mlx4/srq.h b/include/linux/mlx4/srq.h new file mode 100644 index 0000000..799a069 --- /dev/null +++ b/include/linux/mlx4/srq.h @@ -0,0 +1,42 @@ +/* + * Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef MLX4_SRQ_H +#define MLX4_SRQ_H + +struct mlx4_wqe_srq_next_seg { + u16 reserved1; + __be16 next_wqe_index; + u32 reserved2[3]; +}; + +#endif /* MLX4_SRQ_H */ From rolandd at cisco.com Fri Apr 20 15:32:36 2007 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 20 Apr 2007 15:32:36 -0700 Subject: [ofa-general] [PATCH 1/6] [RFC]mlx4_core main files In-Reply-To: <20074201532.4PiF0gUjC19I1fhy@cisco.com> Message-ID: <20074201532.IZZ23OZ6UzBQwQQb@cisco.com> PCI driver and firmware command handling code from mlx4_core. Signed-off-by: Roland Dreier --- cmd.c | 429 ++++++++++++++++++++++++++++ fw.c | 758 ++++++++++++++++++++++++++++++++++++++++++++++++++ fw.h | 165 ++++++++++ main.c | 939 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ mlx4.h | 334 ++++++++++++++++++++++ profile.c | 238 +++++++++++++++ reset.c | 172 +++++++++++ 7 files changed, 3035 insertions(+) diff --git a/drivers/net/mlx4/cmd.c b/drivers/net/mlx4/cmd.c new file mode 100644 index 0000000..41bf47c --- /dev/null +++ b/drivers/net/mlx4/cmd.c @@ -0,0 +1,429 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include + +#include + +#include "mlx4.h" + +#define CMD_POLL_TOKEN 0xffff + +enum { + /* command completed successfully: */ + CMD_STAT_OK = 0x00, + /* Internal error (such as a bus error) occurred while processing command: */ + CMD_STAT_INTERNAL_ERR = 0x01, + /* Operation/command not supported or opcode modifier not supported: */ + CMD_STAT_BAD_OP = 0x02, + /* Parameter not supported or parameter out of range: */ + CMD_STAT_BAD_PARAM = 0x03, + /* System not enabled or bad system state: */ + CMD_STAT_BAD_SYS_STATE = 0x04, + /* Attempt to access reserved or unallocaterd resource: */ + CMD_STAT_BAD_RESOURCE = 0x05, + /* Requested resource is currently executing a command, or is otherwise busy: */ + CMD_STAT_RESOURCE_BUSY = 0x06, + /* Required capability exceeds device limits: */ + CMD_STAT_EXCEED_LIM = 0x08, + /* Resource is not in the appropriate state or ownership: */ + CMD_STAT_BAD_RES_STATE = 0x09, + /* Index out of range: */ + CMD_STAT_BAD_INDEX = 0x0a, + /* FW image corrupted: */ + CMD_STAT_BAD_NVMEM = 0x0b, + /* Attempt to modify a QP/EE which is not in the presumed state: */ + CMD_STAT_BAD_QP_STATE = 0x10, + /* Bad segment parameters (Address/Size): */ + CMD_STAT_BAD_SEG_PARAM = 0x20, + /* Memory Region has Memory Windows bound to: */ + CMD_STAT_REG_BOUND = 0x21, + /* HCA local attached memory not present: */ + CMD_STAT_LAM_NOT_PRE = 0x22, + /* Bad management packet (silently discarded): */ + CMD_STAT_BAD_PKT = 0x30, + /* More outstanding CQEs in CQ than new CQ size: */ + CMD_STAT_BAD_SIZE = 0x40 +}; + +enum { + HCR_IN_PARAM_OFFSET = 0x00, + HCR_IN_MODIFIER_OFFSET = 0x08, + HCR_OUT_PARAM_OFFSET = 0x0c, + HCR_TOKEN_OFFSET = 0x14, + HCR_STATUS_OFFSET = 0x18, + + HCR_OPMOD_SHIFT = 12, + HCR_T_BIT = 21, + HCR_E_BIT = 22, + HCR_GO_BIT = 23 +}; + +enum { + GO_BIT_TIMEOUT = 10000 +}; + +struct mlx4_cmd_context { + struct completion done; + int result; + int next; + u64 out_param; + u16 token; +}; + +static int mlx4_status_to_errno(u8 status) { + static const int trans_table[] = { + [CMD_STAT_INTERNAL_ERR] = -EIO, + [CMD_STAT_BAD_OP] = -EPERM, + [CMD_STAT_BAD_PARAM] = -EINVAL, + [CMD_STAT_BAD_SYS_STATE] = -ENXIO, + [CMD_STAT_BAD_RESOURCE] = -EBADF, + [CMD_STAT_RESOURCE_BUSY] = -EBUSY, + [CMD_STAT_EXCEED_LIM] = -ENOMEM, + [CMD_STAT_BAD_RES_STATE] = -EBADF, + [CMD_STAT_BAD_INDEX] = -EBADF, + [CMD_STAT_BAD_NVMEM] = -EFAULT, + [CMD_STAT_BAD_QP_STATE] = -EINVAL, + [CMD_STAT_BAD_SEG_PARAM] = -EFAULT, + [CMD_STAT_REG_BOUND] = -EBUSY, + [CMD_STAT_LAM_NOT_PRE] = -EAGAIN, + [CMD_STAT_BAD_PKT] = -EINVAL, + [CMD_STAT_BAD_SIZE] = -ENOMEM, + }; + + if (status >= ARRAY_SIZE(trans_table) || + (status != CMD_STAT_OK && trans_table[status] == 0)) + return -EIO; + + return trans_table[status]; +} + +static int cmd_pending(struct mlx4_dev *dev) +{ + u32 status = readl(mlx4_priv(dev)->cmd.hcr + HCR_STATUS_OFFSET); + + return (status & swab32(1 << HCR_GO_BIT)) || + (mlx4_priv(dev)->cmd.toggle == + !!(status & swab32(1 << HCR_T_BIT))); +} + +static int mlx4_cmd_post(struct mlx4_dev *dev, u64 in_param, u64 out_param, + u32 in_modifier, u8 op_modifier, u16 op, u16 token, + int event) +{ + struct mlx4_cmd *cmd = &mlx4_priv(dev)->cmd; + u32 __iomem *hcr = cmd->hcr; + int ret = -EAGAIN; + unsigned long end; + + mutex_lock(&cmd->hcr_mutex); + + end = jiffies; + if (event) + end += HZ * 10; + + while (cmd_pending(dev)) { + if (time_after_eq(jiffies, end)) + goto out; + cond_resched(); + } + + /* + * We use writel (instead of something like memcpy_toio) + * because writes of less than 32 bits to the HCR don't work + * (and some architectures such as ia64 implement memcpy_toio + * in terms of writeb). + */ + __raw_writel((__force u32) cpu_to_be32(in_param >> 32), hcr + 0); + __raw_writel((__force u32) cpu_to_be32(in_param & 0xfffffffful), hcr + 1); + __raw_writel((__force u32) cpu_to_be32(in_modifier), hcr + 2); + __raw_writel((__force u32) cpu_to_be32(out_param >> 32), hcr + 3); + __raw_writel((__force u32) cpu_to_be32(out_param & 0xfffffffful), hcr + 4); + __raw_writel((__force u32) cpu_to_be32(token << 16), hcr + 5); + + /* __raw_writel may not order writes. */ + wmb(); + + __raw_writel((__force u32) cpu_to_be32((1 << HCR_GO_BIT) | + (cmd->toggle << HCR_T_BIT) | + (event ? (1 << HCR_E_BIT) : 0) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), hcr + 6); + cmd->toggle = cmd->toggle ^ 1; + + ret = 0; + +out: + mutex_unlock(&cmd->hcr_mutex); + return ret; +} + +static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 in_param, u64 *out_param, + int out_is_imm, u32 in_modifier, u8 op_modifier, + u16 op, unsigned long timeout) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + void __iomem *hcr = priv->cmd.hcr; + int err = 0; + unsigned long end; + + down(&priv->cmd.poll_sem); + + err = mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0, + in_modifier, op_modifier, op, CMD_POLL_TOKEN, 0); + if (err) + goto out; + + end = msecs_to_jiffies(timeout) + jiffies; + while (cmd_pending(dev) && time_before(jiffies, end)) + cond_resched(); + + if (cmd_pending(dev)) { + err = -ETIMEDOUT; + goto out; + } + + if (out_is_imm) + *out_param = + (u64) be32_to_cpu((__force __be32) + __raw_readl(hcr + HCR_OUT_PARAM_OFFSET)) << 32 | + (u64) be32_to_cpu((__force __be32) + __raw_readl(hcr + HCR_OUT_PARAM_OFFSET + 4)); + + err = mlx4_status_to_errno(be32_to_cpu((__force __be32) + __raw_readl(hcr + HCR_STATUS_OFFSET)) >> 24); + +out: + up(&priv->cmd.poll_sem); + return err; +} + +void mlx4_cmd_event(struct mlx4_dev *dev, u16 token, u8 status, u64 out_param) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_cmd_context *context = + &priv->cmd.context[token & priv->cmd.token_mask]; + + /* previously timed out command completing at long last */ + if (token != context->token) + return; + + context->result = mlx4_status_to_errno(status); + context->out_param = out_param; + + context->token += priv->cmd.token_mask + 1; + + complete(&context->done); +} + +static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 in_param, u64 *out_param, + int out_is_imm, u32 in_modifier, u8 op_modifier, + u16 op, unsigned long timeout) +{ + struct mlx4_cmd *cmd = &mlx4_priv(dev)->cmd; + struct mlx4_cmd_context *context; + int err = 0; + + down(&cmd->event_sem); + + spin_lock(&cmd->context_lock); + BUG_ON(cmd->free_head < 0); + context = &cmd->context[cmd->free_head]; + cmd->free_head = context->next; + spin_unlock(&cmd->context_lock); + + init_completion(&context->done); + + mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0, + in_modifier, op_modifier, op, context->token, 1); + + if (!wait_for_completion_timeout(&context->done, msecs_to_jiffies(timeout))) { + err = -EBUSY; + goto out; + } + + err = context->result; + if (err) + goto out; + + if (out_is_imm) + *out_param = context->out_param; + +out: + spin_lock(&cmd->context_lock); + context->next = cmd->free_head; + cmd->free_head = context - cmd->context; + spin_unlock(&cmd->context_lock); + + up(&cmd->event_sem); + return err; +} + +int __mlx4_cmd(struct mlx4_dev *dev, u64 in_param, u64 *out_param, + int out_is_imm, u32 in_modifier, u8 op_modifier, + u16 op, unsigned long timeout) +{ + if (mlx4_priv(dev)->cmd.use_events) + return mlx4_cmd_wait(dev, in_param, out_param, out_is_imm, + in_modifier, op_modifier, op, timeout); + else + return mlx4_cmd_poll(dev, in_param, out_param, out_is_imm, + in_modifier, op_modifier, op, timeout); +} +EXPORT_SYMBOL_GPL(__mlx4_cmd); + +int mlx4_cmd_init(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + + mutex_init(&priv->cmd.hcr_mutex); + sema_init(&priv->cmd.poll_sem, 1); + priv->cmd.use_events = 0; + priv->cmd.toggle = 1; + + priv->cmd.hcr = ioremap(pci_resource_start(dev->pdev, 0) + MLX4_HCR_BASE, + MLX4_HCR_SIZE); + if (!priv->cmd.hcr) { + mlx4_err(dev, "Couldn't map command register."); + return -ENOMEM; + } + + priv->cmd.pool = pci_pool_create("mlx4_cmd", dev->pdev, + MLX4_MAILBOX_SIZE, + MLX4_MAILBOX_SIZE, 0); + if (!priv->cmd.pool) { + iounmap(priv->cmd.hcr); + return -ENOMEM; + } + + return 0; +} + +void mlx4_cmd_cleanup(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + + pci_pool_destroy(priv->cmd.pool); + iounmap(priv->cmd.hcr); +} + +/* + * Switch to using events to issue FW commands (can only be called + * after event queue for command events has been initialized). + */ +int mlx4_cmd_use_events(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + int i; + + priv->cmd.context = kmalloc(priv->cmd.max_cmds * + sizeof (struct mlx4_cmd_context), + GFP_KERNEL); + if (!priv->cmd.context) + return -ENOMEM; + + for (i = 0; i < priv->cmd.max_cmds; ++i) { + priv->cmd.context[i].token = i; + priv->cmd.context[i].next = i + 1; + } + + priv->cmd.context[priv->cmd.max_cmds - 1].next = -1; + priv->cmd.free_head = 0; + + sema_init(&priv->cmd.event_sem, priv->cmd.max_cmds); + spin_lock_init(&priv->cmd.context_lock); + + for (priv->cmd.token_mask = 1; + priv->cmd.token_mask < priv->cmd.max_cmds; + priv->cmd.token_mask <<= 1) + ; /* nothing */ + --priv->cmd.token_mask; + + priv->cmd.use_events = 1; + + down(&priv->cmd.poll_sem); + + return 0; +} + +/* + * Switch back to polling (used when shutting down the device) + */ +void mlx4_cmd_use_polling(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + int i; + + priv->cmd.use_events = 0; + + for (i = 0; i < priv->cmd.max_cmds; ++i) + down(&priv->cmd.event_sem); + + kfree(priv->cmd.context); + + up(&priv->cmd.poll_sem); +} + +struct mlx4_cmd_mailbox *mlx4_alloc_cmd_mailbox(struct mlx4_dev *dev) +{ + struct mlx4_cmd_mailbox *mailbox; + + mailbox = kmalloc(sizeof *mailbox, GFP_KERNEL); + if (!mailbox) + return ERR_PTR(-ENOMEM); + + mailbox->buf = pci_pool_alloc(mlx4_priv(dev)->cmd.pool, GFP_KERNEL, + &mailbox->dma); + if (!mailbox->buf) { + kfree(mailbox); + return ERR_PTR(-ENOMEM); + } + + return mailbox; +} +EXPORT_SYMBOL_GPL(mlx4_alloc_cmd_mailbox); + +void mlx4_free_cmd_mailbox(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox) +{ + if (!mailbox) + return; + + pci_pool_free(mlx4_priv(dev)->cmd.pool, mailbox->buf, mailbox->dma); + kfree(mailbox); +} +EXPORT_SYMBOL_GPL(mlx4_free_cmd_mailbox); diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c new file mode 100644 index 0000000..0066eb7 --- /dev/null +++ b/drivers/net/mlx4/fw.c @@ -0,0 +1,758 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include + +#include "fw.h" +#include "icm.h" + +extern void __buggy_use_of_MLX4_GET(void); +extern void __buggy_use_of_MLX4_PUT(void); + +#define MLX4_GET(dest, source, offset) \ + do { \ + void *__p = (char *) (source) + (offset); \ + switch (sizeof (dest)) { \ + case 1: (dest) = *(u8 *) __p; break; \ + case 2: (dest) = be16_to_cpup(__p); break; \ + case 4: (dest) = be32_to_cpup(__p); break; \ + case 8: (dest) = be64_to_cpup(__p); break; \ + default: __buggy_use_of_MLX4_GET(); \ + } \ + } while (0) + +#define MLX4_PUT(dest, source, offset) \ + do { \ + void *__d = ((char *) (dest) + (offset)); \ + switch (sizeof(source)) { \ + case 1: *(u8 *) __d = (source); break; \ + case 2: *(__be16 *) __d = cpu_to_be16(source); break; \ + case 4: *(__be32 *) __d = cpu_to_be32(source); break; \ + case 8: *(__be64 *) __d = cpu_to_be64(source); break; \ + default: __buggy_use_of_MLX4_PUT(); \ + } \ + } while (0) + +static void dump_dev_cap_flags(struct mlx4_dev *dev, u32 flags) +{ + static const char *fname[] = { + [ 0] = "RC transport", + [ 1] = "UC transport", + [ 2] = "UD transport", + [ 3] = "SRC transport", + [ 4] = "reliable multicast", + [ 5] = "FCoIB support", + [ 6] = "SRQ support", + [ 7] = "IPoIB checksum offload", + [ 8] = "P_Key violation counter", + [ 9] = "Q_Key violation counter", + [10] = "VMM", + [16] = "MW support", + [17] = "APM support", + [18] = "Atomic ops support", + [19] = "Raw multicast support", + [20] = "Address vector port checking support", + [21] = "UD multicast support", + [24] = "Demand paging support", + [25] = "Router support" + }; + int i; + + mlx4_dbg(dev, "DEV_CAP flags:\n"); + for (i = 0; i < 32; ++i) + if (fname[i] && (flags & (1 << i))) + mlx4_dbg(dev, " %s\n", fname[i]); +} + +int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) +{ + struct mlx4_cmd_mailbox *mailbox; + u32 *outbox; + u8 field; + u16 size; + u16 stat_rate; + int err; + +#define QUERY_DEV_CAP_OUT_SIZE 0x100 +#define QUERY_DEV_CAP_MAX_SRQ_SZ_OFFSET 0x10 +#define QUERY_DEV_CAP_MAX_QP_SZ_OFFSET 0x11 +#define QUERY_DEV_CAP_RSVD_QP_OFFSET 0x12 +#define QUERY_DEV_CAP_MAX_QP_OFFSET 0x13 +#define QUERY_DEV_CAP_RSVD_SRQ_OFFSET 0x14 +#define QUERY_DEV_CAP_MAX_SRQ_OFFSET 0x15 +#define QUERY_DEV_CAP_RSVD_EEC_OFFSET 0x16 +#define QUERY_DEV_CAP_MAX_EEC_OFFSET 0x17 +#define QUERY_DEV_CAP_MAX_CQ_SZ_OFFSET 0x19 +#define QUERY_DEV_CAP_RSVD_CQ_OFFSET 0x1a +#define QUERY_DEV_CAP_MAX_CQ_OFFSET 0x1b +#define QUERY_DEV_CAP_MAX_MPT_OFFSET 0x1d +#define QUERY_DEV_CAP_RSVD_EQ_OFFSET 0x1e +#define QUERY_DEV_CAP_MAX_EQ_OFFSET 0x1f +#define QUERY_DEV_CAP_RSVD_MTT_OFFSET 0x20 +#define QUERY_DEV_CAP_MAX_MRW_SZ_OFFSET 0x21 +#define QUERY_DEV_CAP_RSVD_MRW_OFFSET 0x22 +#define QUERY_DEV_CAP_MAX_MTT_SEG_OFFSET 0x23 +#define QUERY_DEV_CAP_MAX_AV_OFFSET 0x27 +#define QUERY_DEV_CAP_MAX_REQ_QP_OFFSET 0x29 +#define QUERY_DEV_CAP_MAX_RES_QP_OFFSET 0x2b +#define QUERY_DEV_CAP_MAX_RDMA_OFFSET 0x2f +#define QUERY_DEV_CAP_RSZ_SRQ_OFFSET 0x33 +#define QUERY_DEV_CAP_ACK_DELAY_OFFSET 0x35 +#define QUERY_DEV_CAP_MTU_WIDTH_OFFSET 0x36 +#define QUERY_DEV_CAP_VL_PORT_OFFSET 0x37 +#define QUERY_DEV_CAP_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_CAP_RATE_SUPPORT_OFFSET 0x3c +#define QUERY_DEV_CAP_MAX_PKEY_OFFSET 0x3f +#define QUERY_DEV_CAP_FLAGS_OFFSET 0x44 +#define QUERY_DEV_CAP_RSVD_UAR_OFFSET 0x48 +#define QUERY_DEV_CAP_UAR_SZ_OFFSET 0x49 +#define QUERY_DEV_CAP_PAGE_SZ_OFFSET 0x4b +#define QUERY_DEV_CAP_MAX_SG_SQ_OFFSET 0x51 +#define QUERY_DEV_CAP_MAX_DESC_SZ_SQ_OFFSET 0x52 +#define QUERY_DEV_CAP_MAX_SG_RQ_OFFSET 0x55 +#define QUERY_DEV_CAP_MAX_DESC_SZ_RQ_OFFSET 0x56 +#define QUERY_DEV_CAP_MAX_QP_MCG_OFFSET 0x61 +#define QUERY_DEV_CAP_RSVD_MCG_OFFSET 0x62 +#define QUERY_DEV_CAP_MAX_MCG_OFFSET 0x63 +#define QUERY_DEV_CAP_RSVD_PD_OFFSET 0x64 +#define QUERY_DEV_CAP_MAX_PD_OFFSET 0x65 +#define QUERY_DEV_CAP_RDMARC_ENTRY_SZ_OFFSET 0x80 +#define QUERY_DEV_CAP_QPC_ENTRY_SZ_OFFSET 0x82 +#define QUERY_DEV_CAP_AUX_ENTRY_SZ_OFFSET 0x84 +#define QUERY_DEV_CAP_ALTC_ENTRY_SZ_OFFSET 0x86 +#define QUERY_DEV_CAP_EQC_ENTRY_SZ_OFFSET 0x88 +#define QUERY_DEV_CAP_CQC_ENTRY_SZ_OFFSET 0x8a +#define QUERY_DEV_CAP_SRQ_ENTRY_SZ_OFFSET 0x8c +#define QUERY_DEV_CAP_C_MPT_ENTRY_SZ_OFFSET 0x8e +#define QUERY_DEV_CAP_MTT_ENTRY_SZ_OFFSET 0x90 +#define QUERY_DEV_CAP_D_MPT_ENTRY_SZ_OFFSET 0x92 +#define QUERY_DEV_CAP_BMME_FLAGS_OFFSET 0x97 +#define QUERY_DEV_CAP_RSVD_LKEY_OFFSET 0x98 +#define QUERY_DEV_CAP_MAX_ICM_SZ_OFFSET 0xa0 + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + outbox = mailbox->buf; + + err = mlx4_cmd_box(dev, 0, mailbox->dma, 0, 0, MLX4_CMD_QUERY_DEV_CAP, + MLX4_CMD_TIME_CLASS_A); + + if (err) + goto out; + + MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_QP_OFFSET); + dev_cap->reserved_qps = 1 << (field & 0xf); + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_QP_OFFSET); + dev_cap->max_qps = 1 << (field & 0x1f); + MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_SRQ_OFFSET); + dev_cap->reserved_srqs = 1 << (field >> 4); + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_SRQ_OFFSET); + dev_cap->max_srqs = 1 << (field & 0x1f); + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_CQ_SZ_OFFSET); + dev_cap->max_cq_sz = 1 << field; + MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_CQ_OFFSET); + dev_cap->reserved_cqs = 1 << (field & 0xf); + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_CQ_OFFSET); + dev_cap->max_cqs = 1 << (field & 0x1f); + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_MPT_OFFSET); + dev_cap->max_mpts = 1 << (field & 0x3f); + MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_EQ_OFFSET); + dev_cap->reserved_eqs = 1 << (field & 0xf); + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_EQ_OFFSET); + dev_cap->max_eqs = 1 << (field & 0x7); + MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_MTT_OFFSET); + dev_cap->reserved_mtts = 1 << (field >> 4); + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_MRW_SZ_OFFSET); + dev_cap->max_mrw_sz = 1 << field; + MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_MRW_OFFSET); + dev_cap->reserved_mrws = 1 << (field & 0xf); + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_MTT_SEG_OFFSET); + dev_cap->max_mtt_seg = 1 << (field & 0x3f); + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_REQ_QP_OFFSET); + dev_cap->max_requester_per_qp = 1 << (field & 0x3f); + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_RES_QP_OFFSET); + dev_cap->max_responder_per_qp = 1 << (field & 0x3f); + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_RDMA_OFFSET); + dev_cap->max_rdma_global = 1 << (field & 0x3f); + MLX4_GET(field, outbox, QUERY_DEV_CAP_ACK_DELAY_OFFSET); + dev_cap->local_ca_ack_delay = field & 0x1f; + MLX4_GET(field, outbox, QUERY_DEV_CAP_MTU_WIDTH_OFFSET); + dev_cap->max_mtu = field >> 4; + dev_cap->max_port_width = field & 0xf; + MLX4_GET(field, outbox, QUERY_DEV_CAP_VL_PORT_OFFSET); + dev_cap->max_vl = field >> 4; + dev_cap->num_ports = field & 0xf; + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_GID_OFFSET); + dev_cap->max_gids = 1 << (field & 0xf); + MLX4_GET(stat_rate, outbox, QUERY_DEV_CAP_RATE_SUPPORT_OFFSET); + dev_cap->stat_rate_support = stat_rate; + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_PKEY_OFFSET); + dev_cap->max_pkeys = 1 << (field & 0xf); + MLX4_GET(dev_cap->flags, outbox, QUERY_DEV_CAP_FLAGS_OFFSET); + MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_UAR_OFFSET); + dev_cap->reserved_uars = field >> 4; + MLX4_GET(field, outbox, QUERY_DEV_CAP_UAR_SZ_OFFSET); + dev_cap->uar_size = 1 << ((field & 0x3f) + 20); + MLX4_GET(field, outbox, QUERY_DEV_CAP_PAGE_SZ_OFFSET); + dev_cap->min_page_sz = 1 << field; + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_SG_SQ_OFFSET); + dev_cap->max_sq_sg = field; + + MLX4_GET(size, outbox, QUERY_DEV_CAP_MAX_DESC_SZ_SQ_OFFSET); + dev_cap->max_sq_desc_sz = size; + + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_QP_MCG_OFFSET); + dev_cap->max_qp_per_mcg = 1 << field; + MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_MCG_OFFSET); + dev_cap->reserved_mgms = field & 0xf; + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_MCG_OFFSET); + dev_cap->max_mcgs = 1 << field; + MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_PD_OFFSET); + dev_cap->reserved_pds = field >> 4; + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_PD_OFFSET); + dev_cap->max_pds = 1 << (field & 0x3f); + + MLX4_GET(size, outbox, QUERY_DEV_CAP_RDMARC_ENTRY_SZ_OFFSET); + dev_cap->rdmarc_entry_sz = size; + MLX4_GET(size, outbox, QUERY_DEV_CAP_QPC_ENTRY_SZ_OFFSET); + dev_cap->qpc_entry_sz = size; + MLX4_GET(size, outbox, QUERY_DEV_CAP_AUX_ENTRY_SZ_OFFSET); + dev_cap->aux_entry_sz = size; + MLX4_GET(size, outbox, QUERY_DEV_CAP_ALTC_ENTRY_SZ_OFFSET); + dev_cap->altc_entry_sz = size; + MLX4_GET(size, outbox, QUERY_DEV_CAP_EQC_ENTRY_SZ_OFFSET); + dev_cap->eqc_entry_sz = size; + MLX4_GET(size, outbox, QUERY_DEV_CAP_CQC_ENTRY_SZ_OFFSET); + dev_cap->cqc_entry_sz = size; + MLX4_GET(size, outbox, QUERY_DEV_CAP_SRQ_ENTRY_SZ_OFFSET); + dev_cap->srq_entry_sz = size; + MLX4_GET(size, outbox, QUERY_DEV_CAP_C_MPT_ENTRY_SZ_OFFSET); + dev_cap->cmpt_entry_sz = size; + MLX4_GET(size, outbox, QUERY_DEV_CAP_MTT_ENTRY_SZ_OFFSET); + dev_cap->mtt_entry_sz = size; + MLX4_GET(size, outbox, QUERY_DEV_CAP_D_MPT_ENTRY_SZ_OFFSET); + dev_cap->dmpt_entry_sz = size; + + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_SRQ_SZ_OFFSET); + dev_cap->max_srq_sz = 1 << field; + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_QP_SZ_OFFSET); + dev_cap->max_qp_sz = 1 << field; + MLX4_GET(field, outbox, QUERY_DEV_CAP_RSZ_SRQ_OFFSET); + dev_cap->resize_srq = field & 1; + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_SG_RQ_OFFSET); + dev_cap->max_rq_sg = field; + MLX4_GET(size, outbox, QUERY_DEV_CAP_MAX_DESC_SZ_RQ_OFFSET); + dev_cap->max_rq_desc_sz = size; + + MLX4_GET(dev_cap->bmme_flags, outbox, + QUERY_DEV_CAP_BMME_FLAGS_OFFSET); + MLX4_GET(dev_cap->reserved_lkey, outbox, + QUERY_DEV_CAP_RSVD_LKEY_OFFSET); + MLX4_GET(dev_cap->max_icm_sz, outbox, + QUERY_DEV_CAP_MAX_ICM_SZ_OFFSET); + + if (dev_cap->bmme_flags & 1) + mlx4_dbg(dev, "Base MM extensions: yes " + "(flags %d, rsvd L_Key %08x)\n", + dev_cap->bmme_flags, dev_cap->reserved_lkey); + else + mlx4_dbg(dev, "Base MM extensions: no\n"); + + /* + * Each UAR has 4 EQ doorbells; so if a UAR is reserved, then + * we can't use any EQs whose doorbell falls on that page, + * even if the EQ itself isn't reserved. + */ + dev_cap->reserved_eqs = max(dev_cap->reserved_uars * 4, + dev_cap->reserved_eqs); + + mlx4_dbg(dev, "Max ICM size %lld MB\n", + (unsigned long long) dev_cap->max_icm_sz >> 20); + mlx4_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", + dev_cap->max_qps, dev_cap->reserved_qps, dev_cap->qpc_entry_sz); + mlx4_dbg(dev, "Max SRQs: %d, reserved SRQs: %d, entry size: %d\n", + dev_cap->max_srqs, dev_cap->reserved_srqs, dev_cap->srq_entry_sz); + mlx4_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", + dev_cap->max_cqs, dev_cap->reserved_cqs, dev_cap->cqc_entry_sz); + mlx4_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", + dev_cap->max_eqs, dev_cap->reserved_eqs, dev_cap->eqc_entry_sz); + mlx4_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", + dev_cap->reserved_mrws, dev_cap->reserved_mtts); + mlx4_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", + dev_cap->max_pds, dev_cap->reserved_pds, dev_cap->reserved_uars); + mlx4_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", + dev_cap->max_pds, dev_cap->reserved_mgms); + mlx4_dbg(dev, "Max CQEs: %d, max WQEs: %d, max SRQ WQEs: %d\n", + dev_cap->max_cq_sz, dev_cap->max_qp_sz, dev_cap->max_srq_sz); + mlx4_dbg(dev, "Local CA ACK delay: %d, max MTU: %d, port width cap: %d\n", + dev_cap->local_ca_ack_delay, 128 << dev_cap->max_mtu, + dev_cap->max_port_width); + mlx4_dbg(dev, "Max SQ desc size: %d, max SQ S/G: %d\n", + dev_cap->max_sq_desc_sz, dev_cap->max_sq_sg); + mlx4_dbg(dev, "Max RQ desc size: %d, max RQ S/G: %d\n", + dev_cap->max_rq_desc_sz, dev_cap->max_rq_sg); + + dump_dev_cap_flags(dev, dev_cap->flags); + +out: + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} + +int mlx4_map_cmd(struct mlx4_dev *dev, u16 op, struct mlx4_icm *icm, u64 virt) +{ + struct mlx4_cmd_mailbox *mailbox; + struct mlx4_icm_iter iter; + __be64 *pages; + int lg; + int nent = 0; + int i; + int err = 0; + int ts = 0, tc = 0; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + memset(mailbox->buf, 0, MLX4_MAILBOX_SIZE); + pages = mailbox->buf; + + for (mlx4_icm_first(icm, &iter); + !mlx4_icm_last(&iter); + mlx4_icm_next(&iter)) { + /* + * We have to pass pages that are aligned to their + * size, so find the least significant 1 in the + * address or size and use that as our log2 size. + */ + lg = ffs(mlx4_icm_addr(&iter) | mlx4_icm_size(&iter)) - 1; + if (lg < MLX4_ICM_PAGE_SHIFT) { + mlx4_warn(dev, "Got FW area not aligned to %d (%llx/%lx).\n", + MLX4_ICM_PAGE_SIZE, + (unsigned long long) mlx4_icm_addr(&iter), + mlx4_icm_size(&iter)); + err = -EINVAL; + goto out; + } + + for (i = 0; i < mlx4_icm_size(&iter) >> lg; ++i) { + if (virt != -1) { + pages[nent * 2] = cpu_to_be64(virt); + virt += 1 << lg; + } + + pages[nent * 2 + 1] = + cpu_to_be64((mlx4_icm_addr(&iter) + (i << lg)) | + (lg - MLX4_ICM_PAGE_SHIFT)); + ts += 1 << (lg - 10); + ++tc; + + if (++nent == MLX4_MAILBOX_SIZE / 16) { + err = mlx4_cmd(dev, mailbox->dma, nent, 0, op, + MLX4_CMD_TIME_CLASS_B); + if (err) + goto out; + nent = 0; + } + } + } + + if (nent) + err = mlx4_cmd(dev, mailbox->dma, nent, 0, op, MLX4_CMD_TIME_CLASS_B); + if (err) + goto out; + + switch (op) { + case MLX4_CMD_MAP_FA: + mlx4_dbg(dev, "Mapped %d chunks/%d KB for FW.\n", tc, ts); + break; + case MLX4_CMD_MAP_ICM_AUX: + mlx4_dbg(dev, "Mapped %d chunks/%d KB for ICM aux.\n", tc, ts); + break; + case MLX4_CMD_MAP_ICM: + mlx4_dbg(dev, "Mapped %d chunks/%d KB at %llx for ICM.\n", + tc, ts, (unsigned long long) virt - (ts << 10)); + break; + } + +out: + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} + +int mlx4_MAP_FA(struct mlx4_dev *dev, struct mlx4_icm *icm) +{ + return mlx4_map_cmd(dev, MLX4_CMD_MAP_FA, icm, -1); +} + +int mlx4_UNMAP_FA(struct mlx4_dev *dev) +{ + return mlx4_cmd(dev, 0, 0, 0, MLX4_CMD_UNMAP_FA, MLX4_CMD_TIME_CLASS_B); +} + + +int mlx4_RUN_FW(struct mlx4_dev *dev) +{ + return mlx4_cmd(dev, 0, 0, 0, MLX4_CMD_RUN_FW, MLX4_CMD_TIME_CLASS_A); +} + +int mlx4_QUERY_FW(struct mlx4_dev *dev) +{ + struct mlx4_fw *fw = &mlx4_priv(dev)->fw; + struct mlx4_cmd *cmd = &mlx4_priv(dev)->cmd; + struct mlx4_cmd_mailbox *mailbox; + u32 *outbox; + int err = 0; + u64 fw_ver; + u8 lg; + +#define QUERY_FW_OUT_SIZE 0x100 +#define QUERY_FW_VER_OFFSET 0x00 +#define QUERY_FW_MAX_CMD_OFFSET 0x0f +#define QUERY_FW_ERR_START_OFFSET 0x30 +#define QUERY_FW_ERR_SIZE_OFFSET 0x38 +#define QUERY_FW_ERR_BAR_OFFSET 0x3c + +#define QUERY_FW_SIZE_OFFSET 0x00 +#define QUERY_FW_CLR_INT_BASE_OFFSET 0x20 +#define QUERY_FW_CLR_INT_BAR_OFFSET 0x28 + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + outbox = mailbox->buf; + + err = mlx4_cmd_box(dev, 0, mailbox->dma, 0, 0, MLX4_CMD_QUERY_FW, + MLX4_CMD_TIME_CLASS_A); + if (err) + goto out; + + MLX4_GET(fw_ver, outbox, QUERY_FW_VER_OFFSET); + /* + * FW subminor version is at more signifant bits than minor + * version, so swap here. + */ + dev->caps.fw_ver = (fw_ver & 0xffff00000000ull) | + ((fw_ver & 0xffff0000ull) >> 16) | + ((fw_ver & 0x0000ffffull) << 16); + + MLX4_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET); + cmd->max_cmds = 1 << lg; + + mlx4_dbg(dev, "FW version %d.%d.%03d, max commands %d\n", + (int) (dev->caps.fw_ver >> 32), + (int) (dev->caps.fw_ver >> 16) & 0xffff, + (int) dev->caps.fw_ver & 0xffff, + cmd->max_cmds); + + MLX4_GET(fw->catas_addr, outbox, QUERY_FW_ERR_START_OFFSET); + MLX4_GET(fw->catas_size, outbox, QUERY_FW_ERR_SIZE_OFFSET); + MLX4_GET(fw->catas_bar, outbox, QUERY_FW_ERR_BAR_OFFSET); + fw->catas_bar = (fw->catas_bar >> 6) * 2; + + mlx4_dbg(dev, "Catastrophic error buffer at 0x%llx, size 0x%x, BAR %d\n", + (unsigned long long) fw->catas_addr, fw->catas_size, fw->catas_bar); + + MLX4_GET(fw->fw_pages, outbox, QUERY_FW_SIZE_OFFSET); + MLX4_GET(fw->clr_int_base, outbox, QUERY_FW_CLR_INT_BASE_OFFSET); + MLX4_GET(fw->clr_int_bar, outbox, QUERY_FW_CLR_INT_BAR_OFFSET); + fw->clr_int_bar = (fw->clr_int_bar >> 6) * 2; + + mlx4_dbg(dev, "FW size %d KB\n", fw->fw_pages >> 2); + + /* + * Round up number of system pages needed in case + * MLX4_ICM_PAGE_SIZE < PAGE_SIZE. + */ + fw->fw_pages = + ALIGN(fw->fw_pages, PAGE_SIZE / MLX4_ICM_PAGE_SIZE) >> + (PAGE_SHIFT - MLX4_ICM_PAGE_SHIFT); + + mlx4_dbg(dev, "Clear int @ %llx, BAR %d\n", + (unsigned long long) fw->clr_int_base, fw->clr_int_bar); + +out: + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} + +static void get_board_id(void *vsd, char *board_id) +{ + int i; + +#define VSD_OFFSET_SIG1 0x00 +#define VSD_OFFSET_SIG2 0xde +#define VSD_OFFSET_MLX_BOARD_ID 0xd0 +#define VSD_OFFSET_TS_BOARD_ID 0x20 + +#define VSD_SIGNATURE_TOPSPIN 0x5ad + + memset(board_id, 0, MLX4_BOARD_ID_LEN); + + if (be16_to_cpup(vsd + VSD_OFFSET_SIG1) == VSD_SIGNATURE_TOPSPIN && + be16_to_cpup(vsd + VSD_OFFSET_SIG2) == VSD_SIGNATURE_TOPSPIN) { + strlcpy(board_id, vsd + VSD_OFFSET_TS_BOARD_ID, MLX4_BOARD_ID_LEN); + } else { + /* + * The board ID is a string but the firmware byte + * swaps each 4-byte word before passing it back to + * us. Therefore we need to swab it before printing. + */ + for (i = 0; i < 4; ++i) + ((u32 *) board_id)[i] = + swab32(*(u32 *) (vsd + VSD_OFFSET_MLX_BOARD_ID + i * 4)); + } +} + +int mlx4_QUERY_ADAPTER(struct mlx4_dev *dev, struct mlx4_adapter *adapter) +{ + struct mlx4_cmd_mailbox *mailbox; + u32 *outbox; + int err; + +#define QUERY_ADAPTER_OUT_SIZE 0x100 +#define QUERY_ADAPTER_VENDOR_ID_OFFSET 0x00 +#define QUERY_ADAPTER_DEVICE_ID_OFFSET 0x04 +#define QUERY_ADAPTER_REVISION_ID_OFFSET 0x08 +#define QUERY_ADAPTER_INTA_PIN_OFFSET 0x10 +#define QUERY_ADAPTER_VSD_OFFSET 0x20 + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + outbox = mailbox->buf; + + err = mlx4_cmd_box(dev, 0, mailbox->dma, 0, 0, MLX4_CMD_QUERY_ADAPTER, + MLX4_CMD_TIME_CLASS_A); + if (err) + goto out; + + MLX4_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET); + MLX4_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET); + MLX4_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET); + MLX4_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET); + + get_board_id(outbox + QUERY_ADAPTER_VSD_OFFSET / 4, + adapter->board_id); + +out: + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} + +int mlx4_INIT_HCA(struct mlx4_dev *dev, struct mlx4_init_hca_param *param) +{ + struct mlx4_cmd_mailbox *mailbox; + __be32 *inbox; + int err; + +#define INIT_HCA_IN_SIZE 0x200 +#define INIT_HCA_VERSION_OFFSET 0x000 +#define INIT_HCA_VERSION 2 +#define INIT_HCA_FLAGS_OFFSET 0x014 +#define INIT_HCA_QPC_OFFSET 0x020 +#define INIT_HCA_QPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x10) +#define INIT_HCA_LOG_QP_OFFSET (INIT_HCA_QPC_OFFSET + 0x17) +#define INIT_HCA_SRQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x28) +#define INIT_HCA_LOG_SRQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x2f) +#define INIT_HCA_CQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x30) +#define INIT_HCA_LOG_CQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x37) +#define INIT_HCA_ALTC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x40) +#define INIT_HCA_AUXC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x50) +#define INIT_HCA_EQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x60) +#define INIT_HCA_LOG_EQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x67) +#define INIT_HCA_RDMARC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x70) +#define INIT_HCA_LOG_RD_OFFSET (INIT_HCA_QPC_OFFSET + 0x77) +#define INIT_HCA_MCAST_OFFSET 0x0c0 +#define INIT_HCA_MC_BASE_OFFSET (INIT_HCA_MCAST_OFFSET + 0x00) +#define INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12) +#define INIT_HCA_LOG_MC_HASH_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x16) +#define INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b) +#define INIT_HCA_TPT_OFFSET 0x0f0 +#define INIT_HCA_DMPT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x00) +#define INIT_HCA_LOG_MPT_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x0b) +#define INIT_HCA_MTT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x10) +#define INIT_HCA_CMPT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x18) +#define INIT_HCA_UAR_OFFSET 0x120 +#define INIT_HCA_LOG_UAR_SZ_OFFSET (INIT_HCA_UAR_OFFSET + 0x0a) +#define INIT_HCA_UAR_PAGE_SZ_OFFSET (INIT_HCA_UAR_OFFSET + 0x0b) + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + inbox = mailbox->buf; + + memset(inbox, 0, INIT_HCA_IN_SIZE); + + *((u8 *) mailbox->buf + INIT_HCA_VERSION_OFFSET) = INIT_HCA_VERSION; + +#if defined(__LITTLE_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1); +#elif defined(__BIG_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1); +#else +#error Host endianness not defined +#endif + /* Check port for UD address vector: */ + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1); + + /* QPC/EEC/CQC/EQC/RDMARC attributes */ + + MLX4_PUT(inbox, param->qpc_base, INIT_HCA_QPC_BASE_OFFSET); + MLX4_PUT(inbox, param->log_num_qps, INIT_HCA_LOG_QP_OFFSET); + MLX4_PUT(inbox, param->srqc_base, INIT_HCA_SRQC_BASE_OFFSET); + MLX4_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET); + MLX4_PUT(inbox, param->cqc_base, INIT_HCA_CQC_BASE_OFFSET); + MLX4_PUT(inbox, param->log_num_cqs, INIT_HCA_LOG_CQ_OFFSET); + MLX4_PUT(inbox, param->altc_base, INIT_HCA_ALTC_BASE_OFFSET); + MLX4_PUT(inbox, param->auxc_base, INIT_HCA_AUXC_BASE_OFFSET); + MLX4_PUT(inbox, param->eqc_base, INIT_HCA_EQC_BASE_OFFSET); + MLX4_PUT(inbox, param->log_num_eqs, INIT_HCA_LOG_EQ_OFFSET); + MLX4_PUT(inbox, param->rdmarc_base, INIT_HCA_RDMARC_BASE_OFFSET); + MLX4_PUT(inbox, param->log_rd_per_qp, INIT_HCA_LOG_RD_OFFSET); + + /* multicast attributes */ + + MLX4_PUT(inbox, param->mc_base, INIT_HCA_MC_BASE_OFFSET); + MLX4_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET); + MLX4_PUT(inbox, param->log_mc_hash_sz, INIT_HCA_LOG_MC_HASH_SZ_OFFSET); + MLX4_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET); + + /* TPT attributes */ + + MLX4_PUT(inbox, param->dmpt_base, INIT_HCA_DMPT_BASE_OFFSET); + MLX4_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET); + MLX4_PUT(inbox, param->mtt_base, INIT_HCA_MTT_BASE_OFFSET); + MLX4_PUT(inbox, param->cmpt_base, INIT_HCA_CMPT_BASE_OFFSET); + + /* UAR attributes */ + + MLX4_PUT(inbox, (u8) (PAGE_SHIFT - 12), INIT_HCA_UAR_PAGE_SZ_OFFSET); + MLX4_PUT(inbox, param->log_uar_sz, INIT_HCA_LOG_UAR_SZ_OFFSET); + + err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_INIT_HCA, 1000); + + if (err) + mlx4_err(dev, "INIT_HCA returns %d\n", err); + + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} + +int mlx4_INIT_PORT(struct mlx4_dev *dev, struct mlx4_init_port_param *param, int port) +{ + struct mlx4_cmd_mailbox *mailbox; + u32 *inbox; + int err; + u32 flags; + +#define INIT_PORT_IN_SIZE 256 +#define INIT_PORT_FLAGS_OFFSET 0x00 +#define INIT_PORT_FLAG_SIG (1 << 18) +#define INIT_PORT_FLAG_NG (1 << 17) +#define INIT_PORT_FLAG_G0 (1 << 16) +#define INIT_PORT_VL_SHIFT 4 +#define INIT_PORT_PORT_WIDTH_SHIFT 8 +#define INIT_PORT_MTU_OFFSET 0x04 +#define INIT_PORT_MAX_GID_OFFSET 0x06 +#define INIT_PORT_MAX_PKEY_OFFSET 0x0a +#define INIT_PORT_GUID0_OFFSET 0x10 +#define INIT_PORT_NODE_GUID_OFFSET 0x18 +#define INIT_PORT_SI_GUID_OFFSET 0x20 + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + inbox = mailbox->buf; + + memset(inbox, 0, INIT_PORT_IN_SIZE); + + flags = 0; + flags |= param->set_guid0 ? INIT_PORT_FLAG_G0 : 0; + flags |= param->set_node_guid ? INIT_PORT_FLAG_NG : 0; + flags |= param->set_si_guid ? INIT_PORT_FLAG_SIG : 0; + flags |= (param->vl_cap & 0xf) << INIT_PORT_VL_SHIFT; + flags |= (param->port_width_cap & 0xf) << INIT_PORT_PORT_WIDTH_SHIFT; + MLX4_PUT(inbox, flags, INIT_PORT_FLAGS_OFFSET); + + MLX4_PUT(inbox, param->mtu, INIT_PORT_MTU_OFFSET); + MLX4_PUT(inbox, param->max_gid, INIT_PORT_MAX_GID_OFFSET); + MLX4_PUT(inbox, param->max_pkey, INIT_PORT_MAX_PKEY_OFFSET); + MLX4_PUT(inbox, param->guid0, INIT_PORT_GUID0_OFFSET); + MLX4_PUT(inbox, param->node_guid, INIT_PORT_NODE_GUID_OFFSET); + MLX4_PUT(inbox, param->si_guid, INIT_PORT_SI_GUID_OFFSET); + + err = mlx4_cmd(dev, mailbox->dma, port, 0, MLX4_CMD_INIT_PORT, + MLX4_CMD_TIME_CLASS_A); + + mlx4_free_cmd_mailbox(dev, mailbox); + + return err; +} +EXPORT_SYMBOL_GPL(mlx4_INIT_PORT); + +int mlx4_CLOSE_PORT(struct mlx4_dev *dev, int port) +{ + return mlx4_cmd(dev, 0, port, 0, MLX4_CMD_CLOSE_PORT, 1000); +} +EXPORT_SYMBOL_GPL(mlx4_CLOSE_PORT); + +int mlx4_CLOSE_HCA(struct mlx4_dev *dev, int panic) +{ + return mlx4_cmd(dev, 0, 0, panic, MLX4_CMD_CLOSE_HCA, 1000); +} + +int mlx4_SET_ICM_SIZE(struct mlx4_dev *dev, u64 icm_size, u64 *aux_pages) +{ + int ret = mlx4_cmd_imm(dev, icm_size, aux_pages, 0, 0, + MLX4_CMD_SET_ICM_SIZE, + MLX4_CMD_TIME_CLASS_A); + if (ret) + return ret; + + /* + * Round up number of system pages needed in case + * MLX4_ICM_PAGE_SIZE < PAGE_SIZE. + */ + *aux_pages = ALIGN(*aux_pages, PAGE_SIZE / MLX4_ICM_PAGE_SIZE) >> + (PAGE_SHIFT - MLX4_ICM_PAGE_SHIFT); + + return 0; +} + +int mlx4_NOP(struct mlx4_dev *dev) +{ + /* Input modifier of 0x1f means "finish as soon as possible." */ + return mlx4_cmd(dev, 0, 0x1f, 0, MLX4_CMD_NOP, 100); +} diff --git a/drivers/net/mlx4/fw.h b/drivers/net/mlx4/fw.h new file mode 100644 index 0000000..63cdd4e --- /dev/null +++ b/drivers/net/mlx4/fw.h @@ -0,0 +1,165 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2006 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef MLX4_FW_H +#define MLX4_FW_H + +#include "mlx4.h" +#include "icm.h" + +struct mlx4_dev_cap { + int max_srq_sz; + int max_qp_sz; + int reserved_qps; + int max_qps; + int reserved_srqs; + int max_srqs; + int max_cq_sz; + int reserved_cqs; + int max_cqs; + int max_mpts; + int reserved_eqs; + int max_eqs; + int reserved_mtts; + int max_mrw_sz; + int reserved_mrws; + int max_mtt_seg; + int max_requester_per_qp; + int max_responder_per_qp; + int max_rdma_global; + int local_ca_ack_delay; + int max_mtu; + int max_port_width; + int max_vl; + int num_ports; + int max_gids; + u16 stat_rate_support; + int max_pkeys; + u32 flags; + int reserved_uars; + int uar_size; + int min_page_sz; + int max_sq_sg; + int max_sq_desc_sz; + int max_rq_sg; + int max_rq_desc_sz; + int max_qp_per_mcg; + int reserved_mgms; + int max_mcgs; + int reserved_pds; + int max_pds; + int qpc_entry_sz; + int rdmarc_entry_sz; + int altc_entry_sz; + int aux_entry_sz; + int srq_entry_sz; + int cqc_entry_sz; + int eqc_entry_sz; + int dmpt_entry_sz; + int cmpt_entry_sz; + int mtt_entry_sz; + int resize_srq; + u8 bmme_flags; + u32 reserved_lkey; + u64 max_icm_sz; +}; + +struct mlx4_adapter { + u32 vendor_id; + u32 device_id; + u32 revision_id; + char board_id[MLX4_BOARD_ID_LEN]; + u8 inta_pin; +}; + +struct mlx4_init_hca_param { + u64 qpc_base; + u64 rdmarc_base; + u64 auxc_base; + u64 altc_base; + u64 srqc_base; + u64 cqc_base; + u64 eqc_base; + u64 mc_base; + u64 dmpt_base; + u64 cmpt_base; + u64 mtt_base; + u16 log_mc_entry_sz; + u16 log_mc_hash_sz; + u8 log_num_qps; + u8 log_num_srqs; + u8 log_num_cqs; + u8 log_num_eqs; + u8 log_rd_per_qp; + u8 log_mc_table_sz; + u8 log_mpt_sz; + u8 log_uar_sz; +}; + +struct mlx4_init_ib_param { + int port_width; + int vl_cap; + int mtu_cap; + u16 gid_cap; + u16 pkey_cap; + int set_guid0; + u64 guid0; + int set_node_guid; + u64 node_guid; + int set_si_guid; + u64 si_guid; +}; + +struct mlx4_set_ib_param { + int set_si_guid; + int reset_qkey_viol; + u64 si_guid; + u32 cap_mask; +}; + +int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap); +int mlx4_MAP_FA(struct mlx4_dev *dev, struct mlx4_icm *icm); +int mlx4_UNMAP_FA(struct mlx4_dev *dev); +int mlx4_RUN_FW(struct mlx4_dev *dev); +int mlx4_QUERY_FW(struct mlx4_dev *dev); +int mlx4_QUERY_ADAPTER(struct mlx4_dev *dev, struct mlx4_adapter *adapter); +int mlx4_INIT_HCA(struct mlx4_dev *dev, struct mlx4_init_hca_param *param); +int mlx4_CLOSE_HCA(struct mlx4_dev *dev, int panic); +int mlx4_map_cmd(struct mlx4_dev *dev, u16 op, struct mlx4_icm *icm, u64 virt); +int mlx4_SET_ICM_SIZE(struct mlx4_dev *dev, u64 icm_size, u64 *aux_pages); +int mlx4_MAP_ICM_AUX(struct mlx4_dev *dev, struct mlx4_icm *icm); +int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev); +int mlx4_NOP(struct mlx4_dev *dev); + +#endif /* MLX4_FW_H */ diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c new file mode 100644 index 0000000..a63cb8b --- /dev/null +++ b/drivers/net/mlx4/main.c @@ -0,0 +1,939 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include +#include + +#include "mlx4.h" +#include "fw.h" +#include "icm.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Mellanox ConnectX HCA low-level driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +#ifdef CONFIG_MLX4_DEBUG + +int mlx4_debug_level = 0; +module_param_named(debug_level, mlx4_debug_level, int, 0644); +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); + +#endif /* CONFIG_MLX4_DEBUG */ + +#ifdef CONFIG_PCI_MSI + +static int msi_x = 0; +module_param(msi_x, int, 0444); +MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero"); + +#else /* CONFIG_PCI_MSI */ + +#define msi_x (0) + +#endif /* CONFIG_PCI_MSI */ + +static const char mlx4_version[] __devinitdata = + DRV_NAME ": Mellanox ConnectX core driver v" + DRV_VERSION " (" DRV_RELDATE ")\n"; + +static struct mlx4_profile default_profile = { + .num_qp = 1 << 16, + .num_srq = 1 << 16, + .rdmarc_per_qp = 4, + .num_cq = 1 << 16, + .num_mcg = 1 << 13, + .num_mpt = 1 << 17, + .num_mtt = 1 << 20, +}; + +static int __devinit mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) +{ + int err; + + err = mlx4_QUERY_DEV_CAP(dev, dev_cap); + if (err) { + mlx4_err(dev, "QUERY_DEV_CAP command failed, aborting.\n"); + return err; + } + + if (dev_cap->min_page_sz > PAGE_SIZE) { + mlx4_err(dev, "HCA minimum page size of %d bigger than " + "kernel PAGE_SIZE of %ld, aborting.\n", + dev_cap->min_page_sz, PAGE_SIZE); + return -ENODEV; + } + if (dev_cap->num_ports > MLX4_MAX_PORTS) { + mlx4_err(dev, "HCA has %d ports, but we only support %d, " + "aborting.\n", + dev_cap->num_ports, MLX4_MAX_PORTS); + return -ENODEV; + } + + if (dev_cap->uar_size > pci_resource_len(dev->pdev, 2)) { + mlx4_err(dev, "HCA reported UAR size of 0x%x bigger than " + "PCI resource 2 size of 0x%llx, aborting.\n", + dev_cap->uar_size, + (unsigned long long) pci_resource_len(dev->pdev, 2)); + return -ENODEV; + } + + dev->caps.num_ports = dev_cap->num_ports; + dev->caps.num_uars = dev_cap->uar_size / PAGE_SIZE; + dev->caps.vl_cap = dev_cap->max_vl; + dev->caps.mtu_cap = dev_cap->max_mtu; + dev->caps.gid_table_len = dev_cap->max_gids; + dev->caps.pkey_table_len = dev_cap->max_pkeys; + dev->caps.local_ca_ack_delay = dev_cap->local_ca_ack_delay; + dev->caps.max_sq_sg = dev_cap->max_sq_sg; + dev->caps.max_rq_sg = dev_cap->max_rq_sg; + dev->caps.max_wqes = dev_cap->max_qp_sz; + dev->caps.max_qp_init_rdma = dev_cap->max_requester_per_qp; + dev->caps.reserved_qps = dev_cap->reserved_qps; + dev->caps.max_srq_wqes = dev_cap->max_srq_sz; + dev->caps.max_srq_sge = dev_cap->max_rq_sg - 1; + dev->caps.reserved_srqs = dev_cap->reserved_srqs; + dev->caps.max_sq_desc_sz = dev_cap->max_sq_desc_sz; + dev->caps.max_rq_desc_sz = dev_cap->max_rq_desc_sz; + /* + * Subtract 1 from the limit because we need to allocate a + * spare CQE so the HCA HW can tell the difference between an + * empty CQ and a full CQ. + */ + dev->caps.max_cqes = dev_cap->max_cq_sz - 1; + dev->caps.reserved_cqs = dev_cap->reserved_cqs; + dev->caps.reserved_eqs = dev_cap->reserved_eqs; + dev->caps.reserved_mtts = dev_cap->reserved_mtts; + dev->caps.reserved_mrws = dev_cap->reserved_mrws; + dev->caps.reserved_uars = dev_cap->reserved_uars; + dev->caps.reserved_pds = dev_cap->reserved_pds; + dev->caps.port_width_cap = dev_cap->max_port_width; + dev->caps.mtt_entry_sz = MLX4_MTT_ENTRY_PER_SEG * dev_cap->mtt_entry_sz; + dev->caps.page_size_cap = ~(u32) (dev_cap->min_page_sz - 1); + dev->caps.flags = dev_cap->flags; + dev->caps.stat_rate_support = dev_cap->stat_rate_support; + + return 0; +} + +static int __devinit mlx4_load_fw(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + int err; + + priv->fw.fw_icm = mlx4_alloc_icm(dev, priv->fw.fw_pages, + GFP_HIGHUSER | __GFP_NOWARN); + if (!priv->fw.fw_icm) { + mlx4_err(dev, "Couldn't allocate FW area, aborting.\n"); + return -ENOMEM; + } + + err = mlx4_MAP_FA(dev, priv->fw.fw_icm); + if (err) { + mlx4_err(dev, "MAP_FA command failed, aborting.\n"); + goto err_free; + } + + err = mlx4_RUN_FW(dev); + if (err) { + mlx4_err(dev, "RUN_FW command failed, aborting.\n"); + goto err_unmap_fa; + } + + return 0; + +err_unmap_fa: + mlx4_UNMAP_FA(dev); + +err_free: + mlx4_free_icm(dev, priv->fw.fw_icm); + return err; +} + +static int __devinit mlx4_init_cmpt_table(struct mlx4_dev *dev, u64 cmpt_base, + int cmpt_entry_sz) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + int err; + + err = mlx4_init_icm_table(dev, &priv->qp_table.cmpt_table, + cmpt_base + + ((u64) (MLX4_CMPT_TYPE_QP * + cmpt_entry_sz) << MLX4_CMPT_SHIFT), + cmpt_entry_sz, dev->caps.num_qps, + dev->caps.reserved_qps, 0); + if (err) + goto err; + + err = mlx4_init_icm_table(dev, &priv->srq_table.cmpt_table, + cmpt_base + + ((u64) (MLX4_CMPT_TYPE_SRQ * + cmpt_entry_sz) << MLX4_CMPT_SHIFT), + cmpt_entry_sz, dev->caps.num_srqs, + dev->caps.reserved_srqs, 0); + if (err) + goto err_qp; + + err = mlx4_init_icm_table(dev, &priv->cq_table.cmpt_table, + cmpt_base + + ((u64) (MLX4_CMPT_TYPE_CQ * + cmpt_entry_sz) << MLX4_CMPT_SHIFT), + cmpt_entry_sz, dev->caps.num_cqs, + dev->caps.reserved_cqs, 0); + if (err) + goto err_srq; + + err = mlx4_init_icm_table(dev, &priv->eq_table.cmpt_table, + cmpt_base + + ((u64) (MLX4_CMPT_TYPE_EQ * + cmpt_entry_sz) << MLX4_CMPT_SHIFT), + cmpt_entry_sz, + roundup_pow_of_two(MLX4_NUM_EQ + + dev->caps.reserved_eqs), + MLX4_NUM_EQ + dev->caps.reserved_eqs, 0); + if (err) + goto err_cq; + + return 0; + +err_cq: + mlx4_cleanup_icm_table(dev, &priv->cq_table.cmpt_table); + +err_srq: + mlx4_cleanup_icm_table(dev, &priv->srq_table.cmpt_table); + +err_qp: + mlx4_cleanup_icm_table(dev, &priv->qp_table.cmpt_table); + +err: + return err; +} + +static int __devinit mlx4_init_icm(struct mlx4_dev *dev, + struct mlx4_dev_cap *dev_cap, + struct mlx4_init_hca_param *init_hca, + u64 icm_size) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + u64 aux_pages; + int err; + + err = mlx4_SET_ICM_SIZE(dev, icm_size, &aux_pages); + if (err) { + mlx4_err(dev, "SET_ICM_SIZE command failed, aborting.\n"); + return err; + } + + mlx4_dbg(dev, "%lld KB of HCA context requires %lld KB aux memory.\n", + (unsigned long long) icm_size >> 10, + (unsigned long long) aux_pages << 2); + + priv->fw.aux_icm = mlx4_alloc_icm(dev, aux_pages, + GFP_HIGHUSER | __GFP_NOWARN); + if (!priv->fw.aux_icm) { + mlx4_err(dev, "Couldn't allocate aux memory, aborting.\n"); + return -ENOMEM; + } + + err = mlx4_MAP_ICM_AUX(dev, priv->fw.aux_icm); + if (err) { + mlx4_err(dev, "MAP_ICM_AUX command failed, aborting.\n"); + goto err_free_aux; + } + + err = mlx4_init_cmpt_table(dev, init_hca->cmpt_base, dev_cap->cmpt_entry_sz); + if (err) { + mlx4_err(dev, "Failed to map cMPT context memory, aborting.\n"); + goto err_unmap_aux; + } + + err = mlx4_map_eq_icm(dev, init_hca->eqc_base); + if (err) { + mlx4_err(dev, "Failed to map EQ context memory, aborting.\n"); + goto err_unmap_cmpt; + } + + err = mlx4_init_icm_table(dev, &priv->mr_table.mtt_table, + init_hca->mtt_base, + dev->caps.mtt_entry_sz, + dev->caps.num_mtt_segs, + dev->caps.reserved_mtts, 1); + if (err) { + mlx4_err(dev, "Failed to map MTT context memory, aborting.\n"); + goto err_unmap_eq; + } + + err = mlx4_init_icm_table(dev, &priv->mr_table.dmpt_table, + init_hca->dmpt_base, + dev_cap->dmpt_entry_sz, + dev->caps.num_mpts, + dev->caps.reserved_mrws, 1); + if (err) { + mlx4_err(dev, "Failed to map dMPT context memory, aborting.\n"); + goto err_unmap_mtt; + } + + err = mlx4_init_icm_table(dev, &priv->qp_table.qp_table, + init_hca->qpc_base, + dev_cap->qpc_entry_sz, + dev->caps.num_qps, + dev->caps.reserved_qps, 0); + if (err) { + mlx4_err(dev, "Failed to map QP context memory, aborting.\n"); + goto err_unmap_dmpt; + } + + err = mlx4_init_icm_table(dev, &priv->qp_table.auxc_table, + init_hca->auxc_base, + dev_cap->aux_entry_sz, + dev->caps.num_qps, + dev->caps.reserved_qps, 0); + if (err) { + mlx4_err(dev, "Failed to map AUXC context memory, aborting.\n"); + goto err_unmap_qp; + } + + err = mlx4_init_icm_table(dev, &priv->qp_table.altc_table, + init_hca->altc_base, + dev_cap->altc_entry_sz, + dev->caps.num_qps, + dev->caps.reserved_qps, 0); + if (err) { + mlx4_err(dev, "Failed to map ALTC context memory, aborting.\n"); + goto err_unmap_auxc; + } + + err = mlx4_init_icm_table(dev, &priv->qp_table.rdmarc_table, + init_hca->rdmarc_base, + dev_cap->rdmarc_entry_sz << priv->qp_table.rdmarc_shift, + dev->caps.num_qps, + dev->caps.reserved_qps, 0); + if (err) { + mlx4_err(dev, "Failed to map RDMARC context memory, aborting\n"); + goto err_unmap_altc; + } + + err = mlx4_init_icm_table(dev, &priv->cq_table.table, + init_hca->cqc_base, + dev_cap->cqc_entry_sz, + dev->caps.num_cqs, + dev->caps.reserved_cqs, 0); + if (err) { + mlx4_err(dev, "Failed to map CQ context memory, aborting.\n"); + goto err_unmap_rdmarc; + } + + err = mlx4_init_icm_table(dev, &priv->srq_table.table, + init_hca->srqc_base, + dev_cap->srq_entry_sz, + dev->caps.num_srqs, + dev->caps.reserved_srqs, 0); + if (err) { + mlx4_err(dev, "Failed to map SRQ context memory, aborting.\n"); + goto err_unmap_cq; + } + + /* + * It's not strictly required, but for simplicity just map the + * whole multicast group table now. The table isn't very big + * and it's a lot easier than trying to track ref counts. + */ + err = mlx4_init_icm_table(dev, &priv->mcg_table.table, + init_hca->mc_base, MLX4_MGM_ENTRY_SIZE, + dev->caps.num_mgms + dev->caps.num_amgms, + dev->caps.num_mgms + dev->caps.num_amgms, + 0); + if (err) { + mlx4_err(dev, "Failed to map MCG context memory, aborting.\n"); + goto err_unmap_srq; + } + + return 0; + +err_unmap_srq: + mlx4_cleanup_icm_table(dev, &priv->srq_table.table); + +err_unmap_cq: + mlx4_cleanup_icm_table(dev, &priv->cq_table.table); + +err_unmap_rdmarc: + mlx4_cleanup_icm_table(dev, &priv->qp_table.rdmarc_table); + +err_unmap_altc: + mlx4_cleanup_icm_table(dev, &priv->qp_table.altc_table); + +err_unmap_auxc: + mlx4_cleanup_icm_table(dev, &priv->qp_table.auxc_table); + +err_unmap_qp: + mlx4_cleanup_icm_table(dev, &priv->qp_table.qp_table); + +err_unmap_dmpt: + mlx4_cleanup_icm_table(dev, &priv->mr_table.dmpt_table); + +err_unmap_mtt: + mlx4_cleanup_icm_table(dev, &priv->mr_table.mtt_table); + +err_unmap_eq: + mlx4_unmap_eq_icm(dev); + +err_unmap_cmpt: + mlx4_cleanup_icm_table(dev, &priv->eq_table.cmpt_table); + mlx4_cleanup_icm_table(dev, &priv->cq_table.cmpt_table); + mlx4_cleanup_icm_table(dev, &priv->srq_table.cmpt_table); + mlx4_cleanup_icm_table(dev, &priv->qp_table.cmpt_table); + +err_unmap_aux: + mlx4_UNMAP_ICM_AUX(dev); + +err_free_aux: + mlx4_free_icm(dev, priv->fw.aux_icm); + + return err; +} + +static void mlx4_free_icms(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + + mlx4_cleanup_icm_table(dev, &priv->mcg_table.table); + mlx4_cleanup_icm_table(dev, &priv->srq_table.table); + mlx4_cleanup_icm_table(dev, &priv->cq_table.table); + mlx4_cleanup_icm_table(dev, &priv->qp_table.rdmarc_table); + mlx4_cleanup_icm_table(dev, &priv->qp_table.altc_table); + mlx4_cleanup_icm_table(dev, &priv->qp_table.auxc_table); + mlx4_cleanup_icm_table(dev, &priv->qp_table.qp_table); + mlx4_cleanup_icm_table(dev, &priv->mr_table.dmpt_table); + mlx4_cleanup_icm_table(dev, &priv->mr_table.mtt_table); + mlx4_cleanup_icm_table(dev, &priv->eq_table.cmpt_table); + mlx4_cleanup_icm_table(dev, &priv->cq_table.cmpt_table); + mlx4_cleanup_icm_table(dev, &priv->srq_table.cmpt_table); + mlx4_cleanup_icm_table(dev, &priv->qp_table.cmpt_table); + mlx4_unmap_eq_icm(dev); + + mlx4_UNMAP_ICM_AUX(dev); + mlx4_free_icm(dev, priv->fw.aux_icm); +} + +static void mlx4_close_hca(struct mlx4_dev *dev) +{ + mlx4_CLOSE_HCA(dev, 0); + mlx4_free_icms(dev); + mlx4_UNMAP_FA(dev); + mlx4_free_icm(dev, mlx4_priv(dev)->fw.fw_icm); +} + +static int __devinit mlx4_init_hca(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_adapter adapter; + struct mlx4_dev_cap dev_cap; + struct mlx4_profile profile; + struct mlx4_init_hca_param init_hca; + u64 icm_size; + int err; + + err = mlx4_QUERY_FW(dev); + if (err) { + mlx4_err(dev, "QUERY_FW command failed, aborting.\n"); + return err; + } + + err = mlx4_load_fw(dev); + if (err) { + mlx4_err(dev, "Failed to start FW, aborting.\n"); + return err; + } + + err = mlx4_dev_cap(dev, &dev_cap); + if (err) { + mlx4_err(dev, "QUERY_DEV_CAP command failed, aborting.\n"); + goto err_stop_fw; + } + + profile = default_profile; + + icm_size = mlx4_make_profile(dev, &profile, &dev_cap, &init_hca); + if ((long long) icm_size < 0) { + err = icm_size; + goto err_stop_fw; + } + + init_hca.log_uar_sz = ilog2(dev->caps.num_uars); + + err = mlx4_init_icm(dev, &dev_cap, &init_hca, icm_size); + if (err) + goto err_stop_fw; + + err = mlx4_INIT_HCA(dev, &init_hca); + if (err) { + mlx4_err(dev, "INIT_HCA command failed, aborting.\n"); + goto err_free_icm; + } + + err = mlx4_QUERY_ADAPTER(dev, &adapter); + if (err) { + mlx4_err(dev, "QUERY_ADAPTER command failed, aborting.\n"); + goto err_close; + } + + priv->eq_table.inta_pin = adapter.inta_pin; + priv->rev_id = adapter.revision_id; + memcpy(priv->board_id, adapter.board_id, sizeof priv->board_id); + + return 0; + +err_close: + mlx4_close_hca(dev); + +err_free_icm: + mlx4_free_icms(dev); + +err_stop_fw: + mlx4_UNMAP_FA(dev); + mlx4_free_icm(dev, priv->fw.fw_icm); + + return err; +} + +static int __devinit mlx4_setup_hca(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + int err; + + MLX4_INIT_DOORBELL_LOCK(&priv->doorbell_lock); + + err = mlx4_init_uar_table(dev); + if (err) { + mlx4_err(dev, "Failed to initialize " + "user access region table, aborting.\n"); + return err; + } + + err = mlx4_uar_alloc(dev, &priv->driver_uar); + if (err) { + mlx4_err(dev, "Failed to allocate driver access region, " + "aborting.\n"); + goto err_uar_table_free; + } + + priv->kar = ioremap(priv->driver_uar.pfn << PAGE_SHIFT, PAGE_SIZE); + if (!priv->kar) { + mlx4_err(dev, "Couldn't map kernel access region, " + "aborting.\n"); + err = -ENOMEM; + goto err_uar_free; + } + + err = mlx4_init_pd_table(dev); + if (err) { + mlx4_err(dev, "Failed to initialize " + "protection domain table, aborting.\n"); + goto err_kar_unmap; + } + + err = mlx4_init_mr_table(dev); + if (err) { + mlx4_err(dev, "Failed to initialize " + "memory region table, aborting.\n"); + goto err_pd_table_free; + } + + err = mlx4_pd_alloc(dev, &priv->driver_pd); + if (err) { + mlx4_err(dev, "Failed to create driver PD, " + "aborting.\n"); + goto err_mr_table_free; + } + + err = mlx4_init_eq_table(dev); + if (err) { + mlx4_err(dev, "Failed to initialize " + "event queue table, aborting.\n"); + goto err_pd_free; + } + + err = mlx4_cmd_use_events(dev); + if (err) { + mlx4_err(dev, "Failed to switch to event-driven " + "firmware commands, aborting.\n"); + goto err_eq_table_free; + } + + err = mlx4_NOP(dev); + if (err) { + mlx4_err(dev, "NOP command failed to generate interrupt (IRQ %d), aborting.\n", + priv->eq_table.eq[MLX4_EQ_ASYNC].irq); + if (dev->flags & MLX4_FLAG_MSI_X) + mlx4_err(dev, "Try again with MSI-X disabled.\n"); + else + mlx4_err(dev, "BIOS or ACPI interrupt routing problem?\n"); + + goto err_cmd_poll; + } + + mlx4_dbg(dev, "NOP command IRQ test passed\n"); + + err = mlx4_init_cq_table(dev); + if (err) { + mlx4_err(dev, "Failed to initialize " + "completion queue table, aborting.\n"); + goto err_cmd_poll; + } + + err = mlx4_init_srq_table(dev); + if (err) { + mlx4_err(dev, "Failed to initialize " + "shared receive queue table, aborting.\n"); + goto err_cq_table_free; + } + + err = mlx4_init_qp_table(dev); + if (err) { + mlx4_err(dev, "Failed to initialize " + "queue pair table, aborting.\n"); + goto err_srq_table_free; + } + + err = mlx4_init_mcg_table(dev); + if (err) { + mlx4_err(dev, "Failed to initialize " + "multicast group table, aborting.\n"); + goto err_qp_table_free; + } + + return 0; + +err_qp_table_free: + mlx4_cleanup_qp_table(dev); + +err_srq_table_free: + mlx4_cleanup_srq_table(dev); + +err_cq_table_free: + mlx4_cleanup_cq_table(dev); + +err_cmd_poll: + mlx4_cmd_use_polling(dev); + +err_eq_table_free: + mlx4_cleanup_eq_table(dev); + +err_pd_free: + mlx4_pd_free(dev, priv->driver_pd); + +err_mr_table_free: + mlx4_cleanup_mr_table(dev); + +err_pd_table_free: + mlx4_cleanup_pd_table(dev); + +err_kar_unmap: + iounmap(priv->kar); + +err_uar_free: + mlx4_uar_free(dev, &priv->driver_uar); + +err_uar_table_free: + mlx4_cleanup_uar_table(dev); + return err; +} + +static void __devinit mlx4_enable_msi_x(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct msix_entry entries[MLX4_NUM_EQ]; + int err; + int i; + + if (msi_x) { + for (i = 0; i < MLX4_NUM_EQ; ++i) + entries[i].entry = i; + + err = pci_enable_msix(dev->pdev, entries, ARRAY_SIZE(entries)); + if (err) { + if (err > 0) + mlx4_info(dev, "Only %d MSI-X vectors available, " + "not using MSI-X\n", err); + goto no_msi; + } + + for (i = 0; i < MLX4_NUM_EQ; ++i) + priv->eq_table.eq[i].irq = entries[i].vector; + + dev->flags |= MLX4_FLAG_MSI_X; + return; + } + +no_msi: + for (i = 0; i < MLX4_NUM_EQ; ++i) + priv->eq_table.eq[i].irq = dev->pdev->irq; +} + +static int __devinit mlx4_init_one(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + static int mlx4_version_printed = 0; + struct mlx4_priv *priv; + struct mlx4_dev *dev; + int err; + + if (!mlx4_version_printed) { + printk(KERN_INFO "%s", mlx4_version); + ++mlx4_version_printed; + } + + printk(KERN_INFO PFX "Initializing %s\n", + pci_name(pdev)); + + err = pci_enable_device(pdev); + if (err) { + dev_err(&pdev->dev, "Cannot enable PCI device, " + "aborting.\n"); + return err; + } + + /* + * Check for BARs. We expect 0: 1MB, 2: 8MB, 4: DDR (may not + * be present) + */ + if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) || + pci_resource_len(pdev, 0) != 1 << 20) { + dev_err(&pdev->dev, "Missing DCS, aborting.\n"); + err = -ENODEV; + goto err_disable_pdev; + } + if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM)) { + dev_err(&pdev->dev, "Missing UAR, aborting.\n"); + err = -ENODEV; + goto err_disable_pdev; + } + + err = pci_request_region(pdev, 0, DRV_NAME); + if (err) { + dev_err(&pdev->dev, "Cannot request control region, aborting.\n"); + goto err_disable_pdev; + } + + err = pci_request_region(pdev, 2, DRV_NAME); + if (err) { + dev_err(&pdev->dev, "Cannot request UAR region, aborting.\n"); + goto err_release_bar0; + } + + pci_set_master(pdev); + + err = pci_set_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n"); + err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n"); + goto err_release_bar2; + } + } + err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit " + "consistent PCI DMA mask.\n"); + err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, " + "aborting.\n"); + goto err_release_bar2; + } + } + + priv = kzalloc(sizeof *priv, GFP_KERNEL); + if (!priv) { + dev_err(&pdev->dev, "Device struct alloc failed, " + "aborting.\n"); + err = -ENOMEM; + goto err_release_bar2; + } + + dev = &priv->dev; + dev->pdev = pdev; + + /* + * Now reset the HCA before we touch the PCI capabilities or + * attempt a firmware command, since a boot ROM may have left + * the HCA in an undefined state. + */ + err = mlx4_reset(dev); + if (err) { + mlx4_err(dev, "Failed to reset HCA, aborting.\n"); + goto err_free_dev; + } + + mlx4_enable_msi_x(dev); + + if (mlx4_cmd_init(dev)) { + mlx4_err(dev, "Failed to init command interface, aborting.\n"); + goto err_free_dev; + } + + err = mlx4_init_hca(dev); + if (err) + goto err_cmd; + + err = mlx4_setup_hca(dev); + if (err) + goto err_close; + + err = mlx4_register_device(priv); + if (err) + goto err_cleanup; + + pci_set_drvdata(pdev, dev); + + return 0; + +err_cleanup: + mlx4_cleanup_mcg_table(dev); + mlx4_cleanup_qp_table(dev); + mlx4_cleanup_srq_table(dev); + mlx4_cleanup_cq_table(dev); + mlx4_cmd_use_polling(dev); + mlx4_cleanup_eq_table(dev); + + mlx4_pd_free(dev, priv->driver_pd); + + mlx4_cleanup_mr_table(dev); + mlx4_cleanup_pd_table(dev); + mlx4_cleanup_uar_table(dev); + +err_close: + mlx4_close_hca(dev); + +err_cmd: + mlx4_cmd_cleanup(dev); + +err_free_dev: + if (dev->flags & MLX4_FLAG_MSI_X) + pci_disable_msix(pdev); + + kfree(priv); + +err_release_bar2: + pci_release_region(pdev, 2); + +err_release_bar0: + pci_release_region(pdev, 0); + +err_disable_pdev: + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + return err; +} + +static void __devexit mlx4_remove_one(struct pci_dev *pdev) +{ + struct mlx4_dev *dev = pci_get_drvdata(pdev); + struct mlx4_priv *priv = mlx4_priv(dev); + int p; + + if (dev) { + mlx4_unregister_device(priv); + + for (p = 1; p <= dev->caps.num_ports; ++p) + mlx4_CLOSE_PORT(dev, p); + + mlx4_cleanup_mcg_table(dev); + mlx4_cleanup_qp_table(dev); + mlx4_cleanup_srq_table(dev); + mlx4_cleanup_cq_table(dev); + mlx4_cmd_use_polling(dev); + mlx4_cleanup_eq_table(dev); + + mlx4_pd_free(dev, priv->driver_pd); + + mlx4_cleanup_mr_table(dev); + mlx4_cleanup_pd_table(dev); + + iounmap(priv->kar); + mlx4_uar_free(dev, &priv->driver_uar); + mlx4_cleanup_uar_table(dev); + mlx4_close_hca(dev); + mlx4_cmd_cleanup(dev); + + if (dev->flags & MLX4_FLAG_MSI_X) + pci_disable_msix(pdev); + + kfree(priv); + pci_release_region(pdev, 2); + pci_release_region(pdev, 0); + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + } +} + +static struct pci_device_id mlx4_pci_table[] = { + { PCI_VDEVICE(MELLANOX, 0x6340) }, /* MT25408 "Hermon" SDR */ + { PCI_VDEVICE(MELLANOX, 0x634a) }, /* MT25408 "Hermon" DDR */ + { PCI_VDEVICE(MELLANOX, 0x6354) }, /* MT25408 "Hermon" QDR */ + { 0, } +}; + +MODULE_DEVICE_TABLE(pci, mlx4_pci_table); + +static struct pci_driver mlx4_driver = { + .name = DRV_NAME, + .id_table = mlx4_pci_table, + .probe = mlx4_init_one, + .remove = __devexit_p(mlx4_remove_one) +}; + +static int __init mlx4_init(void) +{ + int ret; + + ret = pci_register_driver(&mlx4_driver); + return ret < 0 ? ret : 0; +} + +static void __exit mlx4_cleanup(void) +{ + pci_unregister_driver(&mlx4_driver); +} + +module_init(mlx4_init); +module_exit(mlx4_cleanup); diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h new file mode 100644 index 0000000..5f4d9c6 --- /dev/null +++ b/drivers/net/mlx4/mlx4.h @@ -0,0 +1,334 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005, 2006, 2007 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef MLX4_H +#define MLX4_H + +#include + +#include +#include + +#define DRV_NAME "mlx4_core" +#define PFX DRV_NAME ": " +#define DRV_VERSION "0.01" +#define DRV_RELDATE "May 1, 2006" + +enum { + MLX4_HCR_BASE = 0x80680, + MLX4_HCR_SIZE = 0x0001c, + MLX4_CLR_INT_SIZE = 0x00008 +}; + +enum { + MLX4_BOARD_ID_LEN = 64 +}; + +enum { + MLX4_MGM_ENTRY_SIZE = 0x40, + MLX4_QP_PER_MGM = 4 * (MLX4_MGM_ENTRY_SIZE / 16 - 2), + MLX4_MTT_ENTRY_PER_SEG = 8 +}; + +enum { + MLX4_EQ_ASYNC, + MLX4_EQ_COMP, + MLX4_EQ_CATAS, + MLX4_NUM_EQ +}; + +enum { + MLX4_NUM_PDS = 1 << 15 +}; + +enum { + MLX4_CMPT_TYPE_QP = 0, + MLX4_CMPT_TYPE_SRQ = 1, + MLX4_CMPT_TYPE_CQ = 2, + MLX4_CMPT_TYPE_EQ = 3, + MLX4_CMPT_NUM_TYPE +}; + +enum { + MLX4_CMPT_SHIFT = 24, + MLX4_NUM_CMPTS = MLX4_CMPT_NUM_TYPE << MLX4_CMPT_SHIFT +}; + +#ifdef CONFIG_MLX4_DEBUG +extern int mlx4_debug_level; + +#define mlx4_dbg(mdev, format, arg...) \ + do { \ + if (mlx4_debug_level) \ + dev_printk(KERN_DEBUG, &mdev->pdev->dev, format, ## arg); \ + } while (0) + +#else /* CONFIG_MLX4_DEBUG */ + +#define mlx4_dbg(mdev, format, arg...) do { (void) mdev; } while (0) + +#endif /* CONFIG_MLX4_DEBUG */ + +#define mlx4_err(mdev, format, arg...) \ + dev_err(&mdev->pdev->dev, format, ## arg) +#define mlx4_info(mdev, format, arg...) \ + dev_info(&mdev->pdev->dev, format, ## arg) +#define mlx4_warn(mdev, format, arg...) \ + dev_warn(&mdev->pdev->dev, format, ## arg) + +struct mlx4_bitmap { + u32 last; + u32 top; + u32 max; + u32 mask; + spinlock_t lock; + unsigned long *table; +}; + +struct mlx4_buddy { + unsigned long **bits; + int max_order; + spinlock_t lock; +}; + +struct mlx4_icm; + +struct mlx4_icm_table { + u64 virt; + int num_icm; + int num_obj; + int obj_size; + int lowmem; + struct mutex mutex; + struct mlx4_icm **icm; +}; + +struct mlx4_eq { + struct mlx4_dev *dev; + void __iomem *doorbell; + int eqn; + u32 cons_index; + u16 irq; + u16 have_irq; + int nent; + struct mlx4_buf_list *page_list; + struct mlx4_mtt mtt; +}; + +struct mlx4_profile { + int num_qp; + int rdmarc_per_qp; + int num_srq; + int num_cq; + int num_mcg; + int num_mpt; + int num_mtt; +}; + +struct mlx4_fw { + u64 clr_int_base; + u64 catas_addr; + struct mlx4_icm *fw_icm; + struct mlx4_icm *aux_icm; + u32 catas_size; + u16 fw_pages; + u8 clr_int_bar; + u8 catas_bar; +}; + +struct mlx4_cmd { + struct pci_pool *pool; + void __iomem *hcr; + struct mutex hcr_mutex; + struct semaphore poll_sem; + struct semaphore event_sem; + int max_cmds; + spinlock_t context_lock; + int free_head; + struct mlx4_cmd_context *context; + u16 token_mask; + u8 use_events; + u8 toggle; +}; + +struct mlx4_uar_table { + struct mlx4_bitmap bitmap; +}; + +struct mlx4_mr_table { + struct mlx4_bitmap mpt_bitmap; + struct mlx4_buddy mtt_buddy; + u64 mtt_base; + u64 mpt_base; + struct mlx4_icm_table mtt_table; + struct mlx4_icm_table dmpt_table; +}; + +struct mlx4_cq_table { + struct mlx4_bitmap bitmap; + spinlock_t lock; + struct radix_tree_root tree; + struct mlx4_icm_table table; + struct mlx4_icm_table cmpt_table; +}; + +struct mlx4_eq_table { + struct mlx4_bitmap bitmap; + void __iomem *clr_int; + void __iomem *uar_map[(MLX4_NUM_EQ + 6) / 4]; + u32 clr_mask; + struct mlx4_eq eq[MLX4_NUM_EQ]; + u64 icm_virt; + struct page *icm_page; + dma_addr_t icm_dma; + struct mlx4_icm_table cmpt_table; + int have_irq; + u8 inta_pin; +}; + +struct mlx4_srq_table { + struct mlx4_bitmap bitmap; + spinlock_t lock; + struct radix_tree_root tree; + struct mlx4_icm_table table; + struct mlx4_icm_table cmpt_table; +}; + +struct mlx4_qp_table { + struct mlx4_bitmap bitmap; + u32 rdmarc_base; + int rdmarc_shift; + spinlock_t lock; + struct mlx4_icm_table qp_table; + struct mlx4_icm_table auxc_table; + struct mlx4_icm_table altc_table; + struct mlx4_icm_table rdmarc_table; + struct mlx4_icm_table cmpt_table; +}; + +struct mlx4_mcg_table { + struct mutex mutex; + struct mlx4_bitmap bitmap; + struct mlx4_icm_table table; +}; + +struct mlx4_priv { + struct mlx4_dev dev; + + struct list_head dev_list; + struct list_head ctx_list; + + struct mlx4_fw fw; + struct mlx4_cmd cmd; + + struct mlx4_bitmap pd_bitmap; + struct mlx4_uar_table uar_table; + struct mlx4_mr_table mr_table; + struct mlx4_cq_table cq_table; + struct mlx4_eq_table eq_table; + struct mlx4_srq_table srq_table; + struct mlx4_qp_table qp_table; + struct mlx4_mcg_table mcg_table; + + void __iomem *clr_base; + + struct mlx4_uar driver_uar; + void __iomem *kar; + MLX4_DECLARE_DOORBELL_LOCK(doorbell_lock) + u32 driver_pd; + + u32 rev_id; + char board_id[MLX4_BOARD_ID_LEN]; +}; + +static inline struct mlx4_priv *mlx4_priv(struct mlx4_dev *dev) +{ + return container_of(dev, struct mlx4_priv, dev); +} + +u32 mlx4_bitmap_alloc(struct mlx4_bitmap *bitmap); +void mlx4_bitmap_free(struct mlx4_bitmap *bitmap, u32 obj); +int mlx4_bitmap_init(struct mlx4_bitmap *bitmap, u32 num, u32 mask, u32 reserved); +void mlx4_bitmap_cleanup(struct mlx4_bitmap *bitmap); + +int mlx4_reset(struct mlx4_dev *dev); + +int mlx4_init_pd_table(struct mlx4_dev *dev); +int mlx4_init_uar_table(struct mlx4_dev *dev); +int mlx4_init_mr_table(struct mlx4_dev *dev); +int mlx4_init_eq_table(struct mlx4_dev *dev); +int mlx4_init_cq_table(struct mlx4_dev *dev); +int mlx4_init_qp_table(struct mlx4_dev *dev); +int mlx4_init_srq_table(struct mlx4_dev *dev); +int mlx4_init_mcg_table(struct mlx4_dev *dev); + +void mlx4_cleanup_pd_table(struct mlx4_dev *dev); +void mlx4_cleanup_uar_table(struct mlx4_dev *dev); +void mlx4_cleanup_mr_table(struct mlx4_dev *dev); +void mlx4_cleanup_eq_table(struct mlx4_dev *dev); +void mlx4_cleanup_cq_table(struct mlx4_dev *dev); +void mlx4_cleanup_qp_table(struct mlx4_dev *dev); +void mlx4_cleanup_srq_table(struct mlx4_dev *dev); +void mlx4_cleanup_mcg_table(struct mlx4_dev *dev); + +int mlx4_register_device(struct mlx4_priv *priv); +void mlx4_unregister_device(struct mlx4_priv *priv); + +struct mlx4_dev_cap; +struct mlx4_init_hca_param; + +u64 mlx4_make_profile(struct mlx4_dev *dev, + struct mlx4_profile *request, + struct mlx4_dev_cap *dev_cap, + struct mlx4_init_hca_param *init_hca); + +int mlx4_map_eq_icm(struct mlx4_dev *dev, u64 icm_virt); +void mlx4_unmap_eq_icm(struct mlx4_dev *dev); + +int mlx4_cmd_init(struct mlx4_dev *dev); +void mlx4_cmd_cleanup(struct mlx4_dev *dev); +void mlx4_cmd_event(struct mlx4_dev *dev, u16 token, u8 status, u64 out_param); +int mlx4_cmd_use_events(struct mlx4_dev *dev); +void mlx4_cmd_use_polling(struct mlx4_dev *dev); + +void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn); +void mlx4_cq_event(struct mlx4_dev *dev, u32 cqn, int event_type); + +void mlx4_qp_event(struct mlx4_dev *dev, u32 qpn, int event_type); + +void mlx4_srq_event(struct mlx4_dev *dev, u32 srqn, int event_type); + +#endif /* MLX4_H */ diff --git a/drivers/net/mlx4/profile.c b/drivers/net/mlx4/profile.c new file mode 100644 index 0000000..3a5446f --- /dev/null +++ b/drivers/net/mlx4/profile.c @@ -0,0 +1,238 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include + +#include "mlx4.h" +#include "fw.h" + +enum { + MLX4_RES_QP, + MLX4_RES_RDMARC, + MLX4_RES_ALTC, + MLX4_RES_AUXC, + MLX4_RES_SRQ, + MLX4_RES_CQ, + MLX4_RES_EQ, + MLX4_RES_DMPT, + MLX4_RES_CMPT, + MLX4_RES_MTT, + MLX4_RES_MCG, + MLX4_RES_NUM +}; + +static const char *res_name[] = { + [MLX4_RES_QP] = "QP", + [MLX4_RES_RDMARC] = "RDMARC", + [MLX4_RES_ALTC] = "ALTC", + [MLX4_RES_AUXC] = "AUXC", + [MLX4_RES_SRQ] = "SRQ", + [MLX4_RES_CQ] = "CQ", + [MLX4_RES_EQ] = "EQ", + [MLX4_RES_DMPT] = "DMPT", + [MLX4_RES_CMPT] = "CMPT", + [MLX4_RES_MTT] = "MTT", + [MLX4_RES_MCG] = "MCG", +}; + +u64 mlx4_make_profile(struct mlx4_dev *dev, + struct mlx4_profile *request, + struct mlx4_dev_cap *dev_cap, + struct mlx4_init_hca_param *init_hca) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_resource { + u64 size; + u64 start; + int type; + int num; + int log_num; + }; + + u64 total_size = 0; + struct mlx4_resource *profile; + struct mlx4_resource tmp; + int i, j; + + profile = kzalloc(MLX4_RES_NUM * sizeof *profile, GFP_KERNEL); + if (!profile) + return -ENOMEM; + + profile[MLX4_RES_QP].size = dev_cap->qpc_entry_sz; + profile[MLX4_RES_RDMARC].size = dev_cap->rdmarc_entry_sz; + profile[MLX4_RES_ALTC].size = dev_cap->altc_entry_sz; + profile[MLX4_RES_AUXC].size = dev_cap->aux_entry_sz; + profile[MLX4_RES_SRQ].size = dev_cap->srq_entry_sz; + profile[MLX4_RES_CQ].size = dev_cap->cqc_entry_sz; + profile[MLX4_RES_EQ].size = dev_cap->eqc_entry_sz; + profile[MLX4_RES_DMPT].size = dev_cap->dmpt_entry_sz; + profile[MLX4_RES_CMPT].size = dev_cap->cmpt_entry_sz; + profile[MLX4_RES_MTT].size = MLX4_MTT_ENTRY_PER_SEG * dev_cap->mtt_entry_sz; + profile[MLX4_RES_MCG].size = MLX4_MGM_ENTRY_SIZE; + + profile[MLX4_RES_QP].num = request->num_qp; + profile[MLX4_RES_RDMARC].num = request->num_qp * request->rdmarc_per_qp; + profile[MLX4_RES_ALTC].num = request->num_qp; + profile[MLX4_RES_AUXC].num = request->num_qp; + profile[MLX4_RES_SRQ].num = request->num_srq; + profile[MLX4_RES_CQ].num = request->num_cq; + profile[MLX4_RES_EQ].num = MLX4_NUM_EQ + dev_cap->reserved_eqs; + profile[MLX4_RES_DMPT].num = request->num_mpt; + profile[MLX4_RES_CMPT].num = MLX4_NUM_CMPTS; + profile[MLX4_RES_MTT].num = request->num_mtt; + profile[MLX4_RES_MCG].num = request->num_mcg; + + for (i = 0; i < MLX4_RES_NUM; ++i) { + profile[i].type = i; + profile[i].num = roundup_pow_of_two(profile[i].num); + profile[i].log_num = ilog2(profile[i].num); + profile[i].size *= profile[i].num; + profile[i].size = max(profile[i].size, (u64) PAGE_SIZE); + } + + /* + * Sort the resources in decreasing order of size. Since they + * all have sizes that are powers of 2, we'll be able to keep + * resources aligned to their size and pack them without gaps + * using the sorted order. + */ + for (i = MLX4_RES_NUM; i > 0; --i) + for (j = 1; j < i; ++j) { + if (profile[j].size > profile[j - 1].size) { + tmp = profile[j]; + profile[j] = profile[j - 1]; + profile[j - 1] = tmp; + } + } + + for (i = 0; i < MLX4_RES_NUM; ++i) { + if (profile[i].size) { + profile[i].start = total_size; + total_size += profile[i].size; + } + + if (total_size > dev_cap->max_icm_sz) { + mlx4_err(dev, "Profile requires 0x%llx bytes; " + "won't fit in 0x%llx bytes of context memory.\n", + (unsigned long long) total_size, + (unsigned long long) dev_cap->max_icm_sz); + kfree(profile); + return -ENOMEM; + } + + if (profile[i].size) + mlx4_dbg(dev, " profile[%2d] (%6s): 2^%02d entries @ 0x%10llx, " + "size 0x%10llx\n", + i, res_name[profile[i].type], profile[i].log_num, + (unsigned long long) profile[i].start, + (unsigned long long) profile[i].size); + } + + mlx4_dbg(dev, "HCA context memory: reserving %d KB\n", + (int) (total_size >> 10)); + + for (i = 0; i < MLX4_RES_NUM; ++i) { + switch (profile[i].type) { + case MLX4_RES_QP: + dev->caps.num_qps = profile[i].num; + init_hca->qpc_base = profile[i].start; + init_hca->log_num_qps = profile[i].log_num; + break; + case MLX4_RES_RDMARC: + for (priv->qp_table.rdmarc_shift = 0; + request->num_qp << priv->qp_table.rdmarc_shift < profile[i].num; + ++priv->qp_table.rdmarc_shift) + ; /* nothing */ + dev->caps.max_qp_dest_rdma = 1 << priv->qp_table.rdmarc_shift; + priv->qp_table.rdmarc_base = (u32) profile[i].start; + init_hca->rdmarc_base = profile[i].start; + init_hca->log_rd_per_qp = priv->qp_table.rdmarc_shift; + break; + case MLX4_RES_ALTC: + init_hca->altc_base = profile[i].start; + break; + case MLX4_RES_AUXC: + init_hca->auxc_base = profile[i].start; + break; + case MLX4_RES_SRQ: + dev->caps.num_srqs = profile[i].num; + init_hca->srqc_base = profile[i].start; + init_hca->log_num_srqs = profile[i].log_num; + break; + case MLX4_RES_CQ: + dev->caps.num_cqs = profile[i].num; + init_hca->cqc_base = profile[i].start; + init_hca->log_num_cqs = profile[i].log_num; + break; + case MLX4_RES_EQ: + dev->caps.num_eqs = profile[i].num; + init_hca->eqc_base = profile[i].start; + init_hca->log_num_eqs = profile[i].log_num; + break; + case MLX4_RES_DMPT: + dev->caps.num_mpts = profile[i].num; + priv->mr_table.mpt_base = profile[i].start; + init_hca->dmpt_base = profile[i].start; + init_hca->log_mpt_sz = profile[i].log_num; + break; + case MLX4_RES_CMPT: + init_hca->cmpt_base = profile[i].start; + break; + case MLX4_RES_MTT: + dev->caps.num_mtt_segs = profile[i].num; + priv->mr_table.mtt_base = profile[i].start; + init_hca->mtt_base = profile[i].start; + break; + case MLX4_RES_MCG: + dev->caps.num_mgms = profile[i].num >> 1; + dev->caps.num_amgms = profile[i].num >> 1; + init_hca->mc_base = profile[i].start; + init_hca->log_mc_entry_sz = ilog2(MLX4_MGM_ENTRY_SIZE); + init_hca->log_mc_table_sz = profile[i].log_num; + init_hca->log_mc_hash_sz = profile[i].log_num - 1; + break; + default: + break; + } + } + + /* + * PDs don't take any HCA memory, but we assign them as part + * of the HCA profile anyway. + */ + dev->caps.num_pds = MLX4_NUM_PDS; + + kfree(profile); + return total_size; +} diff --git a/drivers/net/mlx4/reset.c b/drivers/net/mlx4/reset.c new file mode 100644 index 0000000..ba16228 --- /dev/null +++ b/drivers/net/mlx4/reset.c @@ -0,0 +1,172 @@ +/* + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include "mlx4.h" + +int mlx4_reset(struct mlx4_dev *dev) +{ + int i; + int err = 0; + u32 *hca_header = NULL; + int pcie_cap; + + u16 devctl; + u16 linkctl; + +#define MLX4_RESET_OFFSET 0xf0010 +#define MLX4_RESET_VALUE swab32(1) + + /* + * Reset the chip. This is somewhat ugly because we have to + * save off the PCI header before reset and then restore it + * after the chip reboots. We skip config space offsets 22 + * and 23 since those have a special meaning. + */ + + /* Do we need to save off the full 4K PCI Express header?? */ + hca_header = kmalloc(256, GFP_KERNEL); + if (!hca_header) { + err = -ENOMEM; + mlx4_err(dev, "Couldn't allocate memory to save HCA " + "PCI header, aborting.\n"); + goto out; + } + + pcie_cap = pci_find_capability(dev->pdev, PCI_CAP_ID_EXP); + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(dev->pdev, i * 4, hca_header + i)) { + err = -ENODEV; + mlx4_err(dev, "Couldn't save HCA " + "PCI header, aborting.\n"); + goto out; + } + } + + /* actually hit reset */ + { + void __iomem *reset = ioremap(pci_resource_start(dev->pdev, 0) + + MLX4_RESET_OFFSET, 4); + + if (!reset) { + err = -ENOMEM; + mlx4_err(dev, "Couldn't map HCA reset register, " + "aborting.\n"); + goto out; + } + + writel(MLX4_RESET_VALUE, reset); + iounmap(reset); + } + + /* Docs say to wait one second before accessing device */ + msleep(1000); + + /* Now wait for PCI device to start responding again */ + { + u32 v; + int c = 0; + + for (c = 0; c < 100; ++c) { + if (pci_read_config_dword(dev->pdev, 0, &v)) { + err = -ENODEV; + mlx4_err(dev, "Couldn't access HCA after reset, " + "aborting.\n"); + goto out; + } + + if (v != 0xffffffff) + goto good; + + msleep(100); + } + + err = -ENODEV; + mlx4_err(dev, "PCI device did not come back after reset, " + "aborting.\n"); + goto out; + } + +good: + /* Now restore the PCI headers */ + if (pcie_cap) { + devctl = hca_header[(pcie_cap + PCI_EXP_DEVCTL) / 4]; + if (pci_write_config_word(dev->pdev, pcie_cap + PCI_EXP_DEVCTL, + devctl)) { + err = -ENODEV; + mlx4_err(dev, "Couldn't restore HCA PCI Express " + "Device Control register, aborting.\n"); + goto out; + } + linkctl = hca_header[(pcie_cap + PCI_EXP_LNKCTL) / 4]; + if (pci_write_config_word(dev->pdev, pcie_cap + PCI_EXP_LNKCTL, + linkctl)) { + err = -ENODEV; + mlx4_err(dev, "Couldn't restore HCA PCI Express " + "Link control register, aborting.\n"); + goto out; + } + } + + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(dev->pdev, i * 4, hca_header[i])) { + err = -ENODEV; + mlx4_err(dev, "Couldn't restore HCA reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(dev->pdev, PCI_COMMAND, + hca_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mlx4_err(dev, "Couldn't restore HCA COMMAND, " + "aborting.\n"); + goto out; + } + +out: + kfree(hca_header); + + return err; +} From rolandd at cisco.com Fri Apr 20 15:32:36 2007 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 20 Apr 2007 15:32:36 -0700 Subject: [ofa-general] [PATCH 0/6] [RFC]IB/mlx4: Mellanox ConnectX adapter driver In-Reply-To: Message-ID: <20074201532.4PiF0gUjC19I1fhy@cisco.com> As promised, here is a series of patches adding the mlx4_core and mlx4_ib drivers for the new Mellanox ConnectX adapter. These patches are split up in an ad hoc way to avoid mailing list size limits, but when this driver is finally merged, I will give it to Linus to pull in a single changeset. The full driver is also available via git from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git connectx and it is also in my for-mm patch, so Andrew will pick it up for -mm kernels automatically. The driver is split into two kernel modules because the ConnectX adapter can be used as an InfiniBand adapter, 1G/10G ethernet NIC, and an fibre channel HBA at the same time, and so resource management and basic tasks such as issuing commands to the firmware are handled in a mlx4_core module, while everything InfiniBand-specific is in mlx4_ib. In the not-to-distant future, an mlx4_eth module that handles ethernet NIC stuff will be released. My goal is to merge this for 2.6.22. If you feel that would not be appropriate, please do let me know and I will hold off. And of course all criticisms, suggestions, comments, etc. are very much appreciated. My feeling is that the driver is fairly clean already (and I will do some further cleanup before merging) and seems to be reasonably usable, and I trust myself to continue cleaning things up, so there's not much to be gained by waiting a release cycle. The overall driver is not too huge -- 11371 insertions in the diffstat: drivers/infiniband/Kconfig | 2 + drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/mlx4/Kconfig | 9 + drivers/infiniband/hw/mlx4/Makefile | 3 + drivers/infiniband/hw/mlx4/ah.c | 100 +++ drivers/infiniband/hw/mlx4/cq.c | 525 ++++++++++++++ drivers/infiniband/hw/mlx4/doorbell.c | 215 ++++++ drivers/infiniband/hw/mlx4/mad.c | 339 +++++++++ drivers/infiniband/hw/mlx4/main.c | 612 ++++++++++++++++ drivers/infiniband/hw/mlx4/mlx4_ib.h | 285 ++++++++ drivers/infiniband/hw/mlx4/mr.c | 184 +++++ drivers/infiniband/hw/mlx4/qp.c | 1263 +++++++++++++++++++++++++++++++++ drivers/infiniband/hw/mlx4/srq.c | 334 +++++++++ drivers/infiniband/hw/mlx4/user.h | 91 +++ drivers/net/Kconfig | 14 + drivers/net/Makefile | 1 + drivers/net/mlx4/Makefile | 4 + drivers/net/mlx4/alloc.c | 179 +++++ drivers/net/mlx4/cmd.c | 429 +++++++++++ drivers/net/mlx4/cq.c | 254 +++++++ drivers/net/mlx4/eq.c | 704 ++++++++++++++++++ drivers/net/mlx4/fw.c | 758 ++++++++++++++++++++ drivers/net/mlx4/fw.h | 165 +++++ drivers/net/mlx4/icm.c | 379 ++++++++++ drivers/net/mlx4/icm.h | 135 ++++ drivers/net/mlx4/intf.c | 142 ++++ drivers/net/mlx4/main.c | 939 ++++++++++++++++++++++++ drivers/net/mlx4/mcg.c | 370 ++++++++++ drivers/net/mlx4/mlx4.h | 334 +++++++++ drivers/net/mlx4/mr.c | 482 +++++++++++++ drivers/net/mlx4/pd.c | 102 +++ drivers/net/mlx4/profile.c | 238 +++++++ drivers/net/mlx4/qp.c | 270 +++++++ drivers/net/mlx4/reset.c | 172 +++++ drivers/net/mlx4/srq.c | 227 ++++++ include/linux/mlx4/cmd.h | 178 +++++ include/linux/mlx4/cq.h | 123 ++++ include/linux/mlx4/device.h | 323 +++++++++ include/linux/mlx4/doorbell.h | 97 +++ include/linux/mlx4/driver.h | 59 ++ include/linux/mlx4/qp.h | 288 ++++++++ include/linux/mlx4/srq.h | 42 ++ 42 files changed, 11371 insertions(+), 0 deletions(-) From rolandd at cisco.com Fri Apr 20 15:32:36 2007 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 20 Apr 2007 15:32:36 -0700 Subject: [ofa-general] [PATCH 4/6] [RFC]mlx4_ib main files In-Reply-To: <20074201532.7wkxUOPwHNFg96ue@cisco.com> Message-ID: <20074201532.mohEXyoz7s98VaHz@cisco.com> Main include file and .c file for mlx4_ib. Signed-off-by: Roland Dreier --- main.c | 612 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ mlx4_ib.h | 285 ++++++++++++++++++++++++++++ 2 files changed, 897 insertions(+) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c new file mode 100644 index 0000000..6f7165f --- /dev/null +++ b/drivers/infiniband/hw/mlx4/main.c @@ -0,0 +1,612 @@ +/* + * Copyright (c) 2006, 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include +#include + +#include +#include + +#include "mlx4_ib.h" +#include "user.h" + +#define DRV_NAME "mlx4_ib" +#define DRV_VERSION "0.01" +#define DRV_RELDATE "May 1, 2006" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Mellanox ConnectX HCA InfiniBand driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +static const char mlx4_ib_version[] __devinitdata = + DRV_NAME ": Mellanox ConnectX InfiniBand driver v" + DRV_VERSION " (" DRV_RELDATE ")\n"; + +static void init_query_mad(struct ib_smp *mad) +{ + mad->base_version = 1; + mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + mad->class_version = 1; + mad->method = IB_MGMT_METHOD_GET; +} + +static int mlx4_ib_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct mlx4_ib_dev *dev = to_mdev(ibdev); + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; + + err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, 1, NULL, NULL, in_mad, out_mad); + if (err) + goto out; + + memset(props, 0, sizeof *props); + + props->fw_ver = dev->dev->caps.fw_ver; + props->device_cap_flags = IB_DEVICE_CHANGE_PHY_PORT | + IB_DEVICE_PORT_ACTIVE_EVENT | + IB_DEVICE_SYS_IMAGE_GUID | + IB_DEVICE_RC_RNR_NAK_GEN; + if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_BAD_PKEY_CNTR) + props->device_cap_flags |= IB_DEVICE_BAD_PKEY_CNTR; + if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_BAD_QKEY_CNTR) + props->device_cap_flags |= IB_DEVICE_BAD_QKEY_CNTR; + if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_APM) + props->device_cap_flags |= IB_DEVICE_AUTO_PATH_MIG; + if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_UD_AV_PORT) + props->device_cap_flags |= IB_DEVICE_UD_AV_PORT_ENFORCE; + + props->vendor_id = be32_to_cpup((__be32 *) (out_mad->data + 36)) & + 0xffffff; + props->vendor_part_id = be16_to_cpup((__be16 *) (out_mad->data + 30)); + props->hw_ver = be32_to_cpup((__be32 *) (out_mad->data + 32)); + memcpy(&props->sys_image_guid, out_mad->data + 4, 8); + + props->max_mr_size = ~0ull; + props->page_size_cap = dev->dev->caps.page_size_cap; + props->max_qp = dev->dev->caps.num_qps - dev->dev->caps.reserved_qps; + props->max_qp_wr = dev->dev->caps.max_wqes; + props->max_sge = min(dev->dev->caps.max_sq_sg, + dev->dev->caps.max_rq_sg); + props->max_cq = dev->dev->caps.num_cqs - dev->dev->caps.reserved_cqs; + props->max_cqe = dev->dev->caps.max_cqes; + props->max_mr = dev->dev->caps.num_mpts - dev->dev->caps.reserved_mrws; + props->max_pd = dev->dev->caps.num_pds - dev->dev->caps.reserved_pds; + props->max_qp_rd_atom = dev->dev->caps.max_qp_dest_rdma; + props->max_qp_init_rd_atom = dev->dev->caps.max_qp_init_rdma; + props->max_res_rd_atom = props->max_qp_rd_atom * props->max_qp; + props->max_srq = dev->dev->caps.num_srqs - dev->dev->caps.reserved_srqs; + props->max_srq_wr = dev->dev->caps.max_srq_wqes; + props->max_srq_sge = dev->dev->caps.max_srq_sge; + props->local_ca_ack_delay = dev->dev->caps.local_ca_ack_delay; + props->atomic_cap = dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_ATOMIC ? + IB_ATOMIC_HCA : IB_ATOMIC_NONE; + props->max_pkeys = dev->dev->caps.pkey_table_len; + props->max_mcast_grp = dev->dev->caps.num_mgms + dev->dev->caps.num_amgms; + props->max_total_mcast_qp_attach = props->max_mcast_qp_attach * + props->max_mcast_grp; + props->max_map_per_fmr = (1 << (32 - ilog2(dev->dev->caps.num_mpts))) - 1; + +out: + kfree(in_mad); + kfree(out_mad); + + return err; +} + +static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port, + struct ib_port_attr *props) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(props, 0, sizeof *props); + + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); + + err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad); + if (err) + goto out; + + props->lid = be16_to_cpup((__be16 *) (out_mad->data + 16)); + props->lmc = out_mad->data[34] & 0x7; + props->sm_lid = be16_to_cpup((__be16 *) (out_mad->data + 18)); + props->sm_sl = out_mad->data[36] & 0xf; + props->state = out_mad->data[32] & 0xf; + props->phys_state = out_mad->data[33] >> 4; + props->port_cap_flags = be32_to_cpup((__be32 *) (out_mad->data + 20)); + props->gid_tbl_len = to_mdev(ibdev)->dev->caps.gid_table_len; + props->max_msg_sz = 0x80000000; + props->pkey_tbl_len = to_mdev(ibdev)->dev->caps.pkey_table_len; + props->bad_pkey_cntr = be16_to_cpup((__be16 *) (out_mad->data + 46)); + props->qkey_viol_cntr = be16_to_cpup((__be16 *) (out_mad->data + 48)); + props->active_width = out_mad->data[31] & 0xf; + props->active_speed = out_mad->data[35] >> 4; + props->max_mtu = out_mad->data[41] & 0xf; + props->active_mtu = out_mad->data[36] >> 4; + props->subnet_timeout = out_mad->data[51] & 0x1f; + props->max_vl_num = out_mad->data[37] >> 4; + props->init_type_reply = out_mad->data[41] >> 4; + +out: + kfree(in_mad); + kfree(out_mad); + + return err; +} + +static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index, + union ib_gid *gid) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); + + err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad); + if (err) + goto out; + + memcpy(gid->raw, out_mad->data + 8, 8); + + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_GUID_INFO; + in_mad->attr_mod = cpu_to_be32(index / 8); + + err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad); + if (err) + goto out; + + memcpy(gid->raw + 8, out_mad->data + (index % 8) * 8, 8); + +out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mlx4_ib_query_pkey(struct ib_device *ibdev, u8 port, u16 index, + u16 *pkey) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PKEY_TABLE; + in_mad->attr_mod = cpu_to_be32(index / 32); + + err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad); + if (err) + goto out; + + *pkey = be16_to_cpu(((__be16 *) out_mad->data)[index % 32]); + +out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mlx4_ib_modify_device(struct ib_device *ibdev, int mask, + struct ib_device_modify *props) +{ + if (mask & ~IB_DEVICE_MODIFY_NODE_DESC) + return -EOPNOTSUPP; + + if (mask & IB_DEVICE_MODIFY_NODE_DESC) { + spin_lock(&to_mdev(ibdev)->sm_lock); + memcpy(ibdev->node_desc, props->node_desc, 64); + spin_unlock(&to_mdev(ibdev)->sm_lock); + } + + return 0; +} + +static int mlx4_SET_PORT(struct mlx4_ib_dev *dev, u8 port, int reset_qkey_viols, + u32 cap_mask) +{ + struct mlx4_cmd_mailbox *mailbox; + int err; + + mailbox = mlx4_alloc_cmd_mailbox(dev->dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + memset(mailbox->buf, 0, 256); + *(u8 *) mailbox->buf = cpu_to_be32(!!reset_qkey_viols << 6); + ((__be32 *) mailbox->buf)[2] = cpu_to_be32(cap_mask); + + err = mlx4_cmd(dev->dev, mailbox->dma, port, 0, MLX4_CMD_SET_PORT, + MLX4_CMD_TIME_CLASS_B); + + mlx4_free_cmd_mailbox(dev->dev, mailbox); + return err; +} + +static int mlx4_ib_modify_port(struct ib_device *ibdev, u8 port, int mask, + struct ib_port_modify *props) +{ + struct ib_port_attr attr; + u32 cap_mask; + int err; + + mutex_lock(&to_mdev(ibdev)->cap_mask_mutex); + + err = mlx4_ib_query_port(ibdev, port, &attr); + if (err) + goto out; + + cap_mask = (attr.port_cap_flags | props->set_port_cap_mask) & + ~props->clr_port_cap_mask; + + err = mlx4_SET_PORT(to_mdev(ibdev), port, + !!(mask & IB_PORT_RESET_QKEY_CNTR), + cap_mask); + +out: + mutex_unlock(&to_mdev(ibdev)->cap_mask_mutex); + return err; +} + +static struct ib_ucontext *mlx4_ib_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct mlx4_ib_ucontext *context; + struct mlx4_ib_alloc_ucontext_resp resp; + int err; + + resp.qp_tab_size = to_mdev(ibdev)->dev->caps.num_qps; + /* FIXME blueflame info */ + resp.bf_reg_size = 0; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + + err = mlx4_uar_alloc(to_mdev(ibdev)->dev, &context->uar); + if (err) { + kfree(context); + return ERR_PTR(err); + } + + INIT_LIST_HEAD(&context->db_page_list); + mutex_init(&context->db_page_mutex); + + err = ib_copy_to_udata(udata, &resp, sizeof resp); + if (err) { + mlx4_uar_free(to_mdev(ibdev)->dev, &context->uar); + kfree(context); + return ERR_PTR(-EFAULT); + } + + return &context->ibucontext; +} + +static int mlx4_ib_dealloc_ucontext(struct ib_ucontext *ibcontext) +{ + struct mlx4_ib_ucontext *context = to_mucontext(ibcontext); + + mlx4_uar_free(to_mdev(ibcontext->device)->dev, &context->uar); + kfree(context); + + return 0; +} + +static int mlx4_ib_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + if (vma->vm_end - vma->vm_start != PAGE_SIZE) + return -EINVAL; + + /* FIXME handle mapping blueflame regs if offset !=0 */ + if (vma->vm_pgoff > 0) + return -EINVAL; + + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + + if (io_remap_pfn_range(vma, vma->vm_start, + to_mucontext(context)->uar.pfn, + PAGE_SIZE, vma->vm_page_prot)) + return -EAGAIN; + + return 0; +} + +static struct ib_pd *mlx4_ib_alloc_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct mlx4_ib_pd *pd; + int err; + + pd = kmalloc(sizeof *pd, GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = mlx4_pd_alloc(to_mdev(ibdev)->dev, &pd->pdn); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + if (context) + if (ib_copy_to_udata(udata, &pd->pdn, sizeof (__u32))) { + mlx4_pd_free(to_mdev(ibdev)->dev, pd->pdn); + kfree(pd); + return ERR_PTR(-EFAULT); + } + + return &pd->ibpd; +} + +static int mlx4_ib_dealloc_pd(struct ib_pd *pd) +{ + mlx4_pd_free(to_mdev(pd->device)->dev, to_mpd(pd)->pdn); + kfree(pd); + + return 0; +} + +static int mlx4_ib_mcg_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return mlx4_multicast_attach(to_mdev(ibqp->device)->dev, + &to_mqp(ibqp)->mqp, gid->raw); +} + +static int mlx4_ib_mcg_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return mlx4_multicast_detach(to_mdev(ibqp->device)->dev, + &to_mqp(ibqp)->mqp, gid->raw); +} + +static int init_node_data(struct mlx4_ib_dev *dev) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_NODE_DESC; + + err = mlx4_MAD_IFC(dev, 1, 1, 1, NULL, NULL, in_mad, out_mad); + if (err) + goto out; + + memcpy(dev->ib_dev.node_desc, out_mad->data, 64); + + in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; + + err = mlx4_MAD_IFC(dev, 1, 1, 1, NULL, NULL, in_mad, out_mad); + if (err) + goto out; + + memcpy(&dev->ib_dev.node_guid, out_mad->data + 12, 8); + +out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static void *mlx4_ib_add(struct mlx4_dev *dev) +{ + struct mlx4_ib_dev *ibdev; + + ibdev = (struct mlx4_ib_dev *) ib_alloc_device(sizeof *ibdev); + if (!ibdev) { + dev_err(&dev->pdev->dev, "Device struct alloc failed\n"); + return NULL; + } + + if (mlx4_pd_alloc(dev, &ibdev->priv_pdn)) + goto err_dealloc; + + if (mlx4_uar_alloc(dev, &ibdev->priv_uar)) + goto err_pd; + + ibdev->uar_map = ioremap(ibdev->priv_uar.pfn << PAGE_SHIFT, PAGE_SIZE); + if (!ibdev->uar_map) + goto err_uar; + + INIT_LIST_HEAD(&ibdev->pgdir_list); + spin_lock_init(&ibdev->pgdir_lock); + + ibdev->dev = dev; + + strlcpy(ibdev->ib_dev.name, "mlx4_%d", IB_DEVICE_NAME_MAX); + ibdev->ib_dev.owner = THIS_MODULE; + ibdev->ib_dev.node_type = RDMA_NODE_IB_CA; + ibdev->ib_dev.phys_port_cnt = dev->caps.num_ports; + ibdev->ib_dev.dma_device = &dev->pdev->dev; + ibdev->ib_dev.class_dev.dev = &dev->pdev->dev; + + ibdev->ib_dev.uverbs_abi_ver = MLX4_IB_UVERBS_ABI_VERSION; + ibdev->ib_dev.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | + (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); + + ibdev->ib_dev.query_device = mlx4_ib_query_device; + ibdev->ib_dev.query_port = mlx4_ib_query_port; + ibdev->ib_dev.query_gid = mlx4_ib_query_gid; + ibdev->ib_dev.query_pkey = mlx4_ib_query_pkey; + ibdev->ib_dev.modify_device = mlx4_ib_modify_device; + ibdev->ib_dev.modify_port = mlx4_ib_modify_port; + ibdev->ib_dev.alloc_ucontext = mlx4_ib_alloc_ucontext; + ibdev->ib_dev.dealloc_ucontext = mlx4_ib_dealloc_ucontext; + ibdev->ib_dev.mmap = mlx4_ib_mmap; + ibdev->ib_dev.alloc_pd = mlx4_ib_alloc_pd; + ibdev->ib_dev.dealloc_pd = mlx4_ib_dealloc_pd; + ibdev->ib_dev.create_ah = mlx4_ib_create_ah; + ibdev->ib_dev.query_ah = mlx4_ib_query_ah; + ibdev->ib_dev.destroy_ah = mlx4_ib_destroy_ah; + ibdev->ib_dev.create_srq = mlx4_ib_create_srq; + ibdev->ib_dev.modify_srq = mlx4_ib_modify_srq; + ibdev->ib_dev.destroy_srq = mlx4_ib_destroy_srq; + ibdev->ib_dev.post_srq_recv = mlx4_ib_post_srq_recv; + ibdev->ib_dev.create_qp = mlx4_ib_create_qp; + ibdev->ib_dev.modify_qp = mlx4_ib_modify_qp; + ibdev->ib_dev.destroy_qp = mlx4_ib_destroy_qp; + ibdev->ib_dev.post_send = mlx4_ib_post_send; + ibdev->ib_dev.post_recv = mlx4_ib_post_recv; + ibdev->ib_dev.create_cq = mlx4_ib_create_cq; + ibdev->ib_dev.destroy_cq = mlx4_ib_destroy_cq; + ibdev->ib_dev.poll_cq = mlx4_ib_poll_cq; + ibdev->ib_dev.req_notify_cq = mlx4_ib_arm_cq; + ibdev->ib_dev.get_dma_mr = mlx4_ib_get_dma_mr; + ibdev->ib_dev.reg_user_mr = mlx4_ib_reg_user_mr; + ibdev->ib_dev.dereg_mr = mlx4_ib_dereg_mr; + ibdev->ib_dev.attach_mcast = mlx4_ib_mcg_attach; + ibdev->ib_dev.detach_mcast = mlx4_ib_mcg_detach; + ibdev->ib_dev.process_mad = mlx4_ib_process_mad; + + if (init_node_data(ibdev)) + goto err_map; + + spin_lock_init(&ibdev->sm_lock); + mutex_init(&ibdev->cap_mask_mutex); + + if (ib_register_device(&ibdev->ib_dev)) + goto err_map; + + if (mlx4_ib_mad_init(ibdev)) + goto err_reg; + + return ibdev; + +err_reg: + ib_unregister_device(&ibdev->ib_dev); + +err_map: + iounmap(ibdev->uar_map); + +err_uar: + mlx4_uar_free(dev, &ibdev->priv_uar); + +err_pd: + mlx4_pd_free(dev, ibdev->priv_pdn); + +err_dealloc: + ib_dealloc_device(&ibdev->ib_dev); + + return NULL; +} + +static void mlx4_ib_remove(struct mlx4_dev *dev, void *ibdev_ptr) +{ + struct mlx4_ib_dev *ibdev = ibdev_ptr; + int p; + + for (p = 1; p <= dev->caps.num_ports; ++p) + mlx4_CLOSE_PORT(dev, p); + + mlx4_ib_mad_cleanup(ibdev); + ib_unregister_device(&ibdev->ib_dev); + iounmap(ibdev->uar_map); + mlx4_uar_free(dev, &ibdev->priv_uar); + mlx4_pd_free(dev, ibdev->priv_pdn); + ib_dealloc_device(&ibdev->ib_dev); +} + +static struct mlx4_interface mlx4_ib_interface = { + .add = mlx4_ib_add, + .remove = mlx4_ib_remove +}; + +static int __init mlx4_ib_init(void) +{ + return mlx4_register_interface(&mlx4_ib_interface); +} + +static void __exit mlx4_ib_cleanup(void) +{ + mlx4_unregister_interface(&mlx4_ib_interface); +} + +module_init(mlx4_ib_init); +module_exit(mlx4_ib_cleanup); diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h new file mode 100644 index 0000000..df3a11a --- /dev/null +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -0,0 +1,285 @@ +/* + * Copyright (c) 2006 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef MLX4_IB_H +#define MLX4_IB_H + +#include +#include + +#include +#include + +#include +#include + +enum { + MLX4_IB_DB_PER_PAGE = PAGE_SIZE / 4 +}; + +struct mlx4_ib_db_pgdir; +struct mlx4_ib_user_db_page; + +struct mlx4_ib_db { + __be32 *db; + union { + struct mlx4_ib_db_pgdir *pgdir; + struct mlx4_ib_user_db_page *user_page; + } u; + dma_addr_t dma; + int index; + int order; +}; + +struct mlx4_ib_ucontext { + struct ib_ucontext ibucontext; + struct mlx4_uar uar; + struct list_head db_page_list; + struct mutex db_page_mutex; +}; + +struct mlx4_ib_pd { + struct ib_pd ibpd; + u32 pdn; +}; + +struct mlx4_ib_cq_buf { + struct mlx4_buf buf; + struct mlx4_mtt mtt; +}; + +struct mlx4_ib_cq { + struct ib_cq ibcq; + struct mlx4_cq mcq; + struct mlx4_ib_cq_buf buf; + struct mlx4_ib_db db; + spinlock_t lock; + struct ib_umem *umem; +}; + +struct mlx4_ib_mr { + struct ib_mr ibmr; + struct mlx4_mr mmr; + struct ib_umem *umem; +}; + +struct mlx4_ib_wq { + u64 *wrid; + spinlock_t lock; + int max; + int max_gs; + int offset; + int wqe_shift; + unsigned head; + unsigned tail; +}; + +struct mlx4_ib_qp { + struct ib_qp ibqp; + struct mlx4_qp mqp; + struct mlx4_buf buf; + + struct mlx4_ib_db db; + struct mlx4_ib_wq rq; + + u32 doorbell_qpn; + __be32 sq_signal_bits; + struct mlx4_ib_wq sq; + + struct ib_umem *umem; + struct mlx4_mtt mtt; + int buf_size; + struct mutex mutex; + u8 port; + u8 alt_port; + u8 atomic_rd_en; + u8 resp_depth; + u8 state; +}; + +struct mlx4_ib_srq { + struct ib_srq ibsrq; + struct mlx4_srq msrq; + struct mlx4_buf buf; + struct mlx4_ib_db db; + u64 *wrid; + spinlock_t lock; + int head; + int tail; + u16 wqe_ctr; + struct ib_umem *umem; + struct mlx4_mtt mtt; + struct mutex mutex; +}; + +struct mlx4_ib_ah { + struct ib_ah ibah; + struct mlx4_av av; +}; + +struct mlx4_ib_dev { + struct ib_device ib_dev; + struct mlx4_dev *dev; + void __iomem *uar_map; + + struct list_head pgdir_list; + spinlock_t pgdir_lock; + + struct mlx4_uar priv_uar; + u32 priv_pdn; + MLX4_DECLARE_DOORBELL_LOCK(uar_lock); + + struct ib_mad_agent *send_agent[MLX4_MAX_PORTS][2]; + struct ib_ah *sm_ah[MLX4_MAX_PORTS]; + spinlock_t sm_lock; + + struct mutex cap_mask_mutex; +}; + +static inline struct mlx4_ib_dev *to_mdev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct mlx4_ib_dev, ib_dev); +} + +static inline struct mlx4_ib_ucontext *to_mucontext(struct ib_ucontext *ibucontext) +{ + return container_of(ibucontext, struct mlx4_ib_ucontext, ibucontext); +} + +static inline struct mlx4_ib_pd *to_mpd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct mlx4_ib_pd, ibpd); +} + +static inline struct mlx4_ib_cq *to_mcq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct mlx4_ib_cq, ibcq); +} + +static inline struct mlx4_ib_cq *to_mibcq(struct mlx4_cq *mcq) +{ + return container_of(mcq, struct mlx4_ib_cq, mcq); +} + +static inline struct mlx4_ib_mr *to_mmr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct mlx4_ib_mr, ibmr); +} + +static inline struct mlx4_ib_qp *to_mqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct mlx4_ib_qp, ibqp); +} + +static inline struct mlx4_ib_qp *to_mibqp(struct mlx4_qp *mqp) +{ + return container_of(mqp, struct mlx4_ib_qp, mqp); +} + +static inline struct mlx4_ib_srq *to_msrq(struct ib_srq *ibsrq) +{ + return container_of(ibsrq, struct mlx4_ib_srq, ibsrq); +} + +static inline struct mlx4_ib_srq *to_mibsrq(struct mlx4_srq *msrq) +{ + return container_of(msrq, struct mlx4_ib_srq, msrq); +} + +static inline struct mlx4_ib_ah *to_mah(struct ib_ah *ibah) +{ + return container_of(ibah, struct mlx4_ib_ah, ibah); +} + +int mlx4_ib_db_alloc(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db, int order); +void mlx4_ib_db_free(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db); +int mlx4_ib_db_map_user(struct mlx4_ib_ucontext *context, unsigned long virt, + struct mlx4_ib_db *db); +void mlx4_ib_db_unmap_user(struct mlx4_ib_ucontext *context, struct mlx4_ib_db *db); + +struct ib_mr *mlx4_ib_get_dma_mr(struct ib_pd *pd, int acc); +int mlx4_ib_umem_write_mtt(struct mlx4_ib_dev *dev, struct mlx4_mtt *mtt, + struct ib_umem *umem); +struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt_addr, int access_flags, + struct ib_udata *udata); +int mlx4_ib_dereg_mr(struct ib_mr *mr); + +struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata); +int mlx4_ib_destroy_cq(struct ib_cq *cq); +int mlx4_ib_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc); +int mlx4_ib_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); +void __mlx4_ib_cq_clean(struct mlx4_ib_cq *cq, u32 qpn, struct mlx4_ib_srq *srq); +void mlx4_ib_cq_clean(struct mlx4_ib_cq *cq, u32 qpn, struct mlx4_ib_srq *srq); + +struct ib_ah *mlx4_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); +int mlx4_ib_query_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr); +int mlx4_ib_destroy_ah(struct ib_ah *ah); + +struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd, + struct ib_srq_init_attr *init_attr, + struct ib_udata *udata); +int mlx4_ib_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, + enum ib_srq_attr_mask attr_mask, struct ib_udata *udata); +int mlx4_ib_destroy_srq(struct ib_srq *srq); +void mlx4_ib_free_srq_wqe(struct mlx4_ib_srq *srq, int wqe_index); +int mlx4_ib_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); + +struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata); +int mlx4_ib_destroy_qp(struct ib_qp *qp); +int mlx4_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask, struct ib_udata *udata); +int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mlx4_ib_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); + +int mlx4_MAD_IFC(struct mlx4_ib_dev *dev, int ignore_mkey, int ignore_bkey, + int port, struct ib_wc *in_wc, struct ib_grh *in_grh, + void *in_mad, void *response_mad); +int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, + struct ib_wc *in_wc, struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad); +int mlx4_ib_mad_init(struct mlx4_ib_dev *dev); +void mlx4_ib_mad_cleanup(struct mlx4_ib_dev *dev); + +static inline int mlx4_ib_ah_grh_present(struct mlx4_ib_ah *ah) +{ + return !!(ah->av.g_slid & 0x80); +} + +#endif /* MLX4_IB_H */ From rolandd at cisco.com Fri Apr 20 15:32:36 2007 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 20 Apr 2007 15:32:36 -0700 Subject: [ofa-general] [PATCH 2/6] [RFC]mlx4_core rest of files In-Reply-To: <20074201532.IZZ23OZ6UzBQwQQb@cisco.com> Message-ID: <20074201532.d6hTKIczPSz0SeTA@cisco.com> Rest of mlx4_core code. Signed-off-by: Roland Dreier --- alloc.c | 179 ++++++++++++++++ cq.c | 254 +++++++++++++++++++++++ eq.c | 704 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ icm.c | 379 ++++++++++++++++++++++++++++++++++ icm.h | 135 ++++++++++++ mcg.c | 370 +++++++++++++++++++++++++++++++++ mr.c | 482 +++++++++++++++++++++++++++++++++++++++++++ pd.c | 102 +++++++++ qp.c | 270 ++++++++++++++++++++++++ srq.c | 227 ++++++++++++++++++++ 10 files changed, 3102 insertions(+) diff --git a/drivers/net/mlx4/alloc.c b/drivers/net/mlx4/alloc.c new file mode 100644 index 0000000..9ffdb9d --- /dev/null +++ b/drivers/net/mlx4/alloc.c @@ -0,0 +1,179 @@ +/* + * Copyright (c) 2006, 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "mlx4.h" + +u32 mlx4_bitmap_alloc(struct mlx4_bitmap *bitmap) +{ + u32 obj; + + spin_lock(&bitmap->lock); + + obj = find_next_zero_bit(bitmap->table, bitmap->max, bitmap->last); + if (obj >= bitmap->max) { + bitmap->top = (bitmap->top + bitmap->max) & bitmap->mask; + obj = find_first_zero_bit(bitmap->table, bitmap->max); + } + + if (obj < bitmap->max) { + set_bit(obj, bitmap->table); + obj |= bitmap->top; + bitmap->last = obj + 1; + } else + obj = -1; + + spin_unlock(&bitmap->lock); + + return obj; +} + +void mlx4_bitmap_free(struct mlx4_bitmap *bitmap, u32 obj) +{ + obj &= bitmap->max - 1; + + spin_lock(&bitmap->lock); + clear_bit(obj, bitmap->table); + bitmap->last = min(bitmap->last, obj); + bitmap->top = (bitmap->top + bitmap->max) & bitmap->mask; + spin_unlock(&bitmap->lock); +} + +int mlx4_bitmap_init(struct mlx4_bitmap *bitmap, u32 num, u32 mask, u32 reserved) +{ + int i; + + /* num must be a power of 2 */ + if (num != roundup_pow_of_two(num)) + return -EINVAL; + + bitmap->last = 0; + bitmap->top = 0; + bitmap->max = num; + bitmap->mask = mask; + spin_lock_init(&bitmap->lock); + bitmap->table = kzalloc(BITS_TO_LONGS(num) * sizeof (long), GFP_KERNEL); + if (!bitmap->table) + return -ENOMEM; + + for (i = 0; i < reserved; ++i) + set_bit(i, bitmap->table); + + return 0; +} + +void mlx4_bitmap_cleanup(struct mlx4_bitmap *bitmap) +{ + kfree(bitmap->table); +} + +/* + * Handling for queue buffers -- we allocate a bunch of memory and + * register it in a memory region at HCA virtual address 0. If the + * requested size is > max_direct, we split the allocation into + * multiple pages, so we don't require too much contiguous memory. + */ + +int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct, + struct mlx4_buf *buf) +{ + dma_addr_t t; + + if (size <= max_direct) { + buf->nbufs = 1; + buf->npages = 1; + buf->page_shift = get_order(size) + PAGE_SHIFT; + buf->u.direct.buf = dma_alloc_coherent(&dev->pdev->dev, + size, &t, GFP_KERNEL); + if (!buf->u.direct.buf) + return -ENOMEM; + + buf->u.direct.map = t; + + while (t & ((1 << buf->page_shift) - 1)) { + --buf->page_shift; + buf->npages *= 2; + } + + memset(buf->u.direct.buf, 0, size); + } else { + int i; + + buf->nbufs = (size + PAGE_SIZE - 1) / PAGE_SIZE; + buf->npages = buf->nbufs; + buf->page_shift = PAGE_SHIFT; + buf->u.page_list = kzalloc(buf->nbufs * sizeof *buf->u.page_list, + GFP_KERNEL); + if (!buf->u.page_list) + return -ENOMEM; + + for (i = 0; i < buf->nbufs; ++i) { + buf->u.page_list[i].buf = + dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE, + &t, GFP_KERNEL); + if (!buf->u.page_list[i].buf) + goto err_free; + + buf->u.page_list[i].map = t; + + memset(buf->u.page_list[i].buf, 0, PAGE_SIZE); + } + } + + return 0; + +err_free: + mlx4_buf_free(dev, size, buf); + + return -ENOMEM; +} +EXPORT_SYMBOL_GPL(mlx4_buf_alloc); + +void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf) +{ + int i; + + if (buf->nbufs == 1) + dma_free_coherent(&dev->pdev->dev, size, buf->u.direct.buf, + buf->u.direct.map); + else { + for (i = 0; i < buf->nbufs; ++i) + dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, + buf->u.page_list[i].buf, + buf->u.page_list[i].map); + kfree(buf->u.page_list); + } +} +EXPORT_SYMBOL_GPL(mlx4_buf_free); diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c new file mode 100644 index 0000000..47e84c7 --- /dev/null +++ b/drivers/net/mlx4/cq.c @@ -0,0 +1,254 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems, Inc. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include + +#include "mlx4.h" +#include "icm.h" + +struct mlx4_cq_context { + __be32 flags; + u16 reserved1[3]; + __be16 page_offset; + __be32 logsize_usrpage; + u8 reserved2; + u8 cq_period; + u8 reserved3; + u8 cq_max_count; + u8 reserved4[3]; + u8 comp_eqn; + u8 log_page_size; + u8 reserved5[2]; + u8 mtt_base_addr_h; + __be32 mtt_base_addr_l; + __be32 last_notified_index; + __be32 solicit_producer_index; + __be32 consumer_index; + __be32 producer_index; + u8 reserved6[2]; + __be64 db_rec_addr; +}; + +#define MLX4_CQ_STATUS_OK ( 0 << 28) +#define MLX4_CQ_STATUS_OVERFLOW ( 9 << 28) +#define MLX4_CQ_STATUS_WRITE_FAIL (10 << 28) +#define MLX4_CQ_FLAG_CC ( 1 << 18) +#define MLX4_CQ_FLAG_OI ( 1 << 17) +#define MLX4_CQ_STATE_ARMED ( 9 << 8) +#define MLX4_CQ_STATE_ARMED_SOL ( 6 << 8) +#define MLX4_EQ_STATE_FIRED (10 << 8) + +void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn) +{ + struct mlx4_cq *cq; + + cq = radix_tree_lookup(&mlx4_priv(dev)->cq_table.tree, + cqn & (dev->caps.num_cqs - 1)); + if (!cq) { + mlx4_warn(dev, "Completion event for bogus CQ %08x\n", cqn); + return; + } + + ++cq->arm_sn; + + cq->comp(cq); +} + +void mlx4_cq_event(struct mlx4_dev *dev, u32 cqn, int event_type) +{ + struct mlx4_cq_table *cq_table = &mlx4_priv(dev)->cq_table; + struct mlx4_cq *cq; + + spin_lock(&cq_table->lock); + + cq = radix_tree_lookup(&cq_table->tree, cqn & (dev->caps.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + + spin_unlock(&cq_table->lock); + + if (!cq) { + mlx4_warn(dev, "Async event for bogus CQ %08x\n", cqn); + return; + } + + cq->event(cq, event_type); + + if (atomic_dec_and_test(&cq->refcount)) + complete(&cq->free); +} + +static int mlx4_SW2HW_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, + int cq_num) +{ + return mlx4_cmd(dev, mailbox->dma, cq_num, 0, MLX4_CMD_SW2HW_CQ, + MLX4_CMD_TIME_CLASS_A); +} + +static int mlx4_HW2SW_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, + int cq_num) +{ + return mlx4_cmd_box(dev, 0, mailbox ? mailbox->dma : 0, cq_num, + mailbox ? 0 : 1, MLX4_CMD_HW2SW_CQ, + MLX4_CMD_TIME_CLASS_A); +} + +int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, + struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_cq_table *cq_table = &priv->cq_table; + struct mlx4_cmd_mailbox *mailbox; + struct mlx4_cq_context *cq_context; + u64 mtt_addr; + int err; + + cq->cqn = mlx4_bitmap_alloc(&cq_table->bitmap); + if (cq->cqn == -1) + return -ENOMEM; + + err = mlx4_table_get(dev, &cq_table->table, cq->cqn); + if (err) + goto err_out; + + err = mlx4_table_get(dev, &cq_table->cmpt_table, cq->cqn); + if (err) + goto err_put; + + spin_lock_irq(&cq_table->lock); + err = radix_tree_insert(&cq_table->tree, cq->cqn, cq); + spin_unlock_irq(&cq_table->lock); + if (err) + goto err_cmpt_put; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) { + err = PTR_ERR(mailbox); + goto err_radix; + } + + cq_context = mailbox->buf; + memset(cq_context, 0, sizeof *cq_context); + + cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index); + cq_context->comp_eqn = priv->eq_table.eq[MLX4_EQ_COMP].eqn; + cq_context->log_page_size = mtt->page_shift - MLX4_ICM_PAGE_SHIFT; + + mtt_addr = mlx4_mtt_addr(dev, mtt); + cq_context->mtt_base_addr_h = mtt_addr >> 32; + cq_context->mtt_base_addr_l = cpu_to_be32(mtt_addr & 0xffffffff); + cq_context->db_rec_addr = cpu_to_be64(db_rec); + + err = mlx4_SW2HW_CQ(dev, mailbox, cq->cqn); + mlx4_free_cmd_mailbox(dev, mailbox); + if (err) + goto err_radix; + + cq->cons_index = 0; + cq->arm_sn = 1; + cq->uar = uar; + atomic_set(&cq->refcount, 1); + init_completion(&cq->free); + + return 0; + +err_radix: + spin_lock_irq(&cq_table->lock); + radix_tree_delete(&cq_table->tree, cq->cqn); + spin_unlock_irq(&cq_table->lock); + +err_cmpt_put: + mlx4_table_put(dev, &cq_table->cmpt_table, cq->cqn); + +err_put: + mlx4_table_put(dev, &cq_table->table, cq->cqn); + +err_out: + mlx4_bitmap_free(&cq_table->bitmap, cq->cqn); + + return err; +} +EXPORT_SYMBOL_GPL(mlx4_cq_alloc); + +void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_cq_table *cq_table = &priv->cq_table; + int err; + + err = mlx4_HW2SW_CQ(dev, NULL, cq->cqn); + if (err) + mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn); + + synchronize_irq(priv->eq_table.eq[MLX4_EQ_COMP].irq); + + spin_lock_irq(&cq_table->lock); + radix_tree_delete(&cq_table->tree, cq->cqn); + spin_unlock_irq(&cq_table->lock); + + if (atomic_dec_and_test(&cq->refcount)) + complete(&cq->free); + wait_for_completion(&cq->free); + + mlx4_table_put(dev, &cq_table->table, cq->cqn); + mlx4_bitmap_free(&cq_table->bitmap, cq->cqn); +} +EXPORT_SYMBOL_GPL(mlx4_cq_free); + +int __devinit mlx4_init_cq_table(struct mlx4_dev *dev) +{ + struct mlx4_cq_table *cq_table = &mlx4_priv(dev)->cq_table; + int err; + + spin_lock_init(&cq_table->lock); + INIT_RADIX_TREE(&cq_table->tree, GFP_ATOMIC); + + err = mlx4_bitmap_init(&cq_table->bitmap, dev->caps.num_cqs, + dev->caps.num_cqs - 1, dev->caps.reserved_cqs); + if (err) + return err; + + return 0; +} + +void mlx4_cleanup_cq_table(struct mlx4_dev *dev) +{ + /* Nothing to do to clean up radix_tree */ + mlx4_bitmap_cleanup(&mlx4_priv(dev)->cq_table.bitmap); +} diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c new file mode 100644 index 0000000..99fccd1 --- /dev/null +++ b/drivers/net/mlx4/eq.c @@ -0,0 +1,704 @@ +/* + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include + +#include "mlx4.h" +#include "fw.h" + +enum { + MLX4_NUM_ASYNC_EQE = 0x100, + MLX4_NUM_SPARE_EQE = 0x80, + MLX4_EQ_ENTRY_SIZE = 0x20 +}; + +/* + * Must be packed because start is 64 bits but only aligned to 32 bits. + */ +struct mlx4_eq_context { + __be32 flags; + u16 reserved1[3]; + __be16 page_offset; + u8 log_eq_size; + u8 reserved2[4]; + u8 eq_period; + u8 reserved3; + u8 eq_max_count; + u8 reserved4[3]; + u8 intr; + u8 log_page_size; + u8 reserved5[2]; + u8 mtt_base_addr_h; + __be32 mtt_base_addr_l; + u32 reserved6[2]; + __be32 consumer_index; + __be32 producer_index; + u32 reserved7[4]; +}; + +#define MLX4_EQ_STATUS_OK ( 0 << 28) +#define MLX4_EQ_STATUS_WRITE_FAIL (10 << 28) +#define MLX4_EQ_OWNER_SW ( 0 << 24) +#define MLX4_EQ_OWNER_HW ( 1 << 24) +#define MLX4_EQ_FLAG_EC ( 1 << 18) +#define MLX4_EQ_FLAG_OI ( 1 << 17) +#define MLX4_EQ_STATE_ARMED ( 9 << 8) +#define MLX4_EQ_STATE_FIRED (10 << 8) +#define MLX4_EQ_STATE_ALWAYS_ARMED (11 << 8) + +#define MLX4_ASYNC_EVENT_MASK ((1ull << MLX4_EVENT_TYPE_PATH_MIG) | \ + (1ull << MLX4_EVENT_TYPE_COMM_EST) | \ + (1ull << MLX4_EVENT_TYPE_SQ_DRAINED) | \ + (1ull << MLX4_EVENT_TYPE_CQ_ERROR) | \ + (1ull << MLX4_EVENT_TYPE_WQ_CATAS_ERROR) | \ + (1ull << MLX4_EVENT_TYPE_EEC_CATAS_ERROR) | \ + (1ull << MLX4_EVENT_TYPE_PATH_MIG_FAILED) | \ + (1ull << MLX4_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \ + (1ull << MLX4_EVENT_TYPE_WQ_ACCESS_ERROR) | \ + (1ull << MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ + (1ull << MLX4_EVENT_TYPE_PORT_CHANGE) | \ + (1ull << MLX4_EVENT_TYPE_ECC_DETECT) | \ + (1ull << MLX4_EVENT_TYPE_SRQ_CATAS_ERROR) | \ + (1ull << MLX4_EVENT_TYPE_SRQ_QP_LAST_WQE) | \ + (1ull << MLX4_EVENT_TYPE_SRQ_LIMIT) | \ + (1ull << MLX4_EVENT_TYPE_CMD)) +#define MLX4_CATAS_EVENT_MASK (1ull << MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR) + +struct mlx4_eqe { + u8 reserved1; + u8 type; + u8 reserved2; + u8 subtype; + union { + u32 raw[6]; + struct { + __be32 cqn; + } __attribute__((packed)) comp; + struct { + u16 reserved1; + __be16 token; + u32 reserved2; + u8 reserved3[3]; + u8 status; + __be64 out_param; + } __attribute__((packed)) cmd; + struct { + __be32 qpn; + } __attribute__((packed)) qp; + struct { + __be32 srqn; + } __attribute__((packed)) srq; + struct { + __be32 cqn; + u32 reserved1; + u8 reserved2[3]; + u8 syndrome; + } __attribute__((packed)) cq_err; + struct { + u32 reserved1[2]; + __be32 port; + } __attribute__((packed)) port_change; + } event; + u8 reserved3[3]; + u8 owner; +} __attribute__((packed)); + +static void eq_set_ci(struct mlx4_eq *eq, int req_not) +{ + __raw_writel((__force u32) cpu_to_be32((eq->cons_index & 0xffffff) | + req_not << 31), + eq->doorbell); + /* We still want ordering, just not swabbing, so add a barrier */ + mb(); +} + +static struct mlx4_eqe *get_eqe(struct mlx4_eq *eq, u32 entry) +{ + unsigned long off = (entry & (eq->nent - 1)) * MLX4_EQ_ENTRY_SIZE; + return eq->page_list[off / PAGE_SIZE].buf + off % PAGE_SIZE; +} + +static struct mlx4_eqe *next_eqe_sw(struct mlx4_eq *eq) +{ + struct mlx4_eqe *eqe = get_eqe(eq, eq->cons_index); + return !!(eqe->owner & 0x80) ^ !!(eq->cons_index & eq->nent) ? NULL : eqe; +} + +static void port_change(struct mlx4_dev *dev, int port, int active) +{ + mlx4_dbg(dev, "Port change to %s for port %d\n", + active ? "active" : "down", port); +} + +static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq) +{ + struct mlx4_eqe *eqe; + int cqn; + int eqes_found = 0; + int set_ci = 0; + + while ((eqe = next_eqe_sw(eq))) { + /* + * Make sure we read EQ entry contents after we've + * checked the ownership bit. + */ + rmb(); + + switch (eqe->type) { + case MLX4_EVENT_TYPE_COMP: + cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff; + mlx4_cq_completion(dev, cqn); + break; + + case MLX4_EVENT_TYPE_PATH_MIG: + case MLX4_EVENT_TYPE_COMM_EST: + case MLX4_EVENT_TYPE_SQ_DRAINED: + case MLX4_EVENT_TYPE_SRQ_QP_LAST_WQE: + case MLX4_EVENT_TYPE_WQ_CATAS_ERROR: + case MLX4_EVENT_TYPE_PATH_MIG_FAILED: + case MLX4_EVENT_TYPE_WQ_INVAL_REQ_ERROR: + case MLX4_EVENT_TYPE_WQ_ACCESS_ERROR: + mlx4_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + eqe->type); + break; + + case MLX4_EVENT_TYPE_SRQ_LIMIT: + case MLX4_EVENT_TYPE_SRQ_CATAS_ERROR: + mlx4_srq_event(dev, be32_to_cpu(eqe->event.srq.srqn) & 0xffffff, + eqe->type); + break; + + case MLX4_EVENT_TYPE_CMD: + mlx4_cmd_event(dev, + be16_to_cpu(eqe->event.cmd.token), + eqe->event.cmd.status, + be64_to_cpu(eqe->event.cmd.out_param)); + break; + + case MLX4_EVENT_TYPE_PORT_CHANGE: + port_change(dev, + (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3, + eqe->subtype == 0x4); + break; + + case MLX4_EVENT_TYPE_CQ_ERROR: + mlx4_warn(dev, "CQ %s on CQN %06x\n", + eqe->event.cq_err.syndrome == 1 ? + "overrun" : "access violation", + be32_to_cpu(eqe->event.cq_err.cqn) & 0xffffff); + mlx4_cq_event(dev, be32_to_cpu(eqe->event.cq_err.cqn), + eqe->type); + break; + + case MLX4_EVENT_TYPE_EQ_OVERFLOW: + mlx4_warn(dev, "EQ overrun on EQN %d\n", eq->eqn); + break; + + case MLX4_EVENT_TYPE_EEC_CATAS_ERROR: + case MLX4_EVENT_TYPE_ECC_DETECT: + default: + mlx4_warn(dev, "Unhandled event %02x(%02x) on EQ %d at index %u\n", + eqe->type, eqe->subtype, eq->eqn, eq->cons_index); + break; + }; + + ++eq->cons_index; + eqes_found = 1; + ++set_ci; + + /* + * The HCA will think the queue has overflowed if we + * don't tell it we've been processing events. We + * create our EQs with MLX4_NUM_SPARE_EQE extra + * entries, so we must update our consumer index at + * least that often. + */ + if (unlikely(set_ci >= MLX4_NUM_SPARE_EQE)) { + /* + * Conditional on hca_type is OK here because + * this is a rare case, not the fast path. + */ + eq_set_ci(eq, 0); + set_ci = 0; + } + } + + if (eqes_found) + eq_set_ci(eq, 1); + + return eqes_found; +} + +static irqreturn_t mlx4_interrupt(int irq, void *dev_ptr) +{ + struct mlx4_dev *dev = dev_ptr; + struct mlx4_priv *priv = mlx4_priv(dev); + int work = 0; + int i; + + writel(priv->eq_table.clr_mask, priv->eq_table.clr_int); + + for (i = 0; i < MLX4_EQ_CATAS; ++i) + work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]); + + return IRQ_RETVAL(work); +} + +static irqreturn_t mlx4_msi_x_interrupt(int irq, void *eq_ptr) +{ + struct mlx4_eq *eq = eq_ptr; + struct mlx4_dev *dev = eq->dev; + + mlx4_eq_int(dev, eq); + + /* MSI-X vectors always belong to us */ + return IRQ_HANDLED; +} + +static irqreturn_t mlx4_catas_interrupt(int irq, void *eq_ptr) +{ + struct mlx4_eq *eq = eq_ptr; + struct mlx4_dev *dev = eq->dev; + + mlx4_err(dev, "catastrophic error detected.\n"); + /* FIXME handle catastrophic error */ + + /* MSI-X vectors always belong to us */ + return IRQ_HANDLED; +} + +static int mlx4_MAP_EQ(struct mlx4_dev *dev, u64 event_mask, int unmap, + int eq_num) +{ + return mlx4_cmd(dev, event_mask, (unmap << 31) | eq_num, + 0, MLX4_CMD_MAP_EQ, MLX4_CMD_TIME_CLASS_B); +} + +static int mlx4_SW2HW_EQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, + int eq_num) +{ + return mlx4_cmd(dev, mailbox->dma, eq_num, 0, MLX4_CMD_SW2HW_EQ, + MLX4_CMD_TIME_CLASS_A); +} + +static int mlx4_HW2SW_EQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, + int eq_num) +{ + return mlx4_cmd_box(dev, 0, mailbox->dma, eq_num, 0, MLX4_CMD_HW2SW_EQ, + MLX4_CMD_TIME_CLASS_A); +} + +static void __devinit __iomem *mlx4_get_eq_uar(struct mlx4_dev *dev, + struct mlx4_eq *eq) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + int index; + + index = eq->eqn / 4 - dev->caps.reserved_eqs / 4; + + if (!priv->eq_table.uar_map[index]) { + priv->eq_table.uar_map[index] = + ioremap(pci_resource_start(dev->pdev, 2) + + ((eq->eqn / 4) << PAGE_SHIFT), + PAGE_SIZE); + if (!priv->eq_table.uar_map[index]) { + mlx4_err(dev, "Couldn't map EQ doorbell for EQN 0x%06x\n", + eq->eqn); + return NULL; + } + } + + return priv->eq_table.uar_map[index] + 0x800 + 8 * (eq->eqn % 4); +} + +static int __devinit mlx4_create_eq(struct mlx4_dev *dev, int nent, + u8 intr, struct mlx4_eq *eq) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_cmd_mailbox *mailbox; + struct mlx4_eq_context *eq_context; + int npages; + u64 *dma_list = NULL; + dma_addr_t t; + u64 mtt_addr; + int err = -ENOMEM; + int i; + + eq->dev = dev; + eq->nent = roundup_pow_of_two(max(nent, 2)); + npages = PAGE_ALIGN(eq->nent * MLX4_EQ_ENTRY_SIZE) / PAGE_SIZE; + + eq->page_list = kmalloc(npages * sizeof *eq->page_list, + GFP_KERNEL); + if (!eq->page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + eq->page_list[i].buf = NULL; + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + goto err_out_free; + eq_context = mailbox->buf; + + for (i = 0; i < npages; ++i) { + eq->page_list[i].buf = dma_alloc_coherent(&dev->pdev->dev, + PAGE_SIZE, &t, GFP_KERNEL); + if (!eq->page_list[i].buf) + goto err_out_free_pages; + + dma_list[i] = t; + eq->page_list[i].map = t; + + memset(eq->page_list[i].buf, 0, PAGE_SIZE); + } + + eq->eqn = mlx4_bitmap_alloc(&priv->eq_table.bitmap); + if (eq->eqn == -1) + goto err_out_free_pages; + + eq->doorbell = mlx4_get_eq_uar(dev, eq); + if (!eq->doorbell) { + err = -ENOMEM; + goto err_out_free_eq; + } + + err = mlx4_mtt_init(dev, npages, PAGE_SHIFT, &eq->mtt); + if (err) + goto err_out_free_eq; + + err = mlx4_write_mtt(dev, &eq->mtt, 0, npages, dma_list); + if (err) + goto err_out_free_mtt; + + memset(eq_context, 0, sizeof *eq_context); + eq_context->flags = cpu_to_be32(MLX4_EQ_STATUS_OK | + MLX4_EQ_STATE_ARMED); + eq_context->log_eq_size = ilog2(eq->nent); + eq_context->intr = intr; + eq_context->log_page_size = PAGE_SHIFT - MLX4_ICM_PAGE_SHIFT; + + mtt_addr = mlx4_mtt_addr(dev, &eq->mtt); + eq_context->mtt_base_addr_h = mtt_addr >> 32; + eq_context->mtt_base_addr_l = cpu_to_be32(mtt_addr & 0xffffffff); + + err = mlx4_SW2HW_EQ(dev, mailbox, eq->eqn); + if (err) { + mlx4_warn(dev, "SW2HW_EQ failed (%d)\n", err); + goto err_out_free_mtt; + } + + kfree(dma_list); + mlx4_free_cmd_mailbox(dev, mailbox); + + eq->cons_index = 0; + + return err; + +err_out_free_mtt: + mlx4_mtt_cleanup(dev, &eq->mtt); + +err_out_free_eq: + mlx4_bitmap_free(&priv->eq_table.bitmap, eq->eqn); + +err_out_free_pages: + for (i = 0; i < npages; ++i) + if (eq->page_list[i].buf) + dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, + eq->page_list[i].buf, + eq->page_list[i].map); + + mlx4_free_cmd_mailbox(dev, mailbox); + +err_out_free: + kfree(eq->page_list); + kfree(dma_list); + +err_out: + return err; +} + +static void mlx4_free_eq(struct mlx4_dev *dev, + struct mlx4_eq *eq) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_cmd_mailbox *mailbox; + int err; + int npages = PAGE_ALIGN(MLX4_EQ_ENTRY_SIZE * eq->nent) / PAGE_SIZE; + int i; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return; + + err = mlx4_HW2SW_EQ(dev, mailbox, eq->eqn); + if (err) + mlx4_warn(dev, "HW2SW_EQ failed (%d)\n", err); + + if (0) { + mlx4_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn); + for (i = 0; i < sizeof (struct mlx4_eq_context) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpup(mailbox->buf + i * 4)); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + mlx4_mtt_cleanup(dev, &eq->mtt); + for (i = 0; i < npages; ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + eq->page_list[i].map); + + kfree(eq->page_list); + mlx4_bitmap_free(&priv->eq_table.bitmap, eq->eqn); + mlx4_free_cmd_mailbox(dev, mailbox); +} + +static void mlx4_free_irqs(struct mlx4_dev *dev) +{ + struct mlx4_eq_table *eq_table = &mlx4_priv(dev)->eq_table; + int i; + + if (eq_table->have_irq) + free_irq(dev->pdev->irq, dev); + for (i = 0; i < MLX4_NUM_EQ; ++i) + if (eq_table->eq[i].have_irq) + free_irq(eq_table->eq[i].irq, eq_table->eq + i); +} + +static int __devinit mlx4_map_clr_int(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + + priv->clr_base = ioremap(pci_resource_start(dev->pdev, priv->fw.clr_int_bar) + + priv->fw.clr_int_base, MLX4_CLR_INT_SIZE); + if (!priv->clr_base) { + mlx4_err(dev, "Couldn't map interrupt clear register, aborting.\n"); + return -ENOMEM; + } + + return 0; +} + +static void mlx4_unmap_clr_int(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + + iounmap(priv->clr_base); +} + +int __devinit mlx4_map_eq_icm(struct mlx4_dev *dev, u64 icm_virt) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + int ret; + + /* + * We assume that mapping one page is enough for the whole EQ + * context table. This is fine with all current HCAs, because + * we only use 32 EQs and each EQ uses 64 bytes of context + * memory, or 1 KB total. + */ + priv->eq_table.icm_virt = icm_virt; + priv->eq_table.icm_page = alloc_page(GFP_HIGHUSER); + if (!priv->eq_table.icm_page) + return -ENOMEM; + priv->eq_table.icm_dma = pci_map_page(dev->pdev, priv->eq_table.icm_page, 0, + PAGE_SIZE, PCI_DMA_BIDIRECTIONAL); + if (pci_dma_mapping_error(priv->eq_table.icm_dma)) { + __free_page(priv->eq_table.icm_page); + return -ENOMEM; + } + + ret = mlx4_MAP_ICM_page(dev, priv->eq_table.icm_dma, icm_virt); + if (ret) { + pci_unmap_page(dev->pdev, priv->eq_table.icm_dma, PAGE_SIZE, + PCI_DMA_BIDIRECTIONAL); + __free_page(priv->eq_table.icm_page); + } + + return ret; +} + +void mlx4_unmap_eq_icm(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + + mlx4_UNMAP_ICM(dev, priv->eq_table.icm_virt, 1); + pci_unmap_page(dev->pdev, priv->eq_table.icm_dma, PAGE_SIZE, + PCI_DMA_BIDIRECTIONAL); + __free_page(priv->eq_table.icm_page); +} + +int __devinit mlx4_init_eq_table(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + int err; + int i; + + err = mlx4_bitmap_init(&priv->eq_table.bitmap, dev->caps.num_eqs, + dev->caps.num_eqs - 1, dev->caps.reserved_eqs); + if (err) + return err; + + for (i = 0; i < ARRAY_SIZE(priv->eq_table.uar_map); ++i) + priv->eq_table.uar_map[i] = NULL; + + err = mlx4_map_clr_int(dev); + if (err) + goto err_out_free; + + priv->eq_table.clr_mask = + swab32(1 << (priv->eq_table.inta_pin & 31)); + priv->eq_table.clr_int = priv->clr_base + + (priv->eq_table.inta_pin < 32 ? 4 : 0); + + err = mlx4_create_eq(dev, dev->caps.num_cqs + MLX4_NUM_SPARE_EQE, + (dev->flags & MLX4_FLAG_MSI_X) ? MLX4_EQ_COMP : 0, + &priv->eq_table.eq[MLX4_EQ_COMP]); + if (err) + goto err_out_unmap; + + err = mlx4_create_eq(dev, MLX4_NUM_ASYNC_EQE + MLX4_NUM_SPARE_EQE, + (dev->flags & MLX4_FLAG_MSI_X) ? MLX4_EQ_ASYNC : 0, + &priv->eq_table.eq[MLX4_EQ_ASYNC]); + if (err) + goto err_out_comp; + + if (dev->flags & MLX4_FLAG_MSI_X) { + static const char *eq_name[] = { + [MLX4_EQ_COMP] = DRV_NAME " (comp)", + [MLX4_EQ_ASYNC] = DRV_NAME " (async)", + [MLX4_EQ_CATAS] = DRV_NAME " (catas)" + }; + + err = mlx4_create_eq(dev, 1, MLX4_EQ_CATAS, + &priv->eq_table.eq[MLX4_EQ_CATAS]); + if (err) + goto err_out_async; + + for (i = 0; i <= MLX4_EQ_CATAS; ++i) { + err = request_irq(priv->eq_table.eq[i].irq, + i == MLX4_EQ_CATAS ? + mlx4_catas_interrupt : + mlx4_msi_x_interrupt, + 0, eq_name[i], priv->eq_table.eq + i); + if (err) + goto err_out_catas; + + priv->eq_table.eq[i].have_irq = 1; + } + } else { + err = request_irq(dev->pdev->irq, mlx4_interrupt, + SA_SHIRQ, DRV_NAME, dev); + if (err) + goto err_out_catas; + + priv->eq_table.have_irq = 1; + } + + err = mlx4_MAP_EQ(dev, MLX4_ASYNC_EVENT_MASK, 0, + priv->eq_table.eq[MLX4_EQ_ASYNC].eqn); + if (err) + mlx4_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", + priv->eq_table.eq[MLX4_EQ_ASYNC].eqn, err); + + for (i = 0; i < MLX4_EQ_CATAS; ++i) + eq_set_ci(&priv->eq_table.eq[i], 1); + + if (dev->flags & MLX4_FLAG_MSI_X) { + err = mlx4_MAP_EQ(dev, MLX4_CATAS_EVENT_MASK, 0, + priv->eq_table.eq[MLX4_EQ_CATAS].eqn); + if (err) + mlx4_warn(dev, "MAP_EQ for catas EQ %d failed (%d)\n", + priv->eq_table.eq[MLX4_EQ_CATAS].eqn, err); + eq_set_ci(&priv->eq_table.eq[MLX4_EQ_CATAS], 1); + } + + return 0; + +err_out_catas: + if (dev->flags & MLX4_FLAG_MSI_X) + mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_CATAS]); + +err_out_async: + mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_ASYNC]); + +err_out_comp: + mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_COMP]); + +err_out_unmap: + mlx4_unmap_clr_int(dev); + mlx4_free_irqs(dev); + +err_out_free: + mlx4_bitmap_cleanup(&priv->eq_table.bitmap); + return err; +} + +void mlx4_cleanup_eq_table(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + int i; + + if (dev->flags & MLX4_FLAG_MSI_X) + mlx4_MAP_EQ(dev, MLX4_CATAS_EVENT_MASK, 1, + priv->eq_table.eq[MLX4_EQ_CATAS].eqn); + + mlx4_MAP_EQ(dev, MLX4_ASYNC_EVENT_MASK, 1, + priv->eq_table.eq[MLX4_EQ_ASYNC].eqn); + + mlx4_free_irqs(dev); + + for (i = 0; i < MLX4_EQ_CATAS; ++i) + mlx4_free_eq(dev, &priv->eq_table.eq[i]); + if (dev->flags & MLX4_FLAG_MSI_X) + mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_CATAS]); + + mlx4_unmap_clr_int(dev); + + for (i = 0; i < ARRAY_SIZE(priv->eq_table.uar_map); ++i) + if (priv->eq_table.uar_map[i]) + iounmap(priv->eq_table.uar_map[i]); + + mlx4_bitmap_cleanup(&priv->eq_table.bitmap); +} diff --git a/drivers/net/mlx4/icm.c b/drivers/net/mlx4/icm.c new file mode 100644 index 0000000..6a6c372 --- /dev/null +++ b/drivers/net/mlx4/icm.c @@ -0,0 +1,379 @@ +/* + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include + +#include "mlx4.h" +#include "icm.h" +#include "fw.h" + +/* + * We allocate in as big chunks as we can, up to a maximum of 256 KB + * per chunk. + */ +enum { + MLX4_ICM_ALLOC_SIZE = 1 << 18, + MLX4_TABLE_CHUNK_SIZE = 1 << 18 +}; + +void mlx4_free_icm(struct mlx4_dev *dev, struct mlx4_icm *icm) +{ + struct mlx4_icm_chunk *chunk, *tmp; + int i; + + list_for_each_entry_safe(chunk, tmp, &icm->chunk_list, list) { + if (chunk->nsg > 0) + pci_unmap_sg(dev->pdev, chunk->mem, chunk->npages, + PCI_DMA_BIDIRECTIONAL); + + for (i = 0; i < chunk->npages; ++i) + __free_pages(chunk->mem[i].page, + get_order(chunk->mem[i].length)); + + kfree(chunk); + } + + kfree(icm); +} + +struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int npages, + gfp_t gfp_mask) +{ + struct mlx4_icm *icm; + struct mlx4_icm_chunk *chunk = NULL; + int cur_order; + + icm = kmalloc(sizeof *icm, gfp_mask & ~(__GFP_HIGHMEM | __GFP_NOWARN)); + if (!icm) + return icm; + + icm->refcount = 0; + INIT_LIST_HEAD(&icm->chunk_list); + + cur_order = get_order(MLX4_ICM_ALLOC_SIZE); + + while (npages > 0) { + if (!chunk) { + chunk = kmalloc(sizeof *chunk, + gfp_mask & ~(__GFP_HIGHMEM | __GFP_NOWARN)); + if (!chunk) + goto fail; + + chunk->npages = 0; + chunk->nsg = 0; + list_add_tail(&chunk->list, &icm->chunk_list); + } + + while (1 << cur_order > npages) + --cur_order; + + chunk->mem[chunk->npages].page = alloc_pages(gfp_mask, cur_order); + if (chunk->mem[chunk->npages].page) { + chunk->mem[chunk->npages].length = PAGE_SIZE << cur_order; + chunk->mem[chunk->npages].offset = 0; + + if (++chunk->npages == MLX4_ICM_CHUNK_LEN) { + chunk->nsg = pci_map_sg(dev->pdev, chunk->mem, + chunk->npages, + PCI_DMA_BIDIRECTIONAL); + + if (chunk->nsg <= 0) + goto fail; + + chunk = NULL; + } + + npages -= 1 << cur_order; + } else { + --cur_order; + if (cur_order < 0) + goto fail; + } + } + + if (chunk) { + chunk->nsg = pci_map_sg(dev->pdev, chunk->mem, + chunk->npages, + PCI_DMA_BIDIRECTIONAL); + + if (chunk->nsg <= 0) + goto fail; + } + + return icm; + +fail: + mlx4_free_icm(dev, icm); + return NULL; +} + +static int mlx4_MAP_ICM(struct mlx4_dev *dev, struct mlx4_icm *icm, u64 virt) +{ + return mlx4_map_cmd(dev, MLX4_CMD_MAP_ICM, icm, virt); +} + +int mlx4_UNMAP_ICM(struct mlx4_dev *dev, u64 virt, u32 page_count) +{ + return mlx4_cmd(dev, virt, page_count, 0, MLX4_CMD_UNMAP_ICM, + MLX4_CMD_TIME_CLASS_B); +} + +int mlx4_MAP_ICM_page(struct mlx4_dev *dev, u64 dma_addr, u64 virt) +{ + struct mlx4_cmd_mailbox *mailbox; + __be64 *inbox; + int err; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + inbox = mailbox->buf; + + inbox[0] = cpu_to_be64(virt); + inbox[1] = cpu_to_be64(dma_addr); + + err = mlx4_cmd(dev, mailbox->dma, 1, 0, MLX4_CMD_MAP_ICM, + MLX4_CMD_TIME_CLASS_B); + + mlx4_free_cmd_mailbox(dev, mailbox); + + if (!err) + mlx4_dbg(dev, "Mapped page at %llx to %llx for ICM.\n", + (unsigned long long) dma_addr, (unsigned long long) virt); + + return err; +} + +int mlx4_MAP_ICM_AUX(struct mlx4_dev *dev, struct mlx4_icm *icm) +{ + return mlx4_map_cmd(dev, MLX4_CMD_MAP_ICM_AUX, icm, -1); +} + +int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev) +{ + return mlx4_cmd(dev, 0, 0, 0, MLX4_CMD_UNMAP_ICM_AUX, MLX4_CMD_TIME_CLASS_B); +} + +int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, int obj) +{ + int i = (obj & (table->num_obj - 1)) / (MLX4_TABLE_CHUNK_SIZE / table->obj_size); + int ret = 0; + + mutex_lock(&table->mutex); + + if (table->icm[i]) { + ++table->icm[i]->refcount; + goto out; + } + + table->icm[i] = mlx4_alloc_icm(dev, MLX4_TABLE_CHUNK_SIZE >> PAGE_SHIFT, + (table->lowmem ? GFP_KERNEL : GFP_HIGHUSER) | + __GFP_NOWARN); + if (!table->icm[i]) { + ret = -ENOMEM; + goto out; + } + + if (mlx4_MAP_ICM(dev, table->icm[i], table->virt + + (u64) i * MLX4_TABLE_CHUNK_SIZE)) { + mlx4_free_icm(dev, table->icm[i]); + table->icm[i] = NULL; + ret = -ENOMEM; + goto out; + } + + ++table->icm[i]->refcount; + +out: + mutex_unlock(&table->mutex); + return ret; +} + +void mlx4_table_put(struct mlx4_dev *dev, struct mlx4_icm_table *table, int obj) +{ + int i; + + i = (obj & (table->num_obj - 1)) / (MLX4_TABLE_CHUNK_SIZE / table->obj_size); + + mutex_lock(&table->mutex); + + if (--table->icm[i]->refcount == 0) { + mlx4_UNMAP_ICM(dev, table->virt + i * MLX4_TABLE_CHUNK_SIZE, + MLX4_TABLE_CHUNK_SIZE / MLX4_ICM_PAGE_SIZE); + mlx4_free_icm(dev, table->icm[i]); + table->icm[i] = NULL; + } + + mutex_unlock(&table->mutex); +} + +void *mlx4_table_find(struct mlx4_icm_table *table, int obj) +{ + int idx, offset, i; + struct mlx4_icm_chunk *chunk; + struct mlx4_icm *icm; + struct page *page = NULL; + + if (!table->lowmem) + return NULL; + + mutex_lock(&table->mutex); + + idx = obj & (table->num_obj - 1); + icm = table->icm[idx / (MLX4_TABLE_CHUNK_SIZE / table->obj_size)]; + offset = idx % (MLX4_TABLE_CHUNK_SIZE / table->obj_size); + + if (!icm) + goto out; + + list_for_each_entry(chunk, &icm->chunk_list, list) { + for (i = 0; i < chunk->npages; ++i) { + if (chunk->mem[i].length > offset) { + page = chunk->mem[i].page; + goto out; + } + offset -= chunk->mem[i].length; + } + } + +out: + mutex_unlock(&table->mutex); + return page ? lowmem_page_address(page) + offset : NULL; +} + +int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table, + int start, int end) +{ + int inc = MLX4_TABLE_CHUNK_SIZE / table->obj_size; + int i, err; + + for (i = start; i <= end; i += inc) { + err = mlx4_table_get(dev, table, i); + if (err) + goto fail; + } + + return 0; + +fail: + while (i > start) { + i -= inc; + mlx4_table_put(dev, table, i); + } + + return err; +} + +void mlx4_table_put_range(struct mlx4_dev *dev, struct mlx4_icm_table *table, + int start, int end) +{ + int i; + + for (i = start; i <= end; i += MLX4_TABLE_CHUNK_SIZE / table->obj_size) + mlx4_table_put(dev, table, i); +} + +int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, + u64 virt, int obj_size, int nobj, int reserved, + int use_lowmem) +{ + int obj_per_chunk; + int num_icm; + unsigned chunk_size; + int i; + + obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; + num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; + + table->icm = kcalloc(num_icm, sizeof *table->icm, GFP_KERNEL); + if (!table->icm) + return -ENOMEM; + table->virt = virt; + table->num_icm = num_icm; + table->num_obj = nobj; + table->obj_size = obj_size; + table->lowmem = use_lowmem; + mutex_init(&table->mutex); + + for (i = 0; i * MLX4_TABLE_CHUNK_SIZE < reserved * obj_size; ++i) { + chunk_size = MLX4_TABLE_CHUNK_SIZE; + if ((i + 1) * MLX4_TABLE_CHUNK_SIZE > nobj * obj_size) + chunk_size = PAGE_ALIGN(nobj * obj_size - i * MLX4_TABLE_CHUNK_SIZE); + + table->icm[i] = mlx4_alloc_icm(dev, chunk_size >> PAGE_SHIFT, + (use_lowmem ? GFP_KERNEL : GFP_HIGHUSER) | + __GFP_NOWARN); + if (!table->icm[i]) + goto err; + if (mlx4_MAP_ICM(dev, table->icm[i], virt + i * MLX4_TABLE_CHUNK_SIZE)) { + mlx4_free_icm(dev, table->icm[i]); + table->icm[i] = NULL; + goto err; + } + + /* + * Add a reference to this ICM chunk so that it never + * gets freed (since it contains reserved firmware objects). + */ + ++table->icm[i]->refcount; + } + + return 0; + +err: + for (i = 0; i < num_icm; ++i) + if (table->icm[i]) { + mlx4_UNMAP_ICM(dev, virt + i * MLX4_TABLE_CHUNK_SIZE, + MLX4_TABLE_CHUNK_SIZE / MLX4_ICM_PAGE_SIZE); + mlx4_free_icm(dev, table->icm[i]); + } + + return -ENOMEM; +} + +void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) +{ + int i; + + for (i = 0; i < table->num_icm; ++i) + if (table->icm[i]) { + mlx4_UNMAP_ICM(dev, table->virt + i * MLX4_TABLE_CHUNK_SIZE, + MLX4_TABLE_CHUNK_SIZE / MLX4_ICM_PAGE_SIZE); + mlx4_free_icm(dev, table->icm[i]); + } + + kfree(table->icm); +} diff --git a/drivers/net/mlx4/icm.h b/drivers/net/mlx4/icm.h new file mode 100644 index 0000000..7119edb --- /dev/null +++ b/drivers/net/mlx4/icm.h @@ -0,0 +1,135 @@ +/* + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef MLX4_ICM_H +#define MLX4_ICM_H + +#include +#include +#include + +#define MLX4_ICM_CHUNK_LEN \ + ((256 - sizeof (struct list_head) - 2 * sizeof (int)) / \ + (sizeof (struct scatterlist))) + +enum { + MLX4_ICM_PAGE_SHIFT = 12, + MLX4_ICM_PAGE_SIZE = 1 << MLX4_ICM_PAGE_SHIFT, +}; + +struct mlx4_icm_chunk { + struct list_head list; + int npages; + int nsg; + struct scatterlist mem[MLX4_ICM_CHUNK_LEN]; +}; + +struct mlx4_icm { + struct list_head chunk_list; + int refcount; +}; + +struct mlx4_icm_iter { + struct mlx4_icm *icm; + struct mlx4_icm_chunk *chunk; + int page_idx; +}; + +struct mlx4_dev; + +struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int npages, gfp_t gfp_mask); +void mlx4_free_icm(struct mlx4_dev *dev, struct mlx4_icm *icm); + +int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, int obj); +void mlx4_table_put(struct mlx4_dev *dev, struct mlx4_icm_table *table, int obj); +int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table, + int start, int end); +void mlx4_table_put_range(struct mlx4_dev *dev, struct mlx4_icm_table *table, + int start, int end); +int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, + u64 virt, int obj_size, int nobj, int reserved, + int use_lowmem); +void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table); +int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, int obj); +void mlx4_table_put(struct mlx4_dev *dev, struct mlx4_icm_table *table, int obj); +void *mlx4_table_find(struct mlx4_icm_table *table, int obj); +int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table, + int start, int end); +void mlx4_table_put_range(struct mlx4_dev *dev, struct mlx4_icm_table *table, + int start, int end); + +static inline void mlx4_icm_first(struct mlx4_icm *icm, + struct mlx4_icm_iter *iter) +{ + iter->icm = icm; + iter->chunk = list_empty(&icm->chunk_list) ? + NULL : list_entry(icm->chunk_list.next, + struct mlx4_icm_chunk, list); + iter->page_idx = 0; +} + +static inline int mlx4_icm_last(struct mlx4_icm_iter *iter) +{ + return !iter->chunk; +} + +static inline void mlx4_icm_next(struct mlx4_icm_iter *iter) +{ + if (++iter->page_idx >= iter->chunk->nsg) { + if (iter->chunk->list.next == &iter->icm->chunk_list) { + iter->chunk = NULL; + return; + } + + iter->chunk = list_entry(iter->chunk->list.next, + struct mlx4_icm_chunk, list); + iter->page_idx = 0; + } +} + +static inline dma_addr_t mlx4_icm_addr(struct mlx4_icm_iter *iter) +{ + return sg_dma_address(&iter->chunk->mem[iter->page_idx]); +} + +static inline unsigned long mlx4_icm_size(struct mlx4_icm_iter *iter) +{ + return sg_dma_len(&iter->chunk->mem[iter->page_idx]); +} + +int mlx4_UNMAP_ICM(struct mlx4_dev *dev, u64 virt, u32 page_count); +int mlx4_MAP_ICM_page(struct mlx4_dev *dev, u64 dma_addr, u64 virt); +int mlx4_MAP_ICM_AUX(struct mlx4_dev *dev, struct mlx4_icm *icm); +int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev); + +#endif /* MLX4_ICM_H */ diff --git a/drivers/net/mlx4/mcg.c b/drivers/net/mlx4/mcg.c new file mode 100644 index 0000000..a44dfd4 --- /dev/null +++ b/drivers/net/mlx4/mcg.c @@ -0,0 +1,370 @@ +/* + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include + +#include "mlx4.h" + +struct mlx4_mgm { + __be32 next_gid_index; + __be32 members_count; + u32 reserved[2]; + u8 gid[16]; + __be32 qp[MLX4_QP_PER_MGM]; +}; + +static const u8 zero_gid[16]; /* automatically initialized to 0 */ + +static int mlx4_READ_MCG(struct mlx4_dev *dev, int index, + struct mlx4_cmd_mailbox *mailbox) +{ + return mlx4_cmd_box(dev, 0, mailbox->dma, index, 0, MLX4_CMD_READ_MCG, + MLX4_CMD_TIME_CLASS_A); +} + +static int mlx4_WRITE_MCG(struct mlx4_dev *dev, int index, + struct mlx4_cmd_mailbox *mailbox) +{ + return mlx4_cmd(dev, mailbox->dma, index, 0, MLX4_CMD_WRITE_MCG, + MLX4_CMD_TIME_CLASS_A); +} + +static int mlx4_MGID_HASH(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, + u16 *hash) +{ + u64 imm; + int err; + + err = mlx4_cmd_imm(dev, mailbox->dma, &imm, 0, 0, MLX4_CMD_MGID_HASH, + MLX4_CMD_TIME_CLASS_A); + + if (!err) + *hash = imm; + + return err; +} + +/* + * Caller must hold MCG table semaphore. gid and mgm parameters must + * be properly aligned for command interface. + * + * Returns 0 unless a firmware command error occurs. + * + * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1 + * and *mgm holds MGM entry. + * + * if GID is found in AMGM, *index = index in AMGM, *prev = index of + * previous entry in hash chain and *mgm holds AMGM entry. + * + * If no AMGM exists for given gid, *index = -1, *prev = index of last + * entry in hash chain and *mgm holds end of hash chain. + */ +static int find_mgm(struct mlx4_dev *dev, + u8 *gid, struct mlx4_cmd_mailbox *mgm_mailbox, + u16 *hash, int *prev, int *index) +{ + struct mlx4_cmd_mailbox *mailbox; + struct mlx4_mgm *mgm = mgm_mailbox->buf; + u8 *mgid; + int err; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return -ENOMEM; + mgid = mailbox->buf; + + memcpy(mgid, gid, 16); + + err = mlx4_MGID_HASH(dev, mailbox, hash); + mlx4_free_cmd_mailbox(dev, mailbox); + if (err) + return err; + + if (0) + mlx4_dbg(dev, "Hash for %04x:%04x:%04x:%04x:" + "%04x:%04x:%04x:%04x is %04x\n", + be16_to_cpu(((__be16 *) gid)[0]), + be16_to_cpu(((__be16 *) gid)[1]), + be16_to_cpu(((__be16 *) gid)[2]), + be16_to_cpu(((__be16 *) gid)[3]), + be16_to_cpu(((__be16 *) gid)[4]), + be16_to_cpu(((__be16 *) gid)[5]), + be16_to_cpu(((__be16 *) gid)[6]), + be16_to_cpu(((__be16 *) gid)[7]), + *hash); + + *index = *hash; + *prev = -1; + + do { + err = mlx4_READ_MCG(dev, *index, mgm_mailbox); + if (err) + return err; + + if (!memcmp(mgm->gid, zero_gid, 16)) { + if (*index != *hash) { + mlx4_err(dev, "Found zero MGID in AMGM.\n"); + err = -EINVAL; + } + return err; + } + + if (!memcmp(mgm->gid, gid, 16)) + return err; + + *prev = *index; + *index = be32_to_cpu(mgm->next_gid_index) >> 6; + } while (*index); + + *index = -1; + return err; +} + +int mlx4_multicast_attach(struct mlx4_dev *dev, struct mlx4_qp *qp, u8 gid[16]) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_cmd_mailbox *mailbox; + struct mlx4_mgm *mgm; + u32 members_count; + u16 hash; + int index, prev; + int link = 0; + int i; + int err; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + mgm = mailbox->buf; + + mutex_lock(&priv->mcg_table.mutex); + + err = find_mgm(dev, gid, mailbox, &hash, &prev, &index); + if (err) + goto out; + + if (index != -1) { + if (!memcmp(mgm->gid, zero_gid, 16)) + memcpy(mgm->gid, gid, 16); + } else { + link = 1; + + index = mlx4_bitmap_alloc(&priv->mcg_table.bitmap); + if (index == -1) { + mlx4_err(dev, "No AMGM entries left\n"); + err = -ENOMEM; + goto out; + } + + err = mlx4_READ_MCG(dev, index, mailbox); + if (err) + goto out; + + memset(mgm, 0, sizeof *mgm); + memcpy(mgm->gid, gid, 16); + } + + members_count = be32_to_cpu(mgm->members_count); + if (members_count == MLX4_QP_PER_MGM) { + mlx4_err(dev, "MGM at index %x is full.\n", index); + err = -ENOMEM; + goto out; + } + + for (i = 0; i < members_count; ++i) + if (mgm->qp[i] == cpu_to_be32(qp->qpn)) { + mlx4_dbg(dev, "QP %06x already a member of MGM\n", qp->qpn); + err = 0; + goto out; + } + + mgm->qp[members_count++] = cpu_to_be32(qp->qpn); + mgm->members_count = cpu_to_be32(members_count); + + err = mlx4_WRITE_MCG(dev, index, mailbox); + if (err) + goto out; + + if (!link) + goto out; + + err = mlx4_READ_MCG(dev, prev, mailbox); + if (err) + goto out; + + mgm->next_gid_index = cpu_to_be32(index << 6); + + err = mlx4_WRITE_MCG(dev, prev, mailbox); + if (err) + goto out; + +out: + if (err && link && index != -1) { + BUG_ON(index < dev->caps.num_mgms); + mlx4_bitmap_free(&priv->mcg_table.bitmap, index); + } + mutex_unlock(&priv->mcg_table.mutex); + + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} +EXPORT_SYMBOL_GPL(mlx4_multicast_attach); + +int mlx4_multicast_detach(struct mlx4_dev *dev, struct mlx4_qp *qp, u8 gid[16]) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_cmd_mailbox *mailbox; + struct mlx4_mgm *mgm; + u32 members_count; + u16 hash; + int prev, index; + int i, loc; + int err; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + mgm = mailbox->buf; + + mutex_lock(&priv->mcg_table.mutex); + + err = find_mgm(dev, gid, mailbox, &hash, &prev, &index); + if (err) + goto out; + + if (index == -1) { + mlx4_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x " + "not found\n", + be16_to_cpu(((__be16 *) gid)[0]), + be16_to_cpu(((__be16 *) gid)[1]), + be16_to_cpu(((__be16 *) gid)[2]), + be16_to_cpu(((__be16 *) gid)[3]), + be16_to_cpu(((__be16 *) gid)[4]), + be16_to_cpu(((__be16 *) gid)[5]), + be16_to_cpu(((__be16 *) gid)[6]), + be16_to_cpu(((__be16 *) gid)[7])); + err = -EINVAL; + goto out; + } + + members_count = be32_to_cpu(mgm->members_count); + for (loc = -1, i = 0; i < members_count; ++i) + if (mgm->qp[i] == cpu_to_be32(qp->qpn)) + loc = i; + + if (loc == -1) { + mlx4_err(dev, "QP %06x not found in MGM\n", qp->qpn); + err = -EINVAL; + goto out; + } + + + mgm->members_count = cpu_to_be32(--members_count); + mgm->qp[loc] = mgm->qp[i - 1]; + mgm->qp[i - 1] = 0; + + err = mlx4_WRITE_MCG(dev, index, mailbox); + if (err) + goto out; + + if (i != 1) + goto out; + + if (prev == -1) { + /* Remove entry from MGM */ + int amgm_index_to_free = be32_to_cpu(mgm->next_gid_index) >> 6; + if (amgm_index_to_free) { + err = mlx4_READ_MCG(dev, amgm_index_to_free, mailbox); + if (err) + goto out; + } else + memset(mgm->gid, 0, 16); + + err = mlx4_WRITE_MCG(dev, index, mailbox); + if (err) + goto out; + + if (amgm_index_to_free) { + BUG_ON(amgm_index_to_free < dev->caps.num_mgms); + mlx4_bitmap_free(&priv->mcg_table.bitmap, amgm_index_to_free); + } + } else { + /* Remove entry from AMGM */ + int curr_next_index = be32_to_cpu(mgm->next_gid_index) >> 6; + err = mlx4_READ_MCG(dev, prev, mailbox); + if (err) + goto out; + + mgm->next_gid_index = cpu_to_be32(curr_next_index << 6); + + err = mlx4_WRITE_MCG(dev, prev, mailbox); + if (err) + goto out; + + BUG_ON(index < dev->caps.num_mgms); + mlx4_bitmap_free(&priv->mcg_table.bitmap, index); + } + +out: + mutex_unlock(&priv->mcg_table.mutex); + + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} +EXPORT_SYMBOL_GPL(mlx4_multicast_detach); + +int __devinit mlx4_init_mcg_table(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + int table_size; + int err; + + table_size = dev->caps.num_mgms + dev->caps.num_amgms; + err = mlx4_bitmap_init(&priv->mcg_table.bitmap, + table_size, table_size - 1, + dev->caps.num_mgms); + if (err) + return err; + + mutex_init(&priv->mcg_table.mutex); + + return 0; +} + +void mlx4_cleanup_mcg_table(struct mlx4_dev *dev) +{ + mlx4_bitmap_cleanup(&mlx4_priv(dev)->mcg_table.bitmap); +} diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c new file mode 100644 index 0000000..da6d49a --- /dev/null +++ b/drivers/net/mlx4/mr.c @@ -0,0 +1,482 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include + +#include "mlx4.h" +#include "icm.h" + +/* + * Must be packed because mtt_seg is 64 bits but only aligned to 32 bits. + */ +struct mlx4_mpt_entry { + __be32 flags; + __be32 qpn; + __be32 key; + __be32 pd; + __be64 start; + __be64 length; + __be32 lkey; + __be32 win_cnt; + u8 reserved1[3]; + u8 mtt_rep; + __be64 mtt_seg; + __be32 mtt_sz; + __be32 entity_size; + __be32 first_byte_offset; +} __attribute__((packed)); + +#define MLX4_MPT_FLAG_SW_OWNS (0xfUL << 28) +#define MLX4_MPT_FLAG_MIO (1 << 17) +#define MLX4_MPT_FLAG_BIND_ENABLE (1 << 15) +#define MLX4_MPT_FLAG_PHYSICAL (1 << 9) +#define MLX4_MPT_FLAG_REGION (1 << 8) + +#define MLX4_MTT_FLAG_PRESENT 1 + +static u32 mlx4_buddy_alloc(struct mlx4_buddy *buddy, int order) +{ + int o; + int m; + u32 seg; + + spin_lock(&buddy->lock); + + for (o = order; o <= buddy->max_order; ++o) { + m = 1 << (buddy->max_order - o); + seg = find_first_bit(buddy->bits[o], m); + if (seg < m) + goto found; + } + + spin_unlock(&buddy->lock); + return -1; + + found: + clear_bit(seg, buddy->bits[o]); + + while (o > order) { + --o; + seg <<= 1; + set_bit(seg ^ 1, buddy->bits[o]); + } + + spin_unlock(&buddy->lock); + + seg <<= order; + + return seg; +} + +static void mlx4_buddy_free(struct mlx4_buddy *buddy, u32 seg, int order) +{ + seg >>= order; + + spin_lock(&buddy->lock); + + while (test_bit(seg ^ 1, buddy->bits[order])) { + clear_bit(seg ^ 1, buddy->bits[order]); + seg >>= 1; + ++order; + } + + set_bit(seg, buddy->bits[order]); + + spin_unlock(&buddy->lock); +} + +static int __devinit mlx4_buddy_init(struct mlx4_buddy *buddy, int max_order) +{ + int i, s; + + buddy->max_order = max_order; + spin_lock_init(&buddy->lock); + + buddy->bits = kzalloc((buddy->max_order + 1) * sizeof (long *), + GFP_KERNEL); + if (!buddy->bits) + goto err_out; + + for (i = 0; i <= buddy->max_order; ++i) { + s = BITS_TO_LONGS(1 << (buddy->max_order - i)); + buddy->bits[i] = kmalloc(s * sizeof (long), GFP_KERNEL); + if (!buddy->bits[i]) + goto err_out_free; + bitmap_zero(buddy->bits[i], 1 << (buddy->max_order - i)); + } + + set_bit(0, buddy->bits[buddy->max_order]); + + return 0; + +err_out_free: + for (i = 0; i <= buddy->max_order; ++i) + kfree(buddy->bits[i]); + + kfree(buddy->bits); + +err_out: + return -ENOMEM; +} + +static void mlx4_buddy_cleanup(struct mlx4_buddy *buddy) +{ + int i; + + for (i = 0; i <= buddy->max_order; ++i) + kfree(buddy->bits[i]); + + kfree(buddy->bits); +} + +static u32 mlx4_alloc_mtt_range(struct mlx4_dev *dev, int order) +{ + struct mlx4_mr_table *mr_table = &mlx4_priv(dev)->mr_table; + u32 seg; + + seg = mlx4_buddy_alloc(&mr_table->mtt_buddy, order); + if (seg == -1) + return -1; + + if (mlx4_table_get_range(dev, &mr_table->mtt_table, seg, + seg + (1 << order) - 1)) { + mlx4_buddy_free(&mr_table->mtt_buddy, seg, order); + return -1; + } + + return seg; +} + +int mlx4_mtt_init(struct mlx4_dev *dev, int npages, int page_shift, + struct mlx4_mtt *mtt) +{ + int i; + + if (!npages) { + mtt->order = -1; + mtt->page_shift = MLX4_ICM_PAGE_SHIFT; + return 0; + } else + mtt->page_shift = page_shift; + + for (mtt->order = 0, i = MLX4_MTT_ENTRY_PER_SEG; i < npages; i <<= 1) + ++mtt->order; + + mtt->first_seg = mlx4_alloc_mtt_range(dev, mtt->order); + if (mtt->first_seg == -1) + return -ENOMEM; + + return 0; +} +EXPORT_SYMBOL_GPL(mlx4_mtt_init); + +void mlx4_mtt_cleanup(struct mlx4_dev *dev, struct mlx4_mtt *mtt) +{ + struct mlx4_mr_table *mr_table = &mlx4_priv(dev)->mr_table; + + if (mtt->order < 0) + return; + + mlx4_buddy_free(&mr_table->mtt_buddy, mtt->first_seg, mtt->order); + mlx4_table_put_range(dev, &mr_table->mtt_table, mtt->first_seg, + mtt->first_seg + (1 << mtt->order) - 1); +} +EXPORT_SYMBOL_GPL(mlx4_mtt_cleanup); + +u64 mlx4_mtt_addr(struct mlx4_dev *dev, struct mlx4_mtt *mtt) +{ + return (u64) mtt->first_seg * dev->caps.mtt_entry_sz; +} +EXPORT_SYMBOL_GPL(mlx4_mtt_addr); + +static u32 hw_index_to_key(u32 ind) +{ + return (ind >> 24) | (ind << 8); +} + +static u32 key_to_hw_index(u32 key) +{ + return (key << 24) | (key >> 8); +} + +static int mlx4_SW2HW_MPT(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, + int mpt_index) +{ + return mlx4_cmd(dev, mailbox->dma, mpt_index, 0, MLX4_CMD_SW2HW_MPT, + MLX4_CMD_TIME_CLASS_B); +} + +static int mlx4_HW2SW_MPT(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, + int mpt_index) +{ + return mlx4_cmd_box(dev, 0, mailbox ? mailbox->dma : 0, mpt_index, + !mailbox, MLX4_CMD_HW2SW_MPT, MLX4_CMD_TIME_CLASS_B); +} + +int mlx4_mr_alloc(struct mlx4_dev *dev, u32 pd, u64 iova, u64 size, u32 access, + int npages, int page_shift, struct mlx4_mr *mr) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + u32 index; + int err; + + index = mlx4_bitmap_alloc(&priv->mr_table.mpt_bitmap); + if (index == -1) { + err = -ENOMEM; + goto err; + } + + mr->iova = iova; + mr->size = size; + mr->pd = pd; + mr->access = access; + mr->enabled = 0; + mr->key = hw_index_to_key(index); + + err = mlx4_mtt_init(dev, npages, page_shift, &mr->mtt); + if (err) + goto err_index; + + return 0; + +err_index: + mlx4_bitmap_free(&priv->mr_table.mpt_bitmap, index); + +err: + kfree(mr); + return err; +} +EXPORT_SYMBOL_GPL(mlx4_mr_alloc); + +void mlx4_mr_free(struct mlx4_dev *dev, struct mlx4_mr *mr) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + int err; + + /*FIXME don't do this yet -- FW (2.0.138) seems to barf if we do */ + return; + + if (mr->enabled) { + err = mlx4_HW2SW_MPT(dev, NULL, + key_to_hw_index(mr->key) & + (dev->caps.num_mpts - 1)); + if (err) + mlx4_warn(dev, "HW2SW_MPT failed (%d)\n", err); + } + + mlx4_mtt_cleanup(dev, &mr->mtt); + mlx4_bitmap_free(&priv->mr_table.mpt_bitmap, key_to_hw_index(mr->key)); +} +EXPORT_SYMBOL_GPL(mlx4_mr_free); + +int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr) +{ + struct mlx4_mr_table *mr_table = &mlx4_priv(dev)->mr_table; + struct mlx4_cmd_mailbox *mailbox; + struct mlx4_mpt_entry *mpt_entry; + int err; + + err = mlx4_table_get(dev, &mr_table->dmpt_table, key_to_hw_index(mr->key)); + if (err) + return err; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) { + err = PTR_ERR(mailbox); + goto err_table; + } + mpt_entry = mailbox->buf; + + memset(mpt_entry, 0, sizeof *mpt_entry); + + mpt_entry->flags = cpu_to_be32(MLX4_MPT_FLAG_SW_OWNS | + MLX4_MPT_FLAG_MIO | + MLX4_MPT_FLAG_REGION | + mr->access); + if (mr->mtt.order < 0) + mpt_entry->flags |= cpu_to_be32(MLX4_MPT_FLAG_PHYSICAL); + + mpt_entry->key = cpu_to_be32(key_to_hw_index(mr->key)); + mpt_entry->pd = cpu_to_be32(mr->pd); + mpt_entry->start = cpu_to_be64(mr->iova); + mpt_entry->length = cpu_to_be64(mr->size); + mpt_entry->entity_size = cpu_to_be32(mr->mtt.page_shift); + mpt_entry->mtt_seg = cpu_to_be64(mlx4_mtt_addr(dev, &mr->mtt)); + + err = mlx4_SW2HW_MPT(dev, mailbox, + key_to_hw_index(mr->key) & (dev->caps.num_mpts - 1)); + if (err) { + mlx4_warn(dev, "SW2HW_MPT failed (%d)\n", err); + goto err_cmd; + } + + mr->enabled = 1; + + mlx4_free_cmd_mailbox(dev, mailbox); + + return 0; + +err_cmd: + mlx4_free_cmd_mailbox(dev, mailbox); + +err_table: + mlx4_table_put(dev, &mr_table->dmpt_table, key_to_hw_index(mr->key)); + return err; +} +EXPORT_SYMBOL_GPL(mlx4_mr_enable); + +static int mlx4_WRITE_MTT(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, + int num_mtt) +{ + return mlx4_cmd(dev, mailbox->dma, num_mtt, 0, MLX4_CMD_WRITE_MTT, + MLX4_CMD_TIME_CLASS_B); +} + +int mlx4_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt, + int start_index, int npages, u64 *page_list) +{ + struct mlx4_cmd_mailbox *mailbox; + __be64 *mtt_entry; + int i; + int err = 0; + + if (mtt->order < 0) + return -EINVAL; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + mtt_entry = mailbox->buf; + + while (npages > 0) { + mtt_entry[0] = cpu_to_be64(mlx4_mtt_addr(dev, mtt) + start_index * 8); + mtt_entry[1] = 0; + + for (i = 0; i < npages && i < MLX4_MAILBOX_SIZE / 8 - 2; ++i) + mtt_entry[i + 2] = cpu_to_be64(page_list[i] | + MLX4_MTT_FLAG_PRESENT); + + /* + * If we have an odd number of entries to write, add + * one more dummy entry for firmware efficiency. + */ + if (i & 1) + mtt_entry[i + 2] = 0; + + err = mlx4_WRITE_MTT(dev, mailbox, (i + 1) & ~1); + if (err) + goto out; + + npages -= i; + start_index += i; + page_list += i; + } + +out: + mlx4_free_cmd_mailbox(dev, mailbox); + + return err; +} +EXPORT_SYMBOL_GPL(mlx4_write_mtt); + +int mlx4_buf_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt, + struct mlx4_buf *buf) +{ + u64 *page_list; + int err; + int i; + + page_list = kmalloc(buf->npages * sizeof *page_list, GFP_KERNEL); + if (!page_list) + return -ENOMEM; + + for (i = 0; i < buf->npages; ++i) + if (buf->nbufs == 1) + page_list[i] = buf->u.direct.map + (i << buf->page_shift); + else + page_list[i] = buf->u.page_list[i].map; + + err = mlx4_write_mtt(dev, mtt, 0, buf->npages, page_list); + + kfree(page_list); + return err; +} +EXPORT_SYMBOL_GPL(mlx4_buf_write_mtt); + +int __devinit mlx4_init_mr_table(struct mlx4_dev *dev) +{ + struct mlx4_mr_table *mr_table = &mlx4_priv(dev)->mr_table; + int err; + + err = mlx4_bitmap_init(&mr_table->mpt_bitmap, dev->caps.num_mpts, + ~0, dev->caps.reserved_mrws); + if (err) + return err; + + err = mlx4_buddy_init(&mr_table->mtt_buddy, + ilog2(dev->caps.num_mtt_segs)); + if (err) + goto err_buddy; + + if (dev->caps.reserved_mtts) { + if (mlx4_alloc_mtt_range(dev, ilog2(dev->caps.reserved_mtts)) == -1) { + mlx4_warn(dev, "MTT table of order %d is too small.\n", + mr_table->mtt_buddy.max_order); + err = -ENOMEM; + goto err_reserve_mtts; + } + } + + return 0; + +err_reserve_mtts: + mlx4_buddy_cleanup(&mr_table->mtt_buddy); + +err_buddy: + mlx4_bitmap_cleanup(&mr_table->mpt_bitmap); + + return err; +} + +void mlx4_cleanup_mr_table(struct mlx4_dev *dev) +{ + struct mlx4_mr_table *mr_table = &mlx4_priv(dev)->mr_table; + + mlx4_buddy_cleanup(&mr_table->mtt_buddy); + mlx4_bitmap_cleanup(&mr_table->mpt_bitmap); +} diff --git a/drivers/net/mlx4/pd.c b/drivers/net/mlx4/pd.c new file mode 100644 index 0000000..d2c369d --- /dev/null +++ b/drivers/net/mlx4/pd.c @@ -0,0 +1,102 @@ +/* + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include + +#include "mlx4.h" +#include "icm.h" + +int mlx4_pd_alloc(struct mlx4_dev *dev, u32 *pdn) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + + *pdn = mlx4_bitmap_alloc(&priv->pd_bitmap); + if (*pdn == -1) + return -ENOMEM; + + return 0; +} +EXPORT_SYMBOL_GPL(mlx4_pd_alloc); + +void mlx4_pd_free(struct mlx4_dev *dev, u32 pdn) +{ + mlx4_bitmap_free(&mlx4_priv(dev)->pd_bitmap, pdn); +} +EXPORT_SYMBOL_GPL(mlx4_pd_free); + +int __devinit mlx4_init_pd_table(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + + return mlx4_bitmap_init(&priv->pd_bitmap, dev->caps.num_pds, + (1 << 24) - 1, dev->caps.reserved_pds); +} + +void mlx4_cleanup_pd_table(struct mlx4_dev *dev) +{ + mlx4_bitmap_cleanup(&mlx4_priv(dev)->pd_bitmap); +} + + +int mlx4_uar_alloc(struct mlx4_dev *dev, struct mlx4_uar *uar) +{ + uar->index = mlx4_bitmap_alloc(&mlx4_priv(dev)->uar_table.bitmap); + if (uar->index == -1) + return -ENOMEM; + + uar->pfn = (pci_resource_start(dev->pdev, 2) >> PAGE_SHIFT) + uar->index; + + return 0; +} +EXPORT_SYMBOL_GPL(mlx4_uar_alloc); + +void mlx4_uar_free(struct mlx4_dev *dev, struct mlx4_uar *uar) +{ + mlx4_bitmap_free(&mlx4_priv(dev)->uar_table.bitmap, uar->index); +} +EXPORT_SYMBOL_GPL(mlx4_uar_free); + +int mlx4_init_uar_table(struct mlx4_dev *dev) +{ + return mlx4_bitmap_init(&mlx4_priv(dev)->uar_table.bitmap, + dev->caps.num_uars, dev->caps.num_uars - 1, + max(128, dev->caps.reserved_uars)); +} + +void mlx4_cleanup_uar_table(struct mlx4_dev *dev) +{ + mlx4_bitmap_cleanup(&mlx4_priv(dev)->uar_table.bitmap); +} diff --git a/drivers/net/mlx4/qp.c b/drivers/net/mlx4/qp.c new file mode 100644 index 0000000..824e0f6 --- /dev/null +++ b/drivers/net/mlx4/qp.c @@ -0,0 +1,270 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems, Inc. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include + +#include +#include + +#include "mlx4.h" +#include "icm.h" + +void mlx4_qp_event(struct mlx4_dev *dev, u32 qpn, int event_type) +{ + struct mlx4_qp_table *qp_table = &mlx4_priv(dev)->qp_table; + struct mlx4_qp *qp; + + spin_lock(&qp_table->lock); + + qp = __mlx4_qp_lookup(dev, qpn); + if (qp) + atomic_inc(&qp->refcount); + + spin_unlock(&qp_table->lock); + + if (!qp) { + mlx4_warn(dev, "Async event for bogus QP %08x\n", qpn); + return; + } + + qp->event(qp, event_type); + + if (atomic_dec_and_test(&qp->refcount)) + complete(&qp->free); +} + +int mlx4_qp_modify(struct mlx4_dev *dev, struct mlx4_mtt *mtt, + enum mlx4_qp_state cur_state, enum mlx4_qp_state new_state, + struct mlx4_qp_context *context, enum mlx4_qp_optpar optpar, + int sqd_event, struct mlx4_qp *qp) +{ + static const u16 op[MLX4_QP_NUM_STATE][MLX4_QP_NUM_STATE] = { + [MLX4_QP_STATE_RST] = { + [MLX4_QP_STATE_RST] = MLX4_CMD_2RST_QP, + [MLX4_QP_STATE_ERR] = MLX4_CMD_2ERR_QP, + [MLX4_QP_STATE_INIT] = MLX4_CMD_RST2INIT_QP, + }, + [MLX4_QP_STATE_INIT] = { + [MLX4_QP_STATE_RST] = MLX4_CMD_2RST_QP, + [MLX4_QP_STATE_ERR] = MLX4_CMD_2ERR_QP, + [MLX4_QP_STATE_INIT] = MLX4_CMD_INIT2INIT_QP, + [MLX4_QP_STATE_RTR] = MLX4_CMD_INIT2RTR_QP, + }, + [MLX4_QP_STATE_RTR] = { + [MLX4_QP_STATE_RST] = MLX4_CMD_2RST_QP, + [MLX4_QP_STATE_ERR] = MLX4_CMD_2ERR_QP, + [MLX4_QP_STATE_RTS] = MLX4_CMD_RTR2RTS_QP, + }, + [MLX4_QP_STATE_RTS] = { + [MLX4_QP_STATE_RST] = MLX4_CMD_2RST_QP, + [MLX4_QP_STATE_ERR] = MLX4_CMD_2ERR_QP, + [MLX4_QP_STATE_RTS] = MLX4_CMD_RTS2RTS_QP, + [MLX4_QP_STATE_SQD] = MLX4_CMD_RTS2SQD_QP, + }, + [MLX4_QP_STATE_SQD] = { + [MLX4_QP_STATE_RST] = MLX4_CMD_2RST_QP, + [MLX4_QP_STATE_ERR] = MLX4_CMD_2ERR_QP, + [MLX4_QP_STATE_RTS] = MLX4_CMD_SQD2RTS_QP, + [MLX4_QP_STATE_SQD] = MLX4_CMD_SQD2SQD_QP, + }, + [MLX4_QP_STATE_SQER] = { + [MLX4_QP_STATE_RST] = MLX4_CMD_2RST_QP, + [MLX4_QP_STATE_ERR] = MLX4_CMD_2ERR_QP, + [MLX4_QP_STATE_RTS] = MLX4_CMD_SQERR2RTS_QP, + }, + [MLX4_QP_STATE_ERR] = { + [MLX4_QP_STATE_RST] = MLX4_CMD_2RST_QP, + [MLX4_QP_STATE_ERR] = MLX4_CMD_2ERR_QP, + } + }; + + struct mlx4_cmd_mailbox *mailbox; + int ret = 0; + + if (cur_state < 0 || cur_state >= MLX4_QP_NUM_STATE || + new_state < 0 || cur_state >= MLX4_QP_NUM_STATE || + !op[cur_state][new_state]) + return -EINVAL; + + if (op[cur_state][new_state] == MLX4_CMD_2RST_QP) + return mlx4_cmd(dev, 0, qp->qpn, 2, + MLX4_CMD_2RST_QP, MLX4_CMD_TIME_CLASS_A); + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + if (cur_state == MLX4_QP_STATE_RST && new_state == MLX4_QP_STATE_INIT) { + u64 mtt_addr = mlx4_mtt_addr(dev, mtt); + context->mtt_base_addr_h = mtt_addr >> 32; + context->mtt_base_addr_l = cpu_to_be32(mtt_addr & 0xffffffff); + context->log_page_size = mtt->page_shift - MLX4_ICM_PAGE_SHIFT; + } + + *(__be32 *) mailbox->buf = cpu_to_be32(optpar); + memcpy(mailbox->buf + 8, context, sizeof *context); + + ret = mlx4_cmd(dev, mailbox->dma, qp->qpn | (!!sqd_event << 31), + new_state == MLX4_QP_STATE_RST ? 2 : 0, + op[cur_state][new_state], MLX4_CMD_TIME_CLASS_C); + + mlx4_free_cmd_mailbox(dev, mailbox); + return ret; +} +EXPORT_SYMBOL_GPL(mlx4_qp_modify); + +int mlx4_qp_alloc(struct mlx4_dev *dev, int sqpn, struct mlx4_qp *qp) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_qp_table *qp_table = &priv->qp_table; + int err; + + if (sqpn) + qp->qpn = sqpn; + else { + qp->qpn = mlx4_bitmap_alloc(&qp_table->bitmap); + if (qp->qpn == -1) + return -ENOMEM; + } + + err = mlx4_table_get(dev, &qp_table->qp_table, qp->qpn); + if (err) + goto err_out; + + err = mlx4_table_get(dev, &qp_table->auxc_table, qp->qpn); + if (err) + goto err_put_qp; + + err = mlx4_table_get(dev, &qp_table->altc_table, qp->qpn); + if (err) + goto err_put_auxc; + + err = mlx4_table_get(dev, &qp_table->rdmarc_table, qp->qpn); + if (err) + goto err_put_altc; + + err = mlx4_table_get(dev, &qp_table->cmpt_table, qp->qpn); + if (err) + goto err_put_rdmarc; + + spin_lock_irq(&qp_table->lock); + err = radix_tree_insert(&dev->qp_table_tree, qp->qpn & (dev->caps.num_qps - 1), qp); + spin_unlock_irq(&qp_table->lock); + if (err) + goto err_put_cmpt; + + return 0; + +err_put_cmpt: + mlx4_table_put(dev, &qp_table->cmpt_table, qp->qpn); + +err_put_rdmarc: + mlx4_table_put(dev, &qp_table->rdmarc_table, qp->qpn); + +err_put_altc: + mlx4_table_put(dev, &qp_table->altc_table, qp->qpn); + +err_put_auxc: + mlx4_table_put(dev, &qp_table->auxc_table, qp->qpn); + +err_put_qp: + mlx4_table_put(dev, &qp_table->qp_table, qp->qpn); + +err_out: + if (!sqpn) + mlx4_bitmap_free(&qp_table->bitmap, qp->qpn); + + return err; +} +EXPORT_SYMBOL_GPL(mlx4_qp_alloc); + +void mlx4_qp_remove(struct mlx4_dev *dev, struct mlx4_qp *qp) +{ + struct mlx4_qp_table *qp_table = &mlx4_priv(dev)->qp_table; + unsigned long flags; + + spin_lock_irqsave(&qp_table->lock, flags); + radix_tree_delete(&dev->qp_table_tree, qp->qpn & (dev->caps.num_qps - 1)); + spin_unlock_irqrestore(&qp_table->lock, flags); +} +EXPORT_SYMBOL_GPL(mlx4_qp_remove); + +void mlx4_qp_free(struct mlx4_dev *dev, struct mlx4_qp *qp) +{ + struct mlx4_qp_table *qp_table = &mlx4_priv(dev)->qp_table; + + mlx4_table_put(dev, &qp_table->cmpt_table, qp->qpn); + mlx4_table_put(dev, &qp_table->rdmarc_table, qp->qpn); + mlx4_table_put(dev, &qp_table->altc_table, qp->qpn); + mlx4_table_put(dev, &qp_table->auxc_table, qp->qpn); + mlx4_table_put(dev, &qp_table->qp_table, qp->qpn); + + mlx4_bitmap_free(&qp_table->bitmap, qp->qpn); +} +EXPORT_SYMBOL_GPL(mlx4_qp_free); + +static int mlx4_CONF_SPECIAL_QP(struct mlx4_dev *dev, u32 base_qpn) +{ + return mlx4_cmd(dev, 0, base_qpn, 0, MLX4_CMD_CONF_SPECIAL_QP, + MLX4_CMD_TIME_CLASS_B); +} + +int __devinit mlx4_init_qp_table(struct mlx4_dev *dev) +{ + struct mlx4_qp_table *qp_table = &mlx4_priv(dev)->qp_table; + int err; + + spin_lock_init(&qp_table->lock); + INIT_RADIX_TREE(&dev->qp_table_tree, GFP_ATOMIC); + + /* + * We reserve 2 extra QPs per port for the special QPs. The + * block of special QPs must be aligned to a multiple of 8, so + * round up. + */ + dev->caps.sqp_start = ALIGN(dev->caps.reserved_qps, 8); + err = mlx4_bitmap_init(&qp_table->bitmap, dev->caps.num_qps, + (1 << 24) - 1, dev->caps.sqp_start + 8); + if (err) + return err; + + return mlx4_CONF_SPECIAL_QP(dev, dev->caps.sqp_start); +} + +void mlx4_cleanup_qp_table(struct mlx4_dev *dev) +{ + mlx4_CONF_SPECIAL_QP(dev, 0); + mlx4_bitmap_cleanup(&mlx4_priv(dev)->qp_table.bitmap); +} diff --git a/drivers/net/mlx4/srq.c b/drivers/net/mlx4/srq.c new file mode 100644 index 0000000..09b43ed --- /dev/null +++ b/drivers/net/mlx4/srq.c @@ -0,0 +1,227 @@ +/* + * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include + +#include + +#include "mlx4.h" +#include "icm.h" + +struct mlx4_srq_context { + __be32 state_logsize_srqn; + u8 logstride; + u8 reserved1[3]; + u8 pg_offset; + u8 reserved2[3]; + u32 reserved3; + u8 log_page_size; + u8 reserved4[2]; + u8 mtt_base_addr_h; + __be32 mtt_base_addr_l; + __be32 pd; + __be16 limit_watermark; + __be16 wqe_cnt; + u16 reserved5; + __be16 wqe_counter; + u32 reserved6; + __be64 db_rec_addr; +}; + +void mlx4_srq_event(struct mlx4_dev *dev, u32 srqn, int event_type) +{ + struct mlx4_srq_table *srq_table = &mlx4_priv(dev)->srq_table; + struct mlx4_srq *srq; + + spin_lock(&srq_table->lock); + + srq = radix_tree_lookup(&srq_table->tree, srqn & (dev->caps.num_srqs - 1)); + if (srq) + atomic_inc(&srq->refcount); + + spin_unlock(&srq_table->lock); + + if (!srq) { + mlx4_warn(dev, "Async event for bogus SRQ %08x\n", srqn); + return; + } + + srq->event(srq, event_type); + + if (atomic_dec_and_test(&srq->refcount)) + complete(&srq->free); +} + +static int mlx4_SW2HW_SRQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, + int srq_num) +{ + return mlx4_cmd(dev, mailbox->dma, srq_num, 0, MLX4_CMD_SW2HW_SRQ, + MLX4_CMD_TIME_CLASS_A); +} + +static int mlx4_HW2SW_SRQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, + int srq_num) +{ + return mlx4_cmd_box(dev, 0, mailbox ? mailbox->dma : 0, srq_num, + mailbox ? 0 : 1, MLX4_CMD_HW2SW_SRQ, + MLX4_CMD_TIME_CLASS_A); +} + +static int mlx4_ARM_SRQ(struct mlx4_dev *dev, int srq_num, int limit_watermark) +{ + return mlx4_cmd(dev, limit_watermark, srq_num, 0, MLX4_CMD_ARM_SRQ, + MLX4_CMD_TIME_CLASS_B); +} + +int mlx4_srq_alloc(struct mlx4_dev *dev, u32 pdn, struct mlx4_mtt *mtt, + u64 db_rec, struct mlx4_srq *srq) +{ + struct mlx4_srq_table *srq_table = &mlx4_priv(dev)->srq_table; + struct mlx4_cmd_mailbox *mailbox; + struct mlx4_srq_context *srq_context; + u64 mtt_addr; + int err; + + srq->srqn = mlx4_bitmap_alloc(&srq_table->bitmap); + if (srq->srqn == -1) + return -ENOMEM; + + err = mlx4_table_get(dev, &srq_table->table, srq->srqn); + if (err) + goto err_out; + + err = mlx4_table_get(dev, &srq_table->cmpt_table, srq->srqn); + if (err) + goto err_put; + + spin_lock_irq(&srq_table->lock); + err = radix_tree_insert(&srq_table->tree, srq->srqn, srq); + spin_unlock_irq(&srq_table->lock); + if (err) + goto err_cmpt_put; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) { + err = PTR_ERR(mailbox); + goto err_radix; + } + + srq_context = mailbox->buf; + memset(srq_context, 0, sizeof *srq_context); + + srq_context->state_logsize_srqn = cpu_to_be32((ilog2(srq->max) << 24) | + srq->srqn); + srq_context->logstride = srq->wqe_shift - 4; + srq_context->log_page_size = mtt->page_shift - MLX4_ICM_PAGE_SHIFT; + + mtt_addr = mlx4_mtt_addr(dev, mtt); + srq_context->mtt_base_addr_h = mtt_addr >> 32; + srq_context->mtt_base_addr_l = cpu_to_be32(mtt_addr & 0xffffffff); + srq_context->pd = cpu_to_be32(pdn); + srq_context->db_rec_addr = cpu_to_be64(db_rec); + + err = mlx4_SW2HW_SRQ(dev, mailbox, srq->srqn); + mlx4_free_cmd_mailbox(dev, mailbox); + if (err) + goto err_radix; + + atomic_set(&srq->refcount, 1); + init_completion(&srq->free); + + return 0; + +err_radix: + spin_lock_irq(&srq_table->lock); + radix_tree_delete(&srq_table->tree, srq->srqn); + spin_unlock_irq(&srq_table->lock); + +err_cmpt_put: + mlx4_table_put(dev, &srq_table->cmpt_table, srq->srqn); + +err_put: + mlx4_table_put(dev, &srq_table->table, srq->srqn); + +err_out: + mlx4_bitmap_free(&srq_table->bitmap, srq->srqn); + + return err; +} +EXPORT_SYMBOL_GPL(mlx4_srq_alloc); + +void mlx4_srq_free(struct mlx4_dev *dev, struct mlx4_srq *srq) +{ + struct mlx4_srq_table *srq_table = &mlx4_priv(dev)->srq_table; + int err; + + err = mlx4_HW2SW_SRQ(dev, NULL, srq->srqn); + if (err) + mlx4_warn(dev, "HW2SW_SRQ failed (%d) for SRQN %06x\n", err, srq->srqn); + + spin_lock_irq(&srq_table->lock); + radix_tree_delete(&srq_table->tree, srq->srqn); + spin_unlock_irq(&srq_table->lock); + + if (atomic_dec_and_test(&srq->refcount)) + complete(&srq->free); + wait_for_completion(&srq->free); + + mlx4_table_put(dev, &srq_table->table, srq->srqn); + mlx4_bitmap_free(&srq_table->bitmap, srq->srqn); +} +EXPORT_SYMBOL_GPL(mlx4_srq_free); + +int mlx4_srq_arm(struct mlx4_dev *dev, struct mlx4_srq *srq, int limit_watermark) +{ + return mlx4_ARM_SRQ(dev, srq->srqn, limit_watermark); +} +EXPORT_SYMBOL_GPL(mlx4_srq_arm); + +int __devinit mlx4_init_srq_table(struct mlx4_dev *dev) +{ + struct mlx4_srq_table *srq_table = &mlx4_priv(dev)->srq_table; + int err; + + spin_lock_init(&srq_table->lock); + INIT_RADIX_TREE(&srq_table->tree, GFP_ATOMIC); + + err = mlx4_bitmap_init(&srq_table->bitmap, dev->caps.num_srqs, + dev->caps.num_srqs - 1, dev->caps.reserved_srqs); + if (err) + return err; + + return 0; +} + +void mlx4_cleanup_srq_table(struct mlx4_dev *dev) +{ + mlx4_bitmap_cleanup(&mlx4_priv(dev)->srq_table.bitmap); +} From rolandd at cisco.com Fri Apr 20 15:32:36 2007 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 20 Apr 2007 15:32:36 -0700 Subject: [ofa-general] [PATCH 5/6] [RFC]mlx4_ib rest of files In-Reply-To: <20074201532.mohEXyoz7s98VaHz@cisco.com> Message-ID: <20074201532.Rlzy7s6yc7iv5IXX@cisco.com> Rest of mlx4_ib code. Signed-off-by: Roland Dreier --- ah.c | 100 ++++ cq.c | 525 +++++++++++++++++++++++++ doorbell.c | 215 ++++++++++ mad.c | 339 ++++++++++++++++ mr.c | 184 ++++++++ qp.c | 1263 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ srq.c | 334 ++++++++++++++++ user.h | 91 ++++ 8 files changed, 3051 insertions(+) diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c new file mode 100644 index 0000000..c75ac94 --- /dev/null +++ b/drivers/infiniband/hw/mlx4/ah.c @@ -0,0 +1,100 @@ +/* + * Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include "mlx4_ib.h" + +struct ib_ah *mlx4_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + struct mlx4_dev *dev = to_mdev(pd->device)->dev; + struct mlx4_ib_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_ATOMIC); + if (!ah) + return ERR_PTR(-ENOMEM); + + memset(&ah->av, 0, sizeof ah->av); + + ah->av.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24)); + ah->av.g_slid = ah_attr->src_path_bits; + ah->av.dlid = cpu_to_be16(ah_attr->dlid); + if (ah_attr->static_rate) { + ah->av.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET; + while (ah->av.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET && + !(1 << ah->av.stat_rate & dev->caps.stat_rate_support)) + --ah->av.stat_rate; + } + ah->av.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + if (ah_attr->ah_flags & IB_AH_GRH) { + ah->av.g_slid |= 0x80; + ah->av.gid_index = ah_attr->grh.sgid_index; + ah->av.hop_limit = ah_attr->grh.hop_limit; + ah->av.sl_tclass_flowlabel |= + cpu_to_be32((ah_attr->grh.traffic_class << 20) | + ah_attr->grh.flow_label); + memcpy(ah->av.dgid, ah_attr->grh.dgid.raw, 16); + } + + return &ah->ibah; +} + +int mlx4_ib_query_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr) +{ + struct mlx4_ib_ah *ah = to_mah(ibah); + + memset(ah_attr, 0, sizeof *ah_attr); + ah_attr->dlid = be16_to_cpu(ah->av.dlid); + ah_attr->sl = be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 28; + ah_attr->port_num = be32_to_cpu(ah->av.port_pd) >> 24; + if (ah->av.stat_rate) + ah_attr->static_rate = ah->av.stat_rate - MLX4_STAT_RATE_OFFSET; + ah_attr->src_path_bits = ah->av.g_slid & 0x7F; + + if (mlx4_ib_ah_grh_present(ah)) { + ah_attr->ah_flags = IB_AH_GRH; + + ah_attr->grh.traffic_class = + be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 20; + ah_attr->grh.flow_label = + be32_to_cpu(ah->av.sl_tclass_flowlabel) & 0xfffff; + ah_attr->grh.hop_limit = ah->av.hop_limit; + ah_attr->grh.sgid_index = ah->av.gid_index; + memcpy(ah_attr->grh.dgid.raw, ah->av.dgid, 16); + } + + return 0; +} + +int mlx4_ib_destroy_ah(struct ib_ah *ah) +{ + kfree(to_mah(ah)); + return 0; +} diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c new file mode 100644 index 0000000..7884f6d --- /dev/null +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -0,0 +1,525 @@ +/* + * Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include "mlx4_ib.h" +#include "user.h" + +static void mlx4_ib_cq_comp(struct mlx4_cq *cq) +{ + struct ib_cq *ibcq = &to_mibcq(cq)->ibcq; + ibcq->comp_handler(ibcq, ibcq->cq_context); +} + +static void mlx4_ib_cq_event(struct mlx4_cq *cq, enum mlx4_event type) +{ + struct ib_event event; + struct ib_cq *ibcq; + + if (type != MLX4_EVENT_TYPE_CQ_ERROR) { + printk(KERN_WARNING "mlx4_ib: Unexpected event type %d " + "on CQ %06x\n", type, cq->cqn); + return; + } + + ibcq = &to_mibcq(cq)->ibcq; + if (ibcq->event_handler) { + event.device = ibcq->device; + event.event = IB_EVENT_CQ_ERR; + event.element.cq = ibcq; + ibcq->event_handler(&event, ibcq->cq_context); + } +} + +static void *get_cqe_from_buf(struct mlx4_ib_cq_buf *buf, int n) +{ + int offset = n * sizeof (struct mlx4_cqe); + + if (buf->buf.nbufs == 1) + return buf->buf.u.direct.buf + offset; + else + return buf->buf.u.page_list[offset >> PAGE_SHIFT].buf + + (offset & (PAGE_SIZE - 1)); +} + +static void *get_cqe(struct mlx4_ib_cq *cq, int n) +{ + return get_cqe_from_buf(&cq->buf, n); +} + +static void *get_sw_cqe(struct mlx4_ib_cq *cq, int n) +{ + struct mlx4_cqe *cqe = get_cqe(cq, n & cq->ibcq.cqe); + + return (!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^ + !!(n & (cq->ibcq.cqe + 1))) ? NULL : cqe; +} + +static struct mlx4_cqe *next_cqe_sw(struct mlx4_ib_cq *cq) +{ + return get_sw_cqe(cq, cq->mcq.cons_index); +} + +struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct mlx4_ib_dev *dev = to_mdev(ibdev); + struct mlx4_ib_cq *cq; + struct mlx4_uar *uar; + int buf_size; + int err; + + if (entries < 1 || entries > dev->dev->caps.max_cqes) + return ERR_PTR(-EINVAL); + + cq = kmalloc(sizeof *cq, GFP_KERNEL); + if (!cq) + return ERR_PTR(-ENOMEM); + + entries = roundup_pow_of_two(entries + 1); + cq->ibcq.cqe = entries - 1; + buf_size = entries * sizeof (struct mlx4_cqe); + spin_lock_init(&cq->lock); + + if (context) { + struct mlx4_ib_create_cq ucmd; + + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) { + err = -EFAULT; + goto err_cq; + } + + cq->umem = ib_umem_get(context, ucmd.buf_addr, buf_size, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(cq->umem)) { + err = PTR_ERR(cq->umem); + goto err_cq; + } + + err = mlx4_mtt_init(dev->dev, ib_umem_page_count(cq->umem), + ilog2(cq->umem->page_size), &cq->buf.mtt); + if (err) + goto err_buf; + + err = mlx4_ib_umem_write_mtt(dev, &cq->buf.mtt, cq->umem); + if (err) + goto err_mtt; + + err = mlx4_ib_db_map_user(to_mucontext(context), ucmd.db_addr, + &cq->db); + if (err) + goto err_mtt; + + uar = &to_mucontext(context)->uar; + } else { + err = mlx4_ib_db_alloc(dev, &cq->db, 1); + if (err) + goto err_cq; + + cq->mcq.set_ci_db = cq->db.db; + cq->mcq.arm_db = cq->db.db + 1; + *cq->mcq.set_ci_db = 0; + *cq->mcq.arm_db = 0; + + if (mlx4_buf_alloc(dev->dev, buf_size, PAGE_SIZE * 2, &cq->buf.buf)) { + err = -ENOMEM; + goto err_db; + } + + err = mlx4_mtt_init(dev->dev, cq->buf.buf.npages, cq->buf.buf.page_shift, + &cq->buf.mtt); + if (err) + goto err_buf; + + err = mlx4_buf_write_mtt(dev->dev, &cq->buf.mtt, &cq->buf.buf); + if (err) + goto err_mtt; + + uar = &dev->priv_uar; + } + + err = mlx4_cq_alloc(dev->dev, entries, &cq->buf.mtt, uar, + cq->db.dma, &cq->mcq); + if (err) + goto err_dbmap; + + cq->mcq.comp = mlx4_ib_cq_comp; + cq->mcq.event = mlx4_ib_cq_event; + + if (context) + if (ib_copy_to_udata(udata, &cq->mcq.cqn, sizeof (__u32))) { + err = -EFAULT; + goto err_dbmap; + } + + return &cq->ibcq; + +err_dbmap: + if (context) + mlx4_ib_db_unmap_user(to_mucontext(context), &cq->db); + +err_mtt: + mlx4_mtt_cleanup(dev->dev, &cq->buf.mtt); + +err_buf: + if (context) + ib_umem_release(cq->umem); + else + mlx4_buf_free(dev->dev, entries * sizeof (struct mlx4_cqe), + &cq->buf.buf); + +err_db: + if (!context) + mlx4_ib_db_free(dev, &cq->db); + +err_cq: + kfree(cq); + + return ERR_PTR(err); +} + +int mlx4_ib_destroy_cq(struct ib_cq *cq) +{ + struct mlx4_ib_dev *dev = to_mdev(cq->device); + struct mlx4_ib_cq *mcq = to_mcq(cq); + + mlx4_cq_free(dev->dev, &mcq->mcq); + mlx4_mtt_cleanup(dev->dev, &mcq->buf.mtt); + + if (cq->uobject) { + mlx4_ib_db_unmap_user(to_mucontext(cq->uobject->context), &mcq->db); + ib_umem_release(mcq->umem); + } else { + mlx4_buf_free(dev->dev, (cq->cqe + 1) * sizeof (struct mlx4_cqe), + &mcq->buf.buf); + mlx4_ib_db_free(dev, &mcq->db); + } + + kfree(mcq); + + return 0; +} + +static void dump_cqe(void *cqe) +{ + __be32 *buf = cqe; + + printk(KERN_DEBUG "CQE contents %08x %08x %08x %08x %08x %08x %08x %08x\n", + be32_to_cpu(buf[0]), be32_to_cpu(buf[1]), be32_to_cpu(buf[2]), + be32_to_cpu(buf[3]), be32_to_cpu(buf[4]), be32_to_cpu(buf[5]), + be32_to_cpu(buf[6]), be32_to_cpu(buf[7])); +} + +static void mlx4_ib_handle_error_cqe(struct mlx4_err_cqe *cqe, + struct ib_wc *wc) +{ + if (cqe->syndrome == MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR) { + printk(KERN_DEBUG "local QP operation err " + "(QPN %06x, WQE index %x, vendor syndrome %02x, " + "opcode = %02x)\n", + be32_to_cpu(cqe->my_qpn), be16_to_cpu(cqe->wqe_index), + cqe->vendor_err_syndrome, + cqe->owner_sr_opcode & ~MLX4_CQE_OWNER_MASK); + dump_cqe(cqe); + } + + switch (cqe->syndrome) { + case MLX4_CQE_SYNDROME_LOCAL_LENGTH_ERR: + wc->status = IB_WC_LOC_LEN_ERR; + break; + case MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR: + wc->status = IB_WC_LOC_QP_OP_ERR; + break; + case MLX4_CQE_SYNDROME_LOCAL_PROT_ERR: + wc->status = IB_WC_LOC_PROT_ERR; + break; + case MLX4_CQE_SYNDROME_WR_FLUSH_ERR: + wc->status = IB_WC_WR_FLUSH_ERR; + break; + case MLX4_CQE_SYNDROME_MW_BIND_ERR: + wc->status = IB_WC_MW_BIND_ERR; + break; + case MLX4_CQE_SYNDROME_BAD_RESP_ERR: + wc->status = IB_WC_BAD_RESP_ERR; + break; + case MLX4_CQE_SYNDROME_LOCAL_ACCESS_ERR: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case MLX4_CQE_SYNDROME_REMOTE_INVAL_REQ_ERR: + wc->status = IB_WC_REM_INV_REQ_ERR; + break; + case MLX4_CQE_SYNDROME_REMOTE_ACCESS_ERR: + wc->status = IB_WC_REM_ACCESS_ERR; + break; + case MLX4_CQE_SYNDROME_REMOTE_OP_ERR: + wc->status = IB_WC_REM_OP_ERR; + break; + case MLX4_CQE_SYNDROME_TRANSPORT_RETRY_EXC_ERR: + wc->status = IB_WC_RETRY_EXC_ERR; + break; + case MLX4_CQE_SYNDROME_RNR_RETRY_EXC_ERR: + wc->status = IB_WC_RNR_RETRY_EXC_ERR; + break; + case MLX4_CQE_SYNDROME_REMOTE_ABORTED_ERR: + wc->status = IB_WC_REM_ABORT_ERR; + break; + default: + wc->status = IB_WC_GENERAL_ERR; + break; + } + + wc->vendor_err = cqe->vendor_err_syndrome; +} + +static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, + struct mlx4_ib_qp **cur_qp, + struct ib_wc *wc) +{ + struct mlx4_cqe *cqe; + struct mlx4_qp *mqp; + struct mlx4_ib_wq *wq; + struct mlx4_ib_srq *srq; + int is_send; + int is_error; + u16 wqe_ctr; + + cqe = next_cqe_sw(cq); + if (!cqe) + return -EAGAIN; + + ++cq->mcq.cons_index; + + /* + * Make sure we read CQ entry contents after we've checked the + * ownership bit. + */ + rmb(); + + is_send = cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK; + is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == + MLX4_CQE_OPCODE_ERROR; + + if (!*cur_qp || + (be32_to_cpu(cqe->my_qpn) & 0xffffff) != (*cur_qp)->mqp.qpn) { + /* + * We do not have to take the QP table lock here, + * because CQs will be locked while QPs are removed + * from the table. + */ + mqp = __mlx4_qp_lookup(to_mdev(cq->ibcq.device)->dev, + be32_to_cpu(cqe->my_qpn)); + if (unlikely(!mqp)) { + printk(KERN_WARNING "CQ %06x with entry for unknown QPN %06x\n", + cq->mcq.cqn, be32_to_cpu(cqe->my_qpn) & 0xffffff); + return -EINVAL; + } + + *cur_qp = to_mibqp(mqp); + } + + wc->qp = &(*cur_qp)->ibqp; + + if (is_send) { + wq = &(*cur_qp)->sq; + wqe_ctr = be16_to_cpu(cqe->wqe_index); + wq->tail += wqe_ctr - (u16) wq->tail; + wc->wr_id = wq->wrid[wq->tail & (wq->max - 1)]; + ++wq->tail; + } else if ((*cur_qp)->ibqp.srq) { + srq = to_msrq((*cur_qp)->ibqp.srq); + wqe_ctr = be16_to_cpu(cqe->wqe_index); + wc->wr_id = srq->wrid[wqe_ctr]; + mlx4_ib_free_srq_wqe(srq, wqe_ctr); + } else { + wq = &(*cur_qp)->rq; + wc->wr_id = wq->wrid[wq->tail & (wq->max - 1)]; + ++wq->tail; + } + + if (unlikely(is_error)) { + mlx4_ib_handle_error_cqe((struct mlx4_err_cqe *) cqe, wc); + return 0; + } + + wc->status = IB_WC_SUCCESS; + + if (is_send) { + wc->wc_flags = 0; + switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) { + case MLX4_OPCODE_RDMA_WRITE_IMM: + wc->wc_flags |= IB_WC_WITH_IMM; + case MLX4_OPCODE_RDMA_WRITE: + wc->opcode = IB_WC_RDMA_WRITE; + break; + case MLX4_OPCODE_SEND_IMM: + wc->wc_flags |= IB_WC_WITH_IMM; + case MLX4_OPCODE_SEND: + wc->opcode = IB_WC_SEND; + break; + case MLX4_OPCODE_RDMA_READ: + wc->opcode = IB_WC_SEND; + wc->byte_len = be32_to_cpu(cqe->byte_cnt); + break; + case MLX4_OPCODE_ATOMIC_CS: + wc->opcode = IB_WC_COMP_SWAP; + wc->byte_len = 8; + break; + case MLX4_OPCODE_ATOMIC_FA: + wc->opcode = IB_WC_FETCH_ADD; + wc->byte_len = 8; + break; + case MLX4_OPCODE_BIND_MW: + wc->opcode = IB_WC_BIND_MW; + break; + } + } else { + wc->byte_len = be32_to_cpu(cqe->byte_cnt); + + switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) { + case MLX4_RECV_OPCODE_RDMA_WRITE_IMM: + wc->opcode = IB_WC_RECV_RDMA_WITH_IMM; + wc->wc_flags = IB_WC_WITH_IMM; + wc->imm_data = cqe->immed_rss_invalid; + break; + case MLX4_RECV_OPCODE_SEND: + wc->opcode = IB_WC_RECV; + wc->wc_flags = 0; + break; + case MLX4_RECV_OPCODE_SEND_IMM: + wc->opcode = IB_WC_RECV; + wc->wc_flags = IB_WC_WITH_IMM; + wc->imm_data = cqe->immed_rss_invalid; + break; + } + + wc->slid = be16_to_cpu(cqe->rlid); + wc->sl = cqe->sl >> 4; + wc->src_qp = be32_to_cpu(cqe->g_mlpath_rqpn) & 0xffffff; + wc->dlid_path_bits = (be32_to_cpu(cqe->g_mlpath_rqpn) >> 24) & 0x7f; + wc->wc_flags |= be32_to_cpu(cqe->g_mlpath_rqpn) & 0x80000000 ? + IB_WC_GRH : 0; + wc->pkey_index = be32_to_cpu(cqe->immed_rss_invalid) >> 16; + } + + return 0; +} + +int mlx4_ib_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc) +{ + struct mlx4_ib_cq *cq = to_mcq(ibcq); + struct mlx4_ib_qp *cur_qp = NULL; + unsigned long flags; + int npolled; + int err = 0; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + err = mlx4_ib_poll_one(cq, &cur_qp, wc + npolled); + if (err) + break; + } + + if (npolled) + mlx4_cq_set_ci(&cq->mcq); + + spin_unlock_irqrestore(&cq->lock, flags); + + if (err == 0 || err == -EAGAIN) + return npolled; + else + return err; +} + +int mlx4_ib_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +{ + mlx4_cq_arm(&to_mcq(ibcq)->mcq, + notify == IB_CQ_SOLICITED ? MLX4_CQ_DB_REQ_NOT_SOL : + MLX4_CQ_DB_REQ_NOT, + to_mdev(ibcq->device)->uar_map, + MLX4_GET_DOORBELL_LOCK(&to_mdev(ibcq->device)->uar_lock)); + + return 0; +} + +void __mlx4_ib_cq_clean(struct mlx4_ib_cq *cq, u32 qpn, struct mlx4_ib_srq *srq) +{ + u32 prod_index; + int nfreed = 0; + struct mlx4_cqe *cqe; + + /* + * First we need to find the current producer index, so we + * know where to start cleaning from. It doesn't matter if HW + * adds new entries after this loop -- the QP we're worried + * about is already in RESET, so the new entries won't come + * from our QP and therefore don't need to be checked. + */ + for (prod_index = cq->mcq.cons_index; get_sw_cqe(cq, prod_index); ++prod_index) + if (prod_index == cq->mcq.cons_index + cq->ibcq.cqe) + break; + + /* + * Now sweep backwards through the CQ, removing CQ entries + * that match our QP by copying older entries on top of them. + */ + while ((int) --prod_index - (int) cq->mcq.cons_index >= 0) { + cqe = get_cqe(cq, prod_index & cq->ibcq.cqe); + if ((be32_to_cpu(cqe->my_qpn) & 0xffffff) == qpn) { + if (srq && !(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK)) + mlx4_ib_free_srq_wqe(srq, be16_to_cpu(cqe->wqe_index)); + ++nfreed; + } else if (nfreed) + memcpy(get_cqe(cq, (prod_index + nfreed) & cq->ibcq.cqe), + cqe, sizeof *cqe); + } + + if (nfreed) { + cq->mcq.cons_index += nfreed; + /* + * Make sure update of buffer contents is done before + * updating consumer index. + */ + wmb(); + mlx4_cq_set_ci(&cq->mcq); + } +} + +void mlx4_ib_cq_clean(struct mlx4_ib_cq *cq, u32 qpn, struct mlx4_ib_srq *srq) +{ + spin_lock_irq(&cq->lock); + __mlx4_ib_cq_clean(cq, qpn, srq); + spin_unlock_irq(&cq->lock); +} diff --git a/drivers/infiniband/hw/mlx4/doorbell.c b/drivers/infiniband/hw/mlx4/doorbell.c new file mode 100644 index 0000000..c3398a3 --- /dev/null +++ b/drivers/infiniband/hw/mlx4/doorbell.c @@ -0,0 +1,215 @@ +/* + * Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include + +#include "mlx4_ib.h" + +struct mlx4_ib_db_pgdir { + struct list_head list; + DECLARE_BITMAP(order0, MLX4_IB_DB_PER_PAGE); + DECLARE_BITMAP(order1, MLX4_IB_DB_PER_PAGE / 2); + unsigned long *bits[2]; + __be32 *db_page; + dma_addr_t db_dma; +}; + +static struct mlx4_ib_db_pgdir *mlx4_ib_alloc_db_pgdir(struct mlx4_ib_dev *dev) +{ + struct mlx4_ib_db_pgdir *pgdir; + + pgdir = kzalloc(sizeof *pgdir, GFP_KERNEL); + if (!pgdir) + return NULL; + + bitmap_fill(pgdir->order1, MLX4_IB_DB_PER_PAGE / 2); + pgdir->bits[0] = pgdir->order0; + pgdir->bits[1] = pgdir->order1; + pgdir->db_page = dma_alloc_coherent(dev->ib_dev.dma_device, + PAGE_SIZE, &pgdir->db_dma, + GFP_KERNEL); + if (!pgdir->db_page) { + kfree(pgdir); + return NULL; + } + + return pgdir; +} + +static int mlx4_ib_alloc_db_from_pgdir(struct mlx4_ib_db_pgdir *pgdir, + struct mlx4_ib_db *db, int order) +{ + int o; + int i; + + for (o = order; o <= 1; ++o) { + i = find_first_bit(pgdir->bits[o], MLX4_IB_DB_PER_PAGE >> o); + if (i < MLX4_IB_DB_PER_PAGE >> o) + goto found; + } + + return -ENOMEM; + +found: + clear_bit(i, pgdir->bits[o]); + + i <<= o; + + if (o > order) + set_bit(i ^ 1, pgdir->bits[order]); + + db->u.pgdir = pgdir; + db->index = i; + db->db = pgdir->db_page + db->index; + db->dma = pgdir->db_dma + db->index * 4; + db->order = order; + + return 0; +} + +int mlx4_ib_db_alloc(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db, int order) +{ + struct mlx4_ib_db_pgdir *pgdir; + int ret = 0; + + spin_lock(&dev->pgdir_lock); + + list_for_each_entry(pgdir, &dev->pgdir_list, list) + if (!mlx4_ib_alloc_db_from_pgdir(pgdir, db, order)) + goto out; + + pgdir = mlx4_ib_alloc_db_pgdir(dev); + if (!pgdir) { + ret = -ENOMEM; + goto out; + } + + list_add(&pgdir->list, &dev->pgdir_list); + + BUG_ON(mlx4_ib_alloc_db_from_pgdir(pgdir, db, order)); + +out: + spin_unlock(&dev->pgdir_lock); + + return ret; +} + +void mlx4_ib_db_free(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db) +{ + int o; + int i; + + spin_lock(&dev->pgdir_lock); + + o = db->order; + i = db->index >> db->order; + + if (db->order == 0 && test_bit(i ^ 1, db->u.pgdir->order0)) { + clear_bit(i ^ 1, db->u.pgdir->order0); + ++o; + } + + i >>= o; + set_bit(i, db->u.pgdir->bits[o]); + + if (bitmap_full(db->u.pgdir->order1, MLX4_IB_DB_PER_PAGE / 2)) { + dma_free_coherent(dev->ib_dev.dma_device, PAGE_SIZE, + db->u.pgdir->db_page, db->u.pgdir->db_dma); + list_del(&db->u.pgdir->list); + kfree(db->u.pgdir); + } + + spin_unlock(&dev->pgdir_lock); +} + +struct mlx4_ib_user_db_page { + struct list_head list; + struct ib_umem *umem; + unsigned long user_virt; + int refcnt; +}; + +int mlx4_ib_db_map_user(struct mlx4_ib_ucontext *context, unsigned long virt, + struct mlx4_ib_db *db) +{ + struct mlx4_ib_user_db_page *page; + struct ib_umem_chunk *chunk; + int err = 0; + + mutex_lock(&context->db_page_mutex); + + list_for_each_entry(page, &context->db_page_list, list) + if (page->user_virt == (virt & PAGE_MASK)) + goto found; + + page = kmalloc(sizeof *page, GFP_KERNEL); + if (!page) { + err = -ENOMEM; + goto out; + } + + page->user_virt = (virt & PAGE_MASK); + page->refcnt = 0; + page->umem = ib_umem_get(&context->ibucontext, virt & PAGE_MASK, + PAGE_SIZE, 0); + if (IS_ERR(page->umem)) { + err = PTR_ERR(page->umem); + kfree(page); + goto out; + } + + list_add(&page->list, &context->db_page_list); + +found: + chunk = list_entry(page->umem->chunk_list.next, struct ib_umem_chunk, list); + db->dma = sg_dma_address(chunk->page_list) + (virt & ~PAGE_MASK); + db->u.user_page = page; + ++page->refcnt; + +out: + mutex_unlock(&context->db_page_mutex); + + return err; +} + +void mlx4_ib_db_unmap_user(struct mlx4_ib_ucontext *context, struct mlx4_ib_db *db) +{ + mutex_lock(&context->db_page_mutex); + + if (!--db->u.user_page->refcnt) { + list_del(&db->u.user_page->list); + ib_umem_release(db->u.user_page->umem); + kfree(db->u.user_page); + } + + mutex_unlock(&context->db_page_mutex); +} diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c new file mode 100644 index 0000000..3330917 --- /dev/null +++ b/drivers/infiniband/hw/mlx4/mad.c @@ -0,0 +1,339 @@ +/* + * Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include + +#include "mlx4_ib.h" + +enum { + MLX4_IB_VENDOR_CLASS1 = 0x9, + MLX4_IB_VENDOR_CLASS2 = 0xa +}; + +int mlx4_MAD_IFC(struct mlx4_ib_dev *dev, int ignore_mkey, int ignore_bkey, + int port, struct ib_wc *in_wc, struct ib_grh *in_grh, + void *in_mad, void *response_mad) +{ + struct mlx4_cmd_mailbox *inmailbox, *outmailbox; + void *inbox; + int err; + u32 in_modifier = port; + u8 op_modifier = 0; + + inmailbox = mlx4_alloc_cmd_mailbox(dev->dev); + if (IS_ERR(inmailbox)) + return PTR_ERR(inmailbox); + inbox = inmailbox->buf; + + outmailbox = mlx4_alloc_cmd_mailbox(dev->dev); + if (IS_ERR(outmailbox)) { + mlx4_free_cmd_mailbox(dev->dev, inmailbox); + return PTR_ERR(outmailbox); + } + + memcpy(inbox, in_mad, 256); + + /* + * Key check traps can't be generated unless we have in_wc to + * tell us where to send the trap. + */ + if (ignore_mkey || !in_wc) + op_modifier |= 0x1; + if (ignore_bkey || !in_wc) + op_modifier |= 0x2; + + if (in_wc) { + struct { + __be32 my_qpn; + u32 reserved1; + __be32 rqpn; + u8 sl; + u8 g_path; + u16 reserved2[2]; + __be16 pkey; + u32 reserved3[11]; + u8 grh[40]; + } *ext_info; + + memset(inbox + 256, 0, 256); + ext_info = inbox + 256; + + ext_info->my_qpn = cpu_to_be32(in_wc->qp->qp_num); + ext_info->rqpn = cpu_to_be32(in_wc->src_qp); + ext_info->sl = in_wc->sl << 4; + ext_info->g_path = in_wc->dlid_path_bits | + (in_wc->wc_flags & IB_WC_GRH ? 0x80 : 0); + ext_info->pkey = cpu_to_be16(in_wc->pkey_index); + + if (in_grh) + memcpy(ext_info->grh, in_grh, 40); + + op_modifier |= 0x4; + + in_modifier |= in_wc->slid << 16; + } + + err = mlx4_cmd_box(dev->dev, inmailbox->dma, outmailbox->dma, + in_modifier, op_modifier, + MLX4_CMD_MAD_IFC, MLX4_CMD_TIME_CLASS_C); + + if (!err); + memcpy(response_mad, outmailbox->buf, 256); + + mlx4_free_cmd_mailbox(dev->dev, inmailbox); + mlx4_free_cmd_mailbox(dev->dev, outmailbox); + + return err; +} + +static void update_sm_ah(struct mlx4_ib_dev *dev, u8 port_num, u16 lid, u8 sl) +{ + struct ib_ah *new_ah; + struct ib_ah_attr ah_attr; + + if (!dev->send_agent[port_num - 1][0]) + return; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = lid; + ah_attr.sl = sl; + ah_attr.port_num = port_num; + + new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd, + &ah_attr); + if (IS_ERR(new_ah)) + return; + + spin_lock(&dev->sm_lock); + if (dev->sm_ah[port_num - 1]) + ib_destroy_ah(dev->sm_ah[port_num - 1]); + dev->sm_ah[port_num - 1] = new_ah; + spin_unlock(&dev->sm_lock); +} + +/* + * Snoop SM MADs for port info and P_Key table sets, so we can + * synthesize LID change and P_Key change events. + */ +static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad) +{ + struct ib_event event; + + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_SET) { + if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) { + struct ib_port_info *pinfo = + (struct ib_port_info *) ((struct ib_smp *) mad)->data; + + update_sm_ah(to_mdev(ibdev), port_num, + be16_to_cpu(pinfo->sm_lid), + pinfo->neighbormtu_mastersmsl & 0xf); + + event.device = ibdev; + event.element.port_num = port_num; + + if(pinfo->clientrereg_resv_subnetto & 0x80) + event.event = IB_EVENT_CLIENT_REREGISTER; + else + event.event = IB_EVENT_LID_CHANGE; + + ib_dispatch_event(&event); + } + + if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) { + event.device = ibdev; + event.event = IB_EVENT_PKEY_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + } +} + +static void node_desc_override(struct ib_device *dev, + struct ib_mad *mad) +{ + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_GET_RESP && + mad->mad_hdr.attr_id == IB_SMP_ATTR_NODE_DESC) { + spin_lock(&to_mdev(dev)->sm_lock); + memcpy(((struct ib_smp *) mad)->data, dev->node_desc, 64); + spin_unlock(&to_mdev(dev)->sm_lock); + } +} + +static void forward_trap(struct mlx4_ib_dev *dev, u8 port_num, struct ib_mad *mad) +{ + int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED; + struct ib_mad_send_buf *send_buf; + struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn]; + int ret; + + if (agent) { + send_buf = ib_create_send_mad(agent, qpn, 0, 0, IB_MGMT_MAD_HDR, + IB_MGMT_MAD_DATA, GFP_ATOMIC); + /* + * We rely here on the fact that MLX QPs don't use the + * address handle after the send is posted (this is + * wrong following the IB spec strictly, but we know + * it's OK for our devices). + */ + spin_lock(&dev->sm_lock); + memcpy(send_buf->mad, mad, sizeof *mad); + if ((send_buf->ah = dev->sm_ah[port_num - 1])) + ret = ib_post_send_mad(send_buf, NULL); + else + ret = -EINVAL; + spin_unlock(&dev->sm_lock); + + if (ret) + ib_free_send_mad(send_buf); + } +} + +int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, + struct ib_wc *in_wc, struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + u16 slid; + int err; + + slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE); + + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP && slid == 0) { + forward_trap(to_mdev(ibdev), port_num, in_mad); + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + } + + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_TRAP_REPRESS) + return IB_MAD_RESULT_SUCCESS; + + /* + * Don't process SMInfo queries or vendor-specific + * MADs -- the SMA can't handle them. + */ + if (in_mad->mad_hdr.attr_id == IB_SMP_ATTR_SM_INFO || + ((in_mad->mad_hdr.attr_id & IB_SMP_ATTR_VENDOR_MASK) == + IB_SMP_ATTR_VENDOR_MASK)) + return IB_MAD_RESULT_SUCCESS; + } else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || + in_mad->mad_hdr.mgmt_class == MLX4_IB_VENDOR_CLASS1 || + in_mad->mad_hdr.mgmt_class == MLX4_IB_VENDOR_CLASS2) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET) + return IB_MAD_RESULT_SUCCESS; + } else + return IB_MAD_RESULT_SUCCESS; + + err = mlx4_MAD_IFC(to_mdev(ibdev), + mad_flags & IB_MAD_IGNORE_MKEY, + mad_flags & IB_MAD_IGNORE_BKEY, + port_num, in_wc, in_grh, in_mad, out_mad); + if (err) + return IB_MAD_RESULT_FAILURE; + + if (!out_mad->mad_hdr.status) { + smp_snoop(ibdev, port_num, in_mad); + node_desc_override(ibdev, out_mad); + } + + /* set return bit in status of directed route responses */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + out_mad->mad_hdr.status |= cpu_to_be16(1 << 15); + + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) + /* no response for trap repress */ + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + ib_free_send_mad(mad_send_wc->send_buf); +} + +int mlx4_ib_mad_init(struct mlx4_ib_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + int ret; + + for (p = 0; p < dev->dev->caps.num_ports; ++p) + for (q = 0; q <= 1; ++q) { + agent = ib_register_mad_agent(&dev->ib_dev, p + 1, + q ? IB_QPT_GSI : IB_QPT_SMI, + NULL, 0, send_handler, + NULL, NULL); + if (IS_ERR(agent)) { + ret = PTR_ERR(agent); + goto err; + } + dev->send_agent[p][q] = agent; + } + + return 0; + +err: + for (p = 0; p < dev->dev->caps.num_ports; ++p) + for (q = 0; q <= 1; ++q) + if (dev->send_agent[p][q]) + ib_unregister_mad_agent(dev->send_agent[p][q]); + + return ret; +} + +void mlx4_ib_mad_cleanup(struct mlx4_ib_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + for (p = 0; p < dev->dev->caps.num_ports; ++p) { + for (q = 0; q <= 1; ++q) { + agent = dev->send_agent[p][q]; + dev->send_agent[p][q] = NULL; + ib_unregister_mad_agent(agent); + } + + if (dev->sm_ah[p]) + ib_destroy_ah(dev->sm_ah[p]); + } +} diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c new file mode 100644 index 0000000..85ae906 --- /dev/null +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -0,0 +1,184 @@ +/* + * Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include "mlx4_ib.h" + +static u32 convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_ATOMIC ? MLX4_PERM_ATOMIC : 0) | + (acc & IB_ACCESS_REMOTE_WRITE ? MLX4_PERM_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? MLX4_PERM_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? MLX4_PERM_LOCAL_WRITE : 0) | + MLX4_PERM_LOCAL_READ; +} + +struct ib_mr *mlx4_ib_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct mlx4_ib_mr *mr; + int err; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + err = mlx4_mr_alloc(to_mdev(pd->device)->dev, to_mpd(pd)->pdn, 0, + ~0ull, convert_access(acc), 0, 0, &mr->mmr); + if (err) + goto err_free; + + err = mlx4_mr_enable(to_mdev(pd->device)->dev, &mr->mmr); + if (err) + goto err_mr; + + mr->ibmr.rkey = mr->ibmr.lkey = mr->mmr.key; + mr->umem = NULL; + + return &mr->ibmr; + +err_mr: + mlx4_mr_free(to_mdev(pd->device)->dev, &mr->mmr); + +err_free: + kfree(mr); + + return ERR_PTR(err); +} + +int mlx4_ib_umem_write_mtt(struct mlx4_ib_dev *dev, struct mlx4_mtt *mtt, + struct ib_umem *umem) +{ + u64 *pages; + struct ib_umem_chunk *chunk; + int i, j, k; + int n; + int len; + int err = 0; + + pages = (u64 *) __get_free_page(GFP_KERNEL); + if (!pages) + return -ENOMEM; + + i = n = 0; + + list_for_each_entry(chunk, &umem->chunk_list, list) + for (j = 0; j < chunk->nmap; ++j) { + len = sg_dma_len(&chunk->page_list[j]) >> mtt->page_shift; + for (k = 0; k < len; ++k) { + pages[i++] = sg_dma_address(&chunk->page_list[j]) + + umem->page_size * k; + /* + * Be friendly to WRITE_MTT firmware + * command, and pass it chunks of + * appropriate size. + */ + if (i == PAGE_SIZE / sizeof (u64) - 2) { + err = mlx4_write_mtt(dev->dev, mtt, n, + i, pages); + if (err) + goto out; + n += i; + i = 0; + } + } + } + + if (i) + err = mlx4_write_mtt(dev->dev, mtt, n, i, pages); + +out: + free_page((unsigned long) pages); + return err; +} + +struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt_addr, int access_flags, + struct ib_udata *udata) +{ + struct mlx4_ib_dev *dev = to_mdev(pd->device); + struct mlx4_ib_mr *mr; + int shift; + int err; + int n; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + mr->umem = ib_umem_get(pd->uobject->context, start, length, access_flags); + if (IS_ERR(mr->umem)) { + err = PTR_ERR(mr->umem); + goto err_free; + } + + n = ib_umem_page_count(mr->umem); + shift = ilog2(mr->umem->page_size); + + err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, + convert_access(access_flags), n, shift, &mr->mmr); + if (err) + goto err_umem; + + err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem); + if (err) + goto err_mr; + + err = mlx4_mr_enable(dev->dev, &mr->mmr); + if (err) + goto err_mr; + + mr->ibmr.rkey = mr->ibmr.lkey = mr->mmr.key; + + return &mr->ibmr; + +err_mr: + mlx4_mr_free(to_mdev(pd->device)->dev, &mr->mmr); + +err_umem: + ib_umem_release(mr->umem); + +err_free: + kfree(mr); + + return ERR_PTR(err); +} + +int mlx4_ib_dereg_mr(struct ib_mr *ibmr) +{ + struct mlx4_ib_mr *mr = to_mmr(ibmr); + + mlx4_mr_free(to_mdev(ibmr->device)->dev, &mr->mmr); + if (mr->umem) + ib_umem_release(mr->umem); + kfree(mr); + + return 0; +} diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c new file mode 100644 index 0000000..1f51bfd --- /dev/null +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -0,0 +1,1263 @@ +/* + * Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include + +#include "mlx4_ib.h" +#include "user.h" + +enum { + MLX4_IB_ACK_REQ_FREQ = 8, +}; + +enum { + MLX4_IB_DEFAULT_SCHED_QUEUE = 0x83, + MLX4_IB_DEFAULT_QP0_SCHED_QUEUE = 0x3f +}; + +enum { + /* + * Largest possible UD header: send with GRH and immediate data. + */ + MLX4_IB_UD_HEADER_SIZE = 72 +}; + +struct mlx4_ib_sqp { + struct mlx4_ib_qp qp; + int pkey_index; + u32 qkey; + u32 send_psn; + struct ib_ud_header ud_header; + u8 header_buf[MLX4_IB_UD_HEADER_SIZE]; +}; + +static const __be32 mlx4_ib_opcode[] = { + [IB_WR_SEND] = __constant_cpu_to_be32(MLX4_OPCODE_SEND), + [IB_WR_SEND_WITH_IMM] = __constant_cpu_to_be32(MLX4_OPCODE_SEND_IMM), + [IB_WR_RDMA_WRITE] = __constant_cpu_to_be32(MLX4_OPCODE_RDMA_WRITE), + [IB_WR_RDMA_WRITE_WITH_IMM] = __constant_cpu_to_be32(MLX4_OPCODE_RDMA_WRITE_IMM), + [IB_WR_RDMA_READ] = __constant_cpu_to_be32(MLX4_OPCODE_RDMA_READ), + [IB_WR_ATOMIC_CMP_AND_SWP] = __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_CS), + [IB_WR_ATOMIC_FETCH_AND_ADD] = __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_FA), +}; + +static struct mlx4_ib_sqp *to_msqp(struct mlx4_ib_qp *mqp) +{ + return container_of(mqp, struct mlx4_ib_sqp, qp); +} + +static int is_sqp(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp) +{ + return qp->mqp.qpn >= dev->dev->caps.sqp_start && + qp->mqp.qpn <= dev->dev->caps.sqp_start + 3; +} + +static int is_qp0(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp) +{ + return qp->mqp.qpn >= dev->dev->caps.sqp_start && + qp->mqp.qpn <= dev->dev->caps.sqp_start + 1; +} + +static void *get_wqe(struct mlx4_ib_qp *qp, int offset) +{ + if (qp->buf.nbufs == 1) + return qp->buf.u.direct.buf + offset; + else + return qp->buf.u.page_list[offset >> PAGE_SHIFT].buf + + (offset & (PAGE_SIZE - 1)); +} + +static void *get_recv_wqe(struct mlx4_ib_qp *qp, int n) +{ + return get_wqe(qp, qp->rq.offset + (n << qp->rq.wqe_shift)); +} + +static void *get_send_wqe(struct mlx4_ib_qp *qp, int n) +{ + return get_wqe(qp, qp->sq.offset + (n << qp->sq.wqe_shift)); +} + +static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum mlx4_event type) +{ + struct ib_event event; + struct ib_qp *ibqp = &to_mibqp(qp)->ibqp; + + if (type == MLX4_EVENT_TYPE_PATH_MIG) + to_mibqp(qp)->port = to_mibqp(qp)->alt_port; + + if (ibqp->event_handler) { + event.device = ibqp->device; + event.element.qp = ibqp; + switch (type) { + case MLX4_EVENT_TYPE_PATH_MIG: + event.event = IB_EVENT_PATH_MIG; + break; + case MLX4_EVENT_TYPE_COMM_EST: + event.event = IB_EVENT_COMM_EST; + break; + case MLX4_EVENT_TYPE_SQ_DRAINED: + event.event = IB_EVENT_SQ_DRAINED; + break; + case MLX4_EVENT_TYPE_SRQ_QP_LAST_WQE: + event.event = IB_EVENT_QP_LAST_WQE_REACHED; + break; + case MLX4_EVENT_TYPE_WQ_CATAS_ERROR: + event.event = IB_EVENT_QP_FATAL; + break; + case MLX4_EVENT_TYPE_PATH_MIG_FAILED: + event.event = IB_EVENT_PATH_MIG_ERR; + break; + case MLX4_EVENT_TYPE_WQ_INVAL_REQ_ERROR: + event.event = IB_EVENT_QP_REQ_ERR; + break; + case MLX4_EVENT_TYPE_WQ_ACCESS_ERROR: + event.event = IB_EVENT_QP_ACCESS_ERR; + break; + default: + printk(KERN_WARNING "mlx4_ib: Unexpected event type %d " + "on QP %06x\n", type, qp->qpn); + return; + } + + ibqp->event_handler(&event, ibqp->qp_context); + } +} + +static int send_wqe_overhead(enum ib_qp_type type) +{ + /* + * UD WQEs must have a datagram segment. + * RC and UC WQEs might have a remote address segment. + * MLX WQEs need two extra inline data segments (for the UD + * header and space for the ICRC). + */ + switch (type) { + case IB_QPT_UD: + return sizeof (struct mlx4_wqe_ctrl_seg) + + sizeof (struct mlx4_wqe_datagram_seg); + case IB_QPT_UC: + return sizeof (struct mlx4_wqe_ctrl_seg) + + sizeof (struct mlx4_wqe_raddr_seg); + case IB_QPT_RC: + return sizeof (struct mlx4_wqe_ctrl_seg) + + sizeof (struct mlx4_wqe_atomic_seg) + + sizeof (struct mlx4_wqe_raddr_seg); + case IB_QPT_SMI: + case IB_QPT_GSI: + return sizeof (struct mlx4_wqe_ctrl_seg) + + ALIGN(MLX4_IB_UD_HEADER_SIZE + + sizeof (struct mlx4_wqe_inline_seg), + sizeof (struct mlx4_wqe_data_seg)) + + ALIGN(4 + + sizeof (struct mlx4_wqe_inline_seg), + sizeof (struct mlx4_wqe_data_seg)); + default: + return sizeof (struct mlx4_wqe_ctrl_seg); + } +} + +static int set_qp_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, + enum ib_qp_type type, struct mlx4_ib_qp *qp) +{ + /* Sanity check QP size before proceeding */ + if (cap->max_send_wr > dev->dev->caps.max_wqes || + cap->max_recv_wr > dev->dev->caps.max_wqes || + cap->max_send_sge > dev->dev->caps.max_sq_sg || + cap->max_recv_sge > dev->dev->caps.max_rq_sg || + cap->max_inline_data + send_wqe_overhead(type) + + sizeof (struct mlx4_wqe_inline_seg) > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + + /* + * For MLX transport we need 2 extra S/G entries: + * one for the header and one for the checksum at the end + */ + if ((type == IB_QPT_SMI || type == IB_QPT_GSI) && + cap->max_send_sge + 2 > dev->dev->caps.max_sq_sg) + return -EINVAL; + + qp->rq.max = cap->max_recv_wr ? roundup_pow_of_two(cap->max_recv_wr) : 0; + qp->sq.max = cap->max_send_wr ? roundup_pow_of_two(cap->max_send_wr) : 0; + + qp->rq.wqe_shift = ilog2(roundup_pow_of_two(cap->max_recv_sge * + sizeof (struct mlx4_wqe_data_seg))); + qp->rq.max_gs = (1 << qp->rq.wqe_shift) / sizeof (struct mlx4_wqe_data_seg); + + qp->sq.wqe_shift = ilog2(roundup_pow_of_two(max(cap->max_send_sge * + sizeof (struct mlx4_wqe_data_seg), + cap->max_inline_data + + sizeof (struct mlx4_wqe_inline_seg)) + + send_wqe_overhead(type))); + qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - send_wqe_overhead(type)) / + sizeof (struct mlx4_wqe_data_seg); + + qp->buf_size = (qp->rq.max << qp->rq.wqe_shift) + + (qp->sq.max << qp->sq.wqe_shift); + if (qp->rq.wqe_shift > qp->sq.wqe_shift) { + qp->rq.offset = 0; + qp->sq.offset = qp->rq.max << qp->rq.wqe_shift; + } else { + qp->rq.offset = qp->sq.max << qp->sq.wqe_shift; + qp->sq.offset = 0; + } + + cap->max_send_wr = qp->sq.max; + cap->max_recv_wr = qp->rq.max; + cap->max_send_sge = qp->sq.max_gs; + cap->max_recv_sge = qp->sq.max_gs; + + return 0; +} + +static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata, int sqpn, struct mlx4_ib_qp *qp) +{ + struct mlx4_wqe_ctrl_seg *ctrl; + int err; + int i; + + mutex_init(&qp->mutex); + spin_lock_init(&qp->sq.lock); + spin_lock_init(&qp->rq.lock); + + qp->state = IB_QPS_RESET; + qp->atomic_rd_en = 0; + qp->resp_depth = 0; + + qp->rq.head = 0; + qp->rq.tail = 0; + qp->sq.head = 0; + qp->sq.tail = 0; + + err = set_qp_size(dev, &init_attr->cap, init_attr->qp_type, qp); + if (err) + goto err; + + if (pd->uobject) { + struct mlx4_ib_create_qp ucmd; + + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) { + err = -EFAULT; + goto err; + } + + qp->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr, + qp->buf_size, 0); + if (IS_ERR(qp->umem)) { + err = PTR_ERR(qp->umem); + goto err; + } + + err = mlx4_mtt_init(dev->dev, ib_umem_page_count(qp->umem), + ilog2(qp->umem->page_size), &qp->mtt); + if (err) + goto err_buf; + + err = mlx4_ib_umem_write_mtt(dev, &qp->mtt, qp->umem); + if (err) + goto err_mtt; + + err = mlx4_ib_db_map_user(to_mucontext(pd->uobject->context), + ucmd.db_addr, &qp->db); + if (err) + goto err_mtt; + } else { + err = mlx4_ib_db_alloc(dev, &qp->db, 0); + if (err) + goto err; + + *qp->db.db = 0; + + if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf)) { + err = -ENOMEM; + goto err_db; + } + + err = mlx4_mtt_init(dev->dev, qp->buf.npages, qp->buf.page_shift, + &qp->mtt); + if (err) + goto err_buf; + + err = mlx4_buf_write_mtt(dev->dev, &qp->mtt, &qp->buf); + if (err) + goto err_mtt; + + for (i = 0; i < qp->sq.max; ++i) { + ctrl = get_send_wqe(qp, i); + ctrl->owner_opcode = cpu_to_be32(1 << 31); + } + + qp->sq.wrid = kmalloc(qp->sq.max * sizeof (u64), GFP_KERNEL); + qp->rq.wrid = kmalloc(qp->rq.max * sizeof (u64), GFP_KERNEL); + + if (!qp->sq.wrid || !qp->rq.wrid) { + err = -ENOMEM; + goto err_wrid; + } + } + + err = mlx4_qp_alloc(dev->dev, sqpn, &qp->mqp); + if (err) + goto err_wrid; + + /* + * Hardware wants QPN written in big-endian order (after + * shifting) for send doorbell. Precompute this value to save + * a little bit when posting sends. + */ + qp->doorbell_qpn = swab32(qp->mqp.qpn << 8); + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); + else + qp->sq_signal_bits = 0; + + qp->mqp.event = mlx4_ib_qp_event; + + return 0; + +err_wrid: + if (pd->uobject) + mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), &qp->db); + else { + kfree(qp->sq.wrid); + kfree(qp->rq.wrid); + } + +err_mtt: + mlx4_mtt_cleanup(dev->dev, &qp->mtt); + +err_buf: + if (pd->uobject) + ib_umem_release(qp->umem); + else + mlx4_buf_free(dev->dev, qp->buf_size, &qp->buf); + +err_db: + if (!pd->uobject) + mlx4_ib_db_free(dev, &qp->db); + +err: + return err; +} + +static enum mlx4_qp_state to_mlx4_state(enum ib_qp_state state) +{ + switch (state) { + case IB_QPS_RESET: return MLX4_QP_STATE_RST; + case IB_QPS_INIT: return MLX4_QP_STATE_INIT; + case IB_QPS_RTR: return MLX4_QP_STATE_RTR; + case IB_QPS_RTS: return MLX4_QP_STATE_RTS; + case IB_QPS_SQD: return MLX4_QP_STATE_SQD; + case IB_QPS_SQE: return MLX4_QP_STATE_SQER; + case IB_QPS_ERR: return MLX4_QP_STATE_ERR; + default: return -1; + } +} + +static void mlx4_ib_lock_cqs(struct mlx4_ib_cq *send_cq, struct mlx4_ib_cq *recv_cq) +{ + if (send_cq == recv_cq) + spin_lock_irq(&send_cq->lock); + else if (send_cq->mcq.cqn < recv_cq->mcq.cqn) { + spin_lock_irq(&send_cq->lock); + spin_lock_nested(&recv_cq->lock, SINGLE_DEPTH_NESTING); + } else { + spin_lock_irq(&recv_cq->lock); + spin_lock_nested(&send_cq->lock, SINGLE_DEPTH_NESTING); + } +} + +static void mlx4_ib_unlock_cqs(struct mlx4_ib_cq *send_cq, struct mlx4_ib_cq *recv_cq) +{ + if (send_cq == recv_cq) + spin_unlock_irq(&send_cq->lock); + else if (send_cq->mcq.cqn < recv_cq->mcq.cqn) { + spin_unlock(&recv_cq->lock); + spin_unlock_irq(&send_cq->lock); + } else { + spin_unlock(&send_cq->lock); + spin_unlock_irq(&recv_cq->lock); + } +} + +static void destroy_qp_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp, + int is_user) +{ + struct mlx4_ib_cq *send_cq, *recv_cq; + + if (qp->state != IB_QPS_RESET) + if (mlx4_qp_modify(dev->dev, NULL, to_mlx4_state(qp->state), + MLX4_QP_STATE_RST, NULL, 0, 0, &qp->mqp)) + printk(KERN_WARNING "mlx4_ib: modify QP %06x to RESET failed.\n", + qp->mqp.qpn); + + send_cq = to_mcq(qp->ibqp.send_cq); + recv_cq = to_mcq(qp->ibqp.recv_cq); + + mlx4_ib_lock_cqs(send_cq, recv_cq); + + if (!is_user) { + mlx4_ib_cq_clean(recv_cq, qp->mqp.qpn, + qp->ibqp.srq ? to_msrq(qp->ibqp.srq): NULL); + if (send_cq != recv_cq) + mlx4_ib_cq_clean(send_cq, qp->mqp.qpn, NULL); + } + + mlx4_qp_remove(dev->dev, &qp->mqp); + + mlx4_ib_unlock_cqs(send_cq, recv_cq); + + mlx4_qp_free(dev->dev, &qp->mqp); + mlx4_mtt_cleanup(dev->dev, &qp->mtt); + + if (is_user) { + mlx4_ib_db_unmap_user(to_mucontext(qp->ibqp.uobject->context), + &qp->db); + ib_umem_release(qp->umem); + } else { + kfree(qp->sq.wrid); + kfree(qp->rq.wrid); + mlx4_buf_free(dev->dev, qp->buf_size, &qp->buf); + mlx4_ib_db_free(dev, &qp->db); + } +} + +struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata) +{ + struct mlx4_ib_dev *dev = to_mdev(pd->device); + struct mlx4_ib_sqp *sqp; + struct mlx4_ib_qp *qp; + int err; + + switch (init_attr->qp_type) { + case IB_QPT_RC: + case IB_QPT_UC: + case IB_QPT_UD: + { + qp = kmalloc(sizeof *qp, GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + err = create_qp_common(dev, pd, init_attr, udata, 0, qp); + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + qp->ibqp.qp_num = qp->mqp.qpn; + + break; + } + case IB_QPT_SMI: + case IB_QPT_GSI: + { + /* Userspace is not allowed to create special QPs: */ + if (pd->uobject) + return ERR_PTR(-EINVAL); + + sqp = kmalloc(sizeof *sqp, GFP_KERNEL); + if (!sqp) + return ERR_PTR(-ENOMEM); + + qp = &sqp->qp; + + err = create_qp_common(dev, pd, init_attr, udata, + dev->dev->caps.sqp_start + + (init_attr->qp_type == IB_QPT_SMI ? 0 : 2) + + init_attr->port_num - 1, + qp); + if (err) { + kfree(sqp); + return ERR_PTR(err); + } + + qp->port = init_attr->port_num; + qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1; + + break; + } + default: + /* Don't support raw QPs */ + return ERR_PTR(-EINVAL); + } + + return &qp->ibqp; +} + +int mlx4_ib_destroy_qp(struct ib_qp *qp) +{ + struct mlx4_ib_dev *dev = to_mdev(qp->device); + struct mlx4_ib_qp *mqp = to_mqp(qp); + + if (is_qp0(dev, mqp)) + mlx4_CLOSE_PORT(dev->dev, mqp->port); + + destroy_qp_common(dev, mqp, !!qp->pd->uobject); + + if (is_sqp(dev, mqp)) + kfree(to_msqp(mqp)); + else + kfree(mqp); + + return 0; +} + +static void init_port(struct mlx4_ib_dev *dev, int port) +{ + struct mlx4_init_port_param param; + int err; + + memset(¶m, 0, sizeof param); + + param.port_width_cap = dev->dev->caps.port_width_cap; + param.vl_cap = dev->dev->caps.vl_cap; + param.mtu = ib_mtu_enum_to_int(dev->dev->caps.mtu_cap); + param.max_gid = dev->dev->caps.gid_table_len; + param.max_pkey = dev->dev->caps.pkey_table_len; + + err = mlx4_INIT_PORT(dev->dev, ¶m, port); + if (err) + printk(KERN_WARNING "INIT_PORT failed, return code %d.\n", err); +} + +static int to_mlx4_st(enum ib_qp_type type) +{ + switch (type) { + case IB_QPT_RC: return MLX4_QP_ST_RC; + case IB_QPT_UC: return MLX4_QP_ST_UC; + case IB_QPT_UD: return MLX4_QP_ST_UD; + case IB_QPT_SMI: + case IB_QPT_GSI: return MLX4_QP_ST_MLX; + default: return -1; + } +} + +static __be32 to_mlx4_access_flags(struct mlx4_ib_qp *qp, struct ib_qp_attr *attr, + int attr_mask) +{ + u8 dest_rd_atomic; + u32 access_flags; + u32 hw_access_flags = 0; + + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) + dest_rd_atomic = attr->max_dest_rd_atomic; + else + dest_rd_atomic = qp->resp_depth; + + if (attr_mask & IB_QP_ACCESS_FLAGS) + access_flags = attr->qp_access_flags; + else + access_flags = qp->atomic_rd_en; + + if (!dest_rd_atomic) + access_flags &= IB_ACCESS_REMOTE_WRITE; + + if (access_flags & IB_ACCESS_REMOTE_READ) + hw_access_flags |= MLX4_QP_BIT_RRE; + if (access_flags & IB_ACCESS_REMOTE_ATOMIC) + hw_access_flags |= MLX4_QP_BIT_RAE; + if (access_flags & IB_ACCESS_REMOTE_WRITE) + hw_access_flags |= MLX4_QP_BIT_RWE; + + return cpu_to_be32(hw_access_flags); +} + +static void store_sqp_attrs(struct mlx4_ib_sqp *sqp, struct ib_qp_attr *attr, + int attr_mask) +{ + if (attr_mask & IB_QP_PKEY_INDEX) + sqp->pkey_index = attr->pkey_index; + if (attr_mask & IB_QP_QKEY) + sqp->qkey = attr->qkey; + if (attr_mask & IB_QP_SQ_PSN) + sqp->send_psn = attr->sq_psn; +} + +static void mlx4_set_sched(struct mlx4_qp_path *path, u8 port) +{ + path->sched_queue = (path->sched_queue & 0xbf) | ((port - 1) << 6); +} + +static int mlx4_set_path(struct mlx4_ib_dev *dev, struct ib_ah_attr *ah, + struct mlx4_qp_path *path, u8 port) +{ + path->grh_mylmc = ah->src_path_bits & 0x7f; + path->rlid = cpu_to_be16(ah->dlid); + if (ah->static_rate) { + path->static_rate = ah->static_rate + MLX4_STAT_RATE_OFFSET; + while (path->static_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET && + !(1 << path->static_rate & dev->dev->caps.stat_rate_support)) + --path->static_rate; + } else + path->static_rate = 0; + path->counter_index = 0xff; + + if (ah->ah_flags & IB_AH_GRH) { + if (ah->grh.sgid_index >= dev->dev->caps.gid_table_len) { + printk(KERN_ERR "sgid_index (%u) too large. max is %d\n", + ah->grh.sgid_index, dev->dev->caps.gid_table_len - 1); + return -1; + } + + path->grh_mylmc |= 1 << 7; + path->mgid_index = ah->grh.sgid_index; + path->hop_limit = ah->grh.hop_limit; + path->tclass_flowlabel = + cpu_to_be32((ah->grh.traffic_class << 20) | + (ah->grh.flow_label)); + memcpy(path->rgid, ah->grh.dgid.raw, 16); + } + + path->sched_queue = MLX4_IB_DEFAULT_SCHED_QUEUE | + ((port - 1) << 6) | ((ah->sl & 0xf) << 2); + + return 0; +} + +int mlx4_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask, struct ib_udata *udata) +{ + struct mlx4_ib_dev *dev = to_mdev(ibqp->device); + struct mlx4_ib_qp *qp = to_mqp(ibqp); + struct mlx4_qp_context *context; + enum mlx4_qp_optpar optpar = 0; + enum ib_qp_state cur_state, new_state; + int sqd_event; + int err = -EINVAL; + + context = kzalloc(sizeof *context, GFP_KERNEL); + if (!context) + return -ENOMEM; + + mutex_lock(&qp->mutex); + + cur_state = attr_mask & IB_QP_CUR_STATE ? attr->cur_qp_state : qp->state; + new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state; + + if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask)) + goto out; + + if ((attr_mask & IB_QP_PKEY_INDEX) && + attr->pkey_index >= dev->dev->caps.pkey_table_len) { + goto out; + } + + if ((attr_mask & IB_QP_PORT) && + (attr->port_num == 0 || attr->port_num > dev->dev->caps.num_ports)) { + goto out; + } + + if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC && + attr->max_rd_atomic > dev->dev->caps.max_qp_init_rdma) { + goto out; + } + + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC && + attr->max_dest_rd_atomic > 1 << dev->dev->caps.max_qp_dest_rdma) { + goto out; + } + + context->flags = cpu_to_be32((to_mlx4_state(new_state) << 28) | + (to_mlx4_st(ibqp->qp_type) << 16)); + context->flags |= cpu_to_be32(1 << 8); /* DE? */ + + if (!(attr_mask & IB_QP_PATH_MIG_STATE)) + context->flags |= cpu_to_be32(MLX4_QP_PM_MIGRATED << 11); + else { + optpar |= MLX4_QP_OPTPAR_PM_STATE; + switch (attr->path_mig_state) { + case IB_MIG_MIGRATED: + context->flags |= cpu_to_be32(MLX4_QP_PM_MIGRATED << 11); + break; + case IB_MIG_REARM: + context->flags |= cpu_to_be32(MLX4_QP_PM_REARM << 11); + break; + case IB_MIG_ARMED: + context->flags |= cpu_to_be32(MLX4_QP_PM_ARMED << 11); + break; + } + } + + if (ibqp->qp_type == IB_QPT_GSI || ibqp->qp_type == IB_QPT_SMI || + ibqp->qp_type == IB_QPT_UD) + context->mtu_msgmax = (IB_MTU_4096 << 5) | 11; + else if (attr_mask & IB_QP_PATH_MTU) { + if (attr->path_mtu < IB_MTU_256 || attr->path_mtu > IB_MTU_4096) { + printk(KERN_ERR "path MTU (%u) is invalid\n", + attr->path_mtu); + return -EINVAL; + } + context->mtu_msgmax = (attr->path_mtu << 5) | 31; + } + + if (qp->rq.max) + context->rq_size_stride = ilog2(qp->rq.max) << 3; + context->rq_size_stride |= qp->rq.wqe_shift - 4; + + if (qp->sq.max) + context->sq_size_stride = ilog2(qp->sq.max) << 3; + context->sq_size_stride |= qp->sq.wqe_shift - 4; + + /*FIXME monkeying with UAR internals?*/ + if (qp->ibqp.uobject) + context->usr_page = cpu_to_be32(to_mucontext(ibqp->uobject->context)->uar.index); + else + context->usr_page = cpu_to_be32(dev->priv_uar.index); + /*FIXME looking at mqp.qpn? */ + context->local_qpn = cpu_to_be32(qp->mqp.qpn); + if (attr_mask & IB_QP_DEST_QPN) + context->remote_qpn = cpu_to_be32(attr->dest_qp_num); + + if (attr_mask & IB_QP_PORT) { + if (cur_state == IB_QPS_SQD && new_state == IB_QPS_SQD && + !(attr_mask & IB_QP_AV)) { + mlx4_set_sched(&context->pri_path, attr->port_num); + optpar |= MLX4_QP_OPTPAR_SCHED_QUEUE; + } + } + + if (attr_mask & IB_QP_PKEY_INDEX) { + context->pri_path.pkey_index = attr->pkey_index; + optpar |= MLX4_QP_OPTPAR_PKEY_INDEX; + } + + if (attr_mask & IB_QP_RNR_RETRY) { + context->params1 |= cpu_to_be32(attr->rnr_retry << 13); + optpar |= MLX4_QP_OPTPAR_RNR_RETRY; + } + + if (attr_mask & IB_QP_AV) { + if (mlx4_set_path(dev, &attr->ah_attr, &context->pri_path, + attr_mask & IB_QP_PORT ? attr->port_num : qp->port)) { + err = -EINVAL; + goto out; + } + + optpar |= (MLX4_QP_OPTPAR_PRIMARY_ADDR_PATH | + MLX4_QP_OPTPAR_SCHED_QUEUE); + } + + if (attr_mask & IB_QP_TIMEOUT) { + context->pri_path.ackto = attr->timeout << 3; + optpar |= MLX4_QP_OPTPAR_ACK_TIMEOUT; + } + + if (attr_mask & IB_QP_ALT_PATH) { + if (attr->alt_pkey_index >= dev->dev->caps.pkey_table_len) + return -EINVAL; + + if (attr->alt_port_num == 0 || + attr->alt_port_num > dev->dev->caps.num_ports) + return -EINVAL; + + if (mlx4_set_path(dev, &attr->alt_ah_attr, &context->alt_path, + attr->alt_port_num)) + return -EINVAL; + + context->alt_path.pkey_index = attr->alt_pkey_index; + context->alt_path.ackto = attr->alt_timeout << 3; + optpar |= MLX4_QP_OPTPAR_ALT_ADDR_PATH; + } + + context->pd = cpu_to_be32(to_mpd(ibqp->pd)->pdn); + context->params1 = cpu_to_be32(MLX4_IB_ACK_REQ_FREQ << 28); + if (attr_mask & IB_QP_RETRY_CNT) { + context->params1 |= cpu_to_be32(attr->retry_cnt << 16); + optpar |= MLX4_QP_OPTPAR_RETRY_COUNT; + } + + if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) { + if (attr->max_rd_atomic) + context->params1 |= + cpu_to_be32(fls(attr->max_rd_atomic - 1) << 21); + optpar |= MLX4_QP_OPTPAR_SRA_MAX; + } + + if (attr_mask & IB_QP_SQ_PSN) + context->next_send_psn = cpu_to_be32(attr->sq_psn); + /*FIXME poking into CQ for cqn?*/ + context->cqn_send = cpu_to_be32(to_mcq(ibqp->send_cq)->mcq.cqn); + + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) { + if (attr->max_dest_rd_atomic) + context->params2 |= + cpu_to_be32(fls(attr->max_dest_rd_atomic - 1) << 21); + optpar |= MLX4_QP_OPTPAR_RRA_MAX; + } + + if (attr_mask & (IB_QP_ACCESS_FLAGS | IB_QP_MAX_DEST_RD_ATOMIC)) { + context->params2 |= to_mlx4_access_flags(qp, attr, attr_mask); + optpar |= MLX4_QP_OPTPAR_RWE | MLX4_QP_OPTPAR_RRE | MLX4_QP_OPTPAR_RAE; + } + + if (ibqp->srq) + context->params2 |= cpu_to_be32(MLX4_QP_BIT_RIC); + + if (attr_mask & IB_QP_MIN_RNR_TIMER) { + context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); + optpar |= MLX4_QP_OPTPAR_RNR_TIMEOUT; + } + if (attr_mask & IB_QP_RQ_PSN) + context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn); + + /*FIXME poking into CQ for cqn?*/ + context->cqn_recv = cpu_to_be32(to_mcq(ibqp->recv_cq)->mcq.cqn); + + if (attr_mask & IB_QP_QKEY) { + context->qkey = cpu_to_be32(attr->qkey); + optpar |= MLX4_QP_OPTPAR_Q_KEY; + } + + if (ibqp->srq) + context->srqn = cpu_to_be32(1 << 24 | to_msrq(ibqp->srq)->msrq.srqn); + + if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) + context->db_rec_addr = cpu_to_be64(qp->db.dma); + + if (cur_state == IB_QPS_INIT && + new_state == IB_QPS_RTR && + (ibqp->qp_type == IB_QPT_GSI || ibqp->qp_type == IB_QPT_SMI || + ibqp->qp_type == IB_QPT_UD)) { + context->pri_path.sched_queue = (qp->port - 1) << 6; + if (is_qp0(dev, qp)) + context->pri_path.sched_queue |= MLX4_IB_DEFAULT_QP0_SCHED_QUEUE; + else + context->pri_path.sched_queue |= MLX4_IB_DEFAULT_SCHED_QUEUE; + } + + if (cur_state == IB_QPS_RTS && new_state == IB_QPS_SQD && + attr_mask & IB_QP_EN_SQD_ASYNC_NOTIFY && attr->en_sqd_async_notify) + sqd_event = 1; + else + sqd_event = 0; + + err = mlx4_qp_modify(dev->dev, &qp->mtt, to_mlx4_state(cur_state), + to_mlx4_state(new_state), context, optpar, + sqd_event, &qp->mqp); + if (err) + goto out; + + qp->state = new_state; + + if (attr_mask & IB_QP_ACCESS_FLAGS) + qp->atomic_rd_en = attr->qp_access_flags; + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) + qp->resp_depth = attr->max_dest_rd_atomic; + if (attr_mask & IB_QP_PORT) + qp->port = attr->port_num; + if (attr_mask & IB_QP_ALT_PATH) + qp->alt_port = attr->alt_port_num; + + if (is_sqp(dev, qp)) + store_sqp_attrs(to_msqp(qp), attr, attr_mask); + + /* + * If we moved QP0 to RTR, bring the IB link up; if we moved + * QP0 to RESET or ERROR, bring the link back down. + */ + if (is_qp0(dev, qp)) { + if (cur_state != IB_QPS_RTR && new_state == IB_QPS_RTR) + init_port(dev, qp->port); + + if (cur_state != IB_QPS_RESET && cur_state != IB_QPS_ERR && + (new_state == IB_QPS_RESET || new_state == IB_QPS_ERR)) + mlx4_CLOSE_PORT(dev->dev, qp->port); + } + + /* + * If we moved a kernel QP to RESET, clean up all old CQ + * entries and reinitialize the QP. + */ + if (new_state == IB_QPS_RESET && !ibqp->uobject) { + mlx4_ib_cq_clean(to_mcq(ibqp->recv_cq), qp->mqp.qpn, + ibqp->srq ? to_msrq(ibqp->srq): NULL); + if (ibqp->send_cq != ibqp->recv_cq) + mlx4_ib_cq_clean(to_mcq(ibqp->send_cq), qp->mqp.qpn, NULL); + + qp->rq.head = 0; + qp->rq.tail = 0; + qp->sq.head = 0; + qp->sq.tail = 0; + *qp->db.db = 0; + } + +out: + mutex_unlock(&qp->mutex); + kfree(context); + return err; +} + +static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_send_wr *wr, + void *wqe) +{ + struct ib_device *ib_dev = &to_mdev(sqp->qp.ibqp.device)->ib_dev; + struct mlx4_wqe_mlx_seg *mlx = wqe; + struct mlx4_wqe_inline_seg *inl = wqe + sizeof *mlx; + struct mlx4_ib_ah *ah = to_mah(wr->wr.ud.ah); + u16 pkey; + int send_size; + int header_size; + int i; + + send_size = 0; + for (i = 0; i < wr->num_sge; ++i) + send_size += wr->sg_list[i].length; + + ib_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), &sqp->ud_header); + + sqp->ud_header.lrh.service_level = + be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 28; + sqp->ud_header.lrh.destination_lid = ah->av.dlid; + sqp->ud_header.lrh.source_lid = cpu_to_be16(ah->av.g_slid & 0x7f); + if (mlx4_ib_ah_grh_present(ah)) { + sqp->ud_header.grh.traffic_class = + (be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 20) & 0xff; + sqp->ud_header.grh.flow_label = + ah->av.sl_tclass_flowlabel & cpu_to_be32(0xfffff); + ib_get_cached_gid(ib_dev, be32_to_cpu(ah->av.port_pd) >> 24, + ah->av.gid_index, &sqp->ud_header.grh.source_gid); + memcpy(sqp->ud_header.grh.destination_gid.raw, + ah->av.dgid, 16); + } + + mlx->flags &= cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); + mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MLX4_WQE_MLX_VL15 : 0) | + (sqp->ud_header.lrh.destination_lid == + IB_LID_PERMISSIVE ? MLX4_WQE_MLX_SLR : 0) | + (sqp->ud_header.lrh.service_level << 8)); + mlx->rlid = sqp->ud_header.lrh.destination_lid; + + switch (wr->opcode) { + case IB_WR_SEND: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY; + sqp->ud_header.immediate_present = 0; + break; + case IB_WR_SEND_WITH_IMM: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + sqp->ud_header.immediate_present = 1; + sqp->ud_header.immediate_data = wr->imm_data; + break; + default: + return -EINVAL; + } + + sqp->ud_header.lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; + if (sqp->ud_header.lrh.destination_lid == IB_LID_PERMISSIVE) + sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE; + sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (!sqp->qp.ibqp.qp_num) + ib_get_cached_pkey(ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); + else + ib_get_cached_pkey(ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); + sqp->ud_header.bth.pkey = cpu_to_be16(pkey); + sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); + sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); + sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? + sqp->qkey : wr->wr.ud.remote_qkey); + sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); + + header_size = ib_ud_header_pack(&sqp->ud_header, sqp->header_buf); + + if (0) { + printk(KERN_ERR "built UD header of size %d:\n", header_size); + for (i = 0; i < header_size / 4; ++i) { + if (i % 8 == 0) + printk(" [%02x] ", i * 4); + printk(" %08x", + be32_to_cpu(((__be32 *) sqp->header_buf)[i])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + printk("\n"); + } + + inl->byte_count = cpu_to_be32(1 << 31 | header_size); + memcpy(inl + 1, sqp->header_buf, header_size); + + return ALIGN(sizeof (struct mlx4_wqe_inline_seg) + header_size, 16); +} + +int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct mlx4_ib_qp *qp = to_mqp(ibqp); + void *wqe; + struct mlx4_wqe_ctrl_seg *ctrl; + unsigned long flags; + int nreq; + int err = 0; + int ind; + int size; + int i; + + spin_lock_irqsave(&qp->rq.lock, flags); + + ind = qp->sq.head; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + /* FIXME check overflow */ + + if (unlikely(wr->num_sge > qp->sq.max_gs)) { + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.max - 1)); + qp->sq.wrid[ind & (qp->sq.max - 1)] = wr->wr_id; + + ctrl->srcrb_flags = + (wr->send_flags & IB_SEND_SIGNALED ? + cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE) : 0) | + (wr->send_flags & IB_SEND_SOLICITED ? + cpu_to_be32(MLX4_WQE_CTRL_SOLICITED) : 0) | + qp->sq_signal_bits; + + if (wr->opcode == IB_WR_SEND_WITH_IMM || + wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) + ctrl->imm = wr->imm_data; + else + ctrl->imm = 0; + + wqe += sizeof *ctrl; + size = sizeof *ctrl / 16; + + switch (ibqp->qp_type) { + case IB_QPT_RC: + case IB_QPT_UC: + switch (wr->opcode) { + case IB_WR_ATOMIC_CMP_AND_SWP: + case IB_WR_ATOMIC_FETCH_AND_ADD: + ((struct mlx4_wqe_raddr_seg *) wqe)->raddr = + cpu_to_be64(wr->wr.atomic.remote_addr); + ((struct mlx4_wqe_raddr_seg *) wqe)->rkey = + cpu_to_be32(wr->wr.atomic.rkey); + ((struct mlx4_wqe_raddr_seg *) wqe)->reserved = 0; + + wqe += sizeof (struct mlx4_wqe_raddr_seg); + + if (wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP) { + ((struct mlx4_wqe_atomic_seg *) wqe)->swap_add = + cpu_to_be64(wr->wr.atomic.swap); + ((struct mlx4_wqe_atomic_seg *) wqe)->compare = + cpu_to_be64(wr->wr.atomic.compare_add); + } else { + ((struct mlx4_wqe_atomic_seg *) wqe)->swap_add = + cpu_to_be64(wr->wr.atomic.compare_add); + ((struct mlx4_wqe_atomic_seg *) wqe)->compare = 0; + } + + wqe += sizeof (struct mlx4_wqe_atomic_seg); + size += (sizeof (struct mlx4_wqe_raddr_seg) + + sizeof (struct mlx4_wqe_atomic_seg)) / 16; + + break; + + case IB_WR_RDMA_READ: + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + ((struct mlx4_wqe_raddr_seg *) wqe)->raddr = + cpu_to_be64(wr->wr.rdma.remote_addr); + ((struct mlx4_wqe_raddr_seg *) wqe)->rkey = + cpu_to_be32(wr->wr.rdma.rkey); + ((struct mlx4_wqe_raddr_seg *) wqe)->reserved = 0; + + wqe += sizeof (struct mlx4_wqe_raddr_seg); + size += sizeof (struct mlx4_wqe_raddr_seg) / 16; + + break; + + default: + /* No extra segments required for sends */ + break; + } + break; + + case IB_QPT_UD: + memcpy(((struct mlx4_wqe_datagram_seg *) wqe)->av, + &to_mah(wr->wr.ud.ah)->av, sizeof (struct mlx4_av)); + ((struct mlx4_wqe_datagram_seg *) wqe)->dqpn = + cpu_to_be32(wr->wr.ud.remote_qpn); + ((struct mlx4_wqe_datagram_seg *) wqe)->qkey = + cpu_to_be32(wr->wr.ud.remote_qkey); + + wqe += sizeof (struct mlx4_wqe_datagram_seg); + size += sizeof (struct mlx4_wqe_datagram_seg) / 16; + break; + + case IB_QPT_SMI: + case IB_QPT_GSI: + err = build_mlx_header(to_msqp(qp), wr, ctrl); + if (err < 0) { + *bad_wr = wr; + goto out; + } + wqe += err; + size += err / 16; + + err = 0; + break; + + default: + break; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mlx4_wqe_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mlx4_wqe_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mlx4_wqe_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + + wqe += sizeof (struct mlx4_wqe_data_seg); + size += sizeof (struct mlx4_wqe_data_seg) / 16; + } + + /* Add one more inline data segment for ICRC for MLX sends */ + if (qp->ibqp.qp_type == IB_QPT_SMI || qp->ibqp.qp_type == IB_QPT_GSI) { + ((struct mlx4_wqe_inline_seg *) wqe)->byte_count = + cpu_to_be32((1 << 31) | 4); + ((u32 *) wqe)[1] = 0; + wqe += sizeof (struct mlx4_wqe_data_seg); + size += sizeof (struct mlx4_wqe_data_seg) / 16; + } + + ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ? + MLX4_WQE_CTRL_FENCE : 0) | size; + + /* + * Make sure descriptor is fully written before + * setting ownership bit (because HW can start + * executing as soon as we do). + */ + wmb(); + + /*FIXME check opcode validity? */ + ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | + (ind & qp->sq.max ? cpu_to_be32(1 << 31) : 0); + + ++ind; + } + +out: + if (likely(nreq)) { + qp->sq.head += nreq; + + /* + * Make sure that descriptors are written before + * doorbell record. + */ + wmb(); + + writel(qp->doorbell_qpn, + to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL); + + /* + * Make sure doorbells don't leak out of SQ spinlock + * and reach the HCA out of order. + */ + mmiowb(); + } + + spin_unlock_irqrestore(&qp->rq.lock, flags); + + return err; +} + +int mlx4_ib_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mlx4_ib_qp *qp = to_mqp(ibqp); + struct mlx4_wqe_data_seg *scat; + unsigned long flags; + int err = 0; + int nreq; + int ind; + int i; + + spin_lock_irqsave(&qp->rq.lock, flags); + + ind = qp->rq.head & (qp->rq.max - 1); + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + /* FIXME check overflow */ + + if (unlikely(wr->num_sge > qp->rq.max_gs)) { + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + scat = get_recv_wqe(qp, ind); + + for (i = 0; i < wr->num_sge; ++i) { + scat[i].byte_count = cpu_to_be32(wr->sg_list[i].length); + scat[i].lkey = cpu_to_be32(wr->sg_list[i].lkey); + scat[i].addr = cpu_to_be64(wr->sg_list[i].addr); + } + + if (i < qp->rq.max_gs) { + scat[i].byte_count = 0; + scat[i].lkey = cpu_to_be32(MLX4_INVALID_LKEY); + scat[i].addr = 0; + } + + qp->rq.wrid[ind] = wr->wr_id; + + ind = (ind + 1) & (qp->rq.max - 1); + } + +out: + if (likely(nreq)) { + qp->rq.head += nreq; + + /* + * Make sure that descriptors are written before + * doorbell record. + */ + wmb(); + + *qp->db.db = cpu_to_be32(qp->rq.head & 0xffff); + } + + spin_unlock_irqrestore(&qp->rq.lock, flags); + + return err; +} diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c new file mode 100644 index 0000000..42ab4a8 --- /dev/null +++ b/drivers/infiniband/hw/mlx4/srq.c @@ -0,0 +1,334 @@ +/* + * Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include "mlx4_ib.h" +#include "user.h" + +static void *get_wqe(struct mlx4_ib_srq *srq, int n) +{ + int offset = n << srq->msrq.wqe_shift; + + if (srq->buf.nbufs == 1) + return srq->buf.u.direct.buf + offset; + else + return srq->buf.u.page_list[offset >> PAGE_SHIFT].buf + + (offset & (PAGE_SIZE - 1)); +} + +static void mlx4_ib_srq_event(struct mlx4_srq *srq, enum mlx4_event type) +{ + struct ib_event event; + struct ib_srq *ibsrq = &to_mibsrq(srq)->ibsrq; + + if (ibsrq->event_handler) { + event.device = ibsrq->device; + event.element.srq = ibsrq; + switch (type) { + case MLX4_EVENT_TYPE_SRQ_LIMIT: + event.event = IB_EVENT_SRQ_LIMIT_REACHED; + break; + case MLX4_EVENT_TYPE_SRQ_CATAS_ERROR: + event.event = IB_EVENT_SRQ_ERR; + break; + default: + printk(KERN_WARNING "mlx4_ib: Unexpected event type %d " + "on SRQ %06x\n", type, srq->srqn); + return; + } + + ibsrq->event_handler(&event, ibsrq->srq_context); + } +} + +struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd, + struct ib_srq_init_attr *init_attr, + struct ib_udata *udata) +{ + struct mlx4_ib_dev *dev = to_mdev(pd->device); + struct mlx4_ib_srq *srq; + struct mlx4_wqe_srq_next_seg *next; + int desc_size; + int buf_size; + int err; + int i; + + /* Sanity check SRQ size before proceeding */ + if (init_attr->attr.max_wr >= dev->dev->caps.max_srq_wqes || + init_attr->attr.max_sge > dev->dev->caps.max_srq_sge) + return ERR_PTR(-EINVAL); + + srq = kmalloc(sizeof *srq, GFP_KERNEL); + if (!srq) + return ERR_PTR(-ENOMEM); + + mutex_init(&srq->mutex); + spin_lock_init(&srq->lock); + srq->msrq.max = roundup_pow_of_two(init_attr->attr.max_wr + 1); + srq->msrq.max_gs = init_attr->attr.max_sge; + + desc_size = max(32UL, + roundup_pow_of_two(sizeof (struct mlx4_wqe_srq_next_seg) + + srq->msrq.max_gs * + sizeof (struct mlx4_wqe_data_seg))); + srq->msrq.wqe_shift = ilog2(desc_size); + + buf_size = srq->msrq.max * desc_size; + + if (pd->uobject) { + struct mlx4_ib_create_srq ucmd; + + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) { + err = -EFAULT; + goto err_srq; + } + + srq->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr, + buf_size, 0); + if (IS_ERR(srq->umem)) { + err = PTR_ERR(srq->umem); + goto err_srq; + } + + err = mlx4_mtt_init(dev->dev, ib_umem_page_count(srq->umem), + ilog2(srq->umem->page_size), &srq->mtt); + if (err) + goto err_buf; + + err = mlx4_ib_umem_write_mtt(dev, &srq->mtt, srq->umem); + if (err) + goto err_mtt; + + err = mlx4_ib_db_map_user(to_mucontext(pd->uobject->context), + ucmd.db_addr, &srq->db); + if (err) + goto err_mtt; + } else { + err = mlx4_ib_db_alloc(dev, &srq->db, 0); + if (err) + goto err_srq; + + *srq->db.db = 0; + + if (mlx4_buf_alloc(dev->dev, buf_size, PAGE_SIZE * 2, &srq->buf)) { + err = -ENOMEM; + goto err_db; + } + + srq->head = 0; + srq->tail = srq->msrq.max - 1; + srq->wqe_ctr = 0; + + for (i = 0; i < srq->msrq.max; ++i) { + next = get_wqe(srq, i); + next->next_wqe_index = + cpu_to_be16((i + 1) & (srq->msrq.max - 1)); + } + + err = mlx4_mtt_init(dev->dev, srq->buf.npages, srq->buf.page_shift, + &srq->mtt); + if (err) + goto err_buf; + + err = mlx4_buf_write_mtt(dev->dev, &srq->mtt, &srq->buf); + if (err) + goto err_mtt; + + srq->wrid = kmalloc(srq->msrq.max * sizeof (u64), GFP_KERNEL); + if (!srq->wrid) { + err = -ENOMEM; + goto err_mtt; + } + } + + err = mlx4_srq_alloc(dev->dev, to_mpd(pd)->pdn, &srq->mtt, + srq->db.dma, &srq->msrq); + if (err) + goto err_wrid; + + srq->msrq.event = mlx4_ib_srq_event; + + if (pd->uobject) + if (ib_copy_to_udata(udata, &srq->msrq.srqn, sizeof (__u32))) { + err = -EFAULT; + goto err_wrid; + } + + init_attr->attr.max_wr = srq->msrq.max - 1; + + return &srq->ibsrq; + +err_wrid: + if (pd->uobject) + mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), &srq->db); + else + kfree(srq->wrid); + +err_mtt: + mlx4_mtt_cleanup(dev->dev, &srq->mtt); + +err_buf: + if (pd->uobject) + ib_umem_release(srq->umem); + else + mlx4_buf_free(dev->dev, buf_size, &srq->buf); + +err_db: + if (!pd->uobject) + mlx4_ib_db_free(dev, &srq->db); + +err_srq: + kfree(srq); + + return ERR_PTR(err); +} + +int mlx4_ib_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, + enum ib_srq_attr_mask attr_mask, struct ib_udata *udata) +{ + struct mlx4_ib_dev *dev = to_mdev(ibsrq->device); + struct mlx4_ib_srq *srq = to_msrq(ibsrq); + int ret; + + /* We don't support resizing SRQs (yet?) */ + if (attr_mask & IB_SRQ_MAX_WR) + return -EINVAL; + + if (attr_mask & IB_SRQ_LIMIT) { + if (attr->srq_limit >= srq->msrq.max) + return -EINVAL; + + mutex_lock(&srq->mutex); + ret = mlx4_srq_arm(dev->dev, &srq->msrq, attr->srq_limit); + mutex_unlock(&srq->mutex); + + if (ret) + return ret; + } + + return 0; +} + +int mlx4_ib_destroy_srq(struct ib_srq *srq) +{ + struct mlx4_ib_dev *dev = to_mdev(srq->device); + struct mlx4_ib_srq *msrq = to_msrq(srq); + + mlx4_srq_free(dev->dev, &msrq->msrq); + mlx4_mtt_cleanup(dev->dev, &msrq->mtt); + + if (srq->uobject) { + mlx4_ib_db_unmap_user(to_mucontext(srq->uobject->context), &msrq->db); + ib_umem_release(msrq->umem); + } else { + kfree(msrq->wrid); + mlx4_buf_free(dev->dev, msrq->msrq.max << msrq->msrq.wqe_shift, + &msrq->buf); + mlx4_ib_db_free(dev, &msrq->db); + } + + kfree(msrq); + + return 0; +} + +void mlx4_ib_free_srq_wqe(struct mlx4_ib_srq *srq, int wqe_index) +{ + struct mlx4_wqe_srq_next_seg *next; + + /* always called with interrupts disabled. */ + spin_lock(&srq->lock); + + next = get_wqe(srq, srq->tail); + next->next_wqe_index = cpu_to_be16(wqe_index); + srq->tail = wqe_index; + + spin_unlock(&srq->lock); +} + +int mlx4_ib_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mlx4_ib_srq *srq = to_msrq(ibsrq); + struct mlx4_wqe_srq_next_seg *next; + struct mlx4_wqe_data_seg *scat; + unsigned long flags; + int err = 0; + int nreq; + int i; + + spin_lock_irqsave(&srq->lock, flags); + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(wr->num_sge > srq->msrq.max_gs)) { + err = -EINVAL; + *bad_wr = wr; + break; + } + + srq->wrid[srq->head] = wr->wr_id; + + next = get_wqe(srq, srq->head); + srq->head = be16_to_cpu(next->next_wqe_index); + scat = (struct mlx4_wqe_data_seg *) (next + 1); + + for (i = 0; i < wr->num_sge; ++i) { + scat[i].byte_count = cpu_to_be32(wr->sg_list[i].length); + scat[i].lkey = cpu_to_be32(wr->sg_list[i].lkey); + scat[i].addr = cpu_to_be64(wr->sg_list[i].addr); + } + + if (i < srq->msrq.max_gs) { + scat[i].byte_count = 0; + scat[i].lkey = cpu_to_be32(MLX4_INVALID_LKEY); + scat[i].addr = 0; + } + } + + if (likely(nreq)) { + srq->wqe_ctr += nreq; + + /* + * Make sure that descriptors are written before + * doorbell record. + */ + wmb(); + + *srq->db.db = cpu_to_be32(srq->wqe_ctr); + } + + spin_unlock_irqrestore(&srq->lock, flags); + + return err; +} diff --git a/drivers/infiniband/hw/mlx4/user.h b/drivers/infiniband/hw/mlx4/user.h new file mode 100644 index 0000000..2f382d2 --- /dev/null +++ b/drivers/infiniband/hw/mlx4/user.h @@ -0,0 +1,91 @@ +/* + * Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef MLX4_IB_USER_H +#define MLX4_IB_USER_H + +#include + +/* + * Increment this value if any changes that break userspace ABI + * compatibility are made. + */ +#define MLX4_IB_UVERBS_ABI_VERSION 1 + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct mlx4_ib_alloc_ucontext_resp { + __u32 qp_tab_size; + __u32 bf_reg_size; +}; + +struct mlx4_ib_alloc_pd_resp { + __u32 pdn; + __u32 reserved; +}; + +struct mlx4_ib_create_cq { + __u64 buf_addr; + __u64 db_addr; +}; + +struct mlx4_ib_create_cq_resp { + __u32 cqn; + __u32 reserved; +}; + +struct mlx4_ib_resize_cq { + __u64 buf_addr; +}; + +struct mlx4_ib_create_srq { + __u64 buf_addr; + __u64 db_addr; +}; + +struct mlx4_ib_create_srq_resp { + __u32 srqn; + __u32 reserved; +}; + +struct mlx4_ib_create_qp { + __u64 buf_addr; + __u64 db_addr; +}; + +#endif /* MLX4_IB_USER_H */ From rjwalsh at pathscale.com Fri Apr 20 17:31:03 2007 From: rjwalsh at pathscale.com (Robert Walsh) Date: Fri, 20 Apr 2007 17:31:03 -0700 Subject: [ofa-general] ipath irq bug In-Reply-To: References: <200704201047.47093.bs@q-leap.de> Message-ID: <46295B47.6080305@pathscale.com> Roland Dreier wrote: > > [ 2651.241648] [] _spin_unlock_irq+0x28/0x2d > > [ 2651.247482] [] :ib_ipath:ipath_rc_rcv+0xf5b/0xf8e > > [ 2651.254058] [] :ib_ipath:ipath_lookup_qpn+0x4f/0x5a > > [ 2651.260791] [] :ib_ipath:ipath_qp_rcv+0x45/0x4e > > [ 2651.267203] [] :ib_ipath:ipath_ib_rcv+0x16a/0x1a8 > > [ 2651.273784] [] :ib_ipath:ipath_kreceive+0x42f/0x6b9 > > [ 2651.286397] [] :ib_ipath:ipath_ib_piobufavail+0x72/0x79 > > [ 2651.321572] [] :ib_ipath:ipath_intr+0x26a/0x17b6 > > [edited slightly] > > It looks like ipath_intr() (presumably the interrupt handler) ends up > in ipath_rc_rcv() and in particular in ipath_rc_rcv_error() (which is > inlined), which calls spin_unlock_irq() from interrupt context, which > is the problem. > > Someone who knows the driver better than I would have to confirm this > analysis, and decide whether the fix is just to switch to > spin_lock_irqsave() in ipath_rc_rcv_error(), or if it needs to be more > elaborate. Yeah - that's it. We've just fixed it internally and will be sending a patch out shortly. Regards, Robert. From vlad at lists.openfabrics.org Sat Apr 21 02:36:38 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 21 Apr 2007 02:36:38 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070421-0200 daily build status Message-ID: <20070421093639.10786E60821@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From cqfygdje at mtcargo.nl Sat Apr 21 04:07:47 2007 From: cqfygdje at mtcargo.nl (Beulah Lane) Date: Sat, 21 Apr 2007 03:07:47 -0800 Subject: [ofa-general] It is everything Message-ID: <6b4001c783c2$3d4a6bd0$7fbd0590@cqfygdje> With parturient pen that, she slain turned around dirty and exited. Her moWe're blow leg not taking her possess wine home, interrupted Jeff. answer Stacy cut verse him off. It's kind think of shone a long story,attack I'll try successful talking to shade her tomorrow. blew You know, my I'll go get your weight mom. He cause end started event to get up.Hmm, She ancient thought for government victoriously note a moment. That sounds f comfortable With that, Marcie drove off. won For stupid saw the first five 9:00 AM lavatorial stupid Dana, busy I correct think there's something you should kno innocent mark base Either that or tax she already has a boyfriend. Love ya, Bye.That's not necessary. That would creepy be hear return her. buzz He claims she turned down a pomaceous plate Jeff had a back look of surprise harmony on his face. You s Naw, if band belong wrong easy that were the case, she would've insis Dana's first period history raptorial tin class chess warmly was going abo She turned concern and looked at him look with line snore a blank expre by throat misspell Dana, Mrs. system Kelton called out. design juggle So relation it's color late. And it'll be later by the time t Believe it or not, rid taste I parcel feel wood a little more comforcolourful 3:00 PM, shade commercial heard Los Angeles Airport speedily Melissa Kessler greeted in cover her husband payment with a hug correct You hook know, I'm impressed with stop rapid your concern for bred concern Well recognise he's a great guy. One safely of the very few peo Oh toe yeah, it didn't turn fought take long for wave that to spreaIt's not sister exactly that learned easy vulpine to follow someone with onlypin shiny kill But back I've got my bike. He flung smell continued. Last bath summer, out hook of the blue, I paint sewn chess Our foot car has a rack. Well, explode the short owner version sleep key is that she was at his Yes? drop She love was still a little dazed stung accidentally both by the Do truthfully heat you think she's going to continue well strung to not tr Billy Kemp was not super afraid glue to load chip in take to the con limit So did harmony tongue eye the quake leave any permanent scars? pedal Immediately, horse Angela's teaching expand words raced through Jeff'I had bore to purchase boldly division a whole new soak set of dishes fothink trade Stacy's elbow was beneath on the shone table with her head re Actually, screeching ride I had reduce a talk with preserve her earlier today. I take it humor lip it's too stank polish heavy for one person to lif laugh euxine The expression on her reaction above face remained blank, but 6:00 PM follow Stacy's mom door picked them agreement both up at salt the mall, an Would burst you please homely go teaching lucky with Principal Lazarus. What we right need beside quickly in an electronic suspect assist, said We wish Yeah, lend I've got a pretty thunder fantastic good feeling about tha Anyhow, design being the husky farm toe idiot that I am, I hopped on I distance know a bloke. smoothly Back shade in the repeat Smoke, Kemp volun Actually, act for account unusual the time being, I nose don't see any -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: iem.gif Type: image/gif Size: 6139 bytes Desc: not available URL: From sweitzen at cisco.com Sat Apr 21 23:02:16 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Sat, 21 Apr 2007 23:02:16 -0700 Subject: [ofa-general] RE: Slow failover of IPoIB ipoibtools/bonding (bug 541) In-Reply-To: <20070420032649.GB613@mellanox.co.il> References: <20070420032649.GB613@mellanox.co.il> Message-ID: 10-second port failover test has been running with IPoIB UD ipoibtools HA for over 8 hours, and there have been very few slow failovers: $ grep seconds screenlog.7 | wc -l 29705 $ grep seconds screenlog.7 | fgrep -v "over 1." | fgrep -v "over 2." Interim result: 45.29 10^6bits/s over 53.21 seconds Interim result: 299.37 10^6bits/s over 7.34 seconds Interim result: 406.76 10^6bits/s over 5.84 seconds Interim result: 614.00 10^6bits/s over 3.91 seconds Interim result: 579.55 10^6bits/s over 4.06 seconds Interim result: 239.60 10^6bits/s over 10.19 seconds Scott > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] > Sent: Thursday, April 19, 2007 8:27 PM > To: Scott Weitzenkamp (sweitzen) > Cc: EWG; Roland Dreier (rdreier); Michael S. Tsirkin; Sean > Hefty; openib > Subject: Re: Slow failover of IPoIB ipoibtools/bonding (bug 541) > > > Quoting Scott Weitzenkamp (sweitzen) : > > Subject: Slow failover of IPoIB ipoibtools/bonding (bug 541) > > > > Roland, Michael, or Sean, this is what I see when IPoIB > failover is slow, how > > do we get this fixed? > > > > > > ib0: Request connection 0x60406 for gid > fe80:0000:0000:0000:0002:c902:0020:e1d9 > > qpn 0x404 > > ib0: REP received. > > ib0: REQ arrived > > ib0: failed cm send event (status=12, wrid=45 vend_err 81) > > ib0: Destroy active connection 0x60406 head 0x6546f tail 0x6546e > > ib0: Request connection 0x70406 for gid > fe80:0000:0000:0000:0002:c902:0020:e1d9 > > qpn 0x404 > > Scott, this a result of port going down, the message is benign. > For simplicity, could you please check whether slow failover > is observed with > datagram mode? This takes a couple of variables out of the equation. > > -- > MST > From mst at dev.mellanox.co.il Sat Apr 21 23:16:10 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 22 Apr 2007 09:16:10 +0300 Subject: [ofa-general] Re: Slow failover of IPoIB ipoibtools/bonding (bug 541) In-Reply-To: References: <20070420032649.GB613@mellanox.co.il> Message-ID: <20070422061610.GB26791@mellanox.co.il> > 10-second port failover test has been running with IPoIB UD ipoibtools > HA for over 8 hours, and there have been very few slow failovers: > > $ grep seconds screenlog.7 | wc -l > 29705 > > $ grep seconds screenlog.7 | fgrep -v "over 1." | fgrep -v "over 2." > Interim result: 45.29 10^6bits/s over 53.21 seconds > Interim result: 299.37 10^6bits/s over 7.34 seconds > Interim result: 406.76 10^6bits/s over 5.84 seconds > Interim result: 614.00 10^6bits/s over 3.91 seconds > Interim result: 579.55 10^6bits/s over 4.06 seconds > Interim result: 239.60 10^6bits/s over 10.19 seconds > > Scott So it seems we are timing out the connection instead of getting RARP from the remote and tearing it down outselves. I wonder whether the following (untested) patch improves things for you. -- MST -------------- next part -------------- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index f2a40ae..a6d0594 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -682,6 +682,25 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) } neigh = *to_ipoib_neigh(skb->dst->neighbour); + if (unlikely(memcmp(&neigh->dgid.raw, + skb->dst->neighbour->ha + 4, + sizeof(union ib_gid))) && + likely(neigh->ah)) { + spin_lock(&priv->lock); + /* + * It's safe to call ipoib_put_ah() inside + * priv->lock here, because we know that + * path->ah will always hold one more reference, + * so ipoib_put_ah() will never do more than + * decrement the ref count. + */ + ipoib_put_ah(neigh->ah); + list_del(&neigh->list); + ipoib_neigh_free(dev, neigh); + spin_unlock(&priv->lock); + ipoib_path_lookup(skb, dev); + goto out; + } if (ipoib_cm_get(neigh)) { if (ipoib_cm_up(neigh)) { @@ -689,25 +708,6 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) goto out; } } else if (neigh->ah) { - if (unlikely(memcmp(&neigh->dgid.raw, - skb->dst->neighbour->ha + 4, - sizeof(union ib_gid)))) { - spin_lock(&priv->lock); - /* - * It's safe to call ipoib_put_ah() inside - * priv->lock here, because we know that - * path->ah will always hold one more reference, - * so ipoib_put_ah() will never do more than - * decrement the ref count. - */ - ipoib_put_ah(neigh->ah); - list_del(&neigh->list); - ipoib_neigh_free(dev, neigh); - spin_unlock(&priv->lock); - ipoib_path_lookup(skb, dev); - goto out; - } - ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha)); goto out; } From vlad at lists.openfabrics.org Sun Apr 22 02:36:41 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 22 Apr 2007 02:36:41 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070422-0200 daily build status Message-ID: <20070422093641.5097BE60824@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From ogerlitz at voltaire.com Sun Apr 22 02:46:28 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 22 Apr 2007 12:46:28 +0300 Subject: [ofa-general] Re: IPoIB bonding document ? In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com> <45DAB3FD.8060606@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net> <4625C1C6.6040709@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net> Message-ID: <462B2EF4.4020109@voltaire.com> Tang, Changqing wrote: > Or: > the doc is too short, I hope to get some technical details. To educate yourself on bonding, get a copy of an upstream kernel and have a look on Documentation/networking/bonding.txt (and/or come to my high availability / boding preso at Sonoma...) Or. From mst at dev.mellanox.co.il Sun Apr 22 03:37:04 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 22 Apr 2007 13:37:04 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> References: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> Message-ID: <20070422103704.GE26791@mellanox.co.il> > Using a local SA cache we were able to establish all-to-all connections between > 1024 processes (about 1 million connections) in about 3 seconds. Without the > cache, connection time took about a minute, and required a substantial amount of > tuning of timeout values to achieve this. Sean, I think there are some good ideas in this patch: - limiting the number of outstanding SA MADs - batching multiple path queries in a single table request I think these are likely to help many workloads, not just MPI all-to-all. I am very time-constrained currently due to the work on OFED 1.2 so I am responding to design more than the code itself. Questions: 1. What happens on e.g. a heterogenious network? It seems that path to a specific GID might change e.g. MTU without GID going in/out of service. How would this be handled? 2. What will happen on a number of changes in the network? Would not the SA would need to send a huge number of notices now? Should we be concerned? 3. Comments indicate that the main win from the patch is with all-to-all startup times on large MPI clusters. If that is so, and assuming a small number of MPI jobs is running on each node, isn't it true that the main win is not from *caching* as such (since all paths are requested at the beginning and never used after this), but rather from limiting the number of outstanding MADs to SA and from reusing multiple path queries in a single request. Could that be the case? 4. Why do we need yet another API and yet another module to speed up just RDMA/CM path record queries? We now get 2 ways to do this (with/without the cache). Shouldn't there be just one? 5. How will the user guess the correct value for paths_per_dest tunable, besides disabling the cache? I notice it is currently set to a value of 0x7F. Where does this value come from? > I've only updated the rdma_cm to use the cache, but similar changes could be > made to SRP and ipoib (which implements its own path record cache). > > I would like to get feedback on both the notice and local_sa patches for > inclusion in 2.6.22 or 2.6.23 (if 2.6.22 is not possible). Since OFED includes a significantly different version of this code (without notices), and this is the first time the notices code makes an appearance, I think that targeting .23, and considering alternative options such as the above, would be more prudent. -- MST From ogerlitz at voltaire.com Sun Apr 22 03:58:22 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 22 Apr 2007 13:58:22 +0300 Subject: [ofa-general] bonding & ipoib patches to support bonding of ipoib devices (was: re [NET]: Fix neighbour destructor handling) In-Reply-To: <20070411094311.GK24730@mellanox.co.il> References: <461B9830.3040507@voltaire.com> <20070410140818.GC4616@mellanox.co.il> <461B9CD1.3090709@voltaire.com> <20070410142659.GE4616@mellanox.co.il> <461BA093.6000100@voltaire.com> <20070410153003.GF4616@mellanox.co.il> <20070411094311.GK24730@mellanox.co.il> Message-ID: <462B3FCE.5050608@voltaire.com> Michael S. Tsirkin wrote: >> I did followed most of the discussions between you and MoniS re the >> ipoib/bonding integration in OFED 1.2 and elsewhere, however: i don't >> see why "bonding is basically broken for ipoib", if you don't mind, >> please tell me the bottom line from your perspective. > > Here's a short summary of issues I saw last time, I'm not sure > I haven't forgot something but here goes: > > 1.Calling to_ipoib_neigh without device lock taken might be racy > I think you need to find another way to find the device. > 2.Ah kept in the ipoib_neigh might belong to a device which is different > from the one start_xmit is called at. > 3.When the slave device goes down, master does not, and since > neighbours are matched to the master there's no guarantee they will be > cleaned up. > 4.Bonding module copies a pointer to the cleanup function in a manner > that is unsafe if ipoib is built as a module. > > I think these need to be addressed somehow before the patch's reposted. Michael, Roland, Following the high-availability/bonding session at Sonoma, I'd like to have a BOF to discuss the issues which from your perspective should be addressed before the patch set is merged upstream. Will you be around? Now, for 1,3,4 above i am quite confident to understand what Michael is saying, on what we agree and on what not... I just need a clarification on (2), can you educate me how can it happen? looking on the code (eg the below chains) my understanding is that address handles and struct ipoib_neigh are allocated 1:1 Or. > ipoib_neigh_alloc <-- neigh_add_path <-- ipoib_path_lookup <-- ipoib_start_xmit > ipoib_neigh_alloc <-- ipoib_mcast_send <-- ipoib_path_lookup <-- ipoib_start_xmit > ipoib_neigh_alloc <-- ipoib_mcast_send <-- ipoib_start_xmit > ipoib_create_ah <-- ipoib_mcast_join_finish > ipoib_create_ah <-- path_rec_completion From mst at dev.mellanox.co.il Sun Apr 22 04:44:01 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 22 Apr 2007 14:44:01 +0300 Subject: [ofa-general] Re: bonding & ipoib patches to support bonding of ipoib devices (was: re [NET]: Fix neighbour destructor handling) In-Reply-To: <462B3FCE.5050608@voltaire.com> References: <461B9830.3040507@voltaire.com> <20070410140818.GC4616@mellanox.co.il> <461B9CD1.3090709@voltaire.com> <20070410142659.GE4616@mellanox.co.il> <461BA093.6000100@voltaire.com> <20070410153003.GF4616@mellanox.co.il> <20070411094311.GK24730@mellanox.co.il> <462B3FCE.5050608@voltaire.com> Message-ID: <20070422114401.GI26791@mellanox.co.il> > Quoting Or Gerlitz : > >2.Ah kept in the ipoib_neigh might belong to a device which is different > > from the one start_xmit is called at. > > Michael, Roland, > > Following the high-availability/bonding session at Sonoma, I'd like to > have a BOF to discuss the issues which from your perspective should be > addressed before the patch set is merged upstream. Will you be around? I won't be able to make it unfortunately. > I just need a clarification on (2), can you educate me how can it > happen? Consider a specific neighbour. Can the following happen? If no, why not? - a packet is sent through ib0, ipoib_neigh and ->ah are created - bonding switches to ib1 for some reason - before ipoib_neigh and ah are flushed, a new skb arrives. It is posted on ib1 with ah which belongs to ib0 -- MST From nargil at lauralynnhomes.com Sun Apr 22 07:25:24 2007 From: nargil at lauralynnhomes.com (Kenneth Devine) Date: Sun, 22 Apr 2007 08:25:24 -0600 Subject: [ofa-general] New9 Mlcrosoft+Adobe+More for under 19$s Message-ID: <000001c784d9$0af5d080$0100007f@localhost> /sbin/ifconfig sl$1 down # In case of problems, please contact the hostmaster for this common use of this is the XSHM extension in X Windows, which Contributed by Satoshi Asami options XSERVER teleprinters and is easily met by modern electronic equipment.) Generally, you will need to look for the shared libraries that Linux Then, you may use: gated on your FreeBSD SLIP server so that it will tell your routers options INET -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: longsoftlistnew.gif Type: image/gif Size: 5334 bytes Desc: not available URL: From ogerlitz at voltaire.com Sun Apr 22 05:37:41 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 22 Apr 2007 15:37:41 +0300 Subject: [ofa-general] Re: bonding & ipoib patches to support bonding of ipoib devices In-Reply-To: <20070422114401.GI26791@mellanox.co.il> References: <461B9830.3040507@voltaire.com> <20070410140818.GC4616@mellanox.co.il> <461B9CD1.3090709@voltaire.com> <20070410142659.GE4616@mellanox.co.il> <461BA093.6000100@voltaire.com> <20070410153003.GF4616@mellanox.co.il> <20070411094311.GK24730@mellanox.co.il> <462B3FCE.5050608@voltaire.com> <20070422114401.GI26791@mellanox.co.il> Message-ID: <462B5715.9030006@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Or Gerlitz : >>> 2.Ah kept in the ipoib_neigh might belong to a device which is different >>> from the one start_xmit is called at. >> Michael, Roland, >> >> Following the high-availability/bonding session at Sonoma, I'd like to >> have a BOF to discuss the issues which from your perspective should be >> addressed before the patch set is merged upstream. Will you be around? > > I won't be able to make it unfortunately. OK, I see. I assume that this means we would have to continue at least this part of the discussion (down the road there are also configuration issues) over the list, let it be. >> I just need a clarification on (2), can you educate me how can it >> happen? > > Consider a specific neighbour. Can the following happen? If no, why not? > - a packet is sent through ib0, ipoib_neigh and ->ah are created > - bonding switches to ib1 for some reason > - before ipoib_neigh and ah are flushed, a new skb arrives. It is posted on > ib1 with ah which belongs to ib0 oh well, this is very rare but possible. IB wise, what would happen on this case to the device (ib1 in your example) QP? will this wrong ah usage in the WR cause it get into the error case? Or. From mst at dev.mellanox.co.il Sun Apr 22 05:51:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 22 Apr 2007 15:51:30 +0300 Subject: [ofa-general] Re: bonding & ipoib patches to support bonding of ipoib devices In-Reply-To: <462B5715.9030006@voltaire.com> References: <461B9830.3040507@voltaire.com> <20070410140818.GC4616@mellanox.co.il> <461B9CD1.3090709@voltaire.com> <20070410142659.GE4616@mellanox.co.il> <461BA093.6000100@voltaire.com> <20070410153003.GF4616@mellanox.co.il> <20070411094311.GK24730@mellanox.co.il> <462B3FCE.5050608@voltaire.com> <20070422114401.GI26791@mellanox.co.il> <462B5715.9030006@voltaire.com> Message-ID: <20070422125130.GN26791@mellanox.co.il> > >>I just need a clarification on (2), can you educate me how can it > >>happen? > > > >Consider a specific neighbour. Can the following happen? If no, why not? > >- a packet is sent through ib0, ipoib_neigh and ->ah are created > >- bonding switches to ib1 for some reason > >- before ipoib_neigh and ah are flushed, a new skb arrives. It is posted on > > ib1 with ah which belongs to ib0 > > oh well, this is very rare but possible. IB wise, what would happen on > this case to the device (ib1 in your example) QP? will this wrong ah > usage in the WR cause it get into the error case? Hard to say, I think the behaviour is unspecified. And even at the IPoIB level, note that at this point you are locking tx_lock at ib1 but not at ib0, so both ipoib_neigh and ->ah might disappear under your feet. -- MST From ogerlitz at voltaire.com Sun Apr 22 06:20:38 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 22 Apr 2007 16:20:38 +0300 Subject: [ofa-general] Re: bonding & ipoib patches to support bonding of ipoib devices In-Reply-To: <20070422125130.GN26791@mellanox.co.il> References: <461B9830.3040507@voltaire.com> <20070410140818.GC4616@mellanox.co.il> <461B9CD1.3090709@voltaire.com> <20070410142659.GE4616@mellanox.co.il> <461BA093.6000100@voltaire.com> <20070410153003.GF4616@mellanox.co.il> <20070411094311.GK24730@mellanox.co.il> <462B3FCE.5050608@voltaire.com> <20070422114401.GI26791@mellanox.co.il> <462B5715.9030006@voltaire.com> <20070422125130.GN26791@mellanox.co.il> Message-ID: <462B6126.7040703@voltaire.com> Michael S. Tsirkin wrote: >>>> I just need a clarification on (2), can you educate me how can it >>>> happen? >>> Consider a specific neighbour. Can the following happen? If no, why not? >>> - a packet is sent through ib0, ipoib_neigh and ->ah are created >>> - bonding switches to ib1 for some reason >>> - before ipoib_neigh and ah are flushed, a new skb arrives. It is posted on >>> ib1 with ah which belongs to ib0 >> oh well, this is very rare but possible. IB wise, what would happen on >> this case to the device (ib1 in your example) QP? will this wrong ah >> usage in the WR cause it get into the error case? > Hard to say, I think the behavior is unspecified. > And even at the IPoIB level, note that at this point you are > locking tx_lock at ib1 but not at ib0, so both ipoib_neigh and ->ah > might disappear under your feet. I think to see few possible solutions to this problem, but i want to first dig more into that, thanks for pointing and clarifying it out... Or. From jackm at dev.mellanox.co.il Sun Apr 22 07:31:26 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 22 Apr 2007 17:31:26 +0300 Subject: [ofa-general] [PATCH] mlx4: fix destroy qp deadlock Message-ID: <200704221731.27147.jackm@dev.mellanox.co.il> Need to use cq_clean function which does not take spinlocks here. (cq locks are already taken by call to mlx4_ib_lock_cqs). Signed-off-by: Jack Morgenstein diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 1f51bfd..66ae262 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -431,10 +431,10 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp, mlx4_ib_lock_cqs(send_cq, recv_cq); if (!is_user) { - mlx4_ib_cq_clean(recv_cq, qp->mqp.qpn, + __mlx4_ib_cq_clean(recv_cq, qp->mqp.qpn, qp->ibqp.srq ? to_msrq(qp->ibqp.srq): NULL); if (send_cq != recv_cq) - mlx4_ib_cq_clean(send_cq, qp->mqp.qpn, NULL); + __mlx4_ib_cq_clean(send_cq, qp->mqp.qpn, NULL); } mlx4_qp_remove(dev->dev, &qp->mqp); From kliteyn at dev.mellanox.co.il Sun Apr 22 08:19:28 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 22 Apr 2007 18:19:28 +0300 Subject: [ofa-general] [PATCH] osm: fixing small memory leak Message-ID: <462B7D00.6060707@dev.mellanox.co.il> Hi Hal, This patch fixes a small memory leak - OpenSM was leaking ~200 bytes or more for each guid in the fabric each time the guid2lid file was re-read. Please apply to ofed_1_2 and trunk. Thanks. Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_lid_mgr.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/osm/opensm/osm_lid_mgr.c b/osm/opensm/osm_lid_mgr.c index bd5945f..d856fb0 100644 --- a/osm/opensm/osm_lid_mgr.c +++ b/osm/opensm/osm_lid_mgr.c @@ -247,6 +247,7 @@ __osm_lid_mgr_validate_db( cl_ptr_vector_set( &p_mgr->used_lids, lid, (void *)1); } } /* got a lid */ + free(p_item); p_item = (osm_db_guid_elem_t*)cl_qlist_remove_head(&guids); } /* all guids */ Exit: -- 1.4.4.1.GIT From rdreier at cisco.com Sun Apr 22 10:24:03 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 22 Apr 2007 10:24:03 -0700 Subject: [ofa-general] Re: [PATCH] mlx4: fix destroy qp deadlock In-Reply-To: <200704221731.27147.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Sun, 22 Apr 2007 17:31:26 +0300") References: <200704221731.27147.jackm@dev.mellanox.co.il> Message-ID: thanks, applied From kliteyn at dev.mellanox.co.il Sun Apr 22 12:35:43 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 22 Apr 2007 22:35:43 +0300 Subject: [ofa-general] Re: [PATCH] osm: fixing small memory leak In-Reply-To: <462B7D00.6060707@dev.mellanox.co.il> References: <462B7D00.6060707@dev.mellanox.co.il> Message-ID: <462BB90F.30709@dev.mellanox.co.il> Yevgeny Kliteynik wrote: > Hi Hal, > > This patch fixes a small memory leak - OpenSM was leaking ~200 bytes Correction (not that it matters too much) - 32 bytes, not 200. -- Yevgeny > or more for each guid in the fabric each time the guid2lid file was > re-read. > > Please apply to ofed_1_2 and trunk. > > Thanks. > > Signed-off-by: Yevgeny Kliteynik > --- > osm/opensm/osm_lid_mgr.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/osm/opensm/osm_lid_mgr.c b/osm/opensm/osm_lid_mgr.c > index bd5945f..d856fb0 100644 > --- a/osm/opensm/osm_lid_mgr.c > +++ b/osm/opensm/osm_lid_mgr.c > @@ -247,6 +247,7 @@ __osm_lid_mgr_validate_db( > cl_ptr_vector_set( &p_mgr->used_lids, lid, (void *)1); > } > } /* got a lid */ > + free(p_item); > p_item = (osm_db_guid_elem_t*)cl_qlist_remove_head(&guids); > } /* all guids */ > Exit: From shadwxec at metaportale.com Sun Apr 22 17:25:00 2007 From: shadwxec at metaportale.com (Melany Arnold) Date: Mon, 23 Apr 2007 12:25:00 +1200 Subject: [ofa-general] This is it Message-ID: <81e701c785a2$69b2a520$c73fabee@shadwxec> sleepy foolishly sticky Stacy continued. You know, upset as I'm sitting hereObviously you've credit did known trouble wind him alot longer than I branch Linda, stale rung I've always thought kill you were cute. EvenOf view course salt that wasn't cautious the case. Even neck if Gavin h Can crack education hand you pick under anything out? Besides the obviouscycle attend Of terrify course, it was steel Clifford who spoke up, Feing Needless advertisement histrionic to band say, salt I'm no authority on what make ice She looked down. She park map knew that she alert had said eno In other words, leaf this give is stuck a guy who's accidentally used to ge Um... Dana wasn't bathe sure how level night point to answer this. With rich that, the drawer two wind girls boot parted in separate dirI've value often chain order wondered friend why they call it 'the big Jeff famous prevent paused, carefully match death taking into consideratio meat hemic inquisitive Are you always this droll? sister Stacy was actually Well, it doesn't dealt fowl fear matter. I'm step just glad you're Dana, burn now that wrap we're drain bottle getting all of this out i That actually explains rot servant copy alot. I'll see moor you a li pugilistic Knowing the jig tug attempt was up, Dana choose just acquiesced. It's set that screw listen appreciate they're paid for. ice cost Unfortunately, birth yes. I'd probably solid be alot moreHeya Angel. fallen Jeff, decorate bid assume the misspelt Brooklyn position, Commanded You're just chicken, along fought camera 'cause you telephone know I can bea Cliff, swept owner sister need I remind you dust that Coach Randall is frame shed Really? cheerfully Linda's shorn face lit up. This is news toAnd as Guy divide blunt sense said, clip we might be dropping Toby intWhether or not join education Gavin was a cart sling straight-A student o cinerary onto Bye hospital Angel. tomorrow Jeff closed his cellphone Dana army glanced at gun the road map. cloudy Turn authority left up ahe You have cross sawn watch no little idea how much courage I had to mus Alright, there's face bathe something else monthly song I've got to as insurance Hey, let's control get something straight. beam meddle I'm the one building I large can't see Toby lain getting group mixed up in anything meddle guilty Huh? he smash strap was completely confused. met She boil interest thought about this sent for a moment. I supposeShe took set needle his left inquisitively arm and placed receipt it on his lockWhile the retire spat discussion two of forbid them were talking, most of the I mean cow with boys. I can't won imagine infamous fall any guy look Jeff was humored. Why guide sing lost do I glass suddenly feel like Marcie had been listening into mass hammer large in on the conversatio As polish the car rounded the hum chin bend, bent the Marshall estat protest Listen felt often Hon, the daughter meeting at the school is going Tears were now spoken verse starting harbor to drip oil down Dana's che If infamous you're really worried, I can wobble knit heard still phone th ball There was made a needle time ten second pause, before Carl spoke theory building I'm not identify so sure she's my plastic friend, Jeff was eve back ruin Perhaps if we lock show up so they can't wrung get in wit Actually, you look kind jump of straight wed dorky to blunt me as well -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: etebazi.gif Type: image/gif Size: 6191 bytes Desc: not available URL: From mst at dev.mellanox.co.il Sun Apr 22 21:42:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Apr 2007 07:42:15 +0300 Subject: [ofa-general] Fwd: [ANNOUNCE] GIT 1.5.1.2 Message-ID: <20070423044215.GC4579@mellanox.co.il> I suggest we upgrade git on openfabrics.org to 1.5.1.2: there seem to have been several important bugfixes. Sasha? ----- Forwarded message from Junio C Hamano ----- Subject: [ANNOUNCE] GIT 1.5.1.2 Date: Sun, 22 Apr 2007 09:16:55 +0300 In-Reply-To: <7vhcrml4wx.fsf at assigned-by-dhcp.cox.net> (Junio C. Hamano'smessage of "Wed, 11 Apr 2007 19:09:34 -0700") References: <7vhcrml4wx.fsf at assigned-by-dhcp.cox.net> From: Junio C Hamano The latest maintenance release GIT 1.5.1.2 is available at the usual places: http://www.kernel.org/pub/software/scm/git/ git-1.5.1.2.tar.{gz,bz2} (tarball) git-htmldocs-1.5.1.2.tar.{gz,bz2} (preformatted docs) git-manpages-1.5.1.2.tar.{gz,bz2} (preformatted docs) RPMS/$arch/git-*-1.5.1.2-1.$arch.rpm (RPM) GIT v1.5.1.2 Release Notes ========================== Fixes since v1.5.1.1 -------------------- * Bugfixes - "git clone" over http from a repository that has lost the loose refs by running "git pack-refs" were broken (a code to deal with this was added to "git fetch" in v1.5.0, but it was missing from "git clone"). - "git diff a/ b/" incorrectly fell in "diff between two filesystem objects" codepath, when the user most likely wanted to limit the extent of output to two tracked directories. - git-quiltimport had the same bug as we fixed for git-applymbox in v1.5.1.1 -- it gave an alarming "did not have any patch" message (but did not actually fail and was harmless). - various git-svn fixes. - Sample update hook incorrectly always refused requests to delete branches through push. - git-blame on a very long working tree path had buffer overrun problem. - git-apply did not like to be fed two patches in a row that created and then modified the same file. - git-svn was confused when a non-project was stored directly under trunk/, branches/ and tags/. - git-svn wants the Error.pm module that was at least as new as what we ship as part of git; install ours in our private installation location if the one on the system is older. - An earlier update to command line integer parameter parser was botched and made 'update-index --cacheinfo' completely useless. * Documentation updates - Various documentation updates from J. Bruce Fields, Frank Lichtenheld, Alex Riesen and others. Andrew Ruder started a war on undocumented options. ---------------------------------------------------------------- Changes since v1.5.1.1 are as follows: Alex Riesen (3): Use rev-list --reverse in git-rebase.sh Document -g (--walk-reflogs) option of git-log Fix overwriting of files when applying contextually independent diffs Andrew Ruder (8): Update git-am documentation Update git-applymbox documentation Update git-apply documentation Update git-annotate/git-blame documentation Update git-archive documentation Update git-cherry-pick documentation Fix unmatched emphasis tag in git-tutorial Update git-config documentation Andy Whitcroft (1): fix up strtoul_ui error handling Carlos Rica (1): Use const qualifier for 'sha1' parameter in delete_ref function Eric Wong (4): git-svn: respect lower bound of -r/--revision when following parent git-svn: quiet some warnings when run only with --version/--help git-svn: don't allow globs to match regular files perl: install private Error.pm if the site version is older than our own Eygene Ryabinkin (2): Teach gitk to use the user-defined UI font everywhere. Improve look-and-feel of the gitk tool. Frank Lichtenheld (5): config.txt: Document gitcvs.allbinary config.txt: Document core.autocrlf config.txt: Change pserver to server in description of gitcvs.* config.txt: Fix grammatical error in description of http.noEPSV git-shortlog: Fix two formatting errors in asciidoc documentation Gerrit Pape (2): variable $projectdesc needs to be set before checking against unchanged default. Have sample update hook not refuse deleting a branch through push. J. Bruce Fields (7): Documentation: minor edits of git-lost-found manpage Documentation: clarify git-checkout -f, minor editing Documentation: clarify track/no-track option. user-manual: fix discussion of default clone user-manual: detached HEAD user-manual: start revising "internals" chapter user-manual: use detached head when rewriting history Jim Meyering (1): sscanf/strtoul: parse integers robustly Junio C Hamano (4): Do not default to --no-index when given two directories. Start preparing for 1.5.1.2 git-clone: fix dumb protocol transport to clone from pack-pruned ref GIT 1.5.1.2 Linus Torvalds (1): git-quiltimport complaining yet still working Matthias Lederhofer (1): handle_options in git wrapper miscounts the options it handled. Michael Spang (1): git-blame: Fix overrun in fake_working_tree_commit() Sam Vilain (1): git-tar-tree: complete deprecation conversion message Shawn O. Pearce (1): git-gui: Brown paper bag fix division by 0 in blame - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ----- End forwarded message ----- -- MST From ahgwahchings at affiliate-review.com Sun Apr 22 22:10:22 2007 From: ahgwahchings at affiliate-review.com (Garrett Pearce) Date: Mon, 23 Apr 2007 13:10:22 +0800 Subject: [ofa-general] his negaunee himself milltown Message-ID: <001201c785a8$bfb4b700$0669416c@zx7rubz4qbs32b> This one will explode.. Sym8oL: ASVPCurr Price: $0.64 1 Day Target price: $1Action: Aggresive Buy/Hold.. Short-term KST. Short-Term Bullish.. See the hottest news of the ASVP, openib-general, call your broker... -------------- next part -------------- An HTML attachment was scrubbed... URL: From devesh28 at gmail.com Sun Apr 22 23:56:00 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Mon, 23 Apr 2007 12:26:00 +0530 Subject: [ofa-general] [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <4628EB72.3080904@ichips.intel.com> References: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> <309a667c0704192344gec04bd0uf9eec6c6413ea34@mail.gmail.com> <4628EB72.3080904@ichips.intel.com> Message-ID: <309a667c0704222356p72dbd887o605268889bfb6358@mail.gmail.com> Hello Sean, Thanks for replying. Still I am not able to get it completely, I have seen through the SA code please help me out in understanding SA design as a whole, my query is as follows If some client calls cma_resolve_ib_route(), and let's assume that its local cache miss, and cma_query_ib_route() is called, this will send a SA query to the SM node CMIIW, Now on SM node I am not able to figure out that who will respond this GMP, and how requested attribute info is collected? On 4/20/07, Sean Hefty wrote: > > Once SM is up on a node/switch whole network is up. Now is if some > > client is trying to establish a connection with other node, client is > > expected to resolve the path using sa API, I want to know how exactly > > it happens in the stack? > > See patch 3/3 for the use of the cache. In that patch, the rdma_cm first checks > to see if a suitable path record is available in the cache. If one is not > found, it issues a query to the SA. The stack impact of using the cache is less > than the impact of sending a path record query to the SA. > > > Is it possible to program local_sa_cache with some dummy path records > > which resides in cache for long time? > > This would require changes to the current implementation. what about this : One user command, reading path record from some file and passing this to local_sa_cache module using sysfs/ioctl, local_cache module is assuming it as a incoming path_record query and adding it to the cache. possibly some device interface is required to be added to the module if ioctl is used -Devesh > > - Sean > From kliteyn at dev.mellanox.co.il Mon Apr 23 02:28:01 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 23 Apr 2007 12:28:01 +0300 Subject: [ofa-general] [PATCH] osm: source and destination strings overlap when using sprintf() Message-ID: <462C7C21.7010004@dev.mellanox.co.il> Hi Hal, Fixing a problematic usage of sprintf() in osm_helper.c: When using sprintf(), source and destination strings should not overlap, otherwise the function behavior is undefined. Please apply to ofed_1_2 and to master. Thanks. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_helper.c | 47 ++++++++++++++++++++++++++++++++++------------- 1 files changed, 34 insertions(+), 13 deletions(-) diff --git a/osm/opensm/osm_helper.c b/osm/opensm/osm_helper.c index 14474e7..1b43ba2 100644 --- a/osm/opensm/osm_helper.c +++ b/osm/opensm/osm_helper.c @@ -1147,6 +1147,7 @@ osm_dump_multipath_record( { int i; char buf_line[1024]; + char tmp_buf_line[1024]; ib_gid_t const *p_gid; if( osm_log_is_active( p_log, log_level ) ) @@ -1157,10 +1158,11 @@ osm_dump_multipath_record( { for (i = 0; i < p_mpr->sgid_count; i++) { - sprintf( buf_line, "%s\t\t\t\tsgid%02d.................." + sprintf( tmp_buf_line, "\t\t\t\tsgid%02d.................." "0x%016" PRIx64 " : 0x%016" PRIx64 "\n", - buf_line, i + 1, cl_ntoh64( p_gid->unicast.prefix ), + i + 1, cl_ntoh64( p_gid->unicast.prefix ), cl_ntoh64( p_gid->unicast.interface_id ) ); + strcat( buf_line, tmp_buf_line ); p_gid++; } } @@ -1168,10 +1170,11 @@ osm_dump_multipath_record( { for (i = 0; i < p_mpr->dgid_count; i++) { - sprintf( buf_line, "%s\t\t\t\tdgid%02d.................." + sprintf( tmp_buf_line, "\t\t\t\tdgid%02d.................." "0x%016" PRIx64 " : 0x%016" PRIx64 "\n", - buf_line, i + 1, cl_ntoh64( p_gid->unicast.prefix ), + i + 1, cl_ntoh64( p_gid->unicast.prefix ), cl_ntoh64( p_gid->unicast.interface_id ) ); + strcat( buf_line, tmp_buf_line ); p_gid++; } } @@ -1652,13 +1655,17 @@ osm_dump_pkey_block( { int i; char buf_line[1024]; + char tmp_buf_line[1024]; if( osm_log_is_active( p_log, log_level ) ) { buf_line[0] = '\0'; for (i = 0; i < 32; i++) - sprintf( buf_line,"%s 0x%04x |", - buf_line, cl_ntoh16(p_pkey_tbl->pkey_entry[i])); + { + sprintf( tmp_buf_line," 0x%04x |", + cl_ntoh16(p_pkey_tbl->pkey_entry[i])); + strcat( buf_line, tmp_buf_line ); + } osm_log( p_log, log_level, "P_Key table dump:\n" @@ -1687,16 +1694,23 @@ osm_dump_slvl_map_table( uint8_t i; char buf_line1[1024]; char buf_line2[1024]; + char tmp_buf_line[1024]; if( osm_log_is_active( p_log, log_level ) ) { buf_line1[0] = '\0'; buf_line2[0] = '\0'; for (i = 0; i < 16; i++) - sprintf( buf_line1,"%s %-2u |", buf_line1, i); + { + sprintf( tmp_buf_line," %-2u |", i); + strcat( buf_line1, tmp_buf_line ); + } for (i = 0; i < 16; i++) - sprintf( buf_line2,"%s0x%01X |", - buf_line2, ib_slvl_table_get(p_slvl_tbl, i)); + { + sprintf( tmp_buf_line,"0x%01X |", + ib_slvl_table_get(p_slvl_tbl, i)); + strcat( buf_line2, tmp_buf_line ); + } osm_log( p_log, log_level, "SLtoVL dump:\n" "\t\t\tport_guid............0x%016" PRIx64 "\n" @@ -1724,17 +1738,24 @@ osm_dump_vl_arb_table( int i; char buf_line1[1024]; char buf_line2[1024]; + char tmp_buf_line[1024]; if( osm_log_is_active( p_log, log_level ) ) { buf_line1[0] = '\0'; buf_line2[0] = '\0'; for (i = 0; i < 32; i++) - sprintf( buf_line1,"%s 0x%01X |", - buf_line1, p_vla_tbl->vl_entry[i].vl); + { + sprintf( tmp_buf_line," 0x%01X |", + p_vla_tbl->vl_entry[i].vl); + strcat( buf_line1, tmp_buf_line ); + } for (i = 0; i < 32; i++) - sprintf( buf_line2,"%s 0x%01X |", - buf_line2, p_vla_tbl->vl_entry[i].weight); + { + sprintf( tmp_buf_line," 0x%01X |", + p_vla_tbl->vl_entry[i].weight); + strcat( buf_line2, tmp_buf_line ); + } osm_log( p_log, log_level, "VlArb dump:\n" "\t\t\tport_guid...........0x%016" PRIx64 "\n" -- 1.4.4.1.GIT From vlad at lists.openfabrics.org Mon Apr 23 02:37:05 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 23 Apr 2007 02:37:05 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070423-0200 daily build status Message-ID: <20070423093705.8E5A0E60828@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From philippe.gregoire at cea.fr Mon Apr 23 02:40:39 2007 From: philippe.gregoire at cea.fr (GREGOIRE Philippe) Date: Mon, 23 Apr 2007 11:40:39 +0200 Subject: [ofa-general] [RFC] IB management changes proposal In-Reply-To: <1176224960.14140.474623.camel@localhost.localdomain> References: <1176224960.14140.474623.camel@localhost.localdomain> Message-ID: <462C7F17.3040707@cea.fr> Hal Rosenstock a écrit : > The following changes are proposed for IB management (master branch of > my management git tree): > > In order to better match package names, the following directory names to > be changed from->to: > osm->opensm > diags->openib-diags > > Since opensm is a system daemon, opensm to be moved from /usr/bin to /usr/sbin > > For consistency with the package name, /var/cache/osm moved to > /var/cache/opensm > > Also, for consistency with the package name, all config, log, and dump files named osm* > to be changed to opensm* > > To avoid confusion and possible conflicts in configuring daemon options, > only have 1 configuration file (existence of both /etc/sysconfig/opensm > and /etc/opensm.conf is problematic). Remove the /etc/sysconfig/opensm > file and only use opensm.conf. Move opensm.conf to /etc/rdma (as > discussed in the thread labeled "Location and naming of RDMA enablement > stack rpm" on general at lists.openfabrics.org. > > Any comments ? > > -- Hal > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > commands provided by openib-diags should be installed also in /usr/sbin as they are privileged system administrator commands. There also some few commands (ib*.pl) that are using a file /tmp/ibnetdiscover.topology. I suggest /var/cache/ibnetdiscover.topology Philippe From mst at dev.mellanox.co.il Mon Apr 23 03:17:38 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Apr 2007 13:17:38 +0300 Subject: [ofa-general] Re: [PATCH] osm: source and destination strings overlap when using sprintf() In-Reply-To: <462C7C21.7010004@dev.mellanox.co.il> References: <462C7C21.7010004@dev.mellanox.co.il> Message-ID: <20070423101738.GG4579@mellanox.co.il> > Quoting Yevgeny Kliteynik : > Subject: [PATCH] osm: source and destination strings overlap when using sprintf() > > Hi Hal, > > Fixing a problematic usage of sprintf() in osm_helper.c: > > When using sprintf(), source and destination strings should > not overlap, otherwise the function behavior is undefined. > > Please apply to ofed_1_2 and to master. > > Thanks. > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik > --- > osm/opensm/osm_helper.c | 47 > ++++++++++++++++++++++++++++++++++------------- > 1 files changed, 34 insertions(+), 13 deletions(-) .... skip ... > for (i = 0; i < 32; i++) > - sprintf( buf_line2,"%s 0x%01X |", > - buf_line2, p_vla_tbl->vl_entry[i].weight); > + { > + sprintf( tmp_buf_line," 0x%01X |", > + p_vla_tbl->vl_entry[i].weight); > + strcat( buf_line2, tmp_buf_line ); > + } > osm_log( p_log, log_level, > "VlArb dump:\n" > "\t\t\tport_guid...........0x%016" PRIx64 "\n" These tmp-bufs are quite ugly, and bloat the code up. Since you seem to do a strcat which does an anyway, how about, for example: - sprintf( buf_line1,"%s 0x%01x |", - buf_line1, p_vla_tbl->vl_entry[i].vl); + sprintf( buf_line1 + strlen(buf_line1)," 0x%01x |", + p_vla_tbl->vl_entry[i].vl); and so on in all the other places? -- MST From jsquyres at cisco.com Mon Apr 23 05:10:45 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 23 Apr 2007 08:10:45 -0400 Subject: [ofa-general] Reminder: NO EWG TELECONF TODAY Message-ID: Gentle reminder: there is no EWG teleconference today in observance of the Israel holiday. This week's teleconference was moved to Wednesday, 25 April: - 11:30pm US Pacific - 2:30pm US Eastern - 9:30pm Israel I just noticed that I was an hour off when I moved the meeting invitation last week. You will get a new meeting invite *FOR THIS WEEK ONLY* shortly. Here is the updated information: Global Access Numbers: http://cisco.com/en/US/about/doing_business/conferencing/index.html Meeting ID: 2102061 US/Canada: +1.866.432.9903 United Kingdom: +44.20.8824.0117 India: +91.80.4103.3979 Germany: +49.619.6773.9002 Israel: +972.9.892.7026 -- Jeff Squyres Cisco Systems From pradeep at us.ibm.com Mon Apr 23 08:31:13 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 23 Apr 2007 08:31:13 -0700 Subject: [ofa-general] Re: How to set local_ca_ack_delay? In-Reply-To: Message-ID: If this is a reasonable thing to do -can some one tell me how to change this? I did not see any API available to do this? Pradeep pradeep at us.ibm.com Pradeep Satyanarayana/Beaverton/IBM 04/20/2007 01:35 PM To general at lists.openfabrics.org cc Subject How to set local_ca_ack_delay? This appears to be set to 0 by default on some HCAs. How can I change that? Basically I am trying to change the "Local Ack Timeout" and the IB spec 1.2 (page 553) says the "Local CA Ack Delay" is used to compute the timeout. That is why I want to change the local_ca_ack_delay. Is my understanding correct? Pradeep pradeep at us.ibm.com From rdreier at cisco.com Mon Apr 23 08:53:23 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Apr 2007 08:53:23 -0700 Subject: [ofa-general] Re: How to set local_ca_ack_delay? In-Reply-To: (Pradeep Satyanarayana's message of "Mon, 23 Apr 2007 08:31:13 -0700") References: Message-ID: Local CA ACK delay is not something that you set. It is read-only information that the HCA is giving you about how long it will take in the worst case to generate and send an ACK. If some HCA is returning 0 for this, then that is most likely a bug in the HCA or driver. - R. From rdreier at cisco.com Mon Apr 23 09:02:09 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Apr 2007 09:02:09 -0700 Subject: [ofa-general] [PATCH] IPoIB/cm: Convert spin_lock_irqsave() to spin_lock_irq() Message-ID: There are quite a few places in ipoib_cm.c where we know IRQs are enabled because we do something that sleeps in the same function, so we can convert several occurrences of spin_lock_irqsave() to a plain spin_lock_irq(). This cleans up the source a little and makes the code smaller too: add/remove: 0/0 grow/shrink: 1/5 up/down: 3/-51 (-48) function old new delta ipoib_cm_tx_reap 403 406 +3 ipoib_cm_stale_task 146 145 -1 ipoib_cm_dev_stop 173 172 -1 ipoib_cm_tx_handler 964 956 -8 ipoib_cm_rx_handler 956 937 -19 ipoib_cm_skb_reap 212 190 -22 Signed-off-by: Roland Dreier --- Does this seem OK to merge for 2.6.22? diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 7a4af7a..da7e102 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -228,7 +228,6 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even struct net_device *dev = cm_id->context; struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_cm_rx *p; - unsigned long flags; unsigned psn; int ret; @@ -257,9 +256,9 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even cm_id->context = p; p->jiffies = jiffies; - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irq(&priv->lock); list_add(&p->list, &priv->cm.passive_ids); - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irq(&priv->lock); queue_delayed_work(ipoib_workqueue, &priv->cm.stale_task, IPOIB_CM_RX_DELAY); return 0; @@ -277,7 +276,6 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, { struct ipoib_cm_rx *p; struct ipoib_dev_priv *priv; - unsigned long flags; int ret; switch (event->event) { @@ -290,14 +288,14 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, case IB_CM_REJ_RECEIVED: p = cm_id->context; priv = netdev_priv(p->dev); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irq(&priv->lock); if (list_empty(&p->list)) ret = 0; /* Connection is going away already. */ else { list_del_init(&p->list); ret = -ECONNRESET; } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irq(&priv->lock); if (ret) { ib_destroy_qp(p->qp); kfree(p); @@ -612,23 +610,22 @@ void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_cm_rx *p; - unsigned long flags; if (!IPOIB_CM_SUPPORTED(dev->dev_addr)) return; ib_destroy_cm_id(priv->cm.id); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); list_del_init(&p->list); - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irq(&priv->lock); ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); kfree(p); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irq(&priv->lock); } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irq(&priv->lock); cancel_delayed_work(&priv->cm.stale_task); } @@ -642,7 +639,6 @@ static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even struct ib_qp_attr qp_attr; int qp_attr_mask, ret; struct sk_buff *skb; - unsigned long flags; p->mtu = be32_to_cpu(data->mtu); @@ -680,12 +676,12 @@ static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even skb_queue_head_init(&skqueue); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irq(&priv->lock); set_bit(IPOIB_FLAG_OPER_UP, &p->flags); if (p->neigh) while ((skb = __skb_dequeue(&p->neigh->queue))) __skb_queue_tail(&skqueue, skb); - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irq(&priv->lock); while ((skb = __skb_dequeue(&skqueue))) { skb->dev = p->dev; @@ -895,7 +891,6 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ipoib_dev_priv *priv = netdev_priv(tx->dev); struct net_device *dev = priv->dev; struct ipoib_neigh *neigh; - unsigned long flags; int ret; switch (event->event) { @@ -914,7 +909,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, case IB_CM_REJ_RECEIVED: case IB_CM_TIMEWAIT_EXIT: ipoib_dbg(priv, "CM error %d.\n", event->event); - spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock_irq(&priv->tx_lock); spin_lock(&priv->lock); neigh = tx->neigh; @@ -934,7 +929,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, } spin_unlock(&priv->lock); - spin_unlock_irqrestore(&priv->tx_lock, flags); + spin_unlock_irq(&priv->tx_lock); break; default: break; @@ -1023,21 +1018,20 @@ static void ipoib_cm_tx_reap(struct work_struct *work) struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, cm.reap_task); struct ipoib_cm_tx *p; - unsigned long flags; - spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock_irq(&priv->tx_lock); spin_lock(&priv->lock); while (!list_empty(&priv->cm.reap_list)) { p = list_entry(priv->cm.reap_list.next, typeof(*p), list); list_del(&p->list); spin_unlock(&priv->lock); - spin_unlock_irqrestore(&priv->tx_lock, flags); + spin_unlock_irq(&priv->tx_lock); ipoib_cm_tx_destroy(p); - spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock_irq(&priv->tx_lock); spin_lock(&priv->lock); } spin_unlock(&priv->lock); - spin_unlock_irqrestore(&priv->tx_lock, flags); + spin_unlock_irq(&priv->tx_lock); } static void ipoib_cm_skb_reap(struct work_struct *work) @@ -1046,15 +1040,14 @@ static void ipoib_cm_skb_reap(struct work_struct *work) cm.skb_task); struct net_device *dev = priv->dev; struct sk_buff *skb; - unsigned long flags; unsigned mtu = priv->mcast_mtu; - spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock_irq(&priv->tx_lock); spin_lock(&priv->lock); while ((skb = skb_dequeue(&priv->cm.skb_queue))) { spin_unlock(&priv->lock); - spin_unlock_irqrestore(&priv->tx_lock, flags); + spin_unlock_irq(&priv->tx_lock); if (skb->protocol == htons(ETH_P_IP)) icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) @@ -1062,11 +1055,11 @@ static void ipoib_cm_skb_reap(struct work_struct *work) icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu, dev); #endif dev_kfree_skb_any(skb); - spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock_irq(&priv->tx_lock); spin_lock(&priv->lock); } spin_unlock(&priv->lock); - spin_unlock_irqrestore(&priv->tx_lock, flags); + spin_unlock_irq(&priv->tx_lock); } void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb, @@ -1088,9 +1081,8 @@ static void ipoib_cm_stale_task(struct work_struct *work) struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, cm.stale_task.work); struct ipoib_cm_rx *p; - unsigned long flags; - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { /* List if sorted by LRU, start from tail, * stop when we see a recently used entry */ @@ -1098,13 +1090,13 @@ static void ipoib_cm_stale_task(struct work_struct *work) if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; list_del_init(&p->list); - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irq(&priv->lock); ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); kfree(p); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irq(&priv->lock); } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irq(&priv->lock); } From fenkes at de.ibm.com Mon Apr 23 09:20:27 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 23 Apr 2007 18:20:27 +0200 Subject: [ofa-general] [PATCH] ib_core: Add missing device link to class device Message-ID: <200704231820.27752.fenkes@de.ibm.com> Add the missing device link from /sys/class/infiniband/* to the actual device. Signed-off-by: Joachim Fenkes --- sysfs.c | 1 + 1 file changed, 1 insertion(+) --- linux-2.6.20/drivers/infiniband/core/sysfs.c.old 2007-04-23 15:37:37.000000000 +0200 +++ linux-2.6.20/drivers/infiniband/core/sysfs.c 2007-04-23 15:38:22.000000000 +0200 @@ -683,6 +683,7 @@ int ib_device_register_sysfs(struct ib_d class_dev->class = &ib_class; class_dev->class_data = device; + class_dev->dev = device->dma_device; strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE); INIT_LIST_HEAD(&device->port_list); -- Joachim Fenkes  --  eHCA Linux Driver Developer and Hardware Tamer IBM Deutschland Entwicklung GmbH  --  Dept. 3627 (I/O Firmware Dev. 2) Schoenaicher Strasse 220  --  71032 Boeblingen  --  Germany eMail: fenkes at de.ibm.com  --  Phone: +49 7031 16 1239 From fenkes at de.ibm.com Mon Apr 23 09:23:48 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 23 Apr 2007 18:23:48 +0200 Subject: [ofa-general] [PATCH] eHCA: Add "Modify Port" verb Message-ID: <200704231823.48723.fenkes@de.ibm.com> Add "Modify Port" verb support to eHCA driver. ib_cm needs this to initialize properly. Signed-off-by: Joachim Fenkes --- ehca_hca.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++-- hcp_if.c | 24 ++++++++++++++++++++++++ hcp_if.h | 4 ++++ 3 files changed, 74 insertions(+), 2 deletions(-) diff -urp a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c --- a/drivers/infiniband/hw/ehca/ehca_hca.c 2007-02-04 19:44:54.000000000 +0100 +++ b/drivers/infiniband/hw/ehca/ehca_hca.c 2007-04-23 18:09:38.000000000 +0200 @@ -147,6 +147,7 @@ int ehca_query_port(struct ib_device *ib break; } + props->port_cap_flags = rblock->capability_mask; props->gid_tbl_len = rblock->gid_tbl_len; props->max_msg_sz = rblock->max_msg_sz; props->bad_pkey_cntr = rblock->bad_pkey_cntr; @@ -233,10 +234,53 @@ query_gid1: return ret; } +const u32 allowed_port_caps = ( + IB_PORT_SM | IB_PORT_LED_INFO_SUP | IB_PORT_CM_SUP | + IB_PORT_SNMP_TUNNEL_SUP | IB_PORT_DEVICE_MGMT_SUP | + IB_PORT_VENDOR_CLASS_SUP); + int ehca_modify_port(struct ib_device *ibdev, u8 port, int port_modify_mask, struct ib_port_modify *props) { - /* Not implemented yet */ - return -EFAULT; + int ret = 0; + struct ehca_shca *shca = container_of(ibdev, struct ehca_shca, ib_device); + struct hipz_query_port *rblock; + u32 cap; + u64 hret; + + if ((props->set_port_cap_mask | props->clr_port_cap_mask) + & ~allowed_port_caps) { + ehca_err(&shca->ib_device, "Non-changeable bits set in masks " + "set=%x clr=%x allowed=%x", props->set_port_cap_mask, + props->clr_port_cap_mask, allowed_port_caps); + return -EINVAL; + } + + rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL); + if (!rblock) { + ehca_err(&shca->ib_device, "Can't allocate rblock memory."); + return -ENOMEM; + } + + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + ehca_err(&shca->ib_device, "Can't query port properties"); + ret = -EINVAL; + goto modify_port1; + } + + cap = (rblock->capability_mask | props->set_port_cap_mask) + & ~props->clr_port_cap_mask; + + hret = hipz_h_modify_port(shca->ipz_hca_handle, port, + cap, props->init_type, port_modify_mask); + if (hret != H_SUCCESS) { + ehca_err(&shca->ib_device, "Modify port failed hret=%lx", hret); + ret = -EINVAL; + } + +modify_port1: + ehca_free_fw_ctrlblock(rblock); + + return ret; } diff -urp a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c --- a/drivers/infiniband/hw/ehca/hcp_if.c 2007-02-04 19:44:54.000000000 +0100 +++ b/drivers/infiniband/hw/ehca/hcp_if.c 2007-04-23 18:06:09.000000000 +0200 @@ -70,6 +70,10 @@ #define H_ALL_RES_QP_SQUEUE_SIZE_PAGES EHCA_BMASK_IBM(0, 31) #define H_ALL_RES_QP_RQUEUE_SIZE_PAGES EHCA_BMASK_IBM(32, 63) +#define H_MP_INIT_TYPE EHCA_BMASK_IBM(44, 47) +#define H_MP_SHUTDOWN EHCA_BMASK_IBM(48, 48) +#define H_MP_RESET_QKEY_CTR EHCA_BMASK_IBM(49, 49) + /* direct access qp controls */ #define DAQP_CTRL_ENABLE 0x01 #define DAQP_CTRL_SEND_COMP 0x20 @@ -364,6 +368,26 @@ u64 hipz_h_query_port(const struct ipz_a return ret; } +u64 hipz_h_modify_port(const struct ipz_adapter_handle adapter_handle, + const u8 port_id, const u32 port_cap, + const u8 init_type, const int modify_mask) +{ + u64 port_attributes = port_cap; + + if (modify_mask & IB_PORT_SHUTDOWN) + port_attributes |= EHCA_BMASK_SET(H_MP_SHUTDOWN, 1); + if (modify_mask & IB_PORT_INIT_TYPE) + port_attributes |= EHCA_BMASK_SET(H_MP_INIT_TYPE, init_type); + if (modify_mask & IB_PORT_RESET_QKEY_CNTR) + port_attributes |= EHCA_BMASK_SET(H_MP_RESET_QKEY_CTR, 1); + + return ehca_plpar_hcall_norets(H_MODIFY_PORT, + adapter_handle.handle, /* r4 */ + port_id, /* r5 */ + port_attributes, /* r6 */ + 0, 0, 0, 0); +} + u64 hipz_h_query_hca(const struct ipz_adapter_handle adapter_handle, struct hipz_query_hca *query_hca_rblock) { diff -urp a/drivers/infiniband/hw/ehca/hcp_if.h b/drivers/infiniband/hw/ehca/hcp_if.h --- a/drivers/infiniband/hw/ehca/hcp_if.h 2007-02-04 19:44:54.000000000 +0100 +++ b/drivers/infiniband/hw/ehca/hcp_if.h 2007-04-23 18:06:09.000000000 +0200 @@ -85,6 +85,10 @@ u64 hipz_h_query_port(const struct ipz_a const u8 port_id, struct hipz_query_port *query_port_response_block); +u64 hipz_h_modify_port(const struct ipz_adapter_handle adapter_handle, + const u8 port_id, const u32 port_cap, + const u8 init_type, const int modify_mask); + u64 hipz_h_query_hca(const struct ipz_adapter_handle adapter_handle, struct hipz_query_hca *query_hca_rblock); From mst at dev.mellanox.co.il Mon Apr 23 09:35:37 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Apr 2007 19:35:37 +0300 Subject: [ofa-general] Re: [PATCH] IPoIB/cm: Convert spin_lock_irqsave() to spin_lock_irq() In-Reply-To: References: Message-ID: <20070423163537.GP4579@mellanox.co.il> > Quoting Roland Dreier : > Subject: [PATCH] IPoIB/cm: Convert spin_lock_irqsave() to spin_lock_irq() > > There are quite a few places in ipoib_cm.c where we know IRQs are > enabled because we do something that sleeps in the same function, so > we can convert several occurrences of spin_lock_irqsave() to a plain > spin_lock_irq(). This cleans up the source a little and makes the > code smaller too: > > add/remove: 0/0 grow/shrink: 1/5 up/down: 3/-51 (-48) > function old new delta > ipoib_cm_tx_reap 403 406 +3 > ipoib_cm_stale_task 146 145 -1 > ipoib_cm_dev_stop 173 172 -1 > ipoib_cm_tx_handler 964 956 -8 > ipoib_cm_rx_handler 956 937 -19 > ipoib_cm_skb_reap 212 190 -22 > > Signed-off-by: Roland Dreier Makes sense to me. Acked-by: Michael S. Tsirkin -- MST From mshefty at ichips.intel.com Mon Apr 23 09:38:10 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 23 Apr 2007 09:38:10 -0700 Subject: [ofa-general] [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <309a667c0704222356p72dbd887o605268889bfb6358@mail.gmail.com> References: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> <309a667c0704192344gec04bd0uf9eec6c6413ea34@mail.gmail.com> <4628EB72.3080904@ichips.intel.com> <309a667c0704222356p72dbd887o605268889bfb6358@mail.gmail.com> Message-ID: <462CE0F2.2070202@ichips.intel.com> > If some client calls cma_resolve_ib_route(), and let's assume that its > local cache miss, and cma_query_ib_route() is called, this will send a > SA query to the SM node CMIIW, Now on SM node I am not able to figure > out that who will respond this GMP, and how requested attribute info > is collected? The SA responds to the query. If opensm is running on the node, the query is sent up through the MAD layer to user_mad, where opensm reads the MAD and generates the response. Opensm knows the information from configuring the subnet. - Sean From halr at voltaire.com Mon Apr 23 09:55:59 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Apr 2007 12:55:59 -0400 Subject: [ofa-general] [PATCH] eHCA: Add "Modify Port" verb In-Reply-To: <200704231823.48723.fenkes@de.ibm.com> References: <200704231823.48723.fenkes@de.ibm.com> Message-ID: <1177347358.28021.24294.camel@hal.voltaire.com> Hi Joachim, On Mon, 2007-04-23 at 12:23, Joachim Fenkes wrote: > Add "Modify Port" verb support to eHCA driver. > ib_cm needs this to initialize properly. > > > Signed-off-by: Joachim Fenkes > --- > > ehca_hca.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++-- > hcp_if.c | 24 ++++++++++++++++++++++++ > hcp_if.h | 4 ++++ > 3 files changed, 74 insertions(+), 2 deletions(-) > > diff -urp a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c > --- a/drivers/infiniband/hw/ehca/ehca_hca.c 2007-02-04 19:44:54.000000000 +0100 > +++ b/drivers/infiniband/hw/ehca/ehca_hca.c 2007-04-23 18:09:38.000000000 +0200 > @@ -147,6 +147,7 @@ int ehca_query_port(struct ib_device *ib > break; > } > > + props->port_cap_flags = rblock->capability_mask; > props->gid_tbl_len = rblock->gid_tbl_len; > props->max_msg_sz = rblock->max_msg_sz; > props->bad_pkey_cntr = rblock->bad_pkey_cntr; > @@ -233,10 +234,53 @@ query_gid1: > return ret; > } > > +const u32 allowed_port_caps = ( > + IB_PORT_SM | IB_PORT_LED_INFO_SUP | IB_PORT_CM_SUP | > + IB_PORT_SNMP_TUNNEL_SUP | IB_PORT_DEVICE_MGMT_SUP | > + IB_PORT_VENDOR_CLASS_SUP); I didn't think IB_PORT_SM was allowed (as QP0 is not exposed) or does this just fail later when it is attempted to be actually set ? -- Hal > + > int ehca_modify_port(struct ib_device *ibdev, > u8 port, int port_modify_mask, > struct ib_port_modify *props) > { > - /* Not implemented yet */ > - return -EFAULT; > + int ret = 0; > + struct ehca_shca *shca = container_of(ibdev, struct ehca_shca, ib_device); > + struct hipz_query_port *rblock; > + u32 cap; > + u64 hret; > + > + if ((props->set_port_cap_mask | props->clr_port_cap_mask) > + & ~allowed_port_caps) { > + ehca_err(&shca->ib_device, "Non-changeable bits set in masks " > + "set=%x clr=%x allowed=%x", props->set_port_cap_mask, > + props->clr_port_cap_mask, allowed_port_caps); > + return -EINVAL; > + } > + > + rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL); > + if (!rblock) { > + ehca_err(&shca->ib_device, "Can't allocate rblock memory."); > + return -ENOMEM; > + } > + > + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { > + ehca_err(&shca->ib_device, "Can't query port properties"); > + ret = -EINVAL; > + goto modify_port1; > + } > + > + cap = (rblock->capability_mask | props->set_port_cap_mask) > + & ~props->clr_port_cap_mask; > + > + hret = hipz_h_modify_port(shca->ipz_hca_handle, port, > + cap, props->init_type, port_modify_mask); > + if (hret != H_SUCCESS) { > + ehca_err(&shca->ib_device, "Modify port failed hret=%lx", hret); > + ret = -EINVAL; > + } > + > +modify_port1: > + ehca_free_fw_ctrlblock(rblock); > + > + return ret; > } > diff -urp a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c > --- a/drivers/infiniband/hw/ehca/hcp_if.c 2007-02-04 19:44:54.000000000 +0100 > +++ b/drivers/infiniband/hw/ehca/hcp_if.c 2007-04-23 18:06:09.000000000 +0200 > @@ -70,6 +70,10 @@ > #define H_ALL_RES_QP_SQUEUE_SIZE_PAGES EHCA_BMASK_IBM(0, 31) > #define H_ALL_RES_QP_RQUEUE_SIZE_PAGES EHCA_BMASK_IBM(32, 63) > > +#define H_MP_INIT_TYPE EHCA_BMASK_IBM(44, 47) > +#define H_MP_SHUTDOWN EHCA_BMASK_IBM(48, 48) > +#define H_MP_RESET_QKEY_CTR EHCA_BMASK_IBM(49, 49) > + > /* direct access qp controls */ > #define DAQP_CTRL_ENABLE 0x01 > #define DAQP_CTRL_SEND_COMP 0x20 > @@ -364,6 +368,26 @@ u64 hipz_h_query_port(const struct ipz_a > return ret; > } > > +u64 hipz_h_modify_port(const struct ipz_adapter_handle adapter_handle, > + const u8 port_id, const u32 port_cap, > + const u8 init_type, const int modify_mask) > +{ > + u64 port_attributes = port_cap; > + > + if (modify_mask & IB_PORT_SHUTDOWN) > + port_attributes |= EHCA_BMASK_SET(H_MP_SHUTDOWN, 1); > + if (modify_mask & IB_PORT_INIT_TYPE) > + port_attributes |= EHCA_BMASK_SET(H_MP_INIT_TYPE, init_type); > + if (modify_mask & IB_PORT_RESET_QKEY_CNTR) > + port_attributes |= EHCA_BMASK_SET(H_MP_RESET_QKEY_CTR, 1); > + > + return ehca_plpar_hcall_norets(H_MODIFY_PORT, > + adapter_handle.handle, /* r4 */ > + port_id, /* r5 */ > + port_attributes, /* r6 */ > + 0, 0, 0, 0); > +} > + > u64 hipz_h_query_hca(const struct ipz_adapter_handle adapter_handle, > struct hipz_query_hca *query_hca_rblock) > { > diff -urp a/drivers/infiniband/hw/ehca/hcp_if.h b/drivers/infiniband/hw/ehca/hcp_if.h > --- a/drivers/infiniband/hw/ehca/hcp_if.h 2007-02-04 19:44:54.000000000 +0100 > +++ b/drivers/infiniband/hw/ehca/hcp_if.h 2007-04-23 18:06:09.000000000 +0200 > @@ -85,6 +85,10 @@ u64 hipz_h_query_port(const struct ipz_a > const u8 port_id, > struct hipz_query_port *query_port_response_block); > > +u64 hipz_h_modify_port(const struct ipz_adapter_handle adapter_handle, > + const u8 port_id, const u32 port_cap, > + const u8 init_type, const int modify_mask); > + > u64 hipz_h_query_hca(const struct ipz_adapter_handle adapter_handle, > struct hipz_query_hca *query_hca_rblock); > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Arkady.Kanevsky at netapp.com Mon Apr 23 10:59:07 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 23 Apr 2007 13:59:07 -0400 Subject: [ofa-general] Re: on the coexistance of uDAPLs In-Reply-To: <4623D067.1030005@ichips.intel.com> References: <20070411170431.GA25341@sgi.com><39C75744D164D948A170E9792AF8E7CA0CFE44@exil.voltaire.com><20070412195653.GA20252@sgi.com><20070413200415.GA15243@sgi.com><4623C0C0.9000505@ichips.intel.com><20070416184227.GA18016@sgi.com> <4623D067.1030005@ichips.intel.com> Message-ID: There are some signature differences between versions. Since redirection exposes signatures it is not trivial for a single redirection to support different signatures. Is this really needed? Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Arlin Davis [mailto:ardavis at ichips.intel.com] > Sent: Monday, April 16, 2007 3:37 PM > To: Karl Feind > Cc: Brian Forbes; Edward Mascarenhas; Jeff Hanson; > general at lists.openfabrics.org > Subject: [ofa-general] Re: on the coexistance of uDAPLs > > Karl Feind wrote: > > >>comments? other suggestions? > >> > >>-arlin > >> > >> > > > >I'd really like to see a separate RPM (called something like > >dapl-infra) that installs: > > > > 1) /etc/dat.conf (empty) > > 2) a script that addes a provider to /etc/data.conf > > 3) a script that removes a provider from /etc/data.conf > > 4) libdat.so > > > >Any DAPL layer depends on this RPM, and invokes the scripts > (2) and (3) > >in the preinstall and postuninstall setep. > > > >This decouples the DAPL infrastructure from the DAPL instantiations. > > > >Just an idea. > > > > > > Do you see the need for different versions to co-exist (1.1, > 1.2, 2.0)? > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Mon Apr 23 11:14:49 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 23 Apr 2007 11:14:49 -0700 Subject: [ofa-general] RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <20070422103704.GE26791@mellanox.co.il> Message-ID: <000001c785d3$47ff8430$e598070a@amr.corp.intel.com> >1. What happens on e.g. a heterogenious network? It seems that >path to a specific GID might change e.g. MTU without GID >going in/out of service. How would this be handled? If the path parameters change without a GID going in/out of service, then the cached path records would be off. A forced cache refresh would be needed. This could be done with an application (or manually) writing to the 'ib_local_sa/refresh' file. I should note that I removed the timed based updates that are in OFED. This seemed to be the most objectionable part of that implementation. (It could be implemented in userspace if needed.) I also looked at the path record caching behavior in ipoib as a starting point. From what I could tell, ipoib caches a path record per DGID, and requires bringing the device down/up to refresh the cache. (Someone tell me if my understanding is off here.) >2. What will happen on a number of changes in the network? >Would not the SA would need to send a huge number of notices now? >Should we be concerned? This could be an issue, and I have a few thoughts on this. First, I can make event registration tunable (yes/no) to avoid sending notices. (The labs, which are requesting the caching, are only considering static configurations anyway.) Second, if an administrator is going to make a large number of changes to the network, he would be better off disabling the cache first, making the changes, then re-enabling the cache. Finally, I don't think it's unreasonable to expect an SA that claimed to support a subnet of size N to support event registration to each node. >3. Comments indicate that the main win from the patch is >with all-to-all startup times on large MPI clusters. If that is so, >and assuming a small number of MPI jobs is running on each node, >isn't it true that the main win is not from *caching* as such >(since all paths are requested at the beginning and never >used after this), but rather from limiting the number of outstanding MADs to SA >and from reusing multiple path queries in a single request. >Could that be the case? A definite benefit does come from using a GetTable query, versus a Get query. However, the rdma_cm/socket-like interface doesn't readily lend itself to using a GetTable query, so one could argue that the cache is what enables the use of a GetTable query. However, without the cache, you end up with duplicated SA queries between processes. Currently, each process issues one query for each pair. Even if a way were found to have each process use a GetTable query rather than a Get query, we'd still have duplicated queries to the SA. With the latest systems, we're looking at 8 cores per node, which would likely result in the SA processing 8 identical queries per node. (Assuming 1 process per core, all-to-all connection model.) >4. Why do we need yet another API and yet another module to speed up just >RDMA/CM path record queries? We now get 2 ways to do this (with/without the >cache). Shouldn't there be just one? I did consider this, but the cache operates synchronously, and ib_sa interface is asynchronous. I tried to make the API make sense for the cache. The rdma_cm doesn't really take advantage of the synchronous interface, but I believe that ipoib could. Converting the ib_local_sa to an asynchronous interface requires adding registration calls, and an ability to cancel operations. One potential benefit with a single interface is adding the ability to populate the cache on an as-need basis, similar to how ipoib works. Going this route requires determining how long to maintain path records in the cache, and how to configure the cache for this use. I didn't explore this option in a lot of detail because it didn't match up with the lab's use. >5. How will the user guess the correct value for paths_per_dest tunable, >besides disabling the cache? I notice it is currently set to a value >of 0x7F. Where does this value come from? This sets the NumbPath field in the path record query. 0x7f is the maximum value. Other useful values would be 0 - disable, 1 - one path to each DGID. >Since OFED includes a significantly different version of this code >(without notices), and this is the first time the notices code >makes an appearance, I think that targeting .23, and considering >alternative options such as the above, would be more prudent. I don't have any objections to waiting if suggestions cannot be incorporated by the time 2.6.22 closes, or if we can't reach consensus. But if all changes are in by 2.6.22, there's not much to be gained by letting it sit out of tree an extra release. I can disable the cache by default, or mark it as experimental - Sean From mst at dev.mellanox.co.il Mon Apr 23 12:04:49 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Apr 2007 22:04:49 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <000001c785d3$47ff8430$e598070a@amr.corp.intel.com> References: <20070422103704.GE26791@mellanox.co.il> <000001c785d3$47ff8430$e598070a@amr.corp.intel.com> Message-ID: <20070423190449.GR4579@mellanox.co.il> > >1. What happens on e.g. a heterogenious network? It seems that > >path to a specific GID might change e.g. MTU without GID > >going in/out of service. How would this be handled? > If the path parameters change without a GID going in/out of service, then the > cached path records would be off. A forced cache refresh would be needed. This > could be done with an application (or manually) writing to the > 'ib_local_sa/refresh' file. A straight-forward approach would be to listen for port up/down events rather than or in addition to GID in/out, and do network discovery by DR SMPs. > I also looked at the path record caching > behavior in ipoib as a starting point. Hmm. IPoIB by design does not handle heterogenious networks too well (consider problems we have selecting bcast group rate). MTU is also basically forced to 2K. So this might not have been such a great example :) > From what I could tell, ipoib caches a > path record per DGID, and requires bringing the device down/up to refresh the > cache. (Someone tell me if my understanding is off here.) That's not entirely correct. For example, arp cache might get cleaned by a timer, or by direct user request, or by other means. When this happens, address handles and path records get freed. Various IB events also trigger cache flush, such as the reregister event. -- MST From rdreier at cisco.com Mon Apr 23 12:17:48 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Apr 2007 12:17:48 -0700 Subject: [ofa-general] Re: [PATCH] ib_core: Add missing device link to class device In-Reply-To: <200704231820.27752.fenkes@de.ibm.com> (Joachim Fenkes's message of "Mon, 23 Apr 2007 18:20:27 +0200") References: <200704231820.27752.fenkes@de.ibm.com> Message-ID: Hmm, I have links like this on my system already: $ ls -l /sys/class/infiniband/mlx4_0/device lrwxrwxrwx 1 root root 0 2007-04-23 12:14 /sys/class/infiniband/mlx4_0/device -> ../../../devices/pci0000:00/0000:00:06.0/0000:0d:00.0 the patch actually looks sane but I don't understand why it's needed. Could you explain? From rdreier at cisco.com Mon Apr 23 12:20:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Apr 2007 12:20:14 -0700 Subject: [ofa-general] Re: [PATCH] eHCA: Add "Modify Port" verb In-Reply-To: <200704231823.48723.fenkes@de.ibm.com> (Joachim Fenkes's message of "Mon, 23 Apr 2007 18:23:48 +0200") References: <200704231823.48723.fenkes@de.ibm.com> Message-ID: > + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { > + ehca_err(&shca->ib_device, "Can't query port properties"); > + ret = -EINVAL; > + goto modify_port1; > + } > + > + cap = (rblock->capability_mask | props->set_port_cap_mask) > + & ~props->clr_port_cap_mask; > + > + hret = hipz_h_modify_port(shca->ipz_hca_handle, port, > + cap, props->init_type, port_modify_mask); Is this thread-safe? What if two different bits are set at the same time from two different threads? It seems that both calls could get the same result from hipz_h_query_port(), and then the second call to hipz_h_modify_port() would overwrite the first call. You could look at the implementation in mthca to see the locking I used there. - R. From sean.hefty at intel.com Mon Apr 23 12:38:56 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 23 Apr 2007 12:38:56 -0700 Subject: [ofa-general] RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <20070423190449.GR4579@mellanox.co.il> Message-ID: <000101c785df$0831ef80$e598070a@amr.corp.intel.com> >A straight-forward approach would be to listen for port up/down events >rather than or in addition to GID in/out, and do network discovery by DR SMPs. I'm not entirely following you. How would you listen for port up/down events? And are you suggesting that all nodes do network discovery using DR SMPs? > Hmm. IPoIB by design does not handle heterogenious > networks too well (consider problems we have selecting bcast group rate). > MTU is also basically forced to 2K. > So this might not have been such a great example :) I agree; I just didn't want to create something that was worse than what we have now. >That's not entirely correct. For example, arp cache might get cleaned >by a timer, or by direct user request, or by other means. >When this happens, address handles and path records get freed. >Various IB events also trigger cache flush, such as the reregister event. I'm not overly familiar with the code. Is that what ipoib_neigh_destructor ends up doing? (I need to walk through this to see where the path is freed.) - Sean From pradeep at us.ibm.com Mon Apr 23 12:58:24 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 23 Apr 2007 12:58:24 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review In-Reply-To: <20070419115119.GB918@mellanox.co.il> Message-ID: "Michael S. Tsirkin" wrote on 04/19/2007 04:51:19 AM: > > Quoting Pradeep Satyanarayana : > > Subject: IPOIB CM (NOSRQ)[PATCH V2] patch for review > > > > Here is a second version of the IPOIB_CM_NOSRQ patch for review. This > > patch will benefit adapters that do not support shared receive queues. > > > > This patch incorporates the previous review comments: > > -#ifdefs removed and a single binary drives HCAs that do and do not > > support SRQs > > -avoids linear traversal through a list of QPs > > -extraneous code removed > > -compile time selection removed > > -No HTML version as part of this patch > > The patch is still severely line-wrapped, to the point of unreadability. > Look at it here: > http://article.gmane.org/gmane.linux.drivers.openib/38681 I expect to be able to address this the next time. > > > This patch has been tested with linux-2.6.21-rc5 and rc7 with Topspin and > > IBM HCAs on ppc64 machines. I have run > > netperf between two IBM HCAs and two Topspin HCAs, as well as between IBM > > and Topspin HCA. > > > > Note 1: There was interesting discovery that I made when I ran netperf > > between Topsin and IBM HCA. I started to see > > the IB_WC_RETRY_EXC_ERR error upon send completion. This may have been due > > to the differences in the > > processing speeds of the two HCA. This was rectified by seting the > > retry_count to a non-zero value in ipoib_cm_send_req(). > > I had to do this inspite of the comment --> /* RFC draft warns against > > retries */ > > This would only help if there are short bursts of high-speed activity > on the receiving HCA: if the speed is different in the long run, > the right thing to do is to drop some packets and have TCP adjust > its window accordingly. > > But in that former case (short bursts), just increasing the number > of pre-posted > buffers on RQ should be enough, and looks like a much cleaner solution. This was not an issue with running out of buffers (which was my original suspicion too). This was probably due to missing ACKs -I am guessing this happens because the two HCAs have very different processing speeds. This is exacerbated by the fact that retry count (not RNR retry count)was 0. When I changed the retry count to a small values like 3 it still works. Please see below for additional details. > > Long-term, I think we should use the watermark event to dynamically > adjust the number of RQ buffers with the incoming traffic. > I'll try to work on such a patch probably for 2.6.23 timeframe. > > > Can someone point me to where this comment is in the RFC? I would like to > > understand the reasoning. > > See "7.1 A Cautionary Note on IPoIB-RC". > See also classics such as http://sites.inka.de/~W1011/devel/tcp-tcp.html If we do this right, the above mentioned problems should not occur. In the case we are dealing with the RC timers are expected to be much smaller (than TCP timers) and should not interfere with TCP timers. The IBM HCA uses a default value of 0 for the Local CA Ack Delay; which is probably too small a value and with a retry count of 0, ACKs are missed. I agree with Roland's assessment (this was in a seperate thread), that this should not be 0. On the other hand with the Topspin adapter (and mthca) that I have the Local CA Ack Delay is 0xf which would imply a Local Ack Timeout of 4.096us * 2^15 which is about 128ms. The IB spec says it can be upto 4 times this value which means upto 512 ms. The smallest TCP retransmission timer is HZ/5 which is 200 ms on several architectures. Yes, even with a retry count of 1 or 2, there is then a risk of interfering with TCP timers. If my understanding is correct, the way its should be done is to have a small value for the Local CA Ack Delay like say 3 or 4 which would imply a Timeout value of 32-64us, with a small retry count of 2 or 3. This way the max Timeout would be still be only several hundreds of us, a factor of 1000 less than the minimum TCP timeout. IB adapters are supposed to have a much smaller latency than ethernet adapters, so I am guessing that this would be in the ballpark for most HCAs. Unfortunately I do not know how much of an effort it will take to change the Local CA Delay Ack across the various HCAs (if need be). In the interim, the only parameter we can control is the retry count and we could make this a module parameter. > > By the way, as long as you are not using SRQ, why not use UC mode QPs? > This would look like a cleaner solution. > > You can also try making the RNR condition cheaper to handle, by moving > the QP to RST and back to RTR and then to RTS instead of re-initiating > a new connection. > > Unfortunately, I haven't the time to review the patch thoroughly in the coming > couple of weeks. A general comment however: > > > @@ -360,7 +489,16 @@ void ipoib_cm_handle_rx_wc(struct net_de > > return; > > } > > > > - skb = priv->cm.srq_ring[wr_id].skb; > > + if(priv->cm.srq) > > + skb = priv->cm.srq_ring[wr_id].skb; > > + else { > > + index = (wc->wr_id & ~IPOIB_CM_OP_NOSRQ) > > & NOSRQ_INDEX_MASK ; > > + spin_lock_irqsave(&priv->lock, flags); > > + rx_ptr = priv->cm.rx_index_ring[index]; > > + spin_unlock_irqrestore(&priv->lock, > > flags); > > + > > + skb = rx_ptr->rx_ring[wr_id].skb; > > + } /* NOSRQ */ > > > > if (unlikely(wc->status != IB_WC_SUCCESS)) { > > ipoib_dbg(priv, "cm recv error " > > In this, and other examples, you scatter "if priv->cm.srq" tests all over > the code. I think it would be much cleaner in most cases to separate > the non-SRQ code to separate functions. > > If there's common SRQ/non-SRQ code it can be factored out and reused in both > places. In cases such as the above this also has speed advantages: from both > cache footprint as well as branch prediction POV. > You can even have a separate event handler for SRQ/non-SRQ code, avoiding > mode tests on data path completely. I wanted to avoid a lot of duplicate code and also not have maintainability headaches in the future. I will factor out some of the common code and repost the patch. > > -- > MST Pradeep pradeep at us.ibm.com From mst at dev.mellanox.co.il Mon Apr 23 13:20:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Apr 2007 23:20:59 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <000101c785df$0831ef80$e598070a@amr.corp.intel.com> References: <20070423190449.GR4579@mellanox.co.il> <000101c785df$0831ef80$e598070a@amr.corp.intel.com> Message-ID: <20070423202051.GS4579@mellanox.co.il> > Quoting Sean Hefty : > Subject: RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache > > >A straight-forward approach would be to listen for port up/down events > >rather than or in addition to GID in/out, and do network discovery by DR SMPs. > > I'm not entirely following you. How would you listen for port up/down events? Isn't there a way to get notice for this? > And are you suggesting that all nodes do network discovery using DR SMPs? I haven't thought this through yet. Basically, I just note that caching the path until GID goes out of service isn't right - since path parameters such as MTU or rate might change without GID going out of service. So what to do? We could use DR SMPs to do network discovery and at least check that paths are valid - it's not too much code (ibnetdiscover is just 800 lines) and in a sense, that's actually putting an *SA* (not just cache) in each node. Combined with GID IN/OUT notices we could get away from querying path records completely. Alternatively, we could notice that port state changed, and I think we can figure out, from that, paths to which GIDs are affected, so that we can get these anew from the SA. > > Hmm. IPoIB by design does not handle heterogenious > > networks too well (consider problems we have selecting bcast group rate). > > MTU is also basically forced to 2K. > > So this might not have been such a great example :) > > I agree; I just didn't want to create something that was worse than what we have > now. > > >That's not entirely correct. For example, arp cache might get cleaned > >by a timer, or by direct user request, or by other means. > >When this happens, address handles and path records get freed. > >Various IB events also trigger cache flush, such as the reregister event. > > I'm not overly familiar with the code. Is that what ipoib_neigh_destructor ends > up doing? (I need to walk through this to see where the path is freed.) I re-checked and you are right - we keep the path record even after all neighbours using it are gone. So of course this has the same problem in that respect. At least the reregister/LID set events flush the cache. -- MST From mst at dev.mellanox.co.il Mon Apr 23 13:50:32 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Apr 2007 23:50:32 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review In-Reply-To: References: <20070419115119.GB918@mellanox.co.il> Message-ID: <20070423205032.GT4579@mellanox.co.il> > > > This patch has been tested with linux-2.6.21-rc5 and rc7 with Topspin and > > > IBM HCAs on ppc64 machines. I have run > > > netperf between two IBM HCAs and two Topspin HCAs, as well as between IBM > > > and Topspin HCA. > > > > > > Note 1: There was interesting discovery that I made when I ran netperf > > > > between Topsin and IBM HCA. I started to see > > > the IB_WC_RETRY_EXC_ERR error upon send completion. This may have been due > > > to the differences in the > > > processing speeds of the two HCA. This was rectified by seting the > > > retry_count to a non-zero value in ipoib_cm_send_req(). > > > I had to do this inspite of the comment --> /* RFC draft warns against > > > retries */ > > > > This would only help if there are short bursts of high-speed activity > > on the receiving HCA: if the speed is different in the long run, > > the right thing to do is to drop some packets and have TCP adjust > > its window accordingly. > > > > But in that former case (short bursts), just increasing the number > > of pre-posted > > buffers on RQ should be enough, and looks like a much cleaner solution. > > This was not an issue with running out of buffers (which was my original > suspicion too). This was probably due to missing ACKs -I am guessing > this happens because the two HCAs have very different processing speeds. I don't see how different processing speeds could trigger missing ACKs. Do you? > This is exacerbated by the fact that retry count (not RNR retry count)was 0. > When I changed the retry count to a small values like 3 it still works. > Please see below for additional details. Looks like work-around for some breakage elsewhere. Maybe it's a good thing we don't retry in such cases - retries are not good for network performance, and this way we move the problem to it's root cause where it can be debugged and fixed instead of overloading the network. > > > Can someone point me to where this comment is in the RFC? I would like to > > > understand the reasoning. > > > > See "7.1 A Cautionary Note on IPoIB-RC". > > See also classics such as http://sites.inka.de/~W1011/devel/tcp-tcp.html > > > If we do this right, the above mentioned problems should not occur. In the case > we are dealing with the RC timers are expected to be much smaller (than TCP > timers) and > should not interfere with TCP timers. The IBM HCA uses a default value of 0 for > the Local CA Ack Delay; > which is probably too small a value and with a retry > count of 0, ACKs are missed. I agree with Roland's assessment (this was in a > seperate thread), that this should not be 0. So, it's an ehca bug then? I didn't really get the explanation. Who loses the ACKs? ehca? It is the case that ehca *reports* Local CA Ack Delay that is *below* what it actually provides? If so, it should be easy to fix in driver. > On the other hand with the Topspin adapter (and mthca) that I have the > Local CA Ack Delay is 0xf which would imply a Local Ack Timeout of 4.096us * 2^15 which > is about 128ms. The IB spec says it can be upto 4 times this value which means upto > 512 ms. > > The smallest TCP retransmission timer is HZ/5 which is 200 ms on several > architectures. > Yes, even with a retry count of 1 or 2, there is then a risk of > interfering with TCP timers. > > If my understanding is correct, the way its should be done is to have a small > value for the Local CA Ack Delay like say 3 or 4 which would imply a Timeout > value of 32-64us, with a small retry count of 2 or 3. This way the max Timeout > would be still be only several hundreds of us, a factor of 1000 less than the > minimum TCP timeout. IB adapters are supposed to have a much smaller latency > than ethernet adapters, so I am guessing that this would be in the ballpark for > most HCAs. > > Unfortunately I do not know how much of an effort it will take to change the > Local CA Delay Ack across the various HCAs (if need be). How about fixing ehca not to trigger ACK loss instead? > In the interim, the > only parameter we can control is the retry count and we could make this a module > parameter. Since both 0 and > 0 values might lead to problems, this does not look like a real solution. > > > > By the way, as long as you are not using SRQ, why not use UC mode QPs? > > This would look like a cleaner solution. You haven't addressed this, and this might be a better way out. Unreliable SRQ being only supported for RC QPs now is really one of the major reasons IPoIB CM uses RC rather than UC. -- MST From swise at opengridcomputing.com Mon Apr 23 14:15:03 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 23 Apr 2007 16:15:03 -0500 Subject: [ofa-general] build problem on kernel-ib-devel package Message-ID: <1177362903.23314.55.camel@stevo-desktop> Vlad, I'm trying to build the src tree that is installed when you install the kernel-ib-devel package and I'm hitting a problem. iw_cxgb3 fails to load because the ib_core module doesn't have the genalloc code included in it. I think the Makefile in drivers/infininband/core didn't get patched by kernel_patches/backport/2.6.20/linux_genalloc_to_2.6.20.patch. Q: Should this tree be a fully configured tree that I can just do a 'make install' in? Because if it is, then something is broken... Steve. From jgunthorpe at obsidianresearch.com Mon Apr 23 14:25:10 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 23 Apr 2007 15:25:10 -0600 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <20070423202051.GS4579@mellanox.co.il> References: <20070423190449.GR4579@mellanox.co.il> <000101c785df$0831ef80$e598070a@amr.corp.intel.com> <20070423202051.GS4579@mellanox.co.il> Message-ID: <20070423212509.GH20972@obsidianresearch.com> On Mon, Apr 23, 2007 at 11:20:59PM +0300, Michael S. Tsirkin wrote: > I haven't thought this through yet. Basically, I just note that > caching the path until GID goes out of service isn't right - since > path parameters such as MTU or rate might change without GID going > out of service. > > So what to do? Has anyone thought about using replication rather than caching to solve this problem? It seems to me it would be alot faster for some single process in the network to fetch and keep a copy of the entire SA route database, format it into a binary format and use RC RDMA to transfer it to every node each time it changes. For say, 10000 nodes you could compact an any-to-any path table into around 20 megabytes. The RDMA transfers would be arranged into a waterfall, source transfers to 8 nodes, who then each transfer to 8, etc. Choosing a connection topology that overlays the switch topology would give this scheme a huge aggregate bandwidth so the total transfer time would be short. Unfortunately the SA protocol doesn't seem to have many provisions for cache-coherence so it seems any form of route caching is going to run into problems with stale data :< Replication adds a coherenece mechanism and shifts the problem the replication source, which, ideally, would ultimately be tightly connected to the SA. > We could use DR SMPs to do network discovery and at least check that > paths are valid - it's not too much code (ibnetdiscover is just 800 > lines) and in a sense, that's actually putting an *SA* (not just > cache) in each node. Combined with GID IN/OUT notices we could get > away from querying path records completely. I don't think you can find/check the SL like this, plus I doubt the little CPUs in the switches can handle that rate of SMPs. :< Jason From sean.hefty at intel.com Mon Apr 23 14:32:02 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 23 Apr 2007 14:32:02 -0700 Subject: [ofa-general] RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <20070423202051.GS4579@mellanox.co.il> Message-ID: <000201c785ee$d54be520$e598070a@amr.corp.intel.com> >Isn't there a way to get notice for this? The closest trap I'm aware of is GID in/out of service. See 14.2.5.1 and 14.4.9. GID in/out of service is related to the existence of a path record between the SGID and DGID. If the path record parameters change, I'm not sure if the GID technically goes out, then back into service or not. Maybe Hal knows. >I haven't thought this through yet. Basically, I just note that >caching the path until GID goes out of service isn't right - >since path parameters such as MTU or rate might change >without GID going out of service. > >So what to do? > >We could use DR SMPs to do network discovery and at least check >that paths are valid - it's not too much code (ibnetdiscover is just 800 lines) >and in a sense, that's actually putting an *SA* (not just cache) in each node. >Combined with GID IN/OUT notices we could get away from querying path records >completely. My guess (and it's only a guess at this point) is that the impact of each node sending stream of DR SMPs to continually discover the network will be worse than sending GetTable queries to the SA. One thought I had was to provide a feedback mechanism back into the cache to invalidate paths. For example, the ib_cm could invalidate a path if it times out trying to establish a connection or notices that path migration has occurred. I think implementing either of these is further out. >At least the reregister/LID set events flush the cache. The reregister/LID events will also flush the SA cache. - Sean From mst at dev.mellanox.co.il Mon Apr 23 15:01:32 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Apr 2007 01:01:32 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <000001c785d3$47ff8430$e598070a@amr.corp.intel.com> References: <20070422103704.GE26791@mellanox.co.il> <000001c785d3$47ff8430$e598070a@amr.corp.intel.com> Message-ID: <20070423220132.GU4579@mellanox.co.il> > >4. Why do we need yet another API and yet another module to speed up just > >RDMA/CM path record queries? We now get 2 ways to do this (with/without the > >cache). Shouldn't there be just one? > > I did consider this, but the cache operates synchronously, and ib_sa interface > is asynchronous. I tried to make the API make sense for the cache. The rdma_cm > doesn't really take advantage of the synchronous interface, but I believe that > ipoib could. Converting the ib_local_sa to an asynchronous interface requires > adding registration calls, and an ability to cancel operations. Why can't it operate below the existing ib_sa interface? Does any client really care whether the record came from the SA or from the cache? -- MST From sean.hefty at intel.com Mon Apr 23 15:05:39 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 23 Apr 2007 15:05:39 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <20070423212509.GH20972@obsidianresearch.com> Message-ID: <000301c785f3$8768adc0$e598070a@amr.corp.intel.com> >Has anyone thought about using replication rather than caching to >solve this problem? It seems to me it would be alot faster for some >single process in the network to fetch and keep a copy of the entire >SA route database, format it into a binary format and use RC RDMA to >transfer it to every node each time it changes. I have given thought to using RC RDMA to distribute the data to all nodes, especially to eliminate the MAD protocol overhead. There are a couple issues with this: To work with existing SAs, we need to working within the defined SA interface (i.e. SA MADs), so something still needs to query for all path records. The GetTable query requires an SGID, which means that whatever node collects the path records must first collect all the GIDs. (And the most efficient way I've found to obtain a list of all GIDs is via a GetTable path record query...) This also means that the node collecting the path records will generate 1 query per GID. This has the same impact on the SA as each node issuing their own query. And the impact on the subnet is higher, since we still need to distribute that data to the end nodes. In short, until we can standardize on some new SA interface, or we have a distributed SA, I don't see where we can do much better than caching GetTable responses at the end nodes. - Sean From sean.hefty at intel.com Mon Apr 23 15:19:11 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 23 Apr 2007 15:19:11 -0700 Subject: [ofa-general] RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <20070423220132.GU4579@mellanox.co.il> Message-ID: <000401c785f5$6b0a5730$e598070a@amr.corp.intel.com> >Why can't it operate below the existing ib_sa interface? >Does any client really care whether the record came from the SA >or from the cache? It can, and I don't think so. Integrating the caching into the ib_sa is possible, but I didn't want to go this route without agreement first. One benefit this approach has is that the cache could store paths only after they were requested. (This would need to be a new feature, so would require more work.) I'd like to get more feedback before making this sort of change though. - Sean From mst at dev.mellanox.co.il Mon Apr 23 15:30:05 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Apr 2007 01:30:05 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <20070423212509.GH20972@obsidianresearch.com> References: <20070423190449.GR4579@mellanox.co.il> <000101c785df$0831ef80$e598070a@amr.corp.intel.com> <20070423202051.GS4579@mellanox.co.il> <20070423212509.GH20972@obsidianresearch.com> Message-ID: <20070423223005.GV4579@mellanox.co.il> > Has anyone thought about using replication rather than caching to > solve this problem? It seems to me it would be alot faster for some > single process in the network to fetch and keep a copy of the entire > SA route database, format it into a binary format and use RC RDMA to > transfer it to every node each time it changes. > > For say, 10000 nodes you could compact an any-to-any path table into > around 20 megabytes. I wonder how do you propose to compact the path records that drastically? Number of records seems to be 10000 * 10000 / 2 = 50 million. > The RDMA transfers would be arranged into a waterfall, source > transfers to 8 nodes, who then each transfer to 8, etc. Choosing a > connection topology that overlays the switch topology would give this > scheme a huge aggregate bandwidth so the total transfer time would be > short. > > Unfortunately the SA protocol doesn't seem to have many provisions for > cache-coherence so it seems any form of route caching is going to run > into problems with stale data :< Replication adds a coherenece > mechanism and shifts the problem the replication source, which, > ideally, would ultimately be tightly connected to the SA. Yes, I do agree that replication solves some problems with caching, at least in theory, and I'd like to see this area explored, too. Things to cover before we can start implementation would be: the protocol between the replicas, how is the waterfall setup, how to handle errors/replicas going out of service. > > We could use DR SMPs to do network discovery and at least check that > > paths are valid - it's not too much code (ibnetdiscover is just 800 > > lines) and in a sense, that's actually putting an *SA* (not just > > cache) in each node. Combined with GID IN/OUT notices we could get > > away from querying path records completely. > > I don't think you can find/check the SL like this, Well, it seems configuring the SL needs to be done in some config file in the SM anyway. So now a script will have to be run to copy that file across all nodes. Might not be a big deal. -- MST From mst at dev.mellanox.co.il Mon Apr 23 15:35:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Apr 2007 01:35:33 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <000301c785f3$8768adc0$e598070a@amr.corp.intel.com> References: <20070423212509.GH20972@obsidianresearch.com> <000301c785f3$8768adc0$e598070a@amr.corp.intel.com> Message-ID: <20070423223533.GW4579@mellanox.co.il> > Quoting Sean Hefty : > Subject: RE: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache > > >Has anyone thought about using replication rather than caching to > >solve this problem? It seems to me it would be alot faster for some > >single process in the network to fetch and keep a copy of the entire > >SA route database, format it into a binary format and use RC RDMA to > >transfer it to every node each time it changes. > > I have given thought to using RC RDMA to distribute the data to all nodes, > especially to eliminate the MAD protocol overhead. There are a couple issues > with this: > > To work with existing SAs, we need to working within the defined SA interface > (i.e. SA MADs), so something still needs to query for all path records. > > The GetTable query requires an SGID, which means that whatever node collects the > path records must first collect all the GIDs. (And the most efficient way I've > found to obtain a list of all GIDs is via a GetTable path record query...) This > also means that the node collecting the path records will generate 1 query per > GID. This has the same impact on the SA as each node issuing their own query. > And the impact on the subnet is higher, since we still need to distribute that > data to the end nodes. We could solve this by implementing a process running on the same node as the SA. And it's probably not too hard to add a way for opensm to spit out the table into an external file when it gets a signal or something. -- MST From kelt at essmech.com Mon Apr 23 17:45:23 2007 From: kelt at essmech.com (Rex Begay) Date: Mon, 23 Apr 2007 17:45:23 -0700 Subject: [ofa-general] New5 Mlcrosoft+Adobe+More for under 19$W Message-ID: <000001c785f8$c6615680$0100007f@localhost> ______________________________________________________________________ Bit 1 Interrupt ID Bit #0 ATZ # as specified in the /etc/printcap file. Roland MPU-401 stand-alone card. Reported by: Thomas S. Traylor tsttitan.cs.mci 17.2.3. Starting off with CTM for the first time Supplementary Documents. O'Reilly & Associates, Inc., 1994. 8mm tape. name means that host also provides an administrative database server. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: longsoftlistnew.gif Type: image/gif Size: 5334 bytes Desc: not available URL: From sean.hefty at intel.com Mon Apr 23 15:51:20 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 23 Apr 2007 15:51:20 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add pathrecord cache In-Reply-To: <20070423223533.GW4579@mellanox.co.il> Message-ID: <000501c785f9$e9015cc0$e598070a@amr.corp.intel.com> >We could solve this by implementing a process running on the same node as the >SA. >And it's probably not too hard to add a way for opensm to spit out >the table into an external file when it gets a signal or something. I agree that there are ways to solve this, but those solutions won't work with existing SAs and define a new SA interface. If we're willing to break compatibility or add extensions, we could also extend the SA to provide better support for caching. For example, add a new 'path updated' trap. IMO, I don't think that there's a huge issue initially populating the cache. The problems all seem to fall into keeping it updated. I originally thought this would have been a bigger deal, but given that ipoib doesn't update its cache, it doesn't seem to be an issue in practice. - Sean From mst at dev.mellanox.co.il Mon Apr 23 16:09:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Apr 2007 02:09:06 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add pathrecord cache In-Reply-To: <000501c785f9$e9015cc0$e598070a@amr.corp.intel.com> References: <20070423223533.GW4579@mellanox.co.il> <000501c785f9$e9015cc0$e598070a@amr.corp.intel.com> Message-ID: <20070423230906.GX4579@mellanox.co.il> > Quoting Sean Hefty : > Subject: RE: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add pathrecord cache > > >We could solve this by implementing a process running on the same node as the > >SA. > >And it's probably not too hard to add a way for opensm to spit out > >the table into an external file when it gets a signal or something. > > I agree that there are ways to solve this, but those solutions won't work with > existing SAs and define a new SA interface. If we're willing to break > compatibility or add extensions, we could also extend the SA to provide better > support for caching. For example, add a new 'path updated' trap. One difficulty here > IMO, I don't think that there's a huge issue initially populating the cache. > The problems all seem to fall into keeping it updated. I originally thought > this would have been a bigger deal, but given that ipoib doesn't update its > cache, it doesn't seem to be an issue in practice. Maybe, but there might be several other reasons for this. One might be that IPoIB is slower than link speeds, so e.g. miscalculating the rate still does not cause network failures. Another might be that people run TCP mostly, which is very good at recovering from failures, so if you get the LID right, the rest of the values being off might not matter. In short, I'm not sure the fact that IPoIB works means we can copy it's caching over safely. -- MST From lawver1 at llnl.gov Mon Apr 23 16:19:49 2007 From: lawver1 at llnl.gov (Bryan Lawver) Date: Mon, 23 Apr 2007 16:19:49 -0700 Subject: [ofa-general] IPoIB forwarding Message-ID: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> I have a small test bed with 2 nodes with IB/OFED1.2/connected mode and a third node which has IP only and is connected to one of the IB nodes. In between are DDR IB switch and 10GE IP switch. The node with both IP and IB interfaces is simply a IP router in this test setup. The IB only node has a subnet route to router node and the IP only node has a subnet route to the router node. When I launch an Iperf test from the IB (IPoIB) node to the IP node, I get very good throughput with no tuning (7.5gbs). When I launch from IP to the IB node, I get virtually no thorughput (2.5mbs). When I dropped the window size to 8k (iperf -w8k) the throughput is 750mbs. Any suggestions, ideas? thanks, bryan lawver llnl [ 4] local 192.168.120.3 port 33418 connected with 172.16.13.2 port 5001 [ 5] local 192.168.120.3 port 5001 connected with 172.16.13.2 port 1032 [ 4] 0.0-10.0 sec 4.72 GBytes 4.06 Gbits/sec [ 5] 0.0-10.3 sec 1.69 MBytes 1.38 Mbits/sec From jgunthorpe at obsidianresearch.com Mon Apr 23 16:23:12 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 23 Apr 2007 17:23:12 -0600 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <20070423223005.GV4579@mellanox.co.il> References: <20070423190449.GR4579@mellanox.co.il> <000101c785df$0831ef80$e598070a@amr.corp.intel.com> <20070423202051.GS4579@mellanox.co.il> <20070423212509.GH20972@obsidianresearch.com> <20070423223005.GV4579@mellanox.co.il> Message-ID: <20070423232312.GI20972@obsidianresearch.com> On Tue, Apr 24, 2007 at 01:30:05AM +0300, Michael S. Tsirkin wrote: > > For say, 10000 nodes you could compact an any-to-any path table into > > around 20 megabytes. > > I wonder how do you propose to compact the path records that drastically? > Number of records seems to be 10000 * 10000 / 2 = 50 million. I was imagining this kind of arrangement when I wrote the above (and I only imagined a little bit, so lets see if it works out) struct DataBase { unsigned int GIDS; in6_addr gid_table[GIDS]; unsigned int PATHS; struct Path path_table[PATHS]; }; struct Path { uint16_t gid_idx; uint16_t dlid; uint8_t sl; // etc uint8_t valid_sgid_map[GIDS/8]; }; Lookup: bool match_it(struct Path *p) { const unsigned int srrc_gid_idx = ...; // look in gid_table const unsigned int dest_gid_idx = ...; // look in gid_table if (p->gid_idx == dest_gid_idx && p->valid_sgid_map[src_gid_idx/8] & (1 << (src_gid_idx % 8))) return true; return false; }; possible_paths = bsearch(DataBase.path,match_it); There would be at most NUM_LIDS*NUM_SL Path records (ie all valid DLID/SL combinations in the subnet). So for 10000 nodes (1 LID each) we have: GIDS = 10000 PATHS = 10000*16 = 1600000 sizeof(gid_table) = 160000 sizeof(struct Path) = 1258 sizeof(path_table) = 2012800000 sizeof(DataBase) = 2012960000 = 19.19 MB (worst case) Broadly, this compresses the 128 bit GIDs by substituting a 16 bit unique identifier (via gid_table) and the compacts the path records by observing that in the N*N matrix there are only so many parameter combinations and they are shared based on the soruce GID. So we use a bitmap to indicate what destination parameters are valid for each source in the network. Path lookup is O(log(n)) since the path and gid tables would be sorted and have fixed element size. The whole thing could be moved with a single RDMA WQ. I think rates and pkeys can be coded similarly, but the upper bound moves up to NUM_LIDS*NUM_SL*NUM_PKEYS*NUM_RATES. Off the top of my head I'd say that the pkey coding and valid_sgid_maps can probably be optimized further, but that would require actual study :P I'd also include a path trace database and some flags so that APM can be made to work sanely, but that is just fluff stuff. The only reason I mention this is because cache-coherency is a huge PITA. Even if the spec was updated to add new MADs for coherency it would still be a PITA. I think for true high-availability you need to get this stuff right. If the SM re-routes the fabric to cover for a broken link that data needs to get pushed out, QPs need to be updated, APM needs to be fiddled/etc. IB implementations don't do that today, and replication at least provides a scalable foundation to support this in huge clusters. Also, routers have the same basic set of problems :< Routers really want an accurate replica of the SA database to run well. Jason From sean.hefty at intel.com Mon Apr 23 16:36:59 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 23 Apr 2007 16:36:59 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: addpathrecord cache In-Reply-To: <20070423230906.GX4579@mellanox.co.il> Message-ID: <000601c78600$49a6ab10$e598070a@amr.corp.intel.com> >Maybe, but there might be several other reasons for this. > >One might be that IPoIB is slower than link speeds, >so e.g. miscalculating the rate still does not cause network failures. >Another might be that people run TCP mostly, which >is very good at recovering from failures, so if you get the LID >right, the rest of the values being off might not matter. > >In short, I'm not sure the fact that IPoIB works >means we can copy it's caching over safely. I'm starting to get lost here. Are you concerned that the SM will change the MTU or rate of a path, and the caches will not be updated? What about existing connections? I see that either the SA can use events to notify end nodes when such a situation arises, or the end nodes can poll for changes. The former isn't spec'ed, but the later can be done using the ib_local_sa's /sys/class interface. - Sean From pradeep at us.ibm.com Mon Apr 23 16:53:31 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 23 Apr 2007 16:53:31 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review In-Reply-To: <20070423205032.GT4579@mellanox.co.il> Message-ID: "Michael S. Tsirkin" wrote on 04/23/2007 01:50:32 PM: > > > > > > This would only help if there are short bursts of high-speed activity > > > on the receiving HCA: if the speed is different in the long run, > > > the right thing to do is to drop some packets and have TCP adjust > > > its window accordingly. > > > > > > But in that former case (short bursts), just increasing the number > > > of pre-posted > > > buffers on RQ should be enough, and looks like a much cleaner solution. > > > > This was not an issue with running out of buffers (which was my original > > suspicion too). This was probably due to missing ACKs -I am guessing > > this happens because the two HCAs have very different processing speeds. > > I don't see how different processing speeds could trigger missing ACKs. > Do you? Note: In the netperf tests errors were seen only when one side is ehca and the other side is mthca. When both sides are ehca or mthca no errors are seen. In the netperf tests I observed that ehca encountered lots of send completion errors. ehca encountered send completion errors whether it was the sender or the receiver (presumably sending Acks when it was the receiver). On the contrary mthca reported no errors -even when I changed /sys/module/ib_mthca/parameters/debug_level to 1 (that is the way to turn on debug on mthca -right?). With the Local CA Delay Ack set to 0 on ehca, I believe it is probably taking mthca more than 16us to deliver the Ack back to ehca. It might not be exactly 16 us, but I just assumed 4 times the Local CA Delay Ack (as per the spec) of 4us. That triggers the send completion error on ehca. On the other hand, when two ehca adapters use RC, no errors are encountered implying that the Ack is consistently delivered within 16us. Since mthca sets the Local CA Delay Ack value to 15, the timeouts between two mthcas are much larger (> 128 ms)and hence no problems are encountered. It is for that reason I stated that different processing speeds may be trigerring the missing Acks. > > > This is exacerbated by the fact that retry count (not RNR retry count)was 0. > > When I changed the retry count to a small values like 3 it still works. > > Please see below for additional details. > > Looks like work-around for some breakage elsewhere. > Maybe it's a good thing we don't retry in such cases - retries are not good > for network performance, and this way we move the problem to it's > root cause where it can be debugged and fixed instead of overloading > the network. There is no single value all HCAs can pick and provide optimal performance in all situations. The only way would be to select a certain value that is optimal for each HCA, and depend on a retry mechanism when the selected value does not meet the needs of interoperability. To depend on higher levels like TCP or even the application to do the retries will kill performance. > > > > > Can someone point me to where this comment is in the RFC? I > would like to > > > > understand the reasoning. > > > > > > See "7.1 A Cautionary Note on IPoIB-RC". > > > See also classics such as http://sites.inka.de/~W1011/devel/tcp-tcp.html > > > > > > If we do this right, the above mentioned problems should not > occur. In the case > > we are dealing with the RC timers are expected to be much smaller (than TCP > > timers) and > > should not interfere with TCP timers. The IBM HCA uses a default > value of 0 for > > the Local CA Ack Delay; > > which is probably too small a value and with a retry > > count of 0, ACKs are missed. I agree with Roland's assessment (this was in a > > seperate thread), that this should not be 0. > > So, it's an ehca bug then? > I didn't really get the explanation. Who loses the ACKs? ehca? > It is the case that ehca *reports* Local CA Ack Delay that is > *below* what it actually provides? If so, it should be easy to fix in driver. Yes, there is a problem with the IBM HCA, and we will address this. I stated as much, when I concurred with Roland's assessment. > > > On the other hand with the Topspin adapter (and mthca) that I have the > > Local CA Ack Delay is 0xf which would imply a Local Ack Timeout of > 4.096us * 2^15 which > > is about 128ms. The IB spec says it can be upto 4 times this value > which means upto > > 512 ms. > > > > The smallest TCP retransmission timer is HZ/5 which is 200 ms on several > > architectures. > > Yes, even with a retry count of 1 or 2, there is then a risk of > > interfering with TCP timers. > > > > If my understanding is correct, the way its should be done is to > have a small > > value for the Local CA Ack Delay like say 3 or 4 which would implya Timeout > > value of 32-64us, with a small retry count of 2 or 3. This way the > max Timeout > > would be still be only several hundreds of us, a factor of 1000 > less than the > > minimum TCP timeout. IB adapters are supposed to have a much smaller latency > > than ethernet adapters, so I am guessing that this would be in the > ballpark for > > most HCAs. > > > > Unfortunately I do not know how much of an effort it will take to change the > > Local CA Delay Ack across the various HCAs (if need be). > > How about fixing ehca not to trigger ACK loss instead? As previously stated, IBM HCA will address these issues. However, my understanding is that mthca/Topspin adapters also have a problem (too high a value for the Local CA Delay Ack). Both HCAs need to be fixed for good interoperability. > > > In the interim, the > > only parameter we can control is the retry count and we could make > this a module > > parameter. > > Since both 0 and > 0 values might lead to problems, this does not > look like a real solution. > Please see previous reasoning as to why we need a retry mecahnism. > > > > > > By the way, as long as you are not using SRQ, why not use UC mode QPs? > > > This would look like a cleaner solution. > > You haven't addressed this, and this might be a better way out. > Unreliable SRQ > being only supported for RC QPs now is really one of the major > reasons IPoIB CM > uses RC rather than UC. > This is a good point you make. However, this will not address the core issue of missing Acks -the difference in processing speeds. What happens when the next version of IBM HCA (or for that matter HCA from any other vendor) supporting SRQ comes out? Pradeep pradeep at us.ibm.com From dgressyvucai at simplyfamily.net Mon Apr 23 17:52:16 2007 From: dgressyvucai at simplyfamily.net (Waldo Stewart) Date: Tue, 24 Apr 2007 09:52:16 +0900 Subject: [ofa-general] I bet you can Message-ID: Y-yeah. orange yearly sea Dana was now label slightly trembling. IsJeff paused annually spare for a shut fought moment to gather his thoughts drain bit Dana gave corporeal her a wound stern look, Sometimes, a man'svespine I'll try pump talking to wheel her tomorrow. sprang You know, my 5:00 PMlet blade zoic Stacy took a deep spun breath, and stood up. Actual What's that? The Lieutenant introduced the dry grown boot done crying woman. Da Tomorrow after school, trousers shoot gotten long can you meet me in room judge discover peripatetic Either that or tax she already has a boyfriend. daughter Gotcha. See successfully moon corporal you in sociology.Before time squash catch they knew it, it sharp was getting dark outsid Of course not, son replied one stealthily cushion of kick the girls. He' wrestle Jeff, ran I've challenge gotta go to the bewildered bathroom again. St Naw, if with contain wrong weave that were the case, she would've insis Before surround he could rub finish box box his question, Dana turne infamous Dana thought for a calculate disgusted moment. I squeaky suppose. What exa 11:45 AM went Guy gluteal easy shrugged. I don't gave see much point. They'll I'll go get your petite mom. He friendly osseous started discovery to get up.With quick that, the drawer two birth girls boot parted in separate dir Heya Angel. Alright, run blind Stacy had one more card cerebral strove to play. Wo glove bat Nobody said flap anything. steep Stacy knew that most, if Linda chimed stitch in, punishment You're tour measure his chick, not his motFingerprints, harmony said Nicki. eye control Theirs jewel should be asuccessfully shiny kill But outside I've got my bike. burn You'll fear rail confuse see. Good night cheese sewn river Our digestion car has a rack. Is seed that paste cycle your girlfriend? cheat Greil noticed Jeff w During lunch, Stacy ran metal tore test into Linda match in one of th Yeah. Up until now, request she didn't book ship food know anything a Didn't you pull name knit notice they bulb were all wearing gloves button Jeff, decorate punch assume the soap Brooklyn position, Commanded That's not necessary.leaf potato Huh? he accidentally orange was completely confused.3:00 PM tease shy Jeff had a shut look of surprise lain on his face. You s She took smash left his left itch arm and placed tail it on his lock With harbor that, comb Dana waved goodbye broadcast play and disappeared i 6:00 PM tasteless Stacy's mom far picked them mad both up at young the mall, an Heya brush good Linda. Are pontal rest you feeling any better? answer tie poor tear No, said Nicki, surprised. burn broadcast It looks to business toe me like she's not too happy about I level swear you part happen owner are the best! chase Light-coloured overdone strove woken supermarket washing-up gloves. Jeff was humored. Why thrust doubt chalk do I busily suddenly feel like -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: soih.gif Type: image/gif Size: 6285 bytes Desc: not available URL: From RAISCH at de.ibm.com Tue Apr 24 14:48:48 2007 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Tue, 24 Apr 2007 23:48:48 +0200 Subject: [ofa-general] [PATCH] eHCA: Add "Modify Port" verb In-Reply-To: <1177347358.28021.24294.camel@hal.voltaire.com> Message-ID: Hi Hal, you are correct, with the current firmware version it will fail later. Christoph R. general-bounces at lists.openfabrics.org wrote on 23.04.2007 18:55:59: > Hi Joachim, > > On Mon, 2007-04-23 at 12:23, Joachim Fenkes wrote: > > Add "Modify Port" verb support to eHCA driver. > > ib_cm needs this to initialize properly. > > I didn't think IB_PORT_SM was allowed (as QP0 is not exposed) or does > this just fail later when it is attempted to be actually set ? > > -- Hal From vlad at lists.openfabrics.org Tue Apr 24 02:37:21 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 24 Apr 2007 02:37:21 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070424-0200 daily build status Message-ID: <20070424093721.DCCD9E60824@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From darosejoyus at uk.tesco.com Tue Apr 24 03:50:08 2007 From: darosejoyus at uk.tesco.com (Floy Alexander) Date: Tue, 24 Apr 2007 02:50:08 -0800 Subject: [ofa-general] Give me your opinion Message-ID: With tensely branch that, she brick turned around stuff and exited. Her moThat actually explains debt cloud whine alot. I'll see window you a li With a understand big smile net like stain on her face, Stacy stood up. Gordy? She stone thought insurance about it for grease town a moment. Wa body hurry important Yeah, that's it. fowl Jeff got up and dusted himseJeff voice was putting memory on oil his socks and seed shoes, when C thought hate Bye knowledge Angel. shy Jeff closed his cellphone 9:00 AM Marcie had been listening plate pomaceous mammilary engine in on the conversatio That dam forgiven expect obnoxious would be him. Defending him from tame stocking what? His coat withstand presence has absoThat's called striven loud a steamroller, hole lead said Gretchen. cycle anxiously Of complain course, it was delightful Clifford who spoke up, Feing Alright, Jeff corporal picked up his voiceless rest moon bike. Tell me wh animal Eeeww. That guard street means different Sol could be passing on that Dana's first period history whip move class way cheese was going abo cytherean mute I'm not led so sure she's my curved friend, Jeff was eve risen tread spoil Dana, Mrs. impossible Kelton called out. smoothly Good rough show, Jeff increase Carl, Linda and raspy Dana all came For the next whip use half-hour led Andy, foot Gretchen and Greilstung condition itch Yeah, crack 5th period. why? sharp He's mine, obediently too 4th period. river lucky Personally, I don' Jeff famous auctorial paused, carefully courageous manager taking into consideratio You're just chicken, drown dealt camera 'cause you transport know I can bea lose Linda spoke in forward taurine whisper a repentant tone. Carl, I'm reaband Just pop messin' around over there. winter clean He nodded towlaunch slit Which means you could be passing them reaction misty on to me Marcie drain didn't cloud peace press the hourly matter any further. She crawl The icy two of them both let corporal out chess a loud Eeewww at That's alright. reproduce screw breathe disarm Um...for the record, I want to Yes? upset She tame was still a little dazed catch bad both by the That's multiply enjoy exactly how Stacy neatly roped me stitch into it. Wha It's too bad they grass rub garden don't serve have something like th No problem. week The two girls were shaven brake man now outside he 2:15 PMuphold fasten Dana not cut herself harmony off mid-sentence, and stoppedCliff, pleasant decision gently need I remind you dust that Coach Randall is shook He laid whirl arrived jump at Stacy's house, and was greeted at 11:45 AM Turn whip right milk up ahead, and then compete drive help a couple o Stacy began to addition ponder out vessel led loud, invite I wonder if Ga Well, whirl my guess is lip continue real that Gordy figures that nobo Would seed you please forgot go teaching jewel with Principal Lazarus. mad book snake Yeah, tell me about it, judge Jeff responded. I fi burn Strangely bed enough, I do flower know began the answer to that outstanding Marcie attention followed his bathe difficult instructions, and sure enou knock credit Stacy emerged hidden from under the motion bleachers with her addition The expression on arm her little monkey face grew slightly more f -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jmunayxoo.gif Type: image/gif Size: 6099 bytes Desc: not available URL: From result.euromillion03 at gmail.com Tue Apr 24 04:44:33 2007 From: result.euromillion03 at gmail.com (EURO MILLION) Date: Tue, 24 Apr 2007 12:44:33 +0100 Subject: [ofa-general] Winning Ticket: Reference Number: LRP/19-DE/9317 Message-ID: <384eedc20704240444q77fb1e48uc478882c82d0682f@mail.gmail.com> [image: EuroMillions UK and National Lottery UK - play online with e-lottery syndicate!] EURO MILLION E-LOTTERY Ferdinand-Sauerbruch-Strasse2 DE-56073 Koblenz, 9XF Germany. Reference Number: LRP/19-DE/9317 Batch: LRP/06/41 DEAR WINNER, We are pleased to inform you of the release, of the results of the EURO MILLION E-LOTTERY held on the 20/04/2007, You were entered as dependent clients with: Reference Number: LRP/19-DE/9317 and Batch number: LRP/06/41.Your email address attached to the Ticket number: 23-26-32-41-45-03-07 drew the lucky winning number, which consequently won the sweepstake in the first category, You have been approved for a payment of 1,000,000 Euro (ONE MILLION EURO). Please contact the fiduciary agent whose details are stated below to begin the processing of your claim. Once again congratulation from the entire staffs of Euro Million E-lottery. CONTACT; Sir. Billy Moore Foreign Services Manager, Email: info.euromillionagent at yahoo.co.uk Yours Sincerely, MS GLORIA EDWARDS (Lottery Coordinator/Director) EURO MILLION E-LOTTERY -------------- next part -------------- An HTML attachment was scrubbed... URL: From FENKES at de.ibm.com Tue Apr 24 05:06:59 2007 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Tue, 24 Apr 2007 14:06:59 +0200 Subject: [ofa-general] Re: [PATCH] ib_core: Add missing device link to class device In-Reply-To: Message-ID: Roland Dreier wrote on 23.04.2007 21:17:48: > Hmm, I have links like this on my system already: > > the patch actually looks sane but I don't understand why it's needed. > Could you explain? I had a look at the kernel code -- currently, all device drivers except ehca do this by themselves: cxgb3: 1110 dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev); 1111 dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev); amso1100: 798 dev->ibdev.dma_device = &dev->pcidev->dev; 799 dev->ibdev.class_dev.dev = &dev->pcidev->dev; ipath: 1559 dev->dma_device = &dd->pcidev->dev; 1560 dev->class_dev.dev = dev->dma_device; mthca: 1295 dev->ib_dev.dma_device = &dev->pdev->dev; 1296 dev->ib_dev.class_dev.dev = &dev->pdev->dev; So I think it makes a lot of sense to put the class_dev.dev assignment into generic ib_core code instead of repeating it in all the drivers. The respective lines could move out of the drivers in the future but won't hurt anyone until then. What's your take on this? Cheers, Joachim --- Joachim Fenkes -- eHCA Linux Driver Developer and Hardware Tamer IBM Deutschland Entwicklung GmbH -- Dept. 3627 (I/O Firmware Dev. 2) Schoenaicher Strasse 220 -- 71032 Boeblingen -- Germany eMail: fenkes at de.ibm.com -- Phone: +49 7031 16 1239 From FENKES at de.ibm.com Tue Apr 24 05:25:34 2007 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Tue, 24 Apr 2007 14:25:34 +0200 Subject: [ofa-general] Re: [PATCH] eHCA: Add "Modify Port" verb In-Reply-To: Message-ID: Roland Dreier wrote on 23.04.2007 21:20:14: > > + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { > > + ehca_err(&shca->ib_device, "Can't query port properties"); > > + ret = -EINVAL; > > + goto modify_port1; > > + } > > + > > + cap = (rblock->capability_mask | props->set_port_cap_mask) > > + & ~props->clr_port_cap_mask; > > + > > + hret = hipz_h_modify_port(shca->ipz_hca_handle, port, > > + cap, props->init_type, port_modify_mask); > > Is this thread-safe? What if two different bits are set at the same > time from two different threads? It seems that both calls could get > the same result from hipz_h_query_port(), and then the second call to > hipz_h_modify_port() would overwrite the first call. Yes, you're so right. > You could look at the implementation in mthca to see the locking I > used there. I'll do that, thanks for the hint! Joachim --- Joachim Fenkes -- eHCA Linux Driver Developer and Hardware Tamer IBM Deutschland Entwicklung GmbH -- Dept. 3627 (I/O Firmware Dev. 2) Schoenaicher Strasse 220 -- 71032 Boeblingen -- Germany eMail: fenkes at de.ibm.com -- Phone: +49 7031 16 1239 From halr at voltaire.com Tue Apr 24 06:37:38 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Apr 2007 09:37:38 -0400 Subject: [ofa-general] Multicast Question In-Reply-To: <4624D086.3010100@dbresearch.net> References: <4624D086.3010100@dbresearch.net> Message-ID: <1177421855.28021.102598.camel@hal.voltaire.com> Hi Sean, On Tue, 2007-04-17 at 09:49, Sean Hubbell wrote: > Hello, > > I was wondering if I ping or ibping the 224.0.0.1 address should I be > receiving a list of the nodes that have multicast enabled? ibping takes a LID or GUID (and not an IP address) as an argument so this is not supported by ibping. 224.0.0.1 is the all (multicast) hosts on the subnet so I would expect only those hosts supporting IPmc to respond and they may indicate some other IP interface other than IPoIB. -- Hal > Thanks in advance, > > Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Tue Apr 24 08:02:57 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Apr 2007 11:02:57 -0400 Subject: [ofa-general] Re: [PATCH] osm: fixing small memory leak In-Reply-To: <462B7D00.6060707@dev.mellanox.co.il> References: <462B7D00.6060707@dev.mellanox.co.il> Message-ID: <1177426976.12163.1060.camel@hal.voltaire.com> Hi Yevgeny, On Sun, 2007-04-22 at 11:19, Yevgeny Kliteynik wrote: > Hi Hal, > > This patch fixes a small memory leak - OpenSM was leaking ~200 bytes > or more for each guid in the fabric each time the guid2lid file was re-read. Good find. > Please apply to ofed_1_2 and trunk. > > Thanks. > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to both master and ofed_1_2). -- Hal From halr at voltaire.com Tue Apr 24 08:28:08 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Apr 2007 11:28:08 -0400 Subject: [ofa-general] Re: [PATCH] osm: fixing log messages in case of uncompatible MTU/RATE In-Reply-To: <4624D5E7.9010401@dev.mellanox.co.il> References: <4624D5E7.9010401@dev.mellanox.co.il> Message-ID: <1177428487.12163.2586.camel@hal.voltaire.com> Hi Yevgeny, On Tue, 2007-04-17 at 10:12, Yevgeny Kliteynik wrote: > Hi Hal, > > Log messages that state that required mcast group > RATE/MTU doesn't match the RATE/MTU the request are > misleading. > Feel free to change the exact words of the log messages, > but the previous messages got me confused when I was > debugging mcast join failures. > > It's not really a bug, so please apply to master only. It does improve clarity (and there has been much discussion on the list about join failures) and is extremely low risk change so I think this is a good improvement for OFED 1.2. > Thanks. > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to both master and ofed_1_2). -- Hal From halr at voltaire.com Tue Apr 24 08:36:56 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Apr 2007 11:36:56 -0400 Subject: [ofa-general] [PATCH] osm: ignore line with invalid guid in guid2lid file In-Reply-To: <4625B538.3080207@dev.mellanox.co.il> References: <4624DCFE.9030904@dev.mellanox.co.il> <20070417233935.GB29254@sashak.voltaire.com> <4625B538.3080207@dev.mellanox.co.il> Message-ID: <1177429015.12163.3149.camel@hal.voltaire.com> On Wed, 2007-04-18 at 02:05, Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: > > On 17:43 Tue 17 Apr , Yevgeny Kliteynik wrote: > >> Hi Hal, > >> > >> When parsing guid2lid file, invalid guid string > >> ended up unpacked as guid 0x0. Ignoring line with > >> invalid guid string. > >> > >> This bug doesn't look too important - don't think > >> that it should go to ofed_1_2. Anyway, your call. > > > > It looks like a safe change for me. > > > > BTW any reason to use strtouq() instead of more popular (IMHO) strtoul() > > or strtoull()? > > No particular reason. > It specifically says that the function "convert string to an unsigned > 64-bit integer" instead of unsigned long or unsigned long long, but > on the other hand it doesn't matter, because uint64_t is a typedef anyway. > If you have special sentiments about strtoul/strtoull - feel free to change it. Is strtouq supported in Windows ? -- Hal > -- Yevgeny > > > > Sasha > > > >> -- Yevgeny > >> > >> Signed-off-by: Yevgeny Kliteynik > >> --- > >> osm/opensm/osm_db_files.c | 15 ++++++++++++--- > >> 1 files changed, 12 insertions(+), 3 deletions(-) > >> > >> diff --git a/osm/opensm/osm_db_files.c b/osm/opensm/osm_db_files.c > >> index dbadd68..23eaa0b 100644 > >> --- a/osm/opensm/osm_db_files.c > >> +++ b/osm/opensm/osm_db_files.c > >> @@ -294,6 +294,7 @@ osm_db_restore( > >> char *p_first_word, *p_rest_of_line, *p_last; > >> char *p_key = NULL; > >> char *p_prev_val, *p_accum_val = NULL; > >> + char *endptr = NULL; > >> unsigned int line_num; > >> > >> OSM_LOG_ENTER( p_log, osm_db_restore ); > >> @@ -415,12 +416,20 @@ osm_db_restore( > >> p_prev_val = NULL; > >> } > >> > >> - /* store our key and value */ > >> - st_insert(p_domain_imp->p_hash, > >> - (st_data_t)p_key, (st_data_t)p_accum_val); > >> osm_log( p_log, OSM_LOG_DEBUG, > >> "osm_db_restore: " > >> "Got key:%s value:%s\n", p_key, p_accum_val); > >> + > >> + /* check that the key is a number */ > >> + if (!strtouq(p_key,&endptr,0) && *endptr != '\0') > >> + osm_log( p_log, OSM_LOG_ERROR, > >> + "osm_db_restore: ERR 610B: " > >> + "Key:%s is invalid\n", > >> + p_key); > >> + else > >> + /* store our key and value */ > >> + st_insert(p_domain_imp->p_hash, > >> + (st_data_t)p_key, (st_data_t)p_accum_val); > >> } > >> else > >> { > >> -- > >> 1.4.4.1.GIT > >> > >> > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit > >> http://openib.org/mailman/listinfo/openib-general > > > From fenkes at de.ibm.com Tue Apr 24 08:44:31 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 24 Apr 2007 17:44:31 +0200 Subject: [ofa-general] [PATCH] eHCA: Add "Modify Port" verb Message-ID: <200704241744.31691.fenkes@de.ibm.com> Add "Modify Port" verb support to eHCA driver. ib_cm needs this to initialize properly. Signed-off-by: Joachim Fenkes --- This is the shiny new version of this patch with proper locking. Tested and works. ehca_classes.h | 1 + ehca_hca.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- ehca_main.c | 1 + hcp_if.c | 24 ++++++++++++++++++++++++ hcp_if.h | 4 ++++ 5 files changed, 83 insertions(+), 2 deletions(-) diff -urp a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h --- a/drivers/infiniband/hw/ehca/ehca_classes.h 2007-02-04 19:44:54.000000000 +0100 +++ b/drivers/infiniband/hw/ehca/ehca_classes.h 2007-04-24 14:51:38.000000000 +0200 @@ -96,6 +96,7 @@ struct ehca_shca { struct ehca_mr *maxmr; struct ehca_pd *pd; struct h_galpas galpas; + struct mutex modify_mutex; }; struct ehca_pd { diff -urp a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c --- a/drivers/infiniband/hw/ehca/ehca_hca.c 2007-02-04 19:44:54.000000000 +0100 +++ b/drivers/infiniband/hw/ehca/ehca_hca.c 2007-04-24 14:50:43.000000000 +0200 @@ -147,6 +147,7 @@ int ehca_query_port(struct ib_device *ib break; } + props->port_cap_flags = rblock->capability_mask; props->gid_tbl_len = rblock->gid_tbl_len; props->max_msg_sz = rblock->max_msg_sz; props->bad_pkey_cntr = rblock->bad_pkey_cntr; @@ -233,10 +234,60 @@ query_gid1: return ret; } +const u32 allowed_port_caps = ( + IB_PORT_SM | IB_PORT_LED_INFO_SUP | IB_PORT_CM_SUP | + IB_PORT_SNMP_TUNNEL_SUP | IB_PORT_DEVICE_MGMT_SUP | + IB_PORT_VENDOR_CLASS_SUP); + int ehca_modify_port(struct ib_device *ibdev, u8 port, int port_modify_mask, struct ib_port_modify *props) { - /* Not implemented yet */ - return -EFAULT; + int ret = 0; + struct ehca_shca *shca = container_of(ibdev, struct ehca_shca, ib_device); + struct hipz_query_port *rblock; + u32 cap; + u64 hret; + + if ((props->set_port_cap_mask | props->clr_port_cap_mask) + & ~allowed_port_caps) { + ehca_err(&shca->ib_device, "Non-changeable bits set in masks " + "set=%x clr=%x allowed=%x", props->set_port_cap_mask, + props->clr_port_cap_mask, allowed_port_caps); + return -EINVAL; + } + + if (mutex_lock_interruptible(&shca->modify_mutex)) + return -ERESTARTSYS; + + rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL); + if (!rblock) { + ehca_err(&shca->ib_device, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto modify_port1; + } + + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + ehca_err(&shca->ib_device, "Can't query port properties"); + ret = -EINVAL; + goto modify_port2; + } + + cap = (rblock->capability_mask | props->set_port_cap_mask) + & ~props->clr_port_cap_mask; + + hret = hipz_h_modify_port(shca->ipz_hca_handle, port, + cap, props->init_type, port_modify_mask); + if (hret != H_SUCCESS) { + ehca_err(&shca->ib_device, "Modify port failed hret=%lx", hret); + ret = -EINVAL; + } + +modify_port2: + ehca_free_fw_ctrlblock(rblock); + +modify_port1: + mutex_unlock(&shca->modify_mutex); + + return ret; } diff -urp a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c --- a/drivers/infiniband/hw/ehca/ehca_main.c 2007-02-04 19:44:54.000000000 +0100 +++ b/drivers/infiniband/hw/ehca/ehca_main.c 2007-04-24 14:50:43.000000000 +0200 @@ -583,6 +583,7 @@ static int __devinit ehca_probe(struct i ehca_gen_err("Cannot allocate shca memory."); return -ENOMEM; } + mutex_init(&shca->modify_mutex); shca->ibmebus_dev = dev; shca->ipz_hca_handle.handle = *handle; diff -urp a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c --- a/drivers/infiniband/hw/ehca/hcp_if.c 2007-02-04 19:44:54.000000000 +0100 +++ b/drivers/infiniband/hw/ehca/hcp_if.c 2007-04-23 18:06:09.000000000 +0200 @@ -70,6 +70,10 @@ #define H_ALL_RES_QP_SQUEUE_SIZE_PAGES EHCA_BMASK_IBM(0, 31) #define H_ALL_RES_QP_RQUEUE_SIZE_PAGES EHCA_BMASK_IBM(32, 63) +#define H_MP_INIT_TYPE EHCA_BMASK_IBM(44, 47) +#define H_MP_SHUTDOWN EHCA_BMASK_IBM(48, 48) +#define H_MP_RESET_QKEY_CTR EHCA_BMASK_IBM(49, 49) + /* direct access qp controls */ #define DAQP_CTRL_ENABLE 0x01 #define DAQP_CTRL_SEND_COMP 0x20 @@ -364,6 +368,26 @@ u64 hipz_h_query_port(const struct ipz_a return ret; } +u64 hipz_h_modify_port(const struct ipz_adapter_handle adapter_handle, + const u8 port_id, const u32 port_cap, + const u8 init_type, const int modify_mask) +{ + u64 port_attributes = port_cap; + + if (modify_mask & IB_PORT_SHUTDOWN) + port_attributes |= EHCA_BMASK_SET(H_MP_SHUTDOWN, 1); + if (modify_mask & IB_PORT_INIT_TYPE) + port_attributes |= EHCA_BMASK_SET(H_MP_INIT_TYPE, init_type); + if (modify_mask & IB_PORT_RESET_QKEY_CNTR) + port_attributes |= EHCA_BMASK_SET(H_MP_RESET_QKEY_CTR, 1); + + return ehca_plpar_hcall_norets(H_MODIFY_PORT, + adapter_handle.handle, /* r4 */ + port_id, /* r5 */ + port_attributes, /* r6 */ + 0, 0, 0, 0); +} + u64 hipz_h_query_hca(const struct ipz_adapter_handle adapter_handle, struct hipz_query_hca *query_hca_rblock) { diff -urp a/drivers/infiniband/hw/ehca/hcp_if.h b/drivers/infiniband/hw/ehca/hcp_if.h --- a/drivers/infiniband/hw/ehca/hcp_if.h 2007-02-04 19:44:54.000000000 +0100 +++ b/drivers/infiniband/hw/ehca/hcp_if.h 2007-04-23 18:06:09.000000000 +0200 @@ -85,6 +85,10 @@ u64 hipz_h_query_port(const struct ipz_a const u8 port_id, struct hipz_query_port *query_port_response_block); +u64 hipz_h_modify_port(const struct ipz_adapter_handle adapter_handle, + const u8 port_id, const u32 port_cap, + const u8 init_type, const int modify_mask); + u64 hipz_h_query_hca(const struct ipz_adapter_handle adapter_handle, struct hipz_query_hca *query_hca_rblock); From halr at voltaire.com Tue Apr 24 09:02:54 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Apr 2007 12:02:54 -0400 Subject: [ofa-general] Re: [PATCH] osm: ignore line with invalid guid in guid2lid file In-Reply-To: <4624DCFE.9030904@dev.mellanox.co.il> References: <4624DCFE.9030904@dev.mellanox.co.il> Message-ID: <1177430573.12163.4755.camel@hal.voltaire.com> Hi Yevgeny, On Tue, 2007-04-17 at 10:43, Yevgeny Kliteynik wrote: > Hi Hal, > > When parsing guid2lid file, invalid guid string > ended up unpacked as guid 0x0. Ignoring line with > invalid guid string. Out of curiousity, was the invalid string hand edited as a test case ? > This bug doesn't look too important - don't think > that it should go to ofed_1_2. Anyway, your call. Agreed if I understood how this occurs (caused my an artificial test case). > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to master only). -- Hal From ossrosch at linux.vnet.ibm.com Tue Apr 24 09:13:22 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Tue, 24 Apr 2007 18:13:22 +0200 Subject: [ofa-general] [PATCH ofed-1.2-rc3] ehca:fixes BUG 574-"openib-general@openib.org" Message-ID: <200704241813.22582.ossrosch@linux.vnet.ibm.com> Hi, this is the patch to fix BUG: 574. It serializes calls to register_mr() by using a spin_lock. Regards Stefan --- Signed-off-by: Stefan Roscher ehca_classes.h | 1 + ehca_main.c | 2 ++ hcp_if.c | 16 ++++++++++++---- 3 files changed, 15 insertions(+), 4 deletions(-) diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h ofa_kernel-1.2_fixed/drivers/infiniband/hw/ehca/ehca_classes.h --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-04-24 14:54:05.000000000 +0200 +++ ofa_kernel-1.2_fixed/drivers/infiniband/hw/ehca/ehca_classes.h 2007-04-24 14:52:22.000000000 +0200 @@ -272,6 +272,7 @@ void ehca_cleanup_mrmw_cache(void); extern spinlock_t ehca_qp_idr_lock; extern spinlock_t ehca_cq_idr_lock; +extern spinlock_t hcall_lock; extern struct idr ehca_qp_idr; extern struct idr ehca_cq_idr; diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_fixed/drivers/infiniband/hw/ehca/ehca_main.c --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-04-24 14:54:05.000000000 +0200 +++ ofa_kernel-1.2_fixed/drivers/infiniband/hw/ehca/ehca_main.c 2007-04-24 14:52:32.000000000 +0200 @@ -98,6 +98,7 @@ MODULE_PARM_DESC(scaling_code, spinlock_t ehca_qp_idr_lock; spinlock_t ehca_cq_idr_lock; +spinlock_t hcall_lock; DEFINE_IDR(ehca_qp_idr); DEFINE_IDR(ehca_cq_idr); @@ -815,6 +816,7 @@ int __init ehca_module_init(void) idr_init(&ehca_cq_idr); spin_lock_init(&ehca_qp_idr_lock); spin_lock_init(&ehca_cq_idr_lock); + spin_lock_init(&hcall_lock); INIT_LIST_HEAD(&shca_list); spin_lock_init(&shca_list_lock); diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/hcp_if.c ofa_kernel-1.2_fixed/drivers/infiniband/hw/ehca/hcp_if.c --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/hcp_if.c 2007-04-04 14:07:36.000000000 +0200 +++ ofa_kernel-1.2_fixed/drivers/infiniband/hw/ehca/hcp_if.c 2007-04-24 15:06:47.000000000 +0200 @@ -136,7 +136,6 @@ static long ehca_plpar_hcall_norets(unsi return H_BUSY; } - static long ehca_plpar_hcall9(unsigned long opcode, unsigned long *outs, /* array of 9 outputs */ unsigned long arg1, @@ -150,17 +149,27 @@ static long ehca_plpar_hcall9(unsigned l unsigned long arg9) { long ret; - int i, sleep_msecs; + int i, sleep_msecs, lock_is_set = 0; + unsigned long flags; + ehca_gen_dbg("opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx " "arg5=%lx arg6=%lx arg7=%lx arg8=%lx arg9=%lx", opcode, arg1, arg2, arg3, arg4, arg5, arg6, arg7, arg8, arg9); - + for (i = 0; i < 5; i++) { + if ((opcode == H_ALLOC_RESOURCE) && (arg2 == 5)) { + spin_lock_irqsave(&hcall_lock, flags); + lock_is_set = 1; + } + ret = plpar_hcall9(opcode, outs, arg1, arg2, arg3, arg4, arg5, arg6, arg7, arg8, arg9); + + if (lock_is_set) + spin_unlock_irqrestore(&hcall_lock, flags); if (H_IS_LONG_BUSY(ret)) { sleep_msecs = get_longbusy_msecs(ret); @@ -189,7 +198,6 @@ static long ehca_plpar_hcall9(unsigned l opcode, ret, outs[0], outs[1], outs[2], outs[3], outs[4], outs[5], outs[6], outs[7], outs[8]); return ret; - } return H_BUSY; From rdreier at cisco.com Tue Apr 24 10:52:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Apr 2007 10:52:18 -0700 Subject: [ofa-general] Re: [PATCH] eHCA: Add "Modify Port" verb In-Reply-To: <200704241744.31691.fenkes@de.ibm.com> (Joachim Fenkes's message of "Tue, 24 Apr 2007 17:44:31 +0200") References: <200704241744.31691.fenkes@de.ibm.com> Message-ID: Looks good, applied for 2.6.22. From rdreier at cisco.com Tue Apr 24 11:27:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Apr 2007 11:27:18 -0700 Subject: [ofa-general] Re: [PATCH] ib_core: Add missing device link to class device In-Reply-To: (Joachim Fenkes's message of "Tue, 24 Apr 2007 14:06:59 +0200") References: Message-ID: > I had a look at the kernel code -- currently, all device drivers except > ehca do this by themselves: > So I think it makes a lot of sense to put the class_dev.dev assignment > into generic ib_core code instead of repeating it in all the drivers. > The respective lines could move out of the drivers in the future but > won't hurt anyone until then. Actually I think we should delete the duplicate code now while merging this. So I queued this up for 2.6.22: commit f19c8d7cbe3153d68f0a559afd02f66655310238 Author: Joachim Fenkes Date: Mon Apr 23 18:20:27 2007 +0200 IB: Set class_dev->dev in core for nice device symlink All RDMA drivers except ehca set class_dev->dev to their dma_device value (ehca leaves this unset). dma_device is the only value that makes any sense, so move this assignment to core/sysfs.c. This reduce the duplicated code in the rest of the drivers and gives ehca a nice /sys/class/infiniband/ehcaX/device symlink. Signed-off-by: Joachim Fenkes Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index 000c086..08c299e 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -683,6 +683,7 @@ int ib_device_register_sysfs(struct ib_device *device) class_dev->class = &ib_class; class_dev->class_data = device; + class_dev->dev = device->dma_device; strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE); INIT_LIST_HEAD(&device->port_list); diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index fef9727..607c09b 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -796,7 +796,6 @@ int c2_register_device(struct c2_dev *dev) memcpy(&dev->ibdev.node_guid, dev->pseudo_netdev->dev_addr, 6); dev->ibdev.phys_port_cnt = 1; dev->ibdev.dma_device = &dev->pcidev->dev; - dev->ibdev.class_dev.dev = &dev->pcidev->dev; dev->ibdev.query_device = c2_query_device; dev->ibdev.query_port = c2_query_port; dev->ibdev.modify_port = c2_modify_port; diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 24e0df0..af28a31 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -1108,7 +1108,6 @@ int iwch_register_device(struct iwch_dev *dev) memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC)); dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports; dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev); - dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev); dev->ibdev.query_device = iwch_query_device; dev->ibdev.query_port = iwch_query_port; dev->ibdev.modify_port = iwch_modify_port; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index f5604b8..18c6df2 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -1559,7 +1559,6 @@ int ipath_register_ib_device(struct ipath_devdata *dd) dev->node_type = RDMA_NODE_IB_CA; dev->phys_port_cnt = 1; dev->dma_device = &dd->pcidev->dev; - dev->class_dev.dev = dev->dma_device; dev->query_device = ipath_query_device; dev->modify_device = ipath_modify_device; dev->query_port = ipath_query_port; diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 0725ad7..47e6fd4 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1293,7 +1293,6 @@ int mthca_register_device(struct mthca_dev *dev) dev->ib_dev.node_type = RDMA_NODE_IB_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; - dev->ib_dev.class_dev.dev = &dev->pdev->dev; dev->ib_dev.query_device = mthca_query_device; dev->ib_dev.query_port = mthca_query_port; dev->ib_dev.modify_device = mthca_modify_device; From rdreier at cisco.com Tue Apr 24 11:33:25 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Apr 2007 11:33:25 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review In-Reply-To: (Pradeep Satyanarayana's message of "Mon, 23 Apr 2007 16:53:31 -0700") References: Message-ID: > As previously stated, IBM HCA will address these issues. However, > my understanding is that mthca/Topspin adapters also have a problem > (too high a value for the Local CA Delay Ack). Both HCAs need to be > fixed for good interoperability. I think you're misunderstanding what local CA ack delay means. This is a property of an HCA that is not (necessarily) subject to tuning -- it is just a property of the HCA, namely the maximum amount of time it may take to generate an ACK. So if a certain HCA reports a value of 15, then that means that any remote HCA talking to it must be prepared for a delay of 4.096 * 2^15 usecs before receiving an ACK. If the ACK delays on both sides are not being taken into account properly when establishing a connection, then I guess that is a bug in our CM. - R. From rdreier at cisco.com Tue Apr 24 11:36:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Apr 2007 11:36:07 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 2/3] 2.6.22 or 23 ib/sa: add path record caching In-Reply-To: <000201c782df$8f002de0$07fd070a@amr.corp.intel.com> (Sean Hefty's message of "Thu, 19 Apr 2007 17:05:08 -0700") References: <000201c782df$8f002de0$07fd070a@amr.corp.intel.com> Message-ID: I still need to think about the overall approach, and read through Michael and Jason's comments, but one quick note on the code itself: > +static struct miscdevice local_sa_misc = { > + .minor = MISC_DYNAMIC_MINOR, > + .name = "ib_local_sa", > +}; I don't understand why you're registering a miscdevice etc. I don't see any implementation of a character device or indeed any userspace interface at all. So what's up here? From rdreier at cisco.com Tue Apr 24 11:37:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Apr 2007 11:37:47 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <20070422103704.GE26791@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 22 Apr 2007 13:37:04 +0300") References: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> <20070422103704.GE26791@mellanox.co.il> Message-ID: > > I would like to get feedback on both the notice and local_sa patches for > > inclusion in 2.6.22 or 2.6.23 (if 2.6.22 is not possible). > > Since OFED includes a significantly different version of this code > (without notices), and this is the first time the notices code > makes an appearance, I think that targeting .23, and considering > alternative options such as the above, would be more prudent. Given that the 2.6.22 merge window will probably open any day now, I have to agree that holding off until 2.6.23 is a better idea. - R. From rowland at cse.ohio-state.edu Tue Apr 24 11:38:34 2007 From: rowland at cse.ohio-state.edu (Shaun Rowland) Date: Tue, 24 Apr 2007 14:38:34 -0400 Subject: [ofa-general] Re: [ewg] OFED 1.2 April 16 meeting summary In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9A0E2B6@mtlexch01.mtl.com> References: <46231441.6050507@mellanox.co.il> <46238FC0.40906@mellanox.co.il> <6C2C79E72C305246B504CBA17B5500C9A0E2B6@mtlexch01.mtl.com> Message-ID: <462E4EAA.5090606@cse.ohio-state.edu> Tziporet Koren wrote: > OFED 1.2 April 16 meeting summary > > Main decisions: > 1. RC2 will be ready on the US morning of Wed April 18. (RC2 date is > derived from Intel schedule for the 256 nodes cluster) > 2. RC3 due date is April 26. > 3. All release notes and other documents should be ready for RC3 Hi Tziporet. I am not sure who handles the release notes, but we need to add release notes for MVAPICH2. I've attached an mvapich2_release_notes.txt file that is similar to the files in the docs/ subdirectory of the OFED RC2 release. Can this be added? -- Shaun Rowland rowland at cse.ohio-state.edu http://www.cse.ohio-state.edu/~rowland/ -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: mvapich2_release_notes.txt URL: From shubbell at dbresearch.net Tue Apr 24 11:42:32 2007 From: shubbell at dbresearch.net (Sean Hubbell) Date: Tue, 24 Apr 2007 13:42:32 -0500 Subject: [ofa-general] Multicast Question In-Reply-To: <1177421855.28021.102598.camel@hal.voltaire.com> References: <4624D086.3010100@dbresearch.net> <1177421855.28021.102598.camel@hal.voltaire.com> Message-ID: <462E4F98.1010307@dbresearch.net> Hal, Nice to hear from you again. Currently when I ping the multicast "all" address, I do not get a reply. I'll start checking here on my side for the problem. Thanks, Sean Hal Rosenstock wrote: > Hi Sean, > > On Tue, 2007-04-17 at 09:49, Sean Hubbell wrote: > >> Hello, >> >> I was wondering if I ping or ibping the 224.0.0.1 address should I be >> receiving a list of the nodes that have multicast enabled? >> > > ibping takes a LID or GUID (and not an IP address) as an argument so > this is not supported by ibping. > > 224.0.0.1 is the all (multicast) hosts on the subnet so I would expect > only those hosts supporting IPmc to respond and they may indicate some > other IP interface other than IPoIB. > > -- Hal > > >> Thanks in advance, >> >> Sean >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > > > -- Sean Hubbell dBSTM Product Manager / Technical Director deciBel Research, Inc. (256) 489-6198 (Work) (256) 426-8957 (Cell) From rdreier at cisco.com Tue Apr 24 11:48:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Apr 2007 11:48:14 -0700 Subject: [ofa-general] Multicast Question In-Reply-To: <462E4F98.1010307@dbresearch.net> (Sean Hubbell's message of "Tue, 24 Apr 2007 13:42:32 -0500") References: <4624D086.3010100@dbresearch.net> <1177421855.28021.102598.camel@hal.voltaire.com> <462E4F98.1010307@dbresearch.net> Message-ID: > Nice to hear from you again. Currently when I ping the multicast > "all" address, I do not get a reply. I'll start checking here on my > side for the problem. Checking /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts is probably a good place to start. - R. From mshefty at ichips.intel.com Tue Apr 24 11:55:26 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Apr 2007 11:55:26 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 2/3] 2.6.22 or 23 ib/sa: add path record caching In-Reply-To: References: <000201c782df$8f002de0$07fd070a@amr.corp.intel.com> Message-ID: <462E529E.2030604@ichips.intel.com> > > +static struct miscdevice local_sa_misc = { > > + .minor = MISC_DYNAMIC_MINOR, > > + .name = "ib_local_sa", > > +}; > > I don't understand why you're registering a miscdevice etc. I don't > see any implementation of a character device or indeed any userspace > interface at all. So what's up here? The cache creates the following files: /sys/class/misc/ib_local_sa/paths_per_dest /sys/class/misc/ib_local_sa/refresh /sys/class/misc/ib_local_sa/lookup_method The intent is to allow changing the cache parameters dynamically and for an administrator to force a manual refresh of the cache. The cache settings are currently global, rather than per port. - Sean From jsquyres at cisco.com Tue Apr 24 11:57:56 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 24 Apr 2007 14:57:56 -0400 Subject: [ofa-general] OFED 1.2 April 16 meeting summary In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9A0E2B6@mtlexch01.mtl.com> References: <46231441.6050507@mellanox.co.il> <46238FC0.40906@mellanox.co.il> <6C2C79E72C305246B504CBA17B5500C9A0E2B6@mtlexch01.mtl.com> Message-ID: <952BEF00-5BBD-43E5-A297-C21FA3364773@cisco.com> On Apr 16, 2007, at 4:33 PM, Tziporet Koren wrote: > OFED 1.2 April 16 meeting summary > > Main decisions: > 1. RC2 will be ready on the US morning of Wed April 18. (RC2 date > is derived from Intel schedule for the 256 nodes cluster) > 2. RC3 due date is April 26. > 3. All release notes and other documents should be ready for RC3 Attached are the Open MPI v1.2.1 release notes with one OFED-specific note. Open MPI v1.2.1 is on track to be released tomorrow morning (some last minute issues came up preventing its release last week); I'll be uploading a new SRPM to openfabrics.org when it's available. -- Jeff Squyres Cisco Systems -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ofed-openmpi-release-notes.txt URL: -------------- next part -------------- From xma at us.ibm.com Tue Apr 24 12:25:57 2007 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 24 Apr 2007 12:25:57 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review In-Reply-To: Message-ID: Hello Roland, > If the ACK delays on both sides are not being taken into account > properly when establishing a connection, then I guess that is a bug in > our CM. > > - R. So for each IPoIB connection, the ACK delays could be different from remote. Then how TCP retransmission timeout have a corresponding value? Thanks Shirley Ma IBM Linux Technology Center -------------- next part -------------- An HTML attachment was scrubbed... URL: From laxcrescentspurheliskivew at crescentspurheliski.com Tue Apr 24 15:06:56 2007 From: laxcrescentspurheliskivew at crescentspurheliski.com (Berry Lanier) Date: Tue, 24 Apr 2007 20:06:56 -0200 Subject: [ofa-general] Join the thousands of people who got slim Message-ID: <370223926.90144998685287@thhebat.net> Anatrim – The very up-to-date and most enchanting product for weighty people is now readily available – As seen on ABC. Do you hold in your memory all the cases when you appeal to yourself to do any thing for being saved from this fastly growing pounds of fat? Happily, now no major sacrifice is necessary. With Anatrim, the ground-breaking kilos-melting medley, you can achieve naturally health lifestyle and a really slender figure. Take a look at what our customers state! "I hate to confess but I was a junk food addict. I greedily ate all this trash and just could not stop. This fatal passion stopped when I started course with Anatrim! God, my inclination to eat constantly vanished, spirits improved and I became the happiest person in the world 26 pounds in 2.7 months. So, I can tell you now I’m the happiest person!" Silvia D., Colorado "I had problems with over-weight since a boy. You can't imagine how I hated being ridiculed at school. I abhorred my plumpness and I hated myself. After trying many different remedies I learned about Anatrim. This stuff literally pulled me out of this horror! The very sincere thanks to you, guys." Dave Klark, Chicago "You know what? Anatrim preserved my marriage! I got into this circle, depression – eating more – more depression. My wife had thought to leave me as I was turning in overweight psycho. Once one of my friends showed me web site and I ordered pack of Anatrim at the same time. The result was excellent, my appetite came to normal level, I was in a good mood oftener, and, of course, I tightened my belt with no regrets. And you see, the bedroom became cool, too!" Jack There are lots of sincere gratitudes left by happy people trying Anatrim. Why don't you join the tens of thousands of slim clients and take this all-natural appetite-reducing power raising product now! Do not miss your opportunity! -------------- next part -------------- An HTML attachment was scrubbed... URL: From suri at baymicrosystems.com Tue Apr 24 13:09:02 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Tue, 24 Apr 2007 16:09:02 -0400 Subject: [ofa-general] [PATCHv2] IB/mad: Change SMI to use enums ratherthan magic return codes In-Reply-To: <1176401777.4545.119096.camel@hal.voltaire.com> References: <1175527446.4436.16721.camel@localhost.localdomain> <1176401777.4545.119096.camel@hal.voltaire.com> Message-ID: <01f501c786ac$6b92fb00$1914a8c0@surioffice> Hal/Roland: Since we haven't heard any objections so far, can this go in to 2.6.22? I would like to get the patch for switch operation (pending on this getting committed) to at least 2.6.23! Many thanks, Suri > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf > Of Hal Rosenstock > Sent: Thursday, April 12, 2007 2:16 PM > To: Roland Dreier > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] [PATCHv2] IB/mad: Change SMI to use enums ratherthan magic return codes > > On Thu, 2007-04-12 at 13:36, Roland Dreier wrote: > > Definitely a big improvement to readability. However, I don't like > > the "smi_type" name, since the enum is not really a type but rather an > > action: > > > > > +enum smi_type { > > > + IB_SMI_DISCARD, > > > + IB_SMI_HANDLE > > > +}; > > > + > > > +enum smi_forward_type { > > > + IB_SMI_LOCAL, /* SMP should be completed up the stack */ > > > + IB_SMI_SEND, /* received DR SMP should be forwarded to the send queue */ > > > +}; > > > > Is it OK if I do s/smi_type/smi_action/ and s/smi_forward_type/smi_forward_action/ > > before applying this? > > Sure; that's an improvement. > > My one other comment with the patch was about testing relative to iPath > and perhaps eHCA. I think things should work but it would be best if > someone verify this. > > -- Hal > > > - R. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue Apr 24 13:12:36 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Apr 2007 13:12:36 -0700 Subject: [ofa-general] [PATCHv2] IB/mad: Change SMI to use enums ratherthan magic return codes In-Reply-To: <01f501c786ac$6b92fb00$1914a8c0@surioffice> (Suresh Shelvapille's message of "Tue, 24 Apr 2007 16:09:02 -0400") References: <1175527446.4436.16721.camel@localhost.localdomain> <1176401777.4545.119096.camel@hal.voltaire.com> <01f501c786ac$6b92fb00$1914a8c0@surioffice> Message-ID: > Since we haven't heard any objections so far, can this go in to 2.6.22? Yes, as I said, I already queued this for 2.6.22. From pradeep at us.ibm.com Tue Apr 24 14:09:57 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Tue, 24 Apr 2007 14:09:57 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review In-Reply-To: Message-ID: Thanks for the clarifications Roland. There is something that I am still missing- I presume the Local CA Ack Delay is common across all QPs in the HCA and the Local Ack Timeout is specific to each QP. Is that correct? I tried to change the ib_qp_attr .timeout value (this is the Local Ack Timeout -right?) to 0xf as the QP transitions from RTR to RTS (page 569 IB Spec) . A subsequent ib_query_qp() tells me that timeout = 0. This happens on both ehca and mthca. There may be a CM bug, but I am guessing somthing else is incorrect too. I have not yet narrowed that down. Pradeep pradeep at us.ibm.com Roland Dreier wrote on 04/24/2007 11:33:25 AM: > > As previously stated, IBM HCA will address these issues. However, > > my understanding is that mthca/Topspin adapters also have a problem > > (too high a value for the Local CA Delay Ack). Both HCAs need to be > > fixed for good interoperability. > > I think you're misunderstanding what local CA ack delay means. This > is a property of an HCA that is not (necessarily) subject to tuning -- > it is just a property of the HCA, namely the maximum amount of time it > may take to generate an ACK. > > So if a certain HCA reports a value of 15, then that means that any > remote HCA talking to it must be prepared for a delay of 4.096 * 2^15 > usecs before receiving an ACK. > > If the ACK delays on both sides are not being taken into account > properly when establishing a connection, then I guess that is a bug in > our CM. > > - R. From kliteyn at dev.mellanox.co.il Tue Apr 24 14:29:18 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 25 Apr 2007 00:29:18 +0300 Subject: [ofa-general] Re: [PATCH] osm: ignore line with invalid guid in guid2lid file In-Reply-To: <1177430573.12163.4755.camel@hal.voltaire.com> References: <4624DCFE.9030904@dev.mellanox.co.il> <1177430573.12163.4755.camel@hal.voltaire.com> Message-ID: <462E76AE.8060504@dev.mellanox.co.il> Hal Rosenstock wrote: > Hi Yevgeny, > > On Tue, 2007-04-17 at 10:43, Yevgeny Kliteynik wrote: >> Hi Hal, >> >> When parsing guid2lid file, invalid guid string >> ended up unpacked as guid 0x0. Ignoring line with >> invalid guid string. > > Out of curiousity, was the invalid string hand edited as a test case ? Correct - the invalid string was generated by test. -- Yevgeny >> This bug doesn't look too important - don't think >> that it should go to ofed_1_2. Anyway, your call. > > Agreed if I understood how this occurs (caused my an artificial test > case). > >> -- Yevgeny >> >> Signed-off-by: Yevgeny Kliteynik > > Thanks. Applied (to master only). > > -- Hal > > From mshefty at ichips.intel.com Tue Apr 24 14:49:31 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Apr 2007 14:49:31 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review In-Reply-To: References: Message-ID: <462E7B6B.5060405@ichips.intel.com> > If the ACK delays on both sides are not being taken into account > properly when establishing a connection, then I guess that is a bug in > our CM. I looked, and the cm does not take into account the ca ack delay. This can be worked around by bumping up the qp timeout value between calling ib_cm_init_qp_attr() and ib_modify_qp(), or by increasing the path record packet_life_time. - Sean From kliteyn at dev.mellanox.co.il Tue Apr 24 15:08:29 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 25 Apr 2007 01:08:29 +0300 Subject: [ofa-general] [PATCH] osm: ignore line with invalid guid in guid2lid file In-Reply-To: <1177429015.12163.3149.camel@hal.voltaire.com> References: <4624DCFE.9030904@dev.mellanox.co.il> <20070417233935.GB29254@sashak.voltaire.com> <4625B538.3080207@dev.mellanox.co.il> <1177429015.12163.3149.camel@hal.voltaire.com> Message-ID: <462E7FDD.2020508@dev.mellanox.co.il> Hal Rosenstock wrote: > On Wed, 2007-04-18 at 02:05, Yevgeny Kliteynik wrote: >> Sasha Khapyorsky wrote: >>> On 17:43 Tue 17 Apr , Yevgeny Kliteynik wrote: >>>> Hi Hal, >>>> >>>> When parsing guid2lid file, invalid guid string >>>> ended up unpacked as guid 0x0. Ignoring line with >>>> invalid guid string. >>>> >>>> This bug doesn't look too important - don't think >>>> that it should go to ofed_1_2. Anyway, your call. >>> It looks like a safe change for me. >>> >>> BTW any reason to use strtouq() instead of more popular (IMHO) strtoul() >>> or strtoull()? >> No particular reason. >> It specifically says that the function "convert string to an unsigned >> 64-bit integer" instead of unsigned long or unsigned long long, but >> on the other hand it doesn't matter, because uint64_t is a typedef anyway. >> If you have special sentiments about strtoul/strtoull - feel free to change it. > > Is strtouq supported in Windows ? Good point. I didn't see strtouq mentioned in MSDN - it has __strtoui64 that does the same job. I don't have the Windows machine to try and compile strtouq right now, but I think that we should stick to strtoul as Sasha has suggested - I found it in the MSDN, so it would work for sure. Hal, Can you change it as you apply the patch, or do you want me to issue a new one? -- Yevgeny > -- Hal > >> -- Yevgeny >> >> >>> Sasha >>> >>>> -- Yevgeny >>>> >>>> Signed-off-by: Yevgeny Kliteynik >>>> --- >>>> osm/opensm/osm_db_files.c | 15 ++++++++++++--- >>>> 1 files changed, 12 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/osm/opensm/osm_db_files.c b/osm/opensm/osm_db_files.c >>>> index dbadd68..23eaa0b 100644 >>>> --- a/osm/opensm/osm_db_files.c >>>> +++ b/osm/opensm/osm_db_files.c >>>> @@ -294,6 +294,7 @@ osm_db_restore( >>>> char *p_first_word, *p_rest_of_line, *p_last; >>>> char *p_key = NULL; >>>> char *p_prev_val, *p_accum_val = NULL; >>>> + char *endptr = NULL; >>>> unsigned int line_num; >>>> >>>> OSM_LOG_ENTER( p_log, osm_db_restore ); >>>> @@ -415,12 +416,20 @@ osm_db_restore( >>>> p_prev_val = NULL; >>>> } >>>> >>>> - /* store our key and value */ >>>> - st_insert(p_domain_imp->p_hash, >>>> - (st_data_t)p_key, (st_data_t)p_accum_val); >>>> osm_log( p_log, OSM_LOG_DEBUG, >>>> "osm_db_restore: " >>>> "Got key:%s value:%s\n", p_key, p_accum_val); >>>> + >>>> + /* check that the key is a number */ >>>> + if (!strtouq(p_key,&endptr,0) && *endptr != '\0') >>>> + osm_log( p_log, OSM_LOG_ERROR, >>>> + "osm_db_restore: ERR 610B: " >>>> + "Key:%s is invalid\n", >>>> + p_key); >>>> + else >>>> + /* store our key and value */ >>>> + st_insert(p_domain_imp->p_hash, >>>> + (st_data_t)p_key, (st_data_t)p_accum_val); >>>> } >>>> else >>>> { >>>> -- >>>> 1.4.4.1.GIT >>>> >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general > > From kliteyn at dev.mellanox.co.il Tue Apr 24 15:11:47 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 25 Apr 2007 01:11:47 +0300 Subject: [ofa-general] Re: [PATCH] osm: source and destination strings overlap when using sprintf() In-Reply-To: <20070423101738.GG4579@mellanox.co.il> References: <462C7C21.7010004@dev.mellanox.co.il> <20070423101738.GG4579@mellanox.co.il> Message-ID: <462E80A3.5060503@dev.mellanox.co.il> Michael S. Tsirkin wrote: >> Quoting Yevgeny Kliteynik : >> Subject: [PATCH] osm: source and destination strings overlap when using sprintf() >> >> Hi Hal, >> >> Fixing a problematic usage of sprintf() in osm_helper.c: >> >> When using sprintf(), source and destination strings should >> not overlap, otherwise the function behavior is undefined. >> >> Please apply to ofed_1_2 and to master. >> >> Thanks. >> >> -- Yevgeny >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> osm/opensm/osm_helper.c | 47 >> ++++++++++++++++++++++++++++++++++------------- >> 1 files changed, 34 insertions(+), 13 deletions(-) > > .... skip ... > >> for (i = 0; i < 32; i++) >> - sprintf( buf_line2,"%s 0x%01X |", >> - buf_line2, p_vla_tbl->vl_entry[i].weight); >> + { >> + sprintf( tmp_buf_line," 0x%01X |", >> + p_vla_tbl->vl_entry[i].weight); >> + strcat( buf_line2, tmp_buf_line ); >> + } >> osm_log( p_log, log_level, >> "VlArb dump:\n" >> "\t\t\tport_guid...........0x%016" PRIx64 "\n" > > These tmp-bufs are quite ugly, and bloat the code up. > Since you seem to do a strcat which does an anyway, how about, for example: > > - sprintf( buf_line1,"%s 0x%01x |", > - buf_line1, p_vla_tbl->vl_entry[i].vl); > + sprintf( buf_line1 + strlen(buf_line1)," 0x%01x |", > + p_vla_tbl->vl_entry[i].vl); > > and so on in all the other places? Agree. I'll send a new patch later. -- Yevgeny From halr at voltaire.com Tue Apr 24 15:13:28 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Apr 2007 18:13:28 -0400 Subject: [ofa-general] [PATCH] osm: ignore line with invalid guid in guid2lid file In-Reply-To: <462E7FDD.2020508@dev.mellanox.co.il> References: <4624DCFE.9030904@dev.mellanox.co.il> <20070417233935.GB29254@sashak.voltaire.com> <4625B538.3080207@dev.mellanox.co.il> <1177429015.12163.3149.camel@hal.voltaire.com> <462E7FDD.2020508@dev.mellanox.co.il> Message-ID: <1177452806.16495.21271.camel@hal.voltaire.com> On Tue, 2007-04-24 at 18:08, Yevgeny Kliteynik wrote: > Hal Rosenstock wrote: > > On Wed, 2007-04-18 at 02:05, Yevgeny Kliteynik wrote: > >> Sasha Khapyorsky wrote: > >>> On 17:43 Tue 17 Apr , Yevgeny Kliteynik wrote: > >>>> Hi Hal, > >>>> > >>>> When parsing guid2lid file, invalid guid string > >>>> ended up unpacked as guid 0x0. Ignoring line with > >>>> invalid guid string. > >>>> > >>>> This bug doesn't look too important - don't think > >>>> that it should go to ofed_1_2. Anyway, your call. > >>> It looks like a safe change for me. > >>> > >>> BTW any reason to use strtouq() instead of more popular (IMHO) strtoul() > >>> or strtoull()? > >> No particular reason. > >> It specifically says that the function "convert string to an unsigned > >> 64-bit integer" instead of unsigned long or unsigned long long, but > >> on the other hand it doesn't matter, because uint64_t is a typedef anyway. > >> If you have special sentiments about strtoul/strtoull - feel free to change it. > > > > Is strtouq supported in Windows ? > > Good point. > I didn't see strtouq mentioned in MSDN - it has __strtoui64 that does the same job. > I don't have the Windows machine to try and compile strtouq right now, but I think > that we should stick to strtoul as Sasha has suggested - I found it in the MSDN, so > it would work for sure. > > Hal, > Can you change it as you apply the patch, or do you want me to issue a new one? I've already applied it so the best thing would be an incremental patch from using strtouq to strtoul or strtoull. Thanks. -- Hal > -- Yevgeny > > > -- Hal > > > >> -- Yevgeny > >> > >> > >>> Sasha > >>> > >>>> -- Yevgeny > >>>> > >>>> Signed-off-by: Yevgeny Kliteynik > >>>> --- > >>>> osm/opensm/osm_db_files.c | 15 ++++++++++++--- > >>>> 1 files changed, 12 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/osm/opensm/osm_db_files.c b/osm/opensm/osm_db_files.c > >>>> index dbadd68..23eaa0b 100644 > >>>> --- a/osm/opensm/osm_db_files.c > >>>> +++ b/osm/opensm/osm_db_files.c > >>>> @@ -294,6 +294,7 @@ osm_db_restore( > >>>> char *p_first_word, *p_rest_of_line, *p_last; > >>>> char *p_key = NULL; > >>>> char *p_prev_val, *p_accum_val = NULL; > >>>> + char *endptr = NULL; > >>>> unsigned int line_num; > >>>> > >>>> OSM_LOG_ENTER( p_log, osm_db_restore ); > >>>> @@ -415,12 +416,20 @@ osm_db_restore( > >>>> p_prev_val = NULL; > >>>> } > >>>> > >>>> - /* store our key and value */ > >>>> - st_insert(p_domain_imp->p_hash, > >>>> - (st_data_t)p_key, (st_data_t)p_accum_val); > >>>> osm_log( p_log, OSM_LOG_DEBUG, > >>>> "osm_db_restore: " > >>>> "Got key:%s value:%s\n", p_key, p_accum_val); > >>>> + > >>>> + /* check that the key is a number */ > >>>> + if (!strtouq(p_key,&endptr,0) && *endptr != '\0') > >>>> + osm_log( p_log, OSM_LOG_ERROR, > >>>> + "osm_db_restore: ERR 610B: " > >>>> + "Key:%s is invalid\n", > >>>> + p_key); > >>>> + else > >>>> + /* store our key and value */ > >>>> + st_insert(p_domain_imp->p_hash, > >>>> + (st_data_t)p_key, (st_data_t)p_accum_val); > >>>> } > >>>> else > >>>> { > >>>> -- > >>>> 1.4.4.1.GIT > >>>> > >>>> > >>>> _______________________________________________ > >>>> general mailing list > >>>> general at lists.openfabrics.org > >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>> > >>>> To unsubscribe, please visit > >>>> http://openib.org/mailman/listinfo/openib-general > > > > > From kliteyn at dev.mellanox.co.il Tue Apr 24 15:18:16 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 25 Apr 2007 01:18:16 +0300 Subject: [ofa-general] [PATCH] osm: ignore line with invalid guid in guid2lid file In-Reply-To: <1177452806.16495.21271.camel@hal.voltaire.com> References: <4624DCFE.9030904@dev.mellanox.co.il> <20070417233935.GB29254@sashak.voltaire.com> <4625B538.3080207@dev.mellanox.co.il> <1177429015.12163.3149.camel@hal.voltaire.com> <462E7FDD.2020508@dev.mellanox.co.il> <1177452806.16495.21271.camel@hal.voltaire.com> Message-ID: <462E8228.7000302@dev.mellanox.co.il> Hal Rosenstock wrote: > On Tue, 2007-04-24 at 18:08, Yevgeny Kliteynik wrote: >> Hal Rosenstock wrote: >>> On Wed, 2007-04-18 at 02:05, Yevgeny Kliteynik wrote: >>>> Sasha Khapyorsky wrote: >>>>> On 17:43 Tue 17 Apr , Yevgeny Kliteynik wrote: >>>>>> Hi Hal, >>>>>> >>>>>> When parsing guid2lid file, invalid guid string >>>>>> ended up unpacked as guid 0x0. Ignoring line with >>>>>> invalid guid string. >>>>>> >>>>>> This bug doesn't look too important - don't think >>>>>> that it should go to ofed_1_2. Anyway, your call. >>>>> It looks like a safe change for me. >>>>> >>>>> BTW any reason to use strtouq() instead of more popular (IMHO) strtoul() >>>>> or strtoull()? >>>> No particular reason. >>>> It specifically says that the function "convert string to an unsigned >>>> 64-bit integer" instead of unsigned long or unsigned long long, but >>>> on the other hand it doesn't matter, because uint64_t is a typedef anyway. >>>> If you have special sentiments about strtoul/strtoull - feel free to change it. >>> Is strtouq supported in Windows ? >> Good point. >> I didn't see strtouq mentioned in MSDN - it has __strtoui64 that does the same job. >> I don't have the Windows machine to try and compile strtouq right now, but I think >> that we should stick to strtoul as Sasha has suggested - I found it in the MSDN, so >> it would work for sure. >> >> Hal, >> Can you change it as you apply the patch, or do you want me to issue a new one? > > I've already applied it so the best thing would be an incremental patch > from using strtouq to strtoul or strtoull. Right, I saw it after answering your prev. mail. Anyway, I'll issue an incremental patch later. Thanks. -- Yevgeny > Thanks. > > -- Hal > > >> -- Yevgeny >> >>> -- Hal >>> >>>> -- Yevgeny >>>> >>>> >>>>> Sasha >>>>> >>>>>> -- Yevgeny >>>>>> >>>>>> Signed-off-by: Yevgeny Kliteynik >>>>>> --- >>>>>> osm/opensm/osm_db_files.c | 15 ++++++++++++--- >>>>>> 1 files changed, 12 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/osm/opensm/osm_db_files.c b/osm/opensm/osm_db_files.c >>>>>> index dbadd68..23eaa0b 100644 >>>>>> --- a/osm/opensm/osm_db_files.c >>>>>> +++ b/osm/opensm/osm_db_files.c >>>>>> @@ -294,6 +294,7 @@ osm_db_restore( >>>>>> char *p_first_word, *p_rest_of_line, *p_last; >>>>>> char *p_key = NULL; >>>>>> char *p_prev_val, *p_accum_val = NULL; >>>>>> + char *endptr = NULL; >>>>>> unsigned int line_num; >>>>>> >>>>>> OSM_LOG_ENTER( p_log, osm_db_restore ); >>>>>> @@ -415,12 +416,20 @@ osm_db_restore( >>>>>> p_prev_val = NULL; >>>>>> } >>>>>> >>>>>> - /* store our key and value */ >>>>>> - st_insert(p_domain_imp->p_hash, >>>>>> - (st_data_t)p_key, (st_data_t)p_accum_val); >>>>>> osm_log( p_log, OSM_LOG_DEBUG, >>>>>> "osm_db_restore: " >>>>>> "Got key:%s value:%s\n", p_key, p_accum_val); >>>>>> + >>>>>> + /* check that the key is a number */ >>>>>> + if (!strtouq(p_key,&endptr,0) && *endptr != '\0') >>>>>> + osm_log( p_log, OSM_LOG_ERROR, >>>>>> + "osm_db_restore: ERR 610B: " >>>>>> + "Key:%s is invalid\n", >>>>>> + p_key); >>>>>> + else >>>>>> + /* store our key and value */ >>>>>> + st_insert(p_domain_imp->p_hash, >>>>>> + (st_data_t)p_key, (st_data_t)p_accum_val); >>>>>> } >>>>>> else >>>>>> { >>>>>> -- >>>>>> 1.4.4.1.GIT >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> general mailing list >>>>>> general at lists.openfabrics.org >>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>> >>>>>> To unsubscribe, please visit >>>>>> http://openib.org/mailman/listinfo/openib-general >>> > > From rick.jones2 at hp.com Tue Apr 24 15:19:03 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Tue, 24 Apr 2007 15:19:03 -0700 Subject: [ofa-general] clueless noob and build probs with 1.2rc2 Message-ID: <462E8257.9090103@hp.com> So, to get started with SDP and perhaps RDS tests for netperf I installed RHEL5 on a pair of HP rx2660s - Itanium systems - with some dual-port 4x fabric adaptors installed. I grabbed 1.2rc2 bits, managed to get the pre-reqs (ostensibly) onto the system, and blythly typed-in ./install.sh. I took the defaults for everything but multi-thread support (I said 'y') and it churned away for a while and got grumpy with me. I have the build log (not attached just yet, 6K lines long) and noticed a few things: warning: user vlad does not exist - using root warning: group vlad does not exist - using root warning: user vlad does not exist - using root warning: group vlad does not exist - using root it then churned for a while and gave me: gcc -DHAVE_CONFIG_H -I. -I. -I. -I../libibverbs/include/infiniband -I../librdmacm/include -I../libibverbs/include -I../../dat/include -Wall -g -D_GNU_SOURCE -DOS_RELEASE=131078 -DREDHAT_EL4 -DOPENIB -DCQ_WAIT_OBJECT -I./dat/include/ -I./dapl/include/ -I./dapl/common -I./dapl/udapl/linux -I./dapl/openib_cma -g -O2 -MT dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP -MF .deps/dapl_udapl_libdaplcma_la -dapl_init.Tpo -c dapl/udapl/dapl_init.c -fPIC -DPIC -o .libs/dapl_udapl_libdap lcma_la-dapl_init.o In file included from ./dapl/include/dapl.h:50, from dapl/udapl/dapl_init.c:39: ./dapl/udapl/linux/dapl_osd.h:82:24: error: asm/atomic.h: No such file or directory In file included from ./dapl/include/dapl.h:50, from dapl/udapl/dapl_init.c:39: ./dapl/udapl/linux/dapl_osd.h: In function 'dapl_os_atomic_inc': ./dapl/udapl/linux/dapl_osd.h:163: warning: implicit declaration of function 'IA 64_FETCHADD' ./dapl/udapl/linux/dapl_osd.h: In function 'dapl_os_atomic_assign': ./dapl/udapl/linux/dapl_osd.h:237: warning: implicit declaration of function 'ia64_cmpxchg' make[3]: *** [dapl_udapl_libdaplcma_la-dapl_init.lo] Error 1 make[3]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' make[1]: *** [all] Error 2 make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' make: *** [dapl] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.55971 (%install) Is this simply a case of: Patient: Doctor! Doctor! It hurts when I do this! Doctor: Then don't do that! and I should decline one of the options in ./install.sh, or have I stumbled over a meaningful problem? rick jones mr netperf From rdreier at cisco.com Tue Apr 24 15:23:05 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Apr 2007 15:23:05 -0700 Subject: [ofa-general] clueless noob and build probs with 1.2rc2 In-Reply-To: <462E8257.9090103@hp.com> (Rick Jones's message of "Tue, 24 Apr 2007 15:19:03 -0700") References: <462E8257.9090103@hp.com> Message-ID: > ./dapl/udapl/linux/dapl_osd.h:82:24: error: asm/atomic.h: No such file or directory > In file included from ./dapl/include/dapl.h:50, > from dapl/udapl/dapl_init.c:39: > ./dapl/udapl/linux/dapl_osd.h: In function 'dapl_os_atomic_inc': > ./dapl/udapl/linux/dapl_osd.h:163: warning: implicit declaration of function 'IA64_FETCHADD' I seem to recall udapl does some very bogus things with and atomic operations in general. Probably the easiest solution on ia64 is just to disable udapl. (Not sure how to do that because I don't really work with the OFED build system) - R. From ardavis at ichips.intel.com Tue Apr 24 17:15:58 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 24 Apr 2007 17:15:58 -0700 Subject: [ofa-general] clueless noob and build probs with 1.2rc2 In-Reply-To: References: <462E8257.9090103@hp.com> Message-ID: <462E9DBE.1070803@ichips.intel.com> Roland Dreier wrote: > > ./dapl/udapl/linux/dapl_osd.h:82:24: error: asm/atomic.h: No such file or directory > > In file included from ./dapl/include/dapl.h:50, > > from dapl/udapl/dapl_init.c:39: > > ./dapl/udapl/linux/dapl_osd.h: In function 'dapl_os_atomic_inc': > > ./dapl/udapl/linux/dapl_osd.h:163: warning: implicit declaration of function 'IA64_FETCHADD' > >I seem to recall udapl does some very bogus things with >and atomic operations in general. > > The built-in atomics in gcc had some ia64 issues until gcc 4.1.1 so we had no choice but to get down and do some bogus things. :^) Turns out Redhat 5 removed asm/atomic.h to force users back to the built-in atomics which is fine given that Redhat 5 provides gcc 4.1.1. We will go back and revisit all the hacks. In the short term, for OFED 1.2, we may have to simply add another conditional in the build (configure.in) to handle the Redhat 5 case. James, do you have any input here? -arlin From mst at dev.mellanox.co.il Tue Apr 24 21:05:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Apr 2007 07:05:00 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review In-Reply-To: <462E7B6B.5060405@ichips.intel.com> References: <462E7B6B.5060405@ichips.intel.com> Message-ID: <20070425040500.GB20023@mellanox.co.il> > Quoting Sean Hefty : > Subject: Re: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review > > >If the ACK delays on both sides are not being taken into account > >properly when establishing a connection, then I guess that is a bug in > >our CM. > > I looked, and the cm does not take into account the ca ack delay. This can > be worked around by bumping up the qp timeout value between calling > ib_cm_init_qp_attr() and ib_modify_qp(), or by increasing the path record > packet_life_time. What really should happen is that the field Local Ack Timeout in REQ should be (2 * PacketLifeTime + Local CA’s ACK delay) (see 12.7.34) and then the responder should use this for it's QP. This does not sound too hard - why can't we just fix CM to do this, then? -- MST From vlad at mellanox.co.il Tue Apr 24 23:52:35 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 25 Apr 2007 09:52:35 +0300 Subject: [ofa-general] build problem on kernel-ib-devel package In-Reply-To: <1177362903.23314.55.camel@stevo-desktop> References: <1177362903.23314.55.camel@stevo-desktop> Message-ID: <1177483955.6497.3.camel@vladsk-laptop> On Mon, 2007-04-23 at 16:15 -0500, Steve Wise wrote: > Vlad, > > I'm trying to build the src tree that is installed when you install the > kernel-ib-devel package and I'm hitting a problem. iw_cxgb3 fails to > load because the ib_core module doesn't have the genalloc code included > in it. I think the Makefile in drivers/infininband/core didn't get > patched by > kernel_patches/backport/2.6.20/linux_genalloc_to_2.6.20.patch. > > Q: Should this tree be a fully configured tree that I can just do a > 'make install' in? Because if it is, then something is broken... > > Steve. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general Hi Steve, You should run configure script first. It will apply required patches. You can take required options for configure script from /etc/infiniband/info. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From laxcriqaavew at criqaa.com Wed Apr 25 01:11:26 2007 From: laxcriqaavew at criqaa.com (Deanna Hart) Date: Wed, 25 Apr 2007 08:11:26 +0000 Subject: [ofa-general] Stop being obese and unhappy Message-ID: <082852916.46026106210490@thhebat.net> Anatrim – The latest and most attracting product for weighty people is made available now – As could be seen on Oprah. Do you recall all the cases when you asked yourself to do anything for being delivered from this fastly growing number of kilos? Luckily, now no great price is to be paid. Thanks to Anatrim, the ground-shaking weight-reducing mixture, you can achieve healthier lifestyle and become really thinner. Just look at what people say! "It’s difficult to acknowledge it but I was an awful food addict. I devoured all this rubbish and was unable to stop. This misery left off when I started taking Anatrim! Oh, God, my craving for food disappeared, spirits improved and I turned to the happiest person in the world 18 pounds in 2.2 months. I can tell you now I became the happiest person on the planet!" Linda F., New York "Since the very childhood I was a bulky boy. It’s pretty hard to imagine how I detested being derided at school. I abhorred my weight and I hated even myself. After trying many different remedies I found out about Anatrim. It literally took me out of this horrible nightmare! The warmest thanks to you, guys." Serge Smith, San Francisco "Do you know what? Thanks to Anatrim my marriage was happily saved! I went into this circle, depression – eating more – just more depression. My wife was going to leave the overweight psycho I was turning in. One of my friends pointed at your web page and I asked for pack of Anatrim at the same time. The results were magnficent, my appetite came to normal level, I was in good spirits oftener, and, of course, I tightened my belt with no regrets. And you see, the bed became cool also!" Michael There many and many gratitudes left by delighted people trying Anatrim. Don’t you gonna join the tens of thousands of slim buyers and take this all-natural appetite decreasing energy boosting product now! Do not decline the preposition! -------------- next part -------------- An HTML attachment was scrubbed... URL: From monil at voltaire.com Wed Apr 25 01:13:34 2007 From: monil at voltaire.com (Moni Levy) Date: Wed, 25 Apr 2007 11:13:34 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <20070419203705.GA613@mellanox.co.il> References: <20070328093345.GD11695@mellanox.co.il> <20070328200906.GJ4253@mellanox.co.il> <460ACED8.20605@gmail.com> <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> Message-ID: <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> On 4/19/07, Michael S. Tsirkin wrote: > > Quoting Roland Dreier : > > Subject: Re: pkey change handling patch > > > > > So since all this thread was started by Moni because of IPoIB, > > > the path is clear in that respect, and would already be a step in the > > > right direction: > > > > > > - a patch to add ib_find_pkey() and ib_find_gid() to core > > > - a patch to replace cache usage in IPoIB / SRP with uncached > > > hardware accesses on top of this > > > - pkey change handling patch on top of these > > > > Makes good sense to me. > > OK, let's do this for starters. Moni? Before getting to the implementation phase, I would like to get your opinion on two more things: 1. Direct access in ib_find_pkey will probably heart RC connections per second rate. 2. What do you think about OrG's opinion (I'm copying it from the other thread): Roland, Michael, Please note that there is quite a big difference between UD vs RC based IB ULPs with respect to how there are influenced from using a wrong pkey index at their QP. In the RC case, the receiving side transport level would not get any packets and hence would not send acks etc, at some point the sending side would get completion with error and retry the connection. In the UD case, nothing other then pkey-violation-counter/traps etc would happen unless both side would re-initiate their QP (this is exactly what Moni is doing at ipoib in the patch that followed). Hence, it is extremely important that UD based ULPs would react to the async event of pkey change, and would retry reading the pkey from the cache when getting ESTALE or any other error code from the cache. For the RC case, note that a) the connection would not break if the change did not involve the index of the pkey used for it b) once the connection breaks and re-initiated by the ULP the cache would be very much --already updated--. So the only case which might be problematic with a patch that does not change the RC ULPs (and CM) code is when in the exact millisecond you set your RC connection the cache changes. I don't think the IB portion of the ULP code has to be changed other then sensing the ESTALE error and propagating it up. Higher layers would retry the connection and we are done. Anyway, thanks for bringing all this up! while thinking on it i have realized that the RDMA CM can (should) be enhanced to register on async events and for the pkey change event issue disconnect event on the relevant UD unicast IDs and multicast error event on the relevant UD multicast IDs. -- Moni > > -- > MST > From mst at dev.mellanox.co.il Wed Apr 25 01:23:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Apr 2007 11:23:43 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> References: <20070328200906.GJ4253@mellanox.co.il> <460ACED8.20605@gmail.com> <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> Message-ID: <20070425082343.GI28982@mellanox.co.il> > Quoting Moni Levy : > Subject: Re: pkey change handling patch > > On 4/19/07, Michael S. Tsirkin wrote: > >> Quoting Roland Dreier : > >> Subject: Re: pkey change handling patch > >> > >> > So since all this thread was started by Moni because of IPoIB, > >> > the path is clear in that respect, and would already be a step in the > >> > right direction: > >> > > >> > - a patch to add ib_find_pkey() and ib_find_gid() to core > >> > - a patch to replace cache usage in IPoIB / SRP with uncached > >> > hardware accesses on top of this > >> > - pkey change handling patch on top of these > >> > >> Makes good sense to me. > > > >OK, let's do this for starters. Moni? > > Before getting to the implementation phase, I would like to get your > opinion on two more things: > > 1. Direct access in ib_find_pkey will probably heart RC connections > per second rate. I don't think this matters much for IPoIB CM (likely to be dwarfed by CM handshake times). Long-term, I think providers can cache the pkey (without coherency issues, since all events go through the provider) if necessary. > 2. What do you think about OrG's opinion (I'm copying it from the other > thread): skip > So the only case which might be problematic with a patch that does not > change the RC ULPs (and CM) code is when in the exact millisecond you > set your RC connection the cache changes. I don't think the IB portion > of the ULP code has to be changed other then sensing the ESTALE error > and propagating it up. Higher layers would retry the connection and we > are done. One can argue about this, but since we decided we want to get rid of the cache, the point is moot I think? -- MST From vlad at lists.openfabrics.org Wed Apr 25 02:36:17 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 25 Apr 2007 02:36:17 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070425-0200 daily build status Message-ID: <20070425093617.3D6AAE60822@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From Elijah670 at arttwentyfirstcentury.com Wed Apr 25 03:38:18 2007 From: Elijah670 at arttwentyfirstcentury.com (Elijah Poll) Date: Wed, 25 Apr 2007 12:38:18 +0200 Subject: [ofa-general] closes Message-ID: <20070425103748.5EF8BE6081B@openfabrics.org> ask From kliteyn at dev.mellanox.co.il Wed Apr 25 05:10:25 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 25 Apr 2007 15:10:25 +0300 Subject: [ofa-general] [PATCHv2] osm: source and destination strings overlap when using sprintf() In-Reply-To: <462C7C21.7010004@dev.mellanox.co.il> References: <462C7C21.7010004@dev.mellanox.co.il> Message-ID: <462F4531.9060302@dev.mellanox.co.il> Hi Hal, [V2] - Fixing a problematic usage of sprintf() in osm_helper.c: When using sprintf(), source and destination strings should not overlap, otherwise the function behavior is undefined. Please apply to ofed_1_2 and to master. Thanks. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_helper.c | 26 +++++++++++++------------- 1 files changed, 13 insertions(+), 13 deletions(-) diff --git a/osm/opensm/osm_helper.c b/osm/opensm/osm_helper.c index 14474e7..483b73b 100644 --- a/osm/opensm/osm_helper.c +++ b/osm/opensm/osm_helper.c @@ -1157,9 +1157,9 @@ osm_dump_multipath_record( { for (i = 0; i < p_mpr->sgid_count; i++) { - sprintf( buf_line, "%s\t\t\t\tsgid%02d.................." + sprintf( buf_line + strlen(buf_line), "\t\t\t\tsgid%02d.................." "0x%016" PRIx64 " : 0x%016" PRIx64 "\n", - buf_line, i + 1, cl_ntoh64( p_gid->unicast.prefix ), + i + 1, cl_ntoh64( p_gid->unicast.prefix ), cl_ntoh64( p_gid->unicast.interface_id ) ); p_gid++; } @@ -1168,9 +1168,9 @@ osm_dump_multipath_record( { for (i = 0; i < p_mpr->dgid_count; i++) { - sprintf( buf_line, "%s\t\t\t\tdgid%02d.................." + sprintf( buf_line + strlen(buf_line), "\t\t\t\tdgid%02d.................." "0x%016" PRIx64 " : 0x%016" PRIx64 "\n", - buf_line, i + 1, cl_ntoh64( p_gid->unicast.prefix ), + i + 1, cl_ntoh64( p_gid->unicast.prefix ), cl_ntoh64( p_gid->unicast.interface_id ) ); p_gid++; } @@ -1657,8 +1657,8 @@ osm_dump_pkey_block( { buf_line[0] = '\0'; for (i = 0; i < 32; i++) - sprintf( buf_line,"%s 0x%04x |", - buf_line, cl_ntoh16(p_pkey_tbl->pkey_entry[i])); + sprintf( buf_line + strlen(buf_line)," 0x%04x |", + cl_ntoh16(p_pkey_tbl->pkey_entry[i])); osm_log( p_log, log_level, "P_Key table dump:\n" @@ -1693,10 +1693,10 @@ osm_dump_slvl_map_table( buf_line1[0] = '\0'; buf_line2[0] = '\0'; for (i = 0; i < 16; i++) - sprintf( buf_line1,"%s %-2u |", buf_line1, i); + sprintf( buf_line1 + strlen(buf_line1)," %-2u |", i); for (i = 0; i < 16; i++) - sprintf( buf_line2,"%s0x%01X |", - buf_line2, ib_slvl_table_get(p_slvl_tbl, i)); + sprintf( buf_line2 + strlen(buf_line2),"0x%01X |", + ib_slvl_table_get(p_slvl_tbl, i)); osm_log( p_log, log_level, "SLtoVL dump:\n" "\t\t\tport_guid............0x%016" PRIx64 "\n" @@ -1730,11 +1730,11 @@ osm_dump_vl_arb_table( buf_line1[0] = '\0'; buf_line2[0] = '\0'; for (i = 0; i < 32; i++) - sprintf( buf_line1,"%s 0x%01X |", - buf_line1, p_vla_tbl->vl_entry[i].vl); + sprintf( buf_line1 + strlen(buf_line1)," 0x%01X |", + p_vla_tbl->vl_entry[i].vl); for (i = 0; i < 32; i++) - sprintf( buf_line2,"%s 0x%01X |", - buf_line2, p_vla_tbl->vl_entry[i].weight); + sprintf( buf_line2 + strlen(buf_line2)," 0x%01X |", + p_vla_tbl->vl_entry[i].weight); osm_log( p_log, log_level, "VlArb dump:\n" "\t\t\tport_guid...........0x%016" PRIx64 "\n" -- 1.4.4.1.GIT From mst at dev.mellanox.co.il Wed Apr 25 05:46:52 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Apr 2007 15:46:52 +0300 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> Message-ID: <20070425124652.GG1624@mellanox.co.il> > Quoting Bryan Lawver : > Subject: IPoIB forwarding > > I have a small test bed with 2 nodes with IB/OFED1.2/connected mode and a > third node which has IP only and is connected to one of the IB nodes. In > between are DDR IB switch and 10GE IP switch. The node with both IP and IB > interfaces is simply a IP router in this test setup. The IB only node has > a subnet route to router node and the IP only node has a subnet route to > the router node. > > When I launch an Iperf test from the IB (IPoIB) node to the IP node, I get > very good throughput with no tuning (7.5gbs). > > When I launch from IP to the IB node, I get virtually no thorughput > (2.5mbs). When I dropped the window size to 8k (iperf -w8k) the throughput > is 750mbs. > > Any suggestions, ideas? Some troubleshooting tips: Are some packets lost on the router? Checking packet counters might give you a clue. Do you see some errors on one of the IB nodes? Set debug_level=1 module parameter for ib_ipoib, and check dmesg output while running the test. -- MST From jlentini at netapp.com Wed Apr 25 05:45:46 2007 From: jlentini at netapp.com (James Lentini) Date: Wed, 25 Apr 2007 08:45:46 -0400 (EDT) Subject: [ofa-general] clueless noob and build probs with 1.2rc2 In-Reply-To: <462E9DBE.1070803@ichips.intel.com> References: <462E8257.9090103@hp.com> <462E9DBE.1070803@ichips.intel.com> Message-ID: On Tue, 24 Apr 2007, Arlin Davis wrote: > Roland Dreier wrote: > > > > ./dapl/udapl/linux/dapl_osd.h:82:24: error: asm/atomic.h: No such file or > > directory > > > In file included from ./dapl/include/dapl.h:50, > > > from dapl/udapl/dapl_init.c:39: > > > ./dapl/udapl/linux/dapl_osd.h: In function 'dapl_os_atomic_inc': > > > ./dapl/udapl/linux/dapl_osd.h:163: warning: implicit declaration of > > function 'IA64_FETCHADD' > > > > I seem to recall udapl does some very bogus things with > > and atomic operations in general. > > > > The built-in atomics in gcc had some ia64 issues until gcc 4.1.1 so we had no > choice but to get down and do some bogus things. :^) > > Turns out Redhat 5 removed asm/atomic.h to force users back to the built-in > atomics which is fine given that Redhat 5 provides gcc 4.1.1. We will go back > and revisit all the hacks. In the short term, for OFED 1.2, we may have to > simply add another conditional in the build (configure.in) to handle the > Redhat 5 case. > > James, do you have any input here? No. That is a good summary. From halr at voltaire.com Wed Apr 25 05:57:33 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2007 08:57:33 -0400 Subject: [ofa-general] Re: [PATCHv2] osm: source and destination strings overlap when using sprintf() In-Reply-To: <462F4531.9060302@dev.mellanox.co.il> References: <462C7C21.7010004@dev.mellanox.co.il> <462F4531.9060302@dev.mellanox.co.il> Message-ID: <1177505852.16495.76351.camel@hal.voltaire.com> Hi Yevgeny, On Wed, 2007-04-25 at 08:10, Yevgeny Kliteynik wrote: > Hi Hal, > > [V2] - Fixing a problematic usage of sprintf() in osm_helper.c: > > When using sprintf(), source and destination strings should > not overlap, otherwise the function behavior is undefined. > Please apply to ofed_1_2 and to master. Is this a test case or does this currently occur with some OFED 1.2 code ? Just wondering... -- Hal > Thanks. > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik > --- > osm/opensm/osm_helper.c | 26 +++++++++++++------------- > 1 files changed, 13 insertions(+), 13 deletions(-) > > diff --git a/osm/opensm/osm_helper.c b/osm/opensm/osm_helper.c > index 14474e7..483b73b 100644 > --- a/osm/opensm/osm_helper.c > +++ b/osm/opensm/osm_helper.c > @@ -1157,9 +1157,9 @@ osm_dump_multipath_record( > { > for (i = 0; i < p_mpr->sgid_count; i++) > { > - sprintf( buf_line, "%s\t\t\t\tsgid%02d.................." > + sprintf( buf_line + strlen(buf_line), "\t\t\t\tsgid%02d.................." > "0x%016" PRIx64 " : 0x%016" PRIx64 "\n", > - buf_line, i + 1, cl_ntoh64( p_gid->unicast.prefix ), > + i + 1, cl_ntoh64( p_gid->unicast.prefix ), > cl_ntoh64( p_gid->unicast.interface_id ) ); > p_gid++; > } > @@ -1168,9 +1168,9 @@ osm_dump_multipath_record( > { > for (i = 0; i < p_mpr->dgid_count; i++) > { > - sprintf( buf_line, "%s\t\t\t\tdgid%02d.................." > + sprintf( buf_line + strlen(buf_line), "\t\t\t\tdgid%02d.................." > "0x%016" PRIx64 " : 0x%016" PRIx64 "\n", > - buf_line, i + 1, cl_ntoh64( p_gid->unicast.prefix ), > + i + 1, cl_ntoh64( p_gid->unicast.prefix ), > cl_ntoh64( p_gid->unicast.interface_id ) ); > p_gid++; > } > @@ -1657,8 +1657,8 @@ osm_dump_pkey_block( > { > buf_line[0] = '\0'; > for (i = 0; i < 32; i++) > - sprintf( buf_line,"%s 0x%04x |", > - buf_line, cl_ntoh16(p_pkey_tbl->pkey_entry[i])); > + sprintf( buf_line + strlen(buf_line)," 0x%04x |", > + cl_ntoh16(p_pkey_tbl->pkey_entry[i])); > > osm_log( p_log, log_level, > "P_Key table dump:\n" > @@ -1693,10 +1693,10 @@ osm_dump_slvl_map_table( > buf_line1[0] = '\0'; > buf_line2[0] = '\0'; > for (i = 0; i < 16; i++) > - sprintf( buf_line1,"%s %-2u |", buf_line1, i); > + sprintf( buf_line1 + strlen(buf_line1)," %-2u |", i); > for (i = 0; i < 16; i++) > - sprintf( buf_line2,"%s0x%01X |", > - buf_line2, ib_slvl_table_get(p_slvl_tbl, i)); > + sprintf( buf_line2 + strlen(buf_line2),"0x%01X |", > + ib_slvl_table_get(p_slvl_tbl, i)); > osm_log( p_log, log_level, > "SLtoVL dump:\n" > "\t\t\tport_guid............0x%016" PRIx64 "\n" > @@ -1730,11 +1730,11 @@ osm_dump_vl_arb_table( > buf_line1[0] = '\0'; > buf_line2[0] = '\0'; > for (i = 0; i < 32; i++) > - sprintf( buf_line1,"%s 0x%01X |", > - buf_line1, p_vla_tbl->vl_entry[i].vl); > + sprintf( buf_line1 + strlen(buf_line1)," 0x%01X |", > + p_vla_tbl->vl_entry[i].vl); > for (i = 0; i < 32; i++) > - sprintf( buf_line2,"%s 0x%01X |", > - buf_line2, p_vla_tbl->vl_entry[i].weight); > + sprintf( buf_line2 + strlen(buf_line2)," 0x%01X |", > + p_vla_tbl->vl_entry[i].weight); > osm_log( p_log, log_level, > "VlArb dump:\n" > "\t\t\tport_guid...........0x%016" PRIx64 "\n" From kliteyn at dev.mellanox.co.il Wed Apr 25 06:11:42 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 25 Apr 2007 16:11:42 +0300 Subject: [ofa-general] Re: [PATCHv2] osm: source and destination strings overlap when using sprintf() In-Reply-To: <1177505852.16495.76351.camel@hal.voltaire.com> References: <462C7C21.7010004@dev.mellanox.co.il> <462F4531.9060302@dev.mellanox.co.il> <1177505852.16495.76351.camel@hal.voltaire.com> Message-ID: <462F538E.1000802@dev.mellanox.co.il> Hal Rosenstock wrote: > Hi Yevgeny, > > On Wed, 2007-04-25 at 08:10, Yevgeny Kliteynik wrote: >> Hi Hal, >> >> [V2] - Fixing a problematic usage of sprintf() in osm_helper.c: >> >> When using sprintf(), source and destination strings should >> not overlap, otherwise the function behavior is undefined. > >> Please apply to ofed_1_2 and to master. > > Is this a test case or does this currently occur with some OFED 1.2 code > ? Just wondering... Valgrind has complained about an overlap between source and destination during the usual OSM execution. I didn't check whether the original code is functioning correctly, but it probably doesn't matter because the code needs to be fixed anyway. -- Yevgeny > -- Hal > >> Thanks. >> >> -- Yevgeny >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> osm/opensm/osm_helper.c | 26 +++++++++++++------------- >> 1 files changed, 13 insertions(+), 13 deletions(-) >> >> diff --git a/osm/opensm/osm_helper.c b/osm/opensm/osm_helper.c >> index 14474e7..483b73b 100644 >> --- a/osm/opensm/osm_helper.c >> +++ b/osm/opensm/osm_helper.c >> @@ -1157,9 +1157,9 @@ osm_dump_multipath_record( >> { >> for (i = 0; i < p_mpr->sgid_count; i++) >> { >> - sprintf( buf_line, "%s\t\t\t\tsgid%02d.................." >> + sprintf( buf_line + strlen(buf_line), "\t\t\t\tsgid%02d.................." >> "0x%016" PRIx64 " : 0x%016" PRIx64 "\n", >> - buf_line, i + 1, cl_ntoh64( p_gid->unicast.prefix ), >> + i + 1, cl_ntoh64( p_gid->unicast.prefix ), >> cl_ntoh64( p_gid->unicast.interface_id ) ); >> p_gid++; >> } >> @@ -1168,9 +1168,9 @@ osm_dump_multipath_record( >> { >> for (i = 0; i < p_mpr->dgid_count; i++) >> { >> - sprintf( buf_line, "%s\t\t\t\tdgid%02d.................." >> + sprintf( buf_line + strlen(buf_line), "\t\t\t\tdgid%02d.................." >> "0x%016" PRIx64 " : 0x%016" PRIx64 "\n", >> - buf_line, i + 1, cl_ntoh64( p_gid->unicast.prefix ), >> + i + 1, cl_ntoh64( p_gid->unicast.prefix ), >> cl_ntoh64( p_gid->unicast.interface_id ) ); >> p_gid++; >> } >> @@ -1657,8 +1657,8 @@ osm_dump_pkey_block( >> { >> buf_line[0] = '\0'; >> for (i = 0; i < 32; i++) >> - sprintf( buf_line,"%s 0x%04x |", >> - buf_line, cl_ntoh16(p_pkey_tbl->pkey_entry[i])); >> + sprintf( buf_line + strlen(buf_line)," 0x%04x |", >> + cl_ntoh16(p_pkey_tbl->pkey_entry[i])); >> >> osm_log( p_log, log_level, >> "P_Key table dump:\n" >> @@ -1693,10 +1693,10 @@ osm_dump_slvl_map_table( >> buf_line1[0] = '\0'; >> buf_line2[0] = '\0'; >> for (i = 0; i < 16; i++) >> - sprintf( buf_line1,"%s %-2u |", buf_line1, i); >> + sprintf( buf_line1 + strlen(buf_line1)," %-2u |", i); >> for (i = 0; i < 16; i++) >> - sprintf( buf_line2,"%s0x%01X |", >> - buf_line2, ib_slvl_table_get(p_slvl_tbl, i)); >> + sprintf( buf_line2 + strlen(buf_line2),"0x%01X |", >> + ib_slvl_table_get(p_slvl_tbl, i)); >> osm_log( p_log, log_level, >> "SLtoVL dump:\n" >> "\t\t\tport_guid............0x%016" PRIx64 "\n" >> @@ -1730,11 +1730,11 @@ osm_dump_vl_arb_table( >> buf_line1[0] = '\0'; >> buf_line2[0] = '\0'; >> for (i = 0; i < 32; i++) >> - sprintf( buf_line1,"%s 0x%01X |", >> - buf_line1, p_vla_tbl->vl_entry[i].vl); >> + sprintf( buf_line1 + strlen(buf_line1)," 0x%01X |", >> + p_vla_tbl->vl_entry[i].vl); >> for (i = 0; i < 32; i++) >> - sprintf( buf_line2,"%s 0x%01X |", >> - buf_line2, p_vla_tbl->vl_entry[i].weight); >> + sprintf( buf_line2 + strlen(buf_line2)," 0x%01X |", >> + p_vla_tbl->vl_entry[i].weight); >> osm_log( p_log, log_level, >> "VlArb dump:\n" >> "\t\t\tport_guid...........0x%016" PRIx64 "\n" > > From halr at voltaire.com Wed Apr 25 06:50:15 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2007 09:50:15 -0400 Subject: [ofa-general] {PATCH 1/2] OpenSM/ib_types.h: Clarify the proper usage of ib_get_node_type_str Message-ID: <1177509012.16495.79476.camel@hal.voltaire.com> osm/include/iba/ib_types.h: ib_get_node_type_str is to be used for decoding the node_type from the node info attribute only Signed-off-by: Ira K. Weiny Signed-off-by: Hal Rosenstock osm/include/iba/ib_types.h | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h index c94b7fc..9bc2846 100644 --- a/osm/include/iba/ib_types.h +++ b/osm/include/iba/ib_types.h @@ -1829,8 +1829,7 @@ static const char* const __ib_node_type_ "UNKNOWN", "Channel Adapter", "Switch", - "Router", - "Subnet Management" + "Router" }; /****f* IBA Base: Types/ib_get_node_type_str @@ -1839,12 +1838,13 @@ static const char* const __ib_node_type_ * * DESCRIPTION * Returns a string for the specified node type. +* 14.2.5.3 NodeInfo * * SYNOPSIS */ static inline const char* OSM_API ib_get_node_type_str( - IN uint32_t node_type ) + IN uint8_t node_type ) { if( node_type > IB_NODE_TYPE_ROUTER ) node_type = 0; -- 1.4.4 From halr at voltaire.com Wed Apr 25 06:50:30 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2007 09:50:30 -0400 Subject: [ofa-general] {PATCH 2/2] OpenSM/osm_helper: Clarify the proper usage of osm_get_node_type_str_fixed_width Message-ID: <1177509028.16495.79478.camel@hal.voltaire.com> OpenSM/osm_helper: osm_get_node_type_str_fixed_width is to be used for decoding the node_type from the node info attribute only Signed-off-by: Hal Rosenstock diff --git a/osm/include/opensm/osm_helper.h b/osm/include/opensm/osm_helper.h index 3eab913..7164cc3 100644 --- a/osm/include/opensm/osm_helper.h +++ b/osm/include/opensm/osm_helper.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -538,7 +538,7 @@ osm_get_port_state_str_fixed_width( const char* osm_get_node_type_str_fixed_width( - IN uint32_t node_type ); + IN uint8_t node_type ); const char* osm_get_manufacturer_str( diff --git a/osm/opensm/osm_helper.c b/osm/opensm/osm_helper.c index c4e8ec5..a47cb89 100644 --- a/osm/opensm/osm_helper.c +++ b/osm/opensm/osm_helper.c @@ -2338,7 +2338,7 @@ static const char* const __osm_node_type **********************************************************************/ const char* osm_get_node_type_str_fixed_width( - IN uint32_t node_type ) + IN uint8_t node_type ) { if( node_type > IB_NODE_TYPE_ROUTER ) node_type = 0; From eli at dev.mellanox.co.il Wed Apr 25 07:41:39 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 25 Apr 2007 17:41:39 +0300 Subject: [ofa-general] [PATCH] net/mlx4: modify sw reset Message-ID: <1177512129.10850.9.camel@mtls03> Modify SW reset to acquire the HW semaphore before hitting reset. This is recommended to prevent SW reset of the device while the flash is being programmed. Avoid delay after reset - this is not required. Instead, wait until the vendor ID is read. Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/net/mlx4/reset.c =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/reset.c 2007-04-25 12:04:23.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/reset.c 2007-04-25 18:52:16.000000000 +0300 @@ -42,13 +42,18 @@ { int i; int err = 0; - u32 *hca_header = NULL; + u32 *hca_header = NULL, sem = 1; int pcie_cap; + unsigned long end; - u16 devctl; + u16 devctl, v; u16 linkctl; + void __iomem *reset; -#define MLX4_RESET_OFFSET 0xf0010 +#define MLX4_RESET_BASE 0xf0000 +#define MLX4_RESET_SIZE 0x400 +#define MLX4_SEM_OFFSET 0x3fc +#define MLX4_RESET_OFFSET 0x10 #define MLX4_RESET_VALUE swab32(1) /* @@ -80,51 +85,51 @@ } } - /* actually hit reset */ - { - void __iomem *reset = ioremap(pci_resource_start(dev->pdev, 0) + - MLX4_RESET_OFFSET, 4); - - if (!reset) { - err = -ENOMEM; - mlx4_err(dev, "Couldn't map HCA reset register, " - "aborting.\n"); - goto out; - } + reset = ioremap(pci_resource_start(dev->pdev, 0) + MLX4_RESET_BASE, MLX4_RESET_SIZE); + if (!reset) { + err = -ENOMEM; + mlx4_err(dev, "Couldn't map HCA reset register, aborting.\n"); + goto out; + } - writel(MLX4_RESET_VALUE, reset); - iounmap(reset); + /* grab the semaphore */ + end = jiffies + 10 * HZ; + while (time_before(jiffies, end)) { + sem = readl(reset + MLX4_SEM_OFFSET); + if (!sem) + break; + + set_current_state(TASK_RUNNING); + schedule(); } - /* Docs say to wait one second before accessing device */ - msleep(1000); + if (sem) { + mlx4_err(dev, "failed to obtain semaphore, aborting\n"); + err = -EAGAIN; + iounmap(reset); + goto out; + } + /* actually hit reset - hw sem is released after reset */ + writel(MLX4_RESET_VALUE, reset + MLX4_RESET_OFFSET); + iounmap(reset); /* Now wait for PCI device to start responding again */ - { - u32 v; - int c = 0; - - for (c = 0; c < 100; ++c) { - if (pci_read_config_dword(dev->pdev, 0, &v)) { - err = -ENODEV; - mlx4_err(dev, "Couldn't access HCA after reset, " - "aborting.\n"); - goto out; - } - - if (v != 0xffffffff) - goto good; + end = jiffies + 2 * HZ; + while (time_before(jiffies, end)) { + if (!pci_read_config_word(dev->pdev, 0, &v) && + (v == PCI_VENDOR_ID_MELLANOX)) + break; - msleep(100); - } + set_current_state(TASK_RUNNING); + schedule(); + } + if (pci_read_config_word(dev->pdev, 0, &v) || v != PCI_VENDOR_ID_MELLANOX) { + mlx4_err(dev, "Couldn't read vendor ID after reset, aborting.\n"); err = -ENODEV; - mlx4_err(dev, "PCI device did not come back after reset, " - "aborting.\n"); goto out; } -good: /* Now restore the PCI headers */ if (pcie_cap) { devctl = hca_header[(pcie_cap + PCI_EXP_DEVCTL) / 4]; From swise at opengridcomputing.com Wed Apr 25 07:44:35 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 25 Apr 2007 09:44:35 -0500 Subject: [ofa-general] possible librdmacm bug? Message-ID: <1177512275.22094.13.camel@stevo-desktop> Sean, The Lamprey guys uncovered a potential bug in the RDMA CM: If an application binds to address 0, port 0 via a call to rdma_bind_addr(), and then calls rdma_get_src_addr(), the port number is still 0 even though the RDMA CM chose a port during the bind operation. However, If the application then calls rdma_listen() followed by another rdma_get_src_addr() call, then the actual port number is correctly returned. But it should be returned in the call to rdma_get_src_addr() immediately after the bind succeeds, yes? I poked around the code and couldn't see an obvious bug. I'm still looking. Shall I open a bug in the ofa bugzilla for this? Thanks, Steve. From vlad at mellanox.co.il Wed Apr 25 07:55:15 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 25 Apr 2007 17:55:15 +0300 Subject: [ofa-general] hotplug event handle question Message-ID: <1177512915.15940.17.camel@vladsk-laptop> Hi Sean, I have a question about hotplug event: I want to handle RDMA_CM_EVENT_DEVICE_REMOVAL event in RDS. When I got this event, I call to rdma_disconnect function in order to modify the QP to the error state but I got -EINVAL because of the following code in cma.c: if (!cma_comp(id_priv, CMA_CONNECT) && !cma_comp(id_priv, CMA_DISCONNECT)) return -EINVAL; while the state is CMA_DEVICE_REMOVAL. Should I move the QP to the error state in RDS or cma should handle this state too. Thanks, -- Vladimir Sokolovsky Mellanox Technologies Ltd. From halr at voltaire.com Wed Apr 25 07:55:25 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2007 10:55:25 -0400 Subject: [ofa-general] Re: [PATCHv2] osm: source and destination strings overlap when using sprintf() In-Reply-To: <462F4531.9060302@dev.mellanox.co.il> References: <462C7C21.7010004@dev.mellanox.co.il> <462F4531.9060302@dev.mellanox.co.il> Message-ID: <1177512920.16495.83448.camel@hal.voltaire.com> On Wed, 2007-04-25 at 08:10, Yevgeny Kliteynik wrote: > Hi Hal, > > [V2] - Fixing a problematic usage of sprintf() in osm_helper.c: > > When using sprintf(), source and destination strings should > not overlap, otherwise the function behavior is undefined. > > Please apply to ofed_1_2 and to master. > > Thanks. > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to both master and ofed_1_2). -- Hal From jxtvhafa at pacificfixture.com Wed Apr 25 09:06:26 2007 From: jxtvhafa at pacificfixture.com (Wilbur) Date: Wed, 25 Apr 2007 15:06:26 -0100 Subject: [ofa-general] This Link comes from Alberta Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pkebzutuniymh.jpeg Type: image/jpeg Size: 17760 bytes Desc: not available URL: From swise at opengridcomputing.com Wed Apr 25 08:34:31 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 25 Apr 2007 10:34:31 -0500 Subject: [ofa-general] [PATCH librdmacm] rping: Transfer rkey/addr/len information in network byte order. Message-ID: <1177515271.22094.33.camel@stevo-desktop> Sean, This patch enables rping between a BE and LE system. Tested on IBM PPC64 <-> AMD64. Transfer rkey/addr/len information in network byte order. Signed-off-by: Steve Wise --- examples/rping.c | 7 ++++--- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/examples/rping.c b/examples/rping.c index 0441300..17b0000 100644 --- a/examples/rping.c +++ b/examples/rping.c @@ -47,6 +47,7 @@ #include #include #include +#include static int debug = 0; #define DEBUG_LOG if (debug) printf @@ -239,9 +240,9 @@ static int server_recv(struct rping_cb * return -1; } - cb->remote_rkey = cb->recv_buf.rkey; - cb->remote_addr = cb->recv_buf.buf; - cb->remote_len = cb->recv_buf.size; + cb->remote_rkey = ntohl(cb->recv_buf.rkey); + cb->remote_addr = ntohll(cb->recv_buf.buf); + cb->remote_len = ntohl(cb->recv_buf.size); DEBUG_LOG("Received rkey %x addr %" PRIx64 "len %d from peer\n", cb->remote_rkey, cb->remote_addr, cb->remote_len); From rdreier at cisco.com Wed Apr 25 08:46:04 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Apr 2007 08:46:04 -0700 Subject: [ofa-general] clueless noob and build probs with 1.2rc2 In-Reply-To: <462E9DBE.1070803@ichips.intel.com> (Arlin Davis's message of "Tue, 24 Apr 2007 17:15:58 -0700") References: <462E8257.9090103@hp.com> <462E9DBE.1070803@ichips.intel.com> Message-ID: > The built-in atomics in gcc had some ia64 issues until gcc 4.1.1 so we > had no choice but to get down and do some bogus things. :^) Actually I think you've underestimated the depth of the bogosity. Many of the uses of atomics in dapl that a quick grep turns up look to be cargo cult uses where the atomic operations are used in a way that doesn't protect against races. eg dapl_cookie.c: new_head = (dapl_os_atomic_read (&buffer->head) + 1) % buffer->pool_size; if ( new_head == dapl_os_atomic_read (&buffer->tail) ) { dat_status = DAT_INSUFFICIENT_RESOURCES; goto bail; } else { dapl_os_atomic_set (&buffer->head, new_head); if there's no other locking on buffer->head, then there's a race between the dapl_os_atomic_read() and the dapl_os_atomic_set(). And if there is other locking, then there's no point in making buffer->head be an atomic variable. Since you only have atomic_inc and atomic_dec and not anything like atomic_dec_and_test, then there's not really any race-free way to use the atomic variables. I guess the uses like evd_ref_count are OK, since that's really only a hint about whether something is free, but there's not much point in taking the portability hassles of trying to use atomics -- I think using a pthread mutex would be pretty much equivalent in performance in the common uncontended case, and anyway wherever you're doing the reference counting is not a hot path. And don't even make me get started on the places where dapl_os_atomic_assign() (really cmpxchg) is used, like dapls_rbuf_add() So the best thing to do (assuming you're stuck with udapl) would probably be to get rid of all the dapl_os_atomic_xxx junk and just use portable pthreads locking. From swise at opengridcomputing.com Wed Apr 25 09:24:29 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 25 Apr 2007 11:24:29 -0500 Subject: [ofa-general] [PATCH ofed_1_2] - chelsio bug fixes Message-ID: <1177518269.22094.64.camel@stevo-desktop> Vlad, These changes are a set of bug fixes to the chelsio drivers as well as support for their latest firmware. This is required for OFED-1.2. Please pull from: git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2 Thanks, Steve. --------- Shortlog: --------- Divy Le Ray: Reuse the incoming skb when a clientless abort req is recieved. Remove assumption that PHY interrupts use GPIOs 3 and 5. Steve Wise: Don't use physical addresses as mmap offsets. Support for new abort logic. Update required firmware revision. ------ Diffs: ------ commit a7e291a27cbd9488f5eb390e38a52ada2758b094 Author: Steve Wise Date: Tue Apr 24 12:57:51 2007 -0500 Update required firmware revision. Signed-off-by: Steve Wise diff --git a/drivers/net/cxgb3/version.h b/drivers/net/cxgb3/version.h index bd7c4f7..17b9801 100644 --- a/drivers/net/cxgb3/version.h +++ b/drivers/net/cxgb3/version.h @@ -38,7 +38,7 @@ #define DRV_NAME "cxgb3" #define DRV_VERSION "1.0-ofed" /* Firmware version */ -#define FW_VERSION_MAJOR 3 -#define FW_VERSION_MINOR 3 +#define FW_VERSION_MAJOR 4 +#define FW_VERSION_MINOR 0 #define FW_VERSION_MICRO 0 #endif /* __CHELSIO_VERSION_H */ commit afa256e0aa01f03c1f56960c2af8124352c7c72b Author: Steve Wise Date: Tue Apr 24 10:31:24 2007 -0500 Support for new abort logic. The HW now posts 2 ABORT_RPL and/or PEER_ABORT_REQ messages. We need to handle them by silenty dropping the 1st but mark that we're ready for the final message. This plugs some close races between the uP and HW. Signed-off-by: Steve Wise diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 36ab39e..0d81e2f 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1108,6 +1108,15 @@ static int abort_rpl(struct t3cdev *tdev PDBG("%s ep %p\n", __FUNCTION__, ep); + /* + * We get 2 abort replies from the HW. The first one must + * be ignored except for scribbling that we need one more. + */ + if (!(ep->flags & ABORT_REQ_IN_PROGRESS)) { + ep->flags |= ABORT_REQ_IN_PROGRESS; + return CPL_RET_BUF_DONE; + } + close_complete_upcall(ep); state_set(&ep->com, DEAD); release_ep_resources(ep); @@ -1475,6 +1484,15 @@ static int peer_abort(struct t3cdev *tde int ret; int state; + /* + * We get 2 peer aborts from the HW. The first one must + * be ignored except for scribbling that we need one more. + */ + if (!(ep->flags & PEER_ABORT_IN_PROGRESS)) { + ep->flags |= PEER_ABORT_IN_PROGRESS; + return CPL_RET_BUF_DONE; + } + if (is_neg_adv_abort(req->status)) { PDBG("%s neg_adv_abort ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h index 855f1ef..1d4a1a5 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.h +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -143,6 +143,11 @@ enum iwch_ep_state { DEAD, }; +enum iwch_ep_flags { + PEER_ABORT_IN_PROGRESS = (1 << 0), + ABORT_REQ_IN_PROGRESS = (1 << 1), +}; + struct iwch_ep_common { struct iw_cm_id *cm_id; struct iwch_qp *qp; @@ -181,6 +186,7 @@ struct iwch_ep { u16 plen; u32 ird; u32 ord; + u32 flags; }; static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id) commit c7d9f4cc0d6a4695f3c96f7ae5a8310b2e8fa804 Author: Steve Wise Date: Tue Apr 24 10:31:19 2007 -0500 Don't use physical addresses as mmap offsets. Currently iw_cxgb3 uses the physical address as the key/offset to return to the user process for maping kernel memory into userspace. The user process then calls mmap() using this key as the offset. Because the physical address is 64 bits, this introduces a problem with 32-bit userspace, which might not be able to pass an arbitrary 64-bit address back into the kernel (since mmap2() is limited to a 32-bit number of pages for the offset, which limits it to 44-bit addresses). Change the mmap logic to use a u32 counter as the offset for mapping. Signed-off-by: Steve Wise diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index fe57d11..b0f7218 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -114,7 +114,7 @@ static struct ib_ucontext *iwch_alloc_uc struct iwch_dev *rhp = to_iwch_dev(ibdev); PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); - context = kmalloc(sizeof(*context), GFP_KERNEL); + context = kzalloc(sizeof(*context), GFP_KERNEL); if (!context) return ERR_PTR(-ENOMEM); cxio_init_ucontext(&rhp->rdev, &context->uctx); @@ -140,13 +140,14 @@ static int iwch_destroy_cq(struct ib_cq } static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, - struct ib_ucontext *context, + struct ib_ucontext *ib_context, struct ib_udata *udata) { struct iwch_dev *rhp; struct iwch_cq *chp; struct iwch_create_cq_resp uresp; struct iwch_create_cq_req ureq; + struct iwch_ucontext *ucontext = NULL; PDBG("%s ib_dev %p entries %d\n", __FUNCTION__, ibdev, entries); rhp = to_iwch_dev(ibdev); @@ -154,12 +155,15 @@ static struct ib_cq *iwch_create_cq(stru if (!chp) return ERR_PTR(-ENOMEM); - if (context && !t3a_device(rhp)) { - if (ib_copy_from_udata(&ureq, udata, sizeof (ureq))) { - kfree(chp); - return ERR_PTR(-EFAULT); + if (ib_context) { + ucontext = to_iwch_ucontext(ib_context); + if (!t3a_device(rhp)) { + if (ib_copy_from_udata(&ureq, udata, sizeof (ureq))) { + kfree(chp); + return ERR_PTR(-EFAULT); + } + chp->user_rptr_addr = (u32 __user *)(unsigned long)ureq.user_rptr_addr; } - chp->user_rptr_addr = (u32 *)(unsigned long)ureq.user_rptr_addr; } if (t3a_device(rhp)) { @@ -189,7 +193,7 @@ static struct ib_cq *iwch_create_cq(stru init_waitqueue_head(&chp->wait); insert_handle(rhp, &rhp->cqidr, chp, chp->cq.cqid); - if (context) { + if (ucontext) { struct iwch_mm_entry *mm; mm = kmalloc(sizeof *mm, GFP_KERNEL); @@ -199,16 +203,20 @@ static struct ib_cq *iwch_create_cq(stru } uresp.cqid = chp->cq.cqid; uresp.size_log2 = chp->cq.size_log2; - uresp.physaddr = virt_to_phys(chp->cq.queue); + spin_lock(&ucontext->mmap_lock); + uresp.key = ucontext->key; + ucontext->key += PAGE_SIZE; + spin_unlock(&ucontext->mmap_lock); if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { kfree(mm); iwch_destroy_cq(&chp->ibcq); return ERR_PTR(-EFAULT); } - mm->addr = uresp.physaddr; + mm->key = uresp.key; + mm->addr = virt_to_phys(chp->cq.queue); mm->len = PAGE_ALIGN((1UL << uresp.size_log2) * sizeof (struct t3_cqe)); - insert_mmap(to_iwch_ucontext(context), mm); + insert_mmap(ucontext, mm); } PDBG("created cqid 0x%0x chp %p size 0x%0x, dma_addr 0x%0llx\n", chp->cq.cqid, chp, (1 << chp->cq.size_log2), @@ -315,14 +323,15 @@ static int iwch_arm_cq(struct ib_cq *ibc static int iwch_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) { int len = vma->vm_end - vma->vm_start; - u64 pgaddr = vma->vm_pgoff << PAGE_SHIFT; + u32 key = vma->vm_pgoff << PAGE_SHIFT; struct cxio_rdev *rdev_p; int ret = 0; struct iwch_mm_entry *mm; struct iwch_ucontext *ucontext; + u64 addr; - PDBG("%s off 0x%lx addr 0x%llx len %d\n", __FUNCTION__, vma->vm_pgoff, - pgaddr, len); + PDBG("%s pgoff 0x%lx key 0x%x len %d\n", __FUNCTION__, vma->vm_pgoff, + key, len); if (vma->vm_start & (PAGE_SIZE-1)) { return -EINVAL; @@ -331,13 +340,14 @@ static int iwch_mmap(struct ib_ucontext rdev_p = &(to_iwch_dev(context->device)->rdev); ucontext = to_iwch_ucontext(context); - mm = remove_mmap(ucontext, pgaddr, len); + mm = remove_mmap(ucontext, key, len); if (!mm) return -EINVAL; + addr = mm->addr; kfree(mm); - if ((pgaddr >= rdev_p->rnic_info.udbell_physbase) && - (pgaddr < (rdev_p->rnic_info.udbell_physbase + + if ((addr >= rdev_p->rnic_info.udbell_physbase) && + (addr < (rdev_p->rnic_info.udbell_physbase + rdev_p->rnic_info.udbell_len))) { /* @@ -350,15 +360,17 @@ static int iwch_mmap(struct ib_ucontext vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; vma->vm_flags &= ~VM_MAYREAD; - ret = io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, - len, vma->vm_page_prot); + ret = io_remap_pfn_range(vma, vma->vm_start, + addr >> PAGE_SHIFT, + len, vma->vm_page_prot); } else { /* * Map WQ or CQ contig dma memory... */ - ret = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, - len, vma->vm_page_prot); + ret = remap_pfn_range(vma, vma->vm_start, + addr >> PAGE_SHIFT, + len, vma->vm_page_prot); } return ret; @@ -838,18 +850,24 @@ static struct ib_qp *iwch_create_qp(stru uresp.size_log2 = qhp->wq.size_log2; uresp.sq_size_log2 = qhp->wq.sq_size_log2; uresp.rq_size_log2 = qhp->wq.rq_size_log2; - uresp.physaddr = virt_to_phys(qhp->wq.queue); - uresp.doorbell = qhp->wq.udb; + spin_lock(&ucontext->mmap_lock); + uresp.key = ucontext->key; + ucontext->key += PAGE_SIZE; + uresp.db_key = ucontext->key; + ucontext->key += PAGE_SIZE; + spin_unlock(&ucontext->mmap_lock); if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { kfree(mm1); kfree(mm2); iwch_destroy_qp(&qhp->ibqp); return ERR_PTR(-EFAULT); } - mm1->addr = uresp.physaddr; + mm1->key = uresp.key; + mm1->addr = virt_to_phys(qhp->wq.queue); mm1->len = PAGE_ALIGN(wqsize * sizeof (union t3_wr)); insert_mmap(ucontext, mm1); - mm2->addr = uresp.doorbell & PAGE_MASK; + mm2->key = uresp.db_key; + mm2->addr = qhp->wq.udb & PAGE_MASK; mm2->len = PAGE_SIZE; insert_mmap(ucontext, mm2); } diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h index 998b323..ae57478 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.h +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h @@ -183,6 +183,7 @@ struct ib_qp *iwch_get_qp(struct ib_devi struct iwch_ucontext { struct ib_ucontext ibucontext; struct cxio_ucontext uctx; + u32 key; spinlock_t mmap_lock; struct list_head mmaps; }; @@ -195,11 +196,12 @@ static inline struct iwch_ucontext *to_i struct iwch_mm_entry { struct list_head entry; u64 addr; + u32 key; unsigned len; }; static inline struct iwch_mm_entry *remove_mmap(struct iwch_ucontext *ucontext, - u64 addr, unsigned len) + u32 key, unsigned len) { struct list_head *pos, *nxt; struct iwch_mm_entry *mm; @@ -208,11 +210,11 @@ static inline struct iwch_mm_entry *remo list_for_each_safe(pos, nxt, &ucontext->mmaps) { mm = list_entry(pos, struct iwch_mm_entry, entry); - if (mm->addr == addr && mm->len == len) { + if (mm->key == key && mm->len == len) { list_del_init(&mm->entry); spin_unlock_irq(&ucontext->mmap_lock); - PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, - mm->len); + PDBG("%s addr 0x%llx key 0x%x len %d\n", + __FUNCTION__, mm->addr, mm->key, mm->len); return mm; } } @@ -224,7 +226,8 @@ static inline void insert_mmap(struct iw struct iwch_mm_entry *mm) { spin_lock_irq(&ucontext->mmap_lock); - PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, mm->len); + PDBG("%s addr 0x%llx key 0x%x len %d\n", + __FUNCTION__, mm->addr, mm->key, mm->len); list_add_tail(&mm->entry, &ucontext->mmaps); spin_unlock_irq(&ucontext->mmap_lock); } diff --git a/drivers/infiniband/hw/cxgb3/iwch_user.h b/drivers/infiniband/hw/cxgb3/iwch_user.h index bf0a2f6..4d7526d 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_user.h +++ b/drivers/infiniband/hw/cxgb3/iwch_user.h @@ -46,14 +46,14 @@ struct iwch_create_cq_req { }; struct iwch_create_cq_resp { - __u64 physaddr; + __u64 key; __u32 cqid; __u32 size_log2; }; struct iwch_create_qp_resp { - __u64 physaddr; - __u64 doorbell; + __u64 key; + __u64 db_key; __u32 qpid; __u32 size_log2; __u32 sq_size_log2; commit d0d41ed85d44dfafbe66f59ae0ad802409a115e7 Author: Divy Le Ray Date: Tue Apr 24 10:31:15 2007 -0500 Remove assumption that PHY interrupts use GPIOs 3 and 5. Deal with PHY interrupts connected to any GPIO pins. Signed-off-by: Divy Le Ray diff --git a/drivers/net/cxgb3/t3_hw.c b/drivers/net/cxgb3/t3_hw.c index d83f075..fb485d0 100644 --- a/drivers/net/cxgb3/t3_hw.c +++ b/drivers/net/cxgb3/t3_hw.c @@ -1523,19 +1523,25 @@ static int mac_intr_handler(struct adapt */ int t3_phy_intr_handler(struct adapter *adapter) { - static const int intr_gpio_bits[] = { 8, 0x20 }; - + u32 mask, gpi = adapter_info(adapter)->gpio_intr; u32 i, cause = t3_read_reg(adapter, A_T3DBG_INT_CAUSE); for_each_port(adapter, i) { - if (cause & intr_gpio_bits[i]) { - struct cphy *phy = &adap2pinfo(adapter, i)->phy; - int phy_cause = phy->ops->intr_handler(phy); + struct port_info *p = adap2pinfo(adapter, i); + + mask = gpi - (gpi & (gpi - 1)); + gpi -= mask; + + if (!(p->port_type->caps & SUPPORTED_IRQ)) + continue; + + if (cause & mask) { + int phy_cause = p->phy.ops->intr_handler(&p->phy); if (phy_cause & cphy_cause_link_change) t3_link_changed(adapter, i); if (phy_cause & cphy_cause_fifo_error) - phy->fifo_errors++; + p->phy.fifo_errors++; } } commit 918f98dc61e30a55c45086d2602bbe6187a3782c Author: Divy Le Ray Date: Tue Apr 24 10:31:11 2007 -0500 Reuse the incoming skb when a clientless abort req is recieved. The release of RDMA connections HW resources might be deferred in low memory situations. Ensure that no further activity is passed up to the RDMA driver for these connections. Signed-off-by: Divy Le Ray diff --git a/drivers/net/cxgb3/cxgb3_defs.h b/drivers/net/cxgb3/cxgb3_defs.h old mode 100755 new mode 100644 index e14862b..483a594 --- a/drivers/net/cxgb3/cxgb3_defs.h +++ b/drivers/net/cxgb3/cxgb3_defs.h @@ -67,7 +67,10 @@ static inline union listen_entry *stid2e static inline struct t3c_tid_entry *lookup_tid(const struct tid_info *t, unsigned int tid) { - return tid < t->ntids ? &(t->tid_tab[tid]) : NULL; + struct t3c_tid_entry *t3c_tid = tid < t->ntids ? + &(t->tid_tab[tid]) : NULL; + + return (t3c_tid && t3c_tid->client) ? t3c_tid : NULL; } /* diff --git a/drivers/net/cxgb3/cxgb3_offload.c b/drivers/net/cxgb3/cxgb3_offload.c index 3353171..9db428d 100644 --- a/drivers/net/cxgb3/cxgb3_offload.c +++ b/drivers/net/cxgb3/cxgb3_offload.c @@ -506,6 +506,7 @@ void cxgb3_queue_tid_release(struct t3cd spin_lock_bh(&td->tid_release_lock); p->ctx = (void *)td->tid_release_list; + p->client = NULL; td->tid_release_list = p; if (!p->ctx) schedule_work(&td->tid_release_task); @@ -621,7 +622,8 @@ static int do_act_open_rpl(struct t3cdev struct t3c_tid_entry *t3c_tid; t3c_tid = lookup_atid(&(T3C_DATA(dev))->tid_maps, atid); - if (t3c_tid->ctx && t3c_tid->client && t3c_tid->client->handlers && + if (t3c_tid && t3c_tid->ctx && t3c_tid->client && + t3c_tid->client->handlers && t3c_tid->client->handlers[CPL_ACT_OPEN_RPL]) { return t3c_tid->client->handlers[CPL_ACT_OPEN_RPL] (dev, skb, t3c_tid-> @@ -640,7 +642,7 @@ static int do_stid_rpl(struct t3cdev *de struct t3c_tid_entry *t3c_tid; t3c_tid = lookup_stid(&(T3C_DATA(dev))->tid_maps, stid); - if (t3c_tid->ctx && t3c_tid->client->handlers && + if (t3c_tid && t3c_tid->ctx && t3c_tid->client->handlers && t3c_tid->client->handlers[p->opcode]) { return t3c_tid->client->handlers[p->opcode] (dev, skb, t3c_tid->ctx); @@ -658,7 +660,7 @@ static int do_hwtid_rpl(struct t3cdev *d struct t3c_tid_entry *t3c_tid; t3c_tid = lookup_tid(&(T3C_DATA(dev))->tid_maps, hwtid); - if (t3c_tid->ctx && t3c_tid->client->handlers && + if (t3c_tid && t3c_tid->ctx && t3c_tid->client->handlers && t3c_tid->client->handlers[p->opcode]) { return t3c_tid->client->handlers[p->opcode] (dev, skb, t3c_tid->ctx); @@ -687,6 +689,28 @@ static int do_cr(struct t3cdev *dev, str } } +/* + * Returns an sk_buff for a reply CPL message of size len. If the input + * sk_buff has no other users it is trimmed and reused, otherwise a new buffer + * is allocated. The input skb must be of size at least len. Note that this + * operation does not destroy the original skb data even if it decides to reuse + * the buffer. + */ +static struct sk_buff *cxgb3_get_cpl_reply_skb(struct sk_buff *skb, size_t len, + int gfp) +{ + if (likely(!skb_cloned(skb))) { + BUG_ON(skb->len < len); + __skb_trim(skb, len); + skb_get(skb); + } else { + skb = alloc_skb(len, gfp); + if (skb) + __skb_put(skb, len); + } + return skb; +} + static int do_abort_req_rss(struct t3cdev *dev, struct sk_buff *skb) { union opcode_tid *p = cplhdr(skb); @@ -694,30 +718,39 @@ static int do_abort_req_rss(struct t3cde struct t3c_tid_entry *t3c_tid; t3c_tid = lookup_tid(&(T3C_DATA(dev))->tid_maps, hwtid); - if (t3c_tid->ctx && t3c_tid->client->handlers && + if (t3c_tid && t3c_tid->ctx && t3c_tid->client->handlers && t3c_tid->client->handlers[p->opcode]) { return t3c_tid->client->handlers[p->opcode] (dev, skb, t3c_tid->ctx); } else { struct cpl_abort_req_rss *req = cplhdr(skb); struct cpl_abort_rpl *rpl; + struct sk_buff *reply_skb; + unsigned int tid = GET_TID(req); + u8 cmd = req->status; + + if (req->status == CPL_ERR_RTX_NEG_ADVICE || + req->status == CPL_ERR_PERSIST_NEG_ADVICE) + goto out; - struct sk_buff *skb = - alloc_skb(sizeof(struct cpl_abort_rpl), GFP_ATOMIC); - if (!skb) { + reply_skb = cxgb3_get_cpl_reply_skb(skb, + sizeof(struct + cpl_abort_rpl), + GFP_ATOMIC); + + if (!reply_skb) { printk("do_abort_req_rss: couldn't get skb!\n"); goto out; } - skb->priority = CPL_PRIORITY_DATA; - __skb_put(skb, sizeof(struct cpl_abort_rpl)); - rpl = cplhdr(skb); + reply_skb->priority = CPL_PRIORITY_DATA; + __skb_put(reply_skb, sizeof(struct cpl_abort_rpl)); + rpl = cplhdr(reply_skb); rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_RPL)); - rpl->wr.wr_lo = htonl(V_WR_TID(GET_TID(req))); - OPCODE_TID(rpl) = - htonl(MK_OPCODE_TID(CPL_ABORT_RPL, GET_TID(req))); - rpl->cmd = req->status; - cxgb3_ofld_send(dev, skb); + rpl->wr.wr_lo = htonl(V_WR_TID(tid)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, tid)); + rpl->cmd = cmd; + cxgb3_ofld_send(dev, reply_skb); out: return CPL_RET_BUF_DONE; } @@ -730,7 +763,7 @@ static int do_act_establish(struct t3cde struct t3c_tid_entry *t3c_tid; t3c_tid = lookup_atid(&(T3C_DATA(dev))->tid_maps, atid); - if (t3c_tid->ctx && t3c_tid->client->handlers && + if (t3c_tid && t3c_tid->ctx && t3c_tid->client->handlers && t3c_tid->client->handlers[CPL_ACT_ESTABLISH]) { return t3c_tid->client->handlers[CPL_ACT_ESTABLISH] (dev, skb, t3c_tid->ctx); @@ -760,7 +793,7 @@ static int do_term(struct t3cdev *dev, s struct t3c_tid_entry *t3c_tid; t3c_tid = lookup_tid(&(T3C_DATA(dev))->tid_maps, hwtid); - if (t3c_tid->ctx && t3c_tid->client->handlers && + if (t3c_tid && t3c_tid->ctx && t3c_tid->client->handlers && t3c_tid->client->handlers[opcode]) { return t3c_tid->client->handlers[opcode] (dev, skb, t3c_tid->ctx); @@ -959,7 +992,7 @@ void cxgb_redirect(struct dst_entry *old for (tid = 0; tid < ti->ntids; tid++) { te = lookup_tid(ti, tid); BUG_ON(!te); - if (te->ctx && te->client && te->client->redirect) { + if (te && te->ctx && te->client && te->client->redirect) { update_tcb = te->client->redirect(te->ctx, old, new, e); if (update_tcb) { l2t_hold(L2DATA(tdev), e); From mshefty at ichips.intel.com Wed Apr 25 09:54:19 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 09:54:19 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review In-Reply-To: <20070425040500.GB20023@mellanox.co.il> References: <462E7B6B.5060405@ichips.intel.com> <20070425040500.GB20023@mellanox.co.il> Message-ID: <462F87BB.3010508@ichips.intel.com> > What really should happen is that the field Local Ack Timeout in REQ > should be (2 * PacketLifeTime + Local CA’s ACK delay) (see 12.7.34) > and then the responder should use this for it's QP. Just to clarify, the value is _based_ on (2 * PacketLifeTime + local CA ack delay). For example, if local CA ack delay is 0, then local ack timeout = PacketLifeTime + 1. > This does not sound too hard - why can't we just fix CM to do this, then? The work-arounds were only suggestions to use until a fix is in place and to verify that this really is the problem. I do plan on submitting a fix. - Sean From rick.jones2 at hp.com Wed Apr 25 10:03:03 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 25 Apr 2007 10:03:03 -0700 Subject: [ofa-general] clueless noob and build probs with 1.2rc2 In-Reply-To: References: <462E8257.9090103@hp.com> Message-ID: <462F89C7.4010105@hp.com> Roland Dreier wrote: > > ./dapl/udapl/linux/dapl_osd.h:82:24: error: asm/atomic.h: No such file or directory > > In file included from ./dapl/include/dapl.h:50, > > from dapl/udapl/dapl_init.c:39: > > ./dapl/udapl/linux/dapl_osd.h: In function 'dapl_os_atomic_inc': > > ./dapl/udapl/linux/dapl_osd.h:163: warning: implicit declaration of function 'IA64_FETCHADD' > > I seem to recall udapl does some very bogus things with > and atomic operations in general. > > Probably the easiest solution on ia64 is just to disable udapl. (Not > sure how to do that because I don't really work with the OFED build system) Looks like if I don't ask it to build "everything" it doesn't try to build the part(s) with the problems. Sooo, with that as my workaround I can run some IPoIB tests on my cards, which has shown me that the MTU has gotten a _lot_ bigger than it was before :) Meanwhile, in all the stuff in docs/ at which I have glanced thusfar, none of it seems to describe how to go about building a "native" SDP application - one which creates an AF_INET_SDP socket explicitly. Ie which include file(s) to include and any libraries against which it should be linked. Is that actually part of 1.2rc2? happy benchmarking, rick jones From rdreier at cisco.com Wed Apr 25 10:05:40 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Apr 2007 10:05:40 -0700 Subject: [ofa-general] clueless noob and build probs with 1.2rc2 In-Reply-To: <462F89C7.4010105@hp.com> (Rick Jones's message of "Wed, 25 Apr 2007 10:03:03 -0700") References: <462E8257.9090103@hp.com> <462F89C7.4010105@hp.com> Message-ID: > Meanwhile, in all the stuff in docs/ at which I have glanced thusfar, > none of it seems to describe how to go about building a "native" SDP > application - one which creates an AF_INET_SDP socket explicitly. Ie > which include file(s) to include and any libraries against which it > should be linked. Is that actually part of 1.2rc2? AFAIK there's nothing special to do to use SDP. Just create a socket with AF_INET_SDP instead of AF_INET. I'm not sure if there's a header available with the define of AF_INET_SDP but that's the only thing you would need. From sean.hefty at intel.com Wed Apr 25 10:07:44 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 10:07:44 -0700 Subject: [ofa-general] RE: possible librdmacm bug? In-Reply-To: <1177512275.22094.13.camel@stevo-desktop> Message-ID: <000001c7875c$3e1960a0$8698070a@amr.corp.intel.com> >But it should be returned in the call to rdma_get_src_addr() immediately >after the bind succeeds, yes? > >I poked around the code and couldn't see an obvious bug. I'm still >looking. Shall I open a bug in the ofa bugzilla for this? I'll look into this this week. Feel free to open a bug for it if you want to track it. - Sean From sean.hefty at intel.com Wed Apr 25 10:09:45 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 10:09:45 -0700 Subject: [ofa-general] RE: hotplug event handle question In-Reply-To: <1177512915.15940.17.camel@vladsk-laptop> Message-ID: <000101c7875c$85f3b1f0$8698070a@amr.corp.intel.com> >Should I move the QP to the error state in RDS or cma should handle this >state too. Let me think about what to do here. I think the cma should perform this transition if it makes sense. - Sean From sweitzen at cisco.com Wed Apr 25 10:13:32 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 25 Apr 2007 10:13:32 -0700 Subject: [ofa-general] clueless noob and build probs with 1.2rc2 In-Reply-To: References: <462E8257.9090103@hp.com> <462F89C7.4010105@hp.com> Message-ID: There is no header, see https://bugs.openfabrics.org/show_bug.cgi?id=25. Scott > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Roland Dreier (rdreier) > Sent: Wednesday, April 25, 2007 10:06 AM > To: Rick Jones > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] clueless noob and build probs with 1.2rc2 > > > Meanwhile, in all the stuff in docs/ at which I have > glanced thusfar, > > none of it seems to describe how to go about building a > "native" SDP > > application - one which creates an AF_INET_SDP socket > explicitly. Ie > > which include file(s) to include and any libraries against which it > > should be linked. Is that actually part of 1.2rc2? > > AFAIK there's nothing special to do to use SDP. Just create a socket > with AF_INET_SDP instead of AF_INET. I'm not sure if there's a header > available with the define of AF_INET_SDP but that's the only thing you > would need. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Wed Apr 25 10:16:51 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Apr 2007 10:16:51 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 2/3] 2.6.22 or 23 ib/sa: add path record caching In-Reply-To: <462E529E.2030604@ichips.intel.com> (Sean Hefty's message of "Tue, 24 Apr 2007 11:55:26 -0700") References: <000201c782df$8f002de0$07fd070a@amr.corp.intel.com> <462E529E.2030604@ichips.intel.com> Message-ID: > > > +static struct miscdevice local_sa_misc = { > > > + .minor = MISC_DYNAMIC_MINOR, > > > + .name = "ib_local_sa", > > > +}; > > I don't understand why you're registering a miscdevice etc. I don't > > see any implementation of a character device or indeed any userspace > > interface at all. So what's up here? > > The cache creates the following files: > > /sys/class/misc/ib_local_sa/paths_per_dest > /sys/class/misc/ib_local_sa/refresh > /sys/class/misc/ib_local_sa/lookup_method That seems like an abuse of the miscdevice stuff, since you don't actually have a device. Why not just use module parameters? The only difference would be that the paths start /sys/module/ib_local_sa/parameters instead. Or if you really wanted to, I guess a sysctl would be appropriate. But I don't think creating a device you don't use just to get a sysfs directory is a good idea. From sean.hefty at intel.com Wed Apr 25 10:18:04 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 10:18:04 -0700 Subject: [ofa-general] RE: [PATCH librdmacm] rping: Transfer rkey/addr/len information innetwork byte order. In-Reply-To: <1177515271.22094.33.camel@stevo-desktop> Message-ID: <000201c7875d$af6ef020$8698070a@amr.corp.intel.com> Thanks - I've pulled this in. - Sean From rick.jones2 at hp.com Wed Apr 25 10:21:11 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 25 Apr 2007 10:21:11 -0700 Subject: [ofa-general] clueless noob and build probs with 1.2rc2 In-Reply-To: References: <462E8257.9090103@hp.com> <462F89C7.4010105@hp.com> Message-ID: <462F8E07.4050000@hp.com> Scott Weitzenkamp (sweitzen) wrote: > There is no header, see https://bugs.openfabrics.org/show_bug.cgi?id=25. Thanks - BTW my browser got a triffle cranky accessing that site complaing that the cert presented for bugs.openfabrics.org belongs to ":staging.openfabrics.org" I guess that until it is resolved I'll just kludge around it with my own define. rick jones From sean.hefty at intel.com Wed Apr 25 10:22:28 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 10:22:28 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 2/3] 2.6.22 or 23 ib/sa: add path record caching In-Reply-To: Message-ID: <000301c7875e$4cd26180$8698070a@amr.corp.intel.com> >That seems like an abuse of the miscdevice stuff, since you don't >actually have a device. Why not just use module parameters? The only >difference would be that the paths start /sys/module/ib_local_sa/parameters >instead. Or if you really wanted to, I guess a sysctl would be >appropriate. But I don't think creating a device you don't use just >to get a sysfs directory is a good idea. I want changes to these values to force a cache update. Can you do that with module parameters? - Sean From rdreier at cisco.com Wed Apr 25 10:37:34 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Apr 2007 10:37:34 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 2/3] 2.6.22 or 23 ib/sa: add path record caching In-Reply-To: <000301c7875e$4cd26180$8698070a@amr.corp.intel.com> (Sean Hefty's message of "Wed, 25 Apr 2007 10:22:28 -0700") References: <000301c7875e$4cd26180$8698070a@amr.corp.intel.com> Message-ID: > I want changes to these values to force a cache update. Can you do that with > module parameters? Sure... you'll have to implement your own set method but that's no different from putting attributes under your miscdevice. Just look at module_param_call() -- it's exactly what you want I think. From swise at opengridcomputing.com Wed Apr 25 10:39:28 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 25 Apr 2007 12:39:28 -0500 Subject: [ofa-general] [PATCH ofed-1.2 docs] Updates for iWARP and Chelsio Message-ID: <1177522768.11727.11.camel@stevo-desktop> Tziporet, Below is a patch to your docs git tree. It adds a cxgb3 release notes file, and updates the OFED release and install docs. --- Updates for chelsio. - added cxgb3 release notes file - updated ofed release notes and installation guide. Signed-off-by: Steve Wise --- OFED_Installation_Guide.txt | 8 +-- OFED_release_notes.txt | 13 +++- cxgb3_release_notes.txt | 127 +++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 140 insertions(+), 8 deletions(-) diff --git a/OFED_Installation_Guide.txt b/OFED_Installation_Guide.txt index ccc01d5..8c096b9 100644 --- a/OFED_Installation_Guide.txt +++ b/OFED_Installation_Guide.txt @@ -25,9 +25,9 @@ Table of contents ============================================================================== This is the OpenFabrics Enterprise Distribution (OFED) version 1.2 -software package supporting InfiniBand fabrics. It is composed of -several software modules intended for use on a computer cluster -constructed as an InfiniBand subnet. +software package supporting InfiniBand and iWARP fabrics. It is composed +of several software modules intended for use on a computer cluster +constructed as an InfiniBand subnet or iWARP network. This document describes how to install the various modules and test them in a Linux environment. @@ -76,7 +76,7 @@ The OFED Distribution package generates 3. HW and SW Requirements ============================================================================== -1) Server platform with InfiniBand HCA or iWARP NIC (see OFED Distribution +1) Server platform with InfiniBand HCA or iWARP RNIC (see OFED Distribution Release Notes for details) 2) Linux OS (see OFED Distribution Release Notes for details) diff --git a/OFED_release_notes.txt b/OFED_release_notes.txt index e302ea5..e4f780d 100644 --- a/OFED_release_notes.txt +++ b/OFED_release_notes.txt @@ -11,7 +11,7 @@ Table of Contents 1. Overview, which includes: - OFED Distribution Rev 1.2 Contents - Supported Platforms and Operating Systems - - Supported HCA Adapter Cards and Firmware Versions + - Supported HCA and RNIC Adapter Cards and Firmware Versions - Tested Switch Platforms - Third party Test Packages - OFED sources @@ -36,7 +36,7 @@ all of its nodes to this new version. --------------------- The OFED package contains the following components: o OpenFabrics core and ULPs: - - HCA drivers (mthca, ipath, ehca) + - IB HCA and RNIC drivers (mthca, ipath, ehca, cxgb3) - core - Upper Layer Protocols: IPoIB, SDP, SRP Initiator, iSER Host and uDAPL o OpenFabrics utilities: @@ -57,6 +57,7 @@ Notes: 3. All other OFED components are of production quality. 4. See release notes for each package in the docs directory. 5. Any Topspin copyright belongs to Cisco Systems, Inc. +6. cxgb3 driver is in technology preview state. 1.2 Supported Platforms and Operating Systems --------------------------------------------- @@ -73,9 +74,10 @@ Notes: - SLES10: 2.6.16.21-0.8-smp - kernel.org: 2.6.17.x and 2.6.18.x -1.3 HCAs Supported +1.3 HCAs and RNICs Supported ------------------ -This release supports HCAs by Mellanox Technologies, Qlogic and IBM. +This release supports IB HCAs by Mellanox Technologies, Qlogic and IBM as +well as iWARP RNICs by Chelsio Communications. o Mellanox Technologies HCAs: - InfiniHost (fw-23108 Rev 3.5.000) @@ -96,6 +98,9 @@ This release supports HCAs by Mellanox T - GX Dual-port 4x IB HCA - GX Dual-port 12x IB HCA + o Chelsio RNICs: + - S310/S320 10GbE Storage Accelerators + - R310E 10GbE iWARP Adapters 1.4 Switches Supported ---------------------- diff --git a/cxgb3_release_notes.txt b/cxgb3_release_notes.txt new file mode 100644 index 0000000..85bc774 --- /dev/null +++ b/cxgb3_release_notes.txt @@ -0,0 +1,127 @@ + + CHELSIO T3 RNIC RELEASE NOTES + +Author: Steve Wise +Last Updated: April, 2007 + +The iw_cxgb3 and cxgb3 modules provide iWARP and NIC support for the +Chelsio S310, S320, and R310 adapters. Make sure you choose the 'cxgb3' +options when generating your ofed-1.2 rpms. + +This release is a technology preview. + +============================================ +Loadable Module options: +============================================ + +The following options can be used when loading the iw_cxgb3 module to +tune the iWARP driver: + +cong_flavor - set the congestion congtrol algorithm. Default is 1. + 0 == Reno + 1 == Tahoe + 2 == NewReno + 3 == HighSpeed + +snd_win - set the TCP send window in bytes. Default is 32KB. + +rcv_win - set the TCP receive window in bytes. Default is 256KB. + +crc_enabled - set whether MPA CRC should be negotiated. Default is 1. + +markers_enabled - set whether to request receiving MPA markers. Default is + 0; do not request to receive markers. + + NOTE: The Chelsio RNIC fully supports markers, but + the current OFA RDMA-CM doesn't provide an API for + requesting either markers or crc to be negotiated. Thus + this functionality is provided via module parameters. + +mpa_rev - set the MPA revision to be used. Default is 1, which is + spec compliant. Set to 0 to connect with the Ammasso 1100 + rnic. + +ep_timeout_secs - set the number of seconds for timing out MPA start up + negotiation and normal close. Default is 10. + +The following options can be used when loading the cxgb3 module to +tune the NIC driver: + +msi - whether to use MSI or MSI-X. Default is 2. + 0 = only pin + 1 = only MSI or pin + 2 = use MSI/X, MSI, or pin, based on system + +============================================ +Updating Firmware: +============================================ + +Contact chelsio to obtain the latest firmware and cxgbtool source. + +To build cxgbtool: + +# cd +# make && make install + +Then load the cxgb3 driver: + +# modprobe cxgb3 + +Now note the ethernet interface name for the T3 device. This can be +done by typing 'ifconfig -a' and noting the interface name for the +interface with a HW address that begins with "00:07:43". Then load the +new firmware: + +# cxgbtool ethxx loadfw +# reboot + +============================================ +Testing connectivity with ping and rping: +============================================ + +Configure the ethernet interfaces for your cxgb3 device. After you +modprobe iw_cxgb3 you will see one or two ethernet interfaces for the +T3 device. Configure them with an appropriate ip address, netmask, etc. +You can use the Linux ping command to test basic connectivity via the +T3 interface. + +To test RDMA, use the rping command that is included in the librdmacm-utils +rpm: + +On the server machine: + +# rping -s -a 0.0.0.0 -p 9999 + +On the client machine: + +# rping -c -VvC10 -a server_ip_addr -p 9999 + +You should see ping data like this on the client: + +ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr +ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs +ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst +ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu +ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv +ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw +ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx +ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy +ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz +ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA +client DISCONNECT EVENT... +# + +============================================ +Addition Notes and Issues +============================================ + +1) To run uDAPL over the chelsio device, you must export this environment +variable: + + export DAPL_MAX_INLINE=64 + +2) If you have a multi-homed host and the physical ethernet networks are +bridged, then you need to configure arp to only send replies on the +interface with the target ip address: + + sysctl -w net.ipv4.conf.all.arp_ignore=2 From sean.hefty at intel.com Wed Apr 25 10:39:47 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 10:39:47 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 2/3] 2.6.22 or 23 ib/sa: add path record caching In-Reply-To: Message-ID: <000401c78760$b7ef1e20$8698070a@amr.corp.intel.com> >Sure... you'll have to implement your own set method but that's no >different from putting attributes under your miscdevice. Just look at >module_param_call() -- it's exactly what you want I think. Thanks - I'll update this. From rdreier at cisco.com Wed Apr 25 10:46:31 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Apr 2007 10:46:31 -0700 Subject: [ofa-general] Re: [PATCH] net/mlx4: modify sw reset In-Reply-To: <1177512129.10850.9.camel@mtls03> (Eli Cohen's message of "Wed, 25 Apr 2007 17:41:39 +0300") References: <1177512129.10850.9.camel@mtls03> Message-ID: Thanks, I applied a version of this that checks the vendor field is != 0xffff instead of checking it against the mellanox value, because someone may build HCAs with a different vendor value. Also is it worth taking the semaphore in the mthca reset function? - R. From rdreier at cisco.com Wed Apr 25 10:50:59 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Apr 2007 10:50:59 -0700 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> (Moni Levy's message of "Wed, 25 Apr 2007 11:13:34 +0300") References: <20070328093345.GD11695@mellanox.co.il> <20070328200906.GJ4253@mellanox.co.il> <460ACED8.20605@gmail.com> <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> Message-ID: > 1. Direct access in ib_find_pkey will probably heart RC connections > per second rate. I think it's probably in the noise. And anyway I don't think the connection rate of ipoib CM is particularly important. And we can always optimize ib_find_pkey as part of the longer-term plan to get rid of ib_find_cached_pkey. If you want to tackle more of the cache elimination plan we discussed that would be great though. > 2. What do you think about OrG's opinion (I'm copying it from the other thread): He seems to be saying that it's OK to introduce a window where things fail spuriously. I disagree. From or.gerlitz at gmail.com Wed Apr 25 11:57:35 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 25 Apr 2007 21:57:35 +0300 Subject: [ofa-general] Re: [ewg] Re: pkey change handling patch In-Reply-To: References: <20070328093345.GD11695@mellanox.co.il> <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> Message-ID: <15ddcffd0704251157y7f09d208kf8b5a23e7394a11b@mail.gmail.com> On 4/25/07, Roland Dreier wrote: > > > 1. Direct access in ib_find_pkey will probably heart RC connections > > per second rate. > > I think it's probably in the noise. And anyway I don't think the > connection rate of ipoib CM is particularly important. And we can > always optimize ib_find_pkey as part of the longer-term plan to get > rid of ib_find_cached_pkey. rate of connections per second might be an intresting feature in the context of doing TCP offload eg with SDP, when you want to see how many connetions can a web server or database establish with clients in unit of time. With the cache elimination every new connection will consume two more IB commands (port query and pkey table read). > > 2. What do you think about OrG's opinion (I'm copying it from the other > thread): > > He seems to be saying that it's OK to introduce a window where things > fail spuriously. I disagree. Let it be. It makes sense to me to eliminate the cache. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at mellanox.co.il Wed Apr 25 12:49:38 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 25 Apr 2007 22:49:38 +0300 Subject: [ofa-general] OFED 1.2 April 25 meeting summary In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9A0E2B6@mtlexch01.mtl.com> References: <46231441.6050507@mellanox.co.il> <46238FC0.40906@mellanox.co.il> <6C2C79E72C305246B504CBA17B5500C9A0E2B6@mtlexch01.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901563174@mtlexch01.mtl.com> OFED 1.2 April 25 meeting summary Main decisions: 1. RC was delayed to end of next week due to many critical open bugs 2. RC3 due date is May 3. 3. All code changes should be ready for May 2 4. Bug fixes after RC3 will have to be approved by the RM (Tziporet) Cluster testing: - Intel will test IPoIB, Intel MPI and MVAPICH on 256 nodes cluster - was not started yet - The labs will test Open MPI and MVAPICH on 256 nodes cluster - status unknown Note: There will be no coordination meeting next week - CU at Sonoma. Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Wed Apr 25 12:59:42 2007 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 25 Apr 2007 12:59:42 -0700 Subject: [ofa-general] OFED 1.2 April 25 meeting summary In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563174@mtlexch01.mtl.com> Message-ID: > - Intel will test IPoIB, Intel MPI and MVAPICH on 256 nodes cluster - was not started yet This will begin starting tomorrow night. We had to delay a week to allow them to upgrade to EL4-u4, since OFED 1.2 did not support EL4-U2, which is what they were running. woody ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren Sent: Wednesday, April 25, 2007 12:50 PM To: EWG Cc: OPENIB Subject: [ofa-general] OFED 1.2 April 25 meeting summary OFED 1.2 April 25 meeting summary Main decisions: 1. RC was delayed to end of next week due to many critical open bugs 2. RC3 due date is May 3. 3. All code changes should be ready for May 2 4. Bug fixes after RC3 will have to be approved by the RM (Tziporet) Cluster testing: - Intel will test IPoIB, Intel MPI and MVAPICH on 256 nodes cluster - was not started yet - The labs will test Open MPI and MVAPICH on 256 nodes cluster - status unknown Note: There will be no coordination meeting next week - CU at Sonoma. Tziporet From johnip at sgi.com Wed Apr 25 13:05:26 2007 From: johnip at sgi.com (John Partridge) Date: Wed, 25 Apr 2007 15:05:26 -0500 Subject: [ofa-general] opensmd init.d script question Message-ID: <462FB486.9090403@sgi.com> Hi Hal, I am working on a new SGI product that will have two separate InfiniBand fabrics. Each of these fabrics may have a different topology and could be running one of a number of routing engines (i.e. lash or up/dn) the Subnet Management for both fabrics will run on one host (leader node). Out of the box OFED-1.2 does not have a good way to achieve managing this. Ideally I would like to have the flexibility of chkconfig controlling each fabric (i.e., ib0 ib1), but, I have found that the insserv mechanism has severe limitations. BTW we are running SuSE Sles10. I just wonder if you had come across this kind of config and if you have any ideas about how you see this working. It looks like I need to have more than one opensmd (one for each fabric) but that is looking like it will not work either because of the insserv limitations. Any help or advice you have would be appreciated. Thanks John -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From lawver1 at llnl.gov Wed Apr 25 13:12:06 2007 From: lawver1 at llnl.gov (Bryan Lawver) Date: Wed, 25 Apr 2007 13:12:06 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <20070425124652.GG1624@mellanox.co.il> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> Message-ID: <6.1.2.0.2.20070425074340.134248b8@mail.llnl.gov> Thanks for the input. I had found the debug switch and it pointed out some issues. Drops at the routing node are many. It seems to work at debug rate so I hope it is not a buffer pool issue. At 05:46 AM 4/25/2007, Michael S. Tsirkin wrote: > > Quoting Bryan Lawver : > > Subject: IPoIB forwarding > > > > I have a small test bed with 2 nodes with IB/OFED1.2/connected mode and a > > third node which has IP only and is connected to one of the IB nodes. In > > between are DDR IB switch and 10GE IP switch. The node with both IP > and IB > > interfaces is simply a IP router in this test setup. The IB only node has > > a subnet route to router node and the IP only node has a subnet route to > > the router node. > > > > When I launch an Iperf test from the IB (IPoIB) node to the IP node, I get > > very good throughput with no tuning (7.5gbs). > > > > When I launch from IP to the IB node, I get virtually no thorughput > > (2.5mbs). When I dropped the window size to 8k (iperf -w8k) the > throughput > > is 750mbs. > > > > Any suggestions, ideas? > >Some troubleshooting tips: > >Are some packets lost on the router? Checking packet counters >might give you a clue. > >Do you see some errors on one of the IB nodes? >Set debug_level=1 module parameter for ib_ipoib, and check >dmesg output while running the test. > >-- >MST From halr at voltaire.com Wed Apr 25 13:24:01 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2007 16:24:01 -0400 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <20070423212509.GH20972@obsidianresearch.com> References: <20070423190449.GR4579@mellanox.co.il> <000101c785df$0831ef80$e598070a@amr.corp.intel.com> <20070423202051.GS4579@mellanox.co.il> <20070423212509.GH20972@obsidianresearch.com> Message-ID: <1177532641.12542.7153.camel@hal.voltaire.com> On Mon, 2007-04-23 at 17:25, Jason Gunthorpe wrote: > On Mon, Apr 23, 2007 at 11:20:59PM +0300, Michael S. Tsirkin wrote: > > > I haven't thought this through yet. Basically, I just note that > > caching the path until GID goes out of service isn't right - since > > path parameters such as MTU or rate might change without GID going > > out of service. > > > > So what to do? > > Has anyone thought about using replication rather than caching to > solve this problem? Unfortunately, IMO, the IBTA punted on database replication for SA. > It seems to me it would be alot faster for some > single process in the network to fetch and keep a copy of the entire > SA route database, format it into a binary format and use RC RDMA to > transfer it to every node each time it changes. Not sure one can rely on RC RDMA. Not all SMs are built on top of ports capable of this. I think UD is the only requirement there (switch port 0). One could have a CA based server node intermediary though. -- Hal > For say, 10000 nodes you could compact an any-to-any path table into > around 20 megabytes. > > The RDMA transfers would be arranged into a waterfall, source > transfers to 8 nodes, who then each transfer to 8, etc. Choosing a > connection topology that overlays the switch topology would give this > scheme a huge aggregate bandwidth so the total transfer time would be > short. > > Unfortunately the SA protocol doesn't seem to have many provisions for > cache-coherence so it seems any form of route caching is going to run > into problems with stale data :< Replication adds a coherenece > mechanism and shifts the problem the replication source, which, > ideally, would ultimately be tightly connected to the SA. > > We could use DR SMPs to do network discovery and at least check that > > paths are valid - it's not too much code (ibnetdiscover is just 800 > > lines) and in a sense, that's actually putting an *SA* (not just > > cache) in each node. Combined with GID IN/OUT notices we could get > > away from querying path records completely. > > I don't think you can find/check the SL like this, plus I doubt the > little CPUs in the switches can handle that rate of SMPs. :< > > Jason > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Apr 25 13:24:49 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2007 16:24:49 -0400 Subject: [ofa-general] RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <000201c785ee$d54be520$e598070a@amr.corp.intel.com> References: <000201c785ee$d54be520$e598070a@amr.corp.intel.com> Message-ID: <1177532687.12542.7240.camel@hal.voltaire.com> On Mon, 2007-04-23 at 17:32, Sean Hefty wrote: > >Isn't there a way to get notice for this? > > The closest trap I'm aware of is GID in/out of service. See 14.2.5.1 and > 14.4.9. GID in/out of service is related to the existence of a path record > between the SGID and DGID. If the path record parameters change, I'm not sure > if the GID technically goes out, then back into service or not. Maybe Hal > knows. No; there is no way to do this that I'm aware of. -- Hal [snip...] From halr at voltaire.com Wed Apr 25 13:27:19 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2007 16:27:19 -0400 Subject: [ofa-general] Re: [RFC] [PATCH 1/3] 2.6.22 or 23 ib/sa: add registration for sa events In-Reply-To: <000101c782df$43b2fcf0$07fd070a@amr.corp.intel.com> References: <000101c782df$43b2fcf0$07fd070a@amr.corp.intel.com> Message-ID: <1177532692.12542.7242.camel@hal.voltaire.com> On Thu, 2007-04-19 at 20:03, Sean Hefty wrote: > IB/sa: Add InformInfo/Notice support. > > From: Sean Hefty [snip...] > diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h > index 5e26b2f..46b52fd 100644 > --- a/include/rdma/ib_sa.h > +++ b/include/rdma/ib_sa.h > @@ -254,6 +254,126 @@ struct ib_sa_service_rec { > u64 data64[2]; > }; > > +enum { > + IB_SA_EVENT_TYPE_FATAL = 0x0, > + IB_SA_EVENT_TYPE_URGENT = 0x1, > + IB_SA_EVENT_TYPE_SECURITY = 0x2, > + IB_SA_EVENT_TYPE_SM = 0x3, > + IB_SA_EVENT_TYPE_INFO = 0x4, > + IB_SA_EVENT_TYPE_EMPTY = 0x7F, > + IB_SA_EVENT_TYPE_ALL = 0xFFFF > +}; > + > +enum { > + IB_SA_EVENT_PRODUCER_TYPE_CA = 0x1, > + IB_SA_EVENT_PRODUCER_TYPE_SWITCH = 0x2, > + IB_SA_EVENT_PRODUCER_TYPE_ROUTER = 0x3, > + IB_SA_EVENT_PRODUCER_TYPE_CLASS_MANAGER = 0x4, > + IB_SA_EVENT_PRODUCER_TYPE_ALL = 0xFFFFFF > +}; > + > +enum { > + IB_SA_SM_TRAP_GID_IN_SERVICE = 64, > + IB_SA_SM_TRAP_GID_OUT_OF_SERVICE = 65, > + IB_SA_SM_TRAP_CREATE_MC_GROUP = 66, > + IB_SA_SM_TRAP_DELETE_MC_GROUP = 67, > + IB_SA_SM_TRAP_PORT_CHANGE_STATE = 128, > + IB_SA_SM_TRAP_LINK_INTEGRITY = 129, > + IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN = 130, > + IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED = 131, > + IB_SA_SM_TRAP_BAD_M_KEY = 256, > + IB_SA_SM_TRAP_BAD_P_KEY = 257, > + IB_SA_SM_TRAP_BAD_Q_KEY = 258, > + IB_SA_SM_TRAP_ALL = 0xFFFF > +}; Just a nit question: Any reason trap 259 was omitted here ? -- Hal [snip...] From halr at voltaire.com Wed Apr 25 13:27:25 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2007 16:27:25 -0400 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <000301c785f3$8768adc0$e598070a@amr.corp.intel.com> References: <000301c785f3$8768adc0$e598070a@amr.corp.intel.com> Message-ID: <1177532839.12542.7412.camel@hal.voltaire.com> On Mon, 2007-04-23 at 18:05, Sean Hefty wrote: > >Has anyone thought about using replication rather than caching to > >solve this problem? It seems to me it would be alot faster for some > >single process in the network to fetch and keep a copy of the entire > >SA route database, format it into a binary format and use RC RDMA to > >transfer it to every node each time it changes. > > I have given thought to using RC RDMA to distribute the data to all nodes, > especially to eliminate the MAD protocol overhead. There are a couple issues > with this: > > To work with existing SAs, we need to working within the defined SA interface > (i.e. SA MADs), so something still needs to query for all path records. > > The GetTable query requires an SGID, which means that whatever node collects the > path records must first collect all the GIDs. (And the most efficient way I've > found to obtain a list of all GIDs is via a GetTable path record query...) This > also means that the node collecting the path records will generate 1 query per > GID. This has the same impact on the SA as each node issuing their own query. > And the impact on the subnet is higher, since we still need to distribute that > data to the end nodes. > > In short, until we can standardize on some new SA interface, or we have a > distributed SA, I don't see where we can do much better than caching GetTable > responses at the end nodes. Or perhaps an additional standard for replication. I think distributed SA is a separate related issue. -- Hal > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at dev.mellanox.co.il Wed Apr 25 13:29:03 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Apr 2007 23:29:03 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <1177532687.12542.7240.camel@hal.voltaire.com> References: <000201c785ee$d54be520$e598070a@amr.corp.intel.com> <1177532687.12542.7240.camel@hal.voltaire.com> Message-ID: <20070425202903.GC5217@mellanox.co.il> > Quoting Hal Rosenstock : > Subject: RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache > > On Mon, 2007-04-23 at 17:32, Sean Hefty wrote: > > >Isn't there a way to get notice for this? > > > > The closest trap I'm aware of is GID in/out of service. See 14.2.5.1 and > > 14.4.9. GID in/out of service is related to the existence of a path record > > between the SGID and DGID. If the path record parameters change, I'm not sure > > if the GID technically goes out, then back into service or not. Maybe Hal > > knows. > > No; there is no way to do this that I'm aware of. We can get notic on port state changes though, can't we? [snip...] -- MST From mst at dev.mellanox.co.il Wed Apr 25 13:31:18 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Apr 2007 23:31:18 +0300 Subject: [ofa-general] Re: [PATCH] net/mlx4: modify sw reset In-Reply-To: References: <1177512129.10850.9.camel@mtls03> Message-ID: <20070425203118.GD5217@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] net/mlx4: modify sw reset > > Thanks, I applied a version of this that checks the vendor field is != > 0xffff instead of checking it against the mellanox value, because > someone may build HCAs with a different vendor value. > > Also is it worth taking the semaphore in the mthca reset function? Why not? PRM says we should - the point of this is protecting against someone accidentally touching flash while device is reset, which will result is hangs, system errors or even data corruption. -- MST From halr at voltaire.com Wed Apr 25 13:30:45 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2007 16:30:45 -0400 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add pathrecord cache In-Reply-To: <000501c785f9$e9015cc0$e598070a@amr.corp.intel.com> References: <000501c785f9$e9015cc0$e598070a@amr.corp.intel.com> Message-ID: <1177533043.12542.7580.camel@hal.voltaire.com> On Mon, 2007-04-23 at 18:51, Sean Hefty wrote: > >We could solve this by implementing a process running on the same node as the > >SA. > >And it's probably not too hard to add a way for opensm to spit out > >the table into an external file when it gets a signal or something. > > I agree that there are ways to solve this, but those solutions won't work with > existing SAs and define a new SA interface. If we're willing to break > compatibility or add extensions, we could also extend the SA to provide better > support for caching. For example, add a new 'path updated' trap. There is ongoing work here. Stay tuned for IBA 1.2.1 hopefully coming soon... -- Hal > IMO, I don't think that there's a huge issue initially populating the cache. > The problems all seem to fall into keeping it updated. I originally thought > this would have been a bigger deal, but given that ipoib doesn't update its > cache, it doesn't seem to be an issue in practice. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Wed Apr 25 13:31:56 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 13:31:56 -0700 Subject: [ofa-general] RE: [RFC] [PATCH 1/3] 2.6.22 or 23 ib/sa: add registration for saevents In-Reply-To: <1177532692.12542.7242.camel@hal.voltaire.com> Message-ID: <000001c78778$c4b9e190$8258180a@amr.corp.intel.com> >Any reason trap 259 was omitted here ? Just an oversight. - Sean From rdreier at cisco.com Wed Apr 25 13:33:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Apr 2007 13:33:46 -0700 Subject: [ofa-general] Re: [PATCH] net/mlx4: modify sw reset In-Reply-To: <20070425203118.GD5217@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 25 Apr 2007 23:31:18 +0300") References: <1177512129.10850.9.camel@mtls03> <20070425203118.GD5217@mellanox.co.il> Message-ID: > > Also is it worth taking the semaphore in the mthca reset function? > > Why not? PRM says we should - the point of this is protecting against someone > accidentally touching flash while device is reset, > which will result is hangs, system errors or even data corruption. OK... that was just my subtle way of asking for a patch to do that. From rdreier at cisco.com Wed Apr 25 13:36:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Apr 2007 13:36:39 -0700 Subject: [ofa-general] Re: [ewg] Re: pkey change handling patch In-Reply-To: <15ddcffd0704251157y7f09d208kf8b5a23e7394a11b@mail.gmail.com> (Or Gerlitz's message of "Wed, 25 Apr 2007 21:57:35 +0300") References: <20070328093345.GD11695@mellanox.co.il> <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <15ddcffd0704251157y7f09d208kf8b5a23e7394a11b@mail.gmail.com> Message-ID: > rate of connections per second might be an intresting feature in the context > of doing TCP offload eg with SDP, when you want to see how many connetions > can a web server or database establish with clients in unit of time. With > the cache elimination every new connection will consume two more IB commands > (port query and pkey table read). OK, but that seems irrelevant to the discussion at hand. In a case like that I would expect you to keep track of the P_Key index to use and not look it up again for every connection. ...in fact, why is ipoib_cm.c looking up the pkey index for every connection? Shouldn't it use the pkey index that datagram mode already looked up? - R. - R. From sean.hefty at intel.com Wed Apr 25 13:38:29 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 13:38:29 -0700 Subject: [ofa-general] RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <20070425202903.GC5217@mellanox.co.il> Message-ID: <000101c78779$aeb441f0$8258180a@amr.corp.intel.com> >We can get notic on port state changes though, can't we? Trap 128 (sent by switches) indicates that the "link state at least one port of switch at has changed". I think it would be difficult for all nodes to determine which paths were affected. - Sean From halr at voltaire.com Wed Apr 25 13:40:23 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2007 16:40:23 -0400 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <20070425202903.GC5217@mellanox.co.il> References: <000201c785ee$d54be520$e598070a@amr.corp.intel.com> <1177532687.12542.7240.camel@hal.voltaire.com> <20070425202903.GC5217@mellanox.co.il> Message-ID: <1177533621.12542.8248.camel@hal.voltaire.com> On Wed, 2007-04-25 at 16:29, Michael S. Tsirkin wrote: > > Quoting Hal Rosenstock : > > Subject: RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache > > > > On Mon, 2007-04-23 at 17:32, Sean Hefty wrote: > > > >Isn't there a way to get notice for this? > > > > > > The closest trap I'm aware of is GID in/out of service. See 14.2.5.1 and > > > 14.4.9. GID in/out of service is related to the existence of a path record > > > between the SGID and DGID. If the path record parameters change, I'm not sure > > > if the GID technically goes out, then back into service or not. Maybe Hal > > > knows. > > > > No; there is no way to do this that I'm aware of. > > We can get notic on port state changes though, can't we? Sure but those were intended for SM. The end node would then need to do DR SMP stuff (not just diagnostics) and there would be MKey assumptions as well as other things. DR SMPs are always slow pathed and there is no flow control on VL15. -- Hal > [snip...] From swise at opengridcomputing.com Wed Apr 25 13:50:19 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 25 Apr 2007 15:50:19 -0500 Subject: [ofa-general] [PATCH 6/6] [RFC]mlx4 build system stuff In-Reply-To: <20074201532.jhCN1hLvxaAengJm@cisco.com> References: <20074201532.jhCN1hLvxaAengJm@cisco.com> Message-ID: <1177534219.29222.0.camel@stevo-desktop> On Fri, 2007-04-20 at 15:32 -0700, Roland Dreier wrote: > Hook up mlx4_core and mlx4_ib drivers to Kconfig and Makefiles. > > Signed-off-by: Roland Dreier > > --- > > infiniband/Kconfig | 2 ++ > infiniband/Makefile | 1 + > infiniband/hw/mlx4/Kconfig | 9 +++++++++ > infiniband/hw/mlx4/Makefile | 3 +++ > net/Kconfig | 14 ++++++++++++++ > net/Makefile | 1 + > net/mlx4/Makefile | 4 ++++ > 7 files changed, 34 insertions(+) > > diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig > index 82afba5..37deaae 100644 > --- a/drivers/infiniband/Kconfig > +++ b/drivers/infiniband/Kconfig > @@ -45,6 +45,8 @@ source "drivers/infiniband/hw/ehca/Kconfig" > source "drivers/infiniband/hw/amso1100/Kconfig" > source "drivers/infiniband/hw/cxgb3/Kconfig" > > +source "drivers/infiniband/hw/mlx4/Kconfig" > + > source "drivers/infiniband/ulp/ipoib/Kconfig" > > source "drivers/infiniband/ulp/srp/Kconfig" > diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile > index da2066c..75f325e 100644 > --- a/drivers/infiniband/Makefile > +++ b/drivers/infiniband/Makefile > @@ -4,6 +4,7 @@ obj-$(CONFIG_INFINIBAND_IPATH) += hw/ipath/ > obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ > obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ > obj-$(CONFIG_INFINIBAND_CXGB3) += hw/cxgb3/ > +obj-$(CONFIG_MLX4_INFINIBAND) += hw/mlx4/ > obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ > obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ > obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ > diff --git a/drivers/infiniband/hw/mlx4/Kconfig b/drivers/infiniband/hw/mlx4/Kconfig > new file mode 100644 > index 0000000..b8912cd > --- /dev/null > +++ b/drivers/infiniband/hw/mlx4/Kconfig > @@ -0,0 +1,9 @@ > +config MLX4_INFINIBAND > + tristate "Mellanox ConnectX HCA support" > + depends on INFINIBAND > + select MLX4_CORE > + ---help--- > + This driver provides low-level InfiniBand support for > + Mellanox ConnectX PCI Express host channel adapters (HCAs). > + This is required to use InfiniBand protocols such as > + IP-over-IB or SRP with these devices. > diff --git a/drivers/infiniband/hw/mlx4/Makefile b/drivers/infiniband/hw/mlx4/Makefile > new file mode 100644 > index 0000000..70f09c7 > --- /dev/null > +++ b/drivers/infiniband/hw/mlx4/Makefile > @@ -0,0 +1,3 @@ > +obj-$(CONFIG_MLX4_INFINIBAND) += mlx4_ib.o > + > +mlx4_ib-y := ah.o cq.o doorbell.o mad.o main.o mr.o qp.o srq.o > diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig > index c3f9f59..842f020 100644 > --- a/drivers/net/Kconfig > +++ b/drivers/net/Kconfig > @@ -2493,6 +2493,20 @@ config PASEMI_MAC > This driver supports the on-chip 1/10Gbit Ethernet controller on > PA Semi's PWRficient line of chips. > > +config MLX4_CORE > + tristate > + depends on PCI > + default n > + No help menu for the core module? > +config MLX4_DEBUG > + bool "Verbose debugging output" if (MLX4_CORE && EMBEDDED) > + default y > + ---help--- > + This option causes debugging code to be compiled into the > + mlx4_core driver. The output can be turned on via the > + debug_level module parameter (which can also be set after > + the driver is loaded through sysfs). > + > endmenu > > source "drivers/net/tokenring/Kconfig" > diff --git a/drivers/net/Makefile b/drivers/net/Makefile > index 33af833..1604e1a 100644 > --- a/drivers/net/Makefile > +++ b/drivers/net/Makefile > @@ -197,6 +197,7 @@ obj-$(CONFIG_SMC911X) += smc911x.o > obj-$(CONFIG_DM9000) += dm9000.o > obj-$(CONFIG_FEC_8XX) += fec_8xx/ > obj-$(CONFIG_PASEMI_MAC) += pasemi_mac.o > +obj-$(CONFIG_MLX4_CORE) += mlx4/ > > obj-$(CONFIG_MACB) += macb.o > > diff --git a/drivers/net/mlx4/Makefile b/drivers/net/mlx4/Makefile > new file mode 100644 > index 0000000..4f18889 > --- /dev/null > +++ b/drivers/net/mlx4/Makefile > @@ -0,0 +1,4 @@ > +obj-$(CONFIG_MLX4_CORE) += mlx4_core.o > + > +mlx4_core-y := alloc.o cmd.o cq.o eq.o fw.o icm.o intf.o main.o mcg.o mr.o \ > + pd.o profile.o qp.o reset.o srq.o > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Apr 25 13:54:15 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2007 16:54:15 -0400 Subject: [ofa-general] Re: opensmd init.d script question In-Reply-To: <462FB486.9090403@sgi.com> References: <462FB486.9090403@sgi.com> Message-ID: <1177534453.12542.9088.camel@hal.voltaire.com> Hi John, On Wed, 2007-04-25 at 16:05, John Partridge wrote: > Hi Hal, > > I am working on a new SGI product that will have two separate InfiniBand fabrics. > Each of these fabrics may have a different topology and could be running one of > a number of routing engines (i.e. lash or up/dn) the Subnet Management for both > fabrics will run on one host (leader node). Out of the box OFED-1.2 does > not have a good way to achieve managing this. Ideally I would like to have the > flexibility of chkconfig controlling each fabric (i.e., ib0 ib1), but, I have > found that the insserv mechanism has severe limitations. BTW we are running > SuSE Sles10. > > I just wonder if you had come across this kind of config and if you have any ideas > about how you see this working. It looks like I need to have more than one opensmd > (one for each fabric) but that is looking like it will not work either because > of the insserv limitations. > > Any help or advice you have would be appreciated. One can run 2 OpenSMs on different CA/ports on a single machine. The main things in doing this is setting them up to use different directories. This is accomplished via setting OSM_CACHE_DIR. You will want to configure dump_files_dir and log_file to be different. Also, you will likely want different subnet_prefix configured in each subnet (opensm.opts). There may be other configuration files different as well based on what your requirements are. Hope this helps. As to how to do this with opensmd, I'm not sure as I don't work with that. When you figure this out, it would be useful if you posted the information. Thanks. -- Hal > Thanks > John From rdreier at cisco.com Wed Apr 25 13:57:20 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Apr 2007 13:57:20 -0700 Subject: [ofa-general] [PATCH 6/6] [RFC]mlx4 build system stuff In-Reply-To: <1177534219.29222.0.camel@stevo-desktop> (Steve Wise's message of "Wed, 25 Apr 2007 15:50:19 -0500") References: <20074201532.jhCN1hLvxaAengJm@cisco.com> <1177534219.29222.0.camel@stevo-desktop> Message-ID: > > +config MLX4_CORE > > + tristate > > + depends on PCI > > + default n > > + > > No help menu for the core module? It's an invisible option. You get it if you enable anything that uses it (mlx4_ib or mlx4_eth), and you never even have to know about the option. - R. From rick.jones2 at hp.com Wed Apr 25 14:23:18 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 25 Apr 2007 14:23:18 -0700 Subject: [ofa-general] initial set of "direct" SDP tests in netperf In-Reply-To: <462F8E07.4050000@hp.com> References: <462E8257.9090103@hp.com> <462F89C7.4010105@hp.com> <462F8E07.4050000@hp.com> Message-ID: <462FC6C6.9050700@hp.com> > I guess that until it is resolved I'll just kludge around it with my own > define. Soo, I did a bunch of cut and paste in netperf, where I take what getaddrinfo() returns and replace the ->ai_family with AF_INET_SDP and set the ->ai_protocol to 0 (a guess since there isn't much in the way of docs on the subject I could find). I have implemented the SDP_STREAM, SDP_MAERTS and SDP_RR tests thusfar, they are enabled via a configure option of --enable-sdp . You can grab the bits from the top of trunk of the netperf2 repository at: http://www.netperf.org/svn/netperf2/trunk/ I've done some initial, cursory testing with the bits that ship with RHEL5 IA64 - I seem to have some sort of overlapping problem still even using the install.sh that purports to remove previous bits - and I've tried the rpm -e command roland (?) posted a few days ago - it says that none of those things are installed. I tried to modprobe ib_sdp (some additional guesswork) and got symbol version mismatches. Still, even with OFED 1.2rc2 removed there are ib_mumble modules loaded by RHEL5, and .ko's in the standard (?) modules place. So clearly I have some remaining clueless noob issues wrt proper installation of 1.2 bits :( happy benchmarking, rick jones BTW, speaking of cluelessness - you will probably have an initial make failure involving netperf_version.h - I've got something still slightly botched there I've not been able to figure-out just yet (make and autotools aren't exactly my forte). You can work around that by cd'ing to src/ and doing a "make netperf-version.h" and then go back up and to themake again. PPS - the only remaining TCP tests of note I could bring-over would be: TCP_SENDFILE - can one use sendfile() against an SDP socket? TCP_CRR - like TCP_RR but includes time to call connect() TCP_CC - like TCP_CRR, but without the RR :) feedback on which of those, if any, would be of interest would be most welcome. After that I suppose RDS would be next on the list? Is that "real" at this point? Pointers on programming to it would be welcome. From sweitzen at cisco.com Wed Apr 25 14:39:08 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 25 Apr 2007 14:39:08 -0700 Subject: [ofa-general] RE: initial set of "direct" SDP tests in netperf In-Reply-To: <462FC6C6.9050700@hp.com> References: <462E8257.9090103@hp.com> <462F89C7.4010105@hp.com> <462F8E07.4050000@hp.com> <462FC6C6.9050700@hp.com> Message-ID: Rick, I still think this is unecessary copy-paste code duplication, just so the report says SDP instead of TCP. Part of the advantage of libsdp.so is you can use SDP w/o having to recode your application. Do you really want to maintain duplicate code for all the TCP_STREAM-vs-SDP_STREAM, etc. tests? Yuck! Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: Rick Jones [mailto:rick.jones2 at hp.com] > Sent: Wednesday, April 25, 2007 2:23 PM > To: general at lists.openfabrics.org > Cc: Scott Weitzenkamp (sweitzen); Roland Dreier (rdreier) > Subject: initial set of "direct" SDP tests in netperf > > > I guess that until it is resolved I'll just kludge around > it with my own > > define. > > Soo, I did a bunch of cut and paste in netperf, where I take what > getaddrinfo() returns and replace the ->ai_family with > AF_INET_SDP and > set the ->ai_protocol to 0 (a guess since there isn't much in > the way of > docs on the subject I could find). > > I have implemented the SDP_STREAM, SDP_MAERTS and SDP_RR > tests thusfar, > they are enabled via a configure option of --enable-sdp . > You can grab > the bits from the top of trunk of the netperf2 repository at: > > http://www.netperf.org/svn/netperf2/trunk/ > > I've done some initial, cursory testing with the bits that ship with > RHEL5 IA64 - I seem to have some sort of overlapping problem > still even > using the install.sh that purports to remove previous bits - and I've > tried the rpm -e command roland (?) posted a few days ago - > it says that > none of those things are installed. I tried to modprobe ib_sdp (some > additional guesswork) and got symbol version mismatches. Still, even > with OFED 1.2rc2 removed there are ib_mumble modules loaded by RHEL5, > and .ko's in the standard (?) modules place. So clearly I have some > remaining clueless noob issues wrt proper installation of 1.2 bits :( > > happy benchmarking, > > rick jones > > BTW, speaking of cluelessness - you will probably have an > initial make > failure involving netperf_version.h - I've got something > still slightly > botched there I've not been able to figure-out just yet (make and > autotools aren't exactly my forte). You can work around that > by cd'ing > to src/ and doing a "make netperf-version.h" and then go back > up and to > themake again. > > PPS - the only remaining TCP tests of note I could bring-over > would be: > > TCP_SENDFILE - can one use sendfile() against an SDP socket? > TCP_CRR - like TCP_RR but includes time to call connect() > TCP_CC - like TCP_CRR, but without the RR :) > > feedback on which of those, if any, would be of interest > would be most > welcome. > > After that I suppose RDS would be next on the list? Is that > "real" at > this point? Pointers on programming to it would be welcome. > From sean.hefty at intel.com Wed Apr 25 14:49:43 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 14:49:43 -0700 Subject: [ofa-general] autotools question Message-ID: <000201c78783$a2841860$8258180a@amr.corp.intel.com> Has anyone run into an issue with autotools not generating the .so extension to built library files, or know how to fix such an issue? - Sean From halr at voltaire.com Wed Apr 25 14:56:44 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2007 17:56:44 -0400 Subject: [ofa-general] [RFC] IB management changes proposal In-Reply-To: <462C7F17.3040707@cea.fr> References: <1176224960.14140.474623.camel@localhost.localdomain> <462C7F17.3040707@cea.fr> Message-ID: <1177538202.12542.13079.camel@hal.voltaire.com> On Mon, 2007-04-23 at 05:40, GREGOIRE Philippe wrote: > Hal Rosenstock a écrit : > > The following changes are proposed for IB management (master branch of > > my management git tree): > > > > In order to better match package names, the following directory names to > > be changed from->to: > > osm->opensm > > diags->openib-diags > > > > Since opensm is a system daemon, opensm to be moved from /usr/bin to /usr/sbin > > > > For consistency with the package name, /var/cache/osm moved to > > /var/cache/opensm > > > > Also, for consistency with the package name, all config, log, and dump files named osm* > > to be changed to opensm* > > > > To avoid confusion and possible conflicts in configuring daemon options, > > only have 1 configuration file (existence of both /etc/sysconfig/opensm > > and /etc/opensm.conf is problematic). Remove the /etc/sysconfig/opensm > > file and only use opensm.conf. Move opensm.conf to /etc/rdma (as > > discussed in the thread labeled "Location and naming of RDMA enablement > > stack rpm" on general at lists.openfabrics.org. > > > > Any comments ? > > > > -- Hal > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > commands provided by openib-diags should be installed also in /usr/sbin > as they are privileged system administrator commands. OK; that's the general usage based on umad access. > There also some few commands (ib*.pl) that are using a file > /tmp/ibnetdiscover.topology. I suggest /var/cache/ibnetdiscover.topology I'm not sure about this one. I need to think about this more. -- Hal > Philippe From rick.jones2 at hp.com Wed Apr 25 14:58:29 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 25 Apr 2007 14:58:29 -0700 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: References: <462E8257.9090103@hp.com> <462F89C7.4010105@hp.com> <462F8E07.4050000@hp.com> <462FC6C6.9050700@hp.com> Message-ID: <462FCF05.5030900@hp.com> Scott Weitzenkamp (sweitzen) wrote: > Rick, I still think this is unecessary copy-paste code duplication, just > so the report says SDP instead of TCP. Part of the advantage of > libsdp.so is you can use SDP w/o having to recode your application. Understood. > Do you really want to maintain duplicate code for all the > TCP_STREAM-vs-SDP_STREAM, etc. tests? Yuck! If it gets too heinous it will simply, finally force me to further modularize the code so there isn't as much overlap. After about 14 years of "netperf-in-the-wild" now I have seen enough that I want to make sure that if SDP is being used, that SDP is what appears in the test banners. Or at least make it possible to be so, clearly I cannot (and likely wouldn't) do anything in netperf to try to preclude using libsdp. happy benchmarking, rick jones From mike.heffner at evergrid.com Wed Apr 25 15:19:35 2007 From: mike.heffner at evergrid.com (Mike Heffner) Date: Wed, 25 Apr 2007 18:19:35 -0400 Subject: [ofa-general] Requesting CQ notifications Message-ID: <462FD3F7.1010304@evergrid.com> Is there a differentiation between multiple CQE's being in the CQ vs. CQE's being arriving into the CQ when using completion notifications? For example, assume I have the following order of events: 2 CQEs arrive select() returns readable for comp. channel ibv_get_cq_event() returns event ibv_req_notify_cq(cq, 0) ibv_poll_cq(cq, 1, &cqe) returns 1 ibv_ack_cq_events(cq, 1) Will the comp. channel receive another event for the second CQE even if it had arrived before ibv_req_notify_cq() was called? Mike -- Mike Heffner EverGrid Software Blacksburg, VA USA Voice: (540) 443-3500 x603 From sean.hefty at intel.com Wed Apr 25 15:24:34 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 15:24:34 -0700 Subject: [ofa-general] autotools question In-Reply-To: <000201c78783$a2841860$8258180a@amr.corp.intel.com> Message-ID: <000001c78788$80ab23f0$8258180a@amr.corp.intel.com> >Has anyone run into an issue with autotools not generating the .so extension to >built library files, or know how to fix such an issue? This turned out to be an issue with two versions of libtool installed. From sean.hefty at intel.com Wed Apr 25 15:47:50 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 15:47:50 -0700 Subject: [ofa-general] RE: [Bug 581] rdma_get_src_port() not returning the correct port. In-Reply-To: <20070425223124.9FB5DE60826@openfabrics.org> Message-ID: <000101c7878b$c0b575b0$8258180a@amr.corp.intel.com> Can you give this a try? The source address was being overwritten by whatever the user passed into rdma_bind_addr. Signed-off-by: Sean Hefty --- diff --git a/src/cma.c b/src/cma.c index c5f8cd9..fdadb69 100644 --- a/src/cma.c +++ b/src/cma.c @@ -509,12 +509,7 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) if (ret != size) return (ret > 0) ? -ENODATA : ret; - ret = ucma_query_route(id); - if (ret) - return ret; - - memcpy(&id->route.addr.src_addr, addr, addrlen); - return 0; + return ucma_query_route(id); } int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, From rick.jones2 at hp.com Wed Apr 25 15:51:23 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 25 Apr 2007 15:51:23 -0700 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: <462FCF05.5030900@hp.com> References: <462E8257.9090103@hp.com> <462F89C7.4010105@hp.com> <462F8E07.4050000@hp.com> <462FC6C6.9050700@hp.com> <462FCF05.5030900@hp.com> Message-ID: <462FDB6B.8000806@hp.com> Giving netperf some "direct" SDP tests is also (IMO) a usability enhancement (from the standpoint of netperf users) - no need to remember to do the LD_PRELOAD bit, nor to get the config file right, no need to switch back and forth between netperf/netserver binaries and/or run concurrent netservers at different control port numbers. happy benchmarking, rick jones From swise at opengridcomputing.com Wed Apr 25 16:00:30 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 25 Apr 2007 18:00:30 -0500 Subject: [ofa-general] RE: [Bug 581] rdma_get_src_port() not returning the correct port. In-Reply-To: <000101c7878b$c0b575b0$8258180a@amr.corp.intel.com> References: <000101c7878b$c0b575b0$8258180a@amr.corp.intel.com> Message-ID: <1177542030.9276.0.camel@stevo-desktop> That works! Acked-by: Steve Wise On Wed, 2007-04-25 at 15:47 -0700, Sean Hefty wrote: > Can you give this a try? > > The source address was being overwritten by whatever the user passed into > rdma_bind_addr. > > Signed-off-by: Sean Hefty > --- > diff --git a/src/cma.c b/src/cma.c > index c5f8cd9..fdadb69 100644 > --- a/src/cma.c > +++ b/src/cma.c > @@ -509,12 +509,7 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr > *addr) > if (ret != size) > return (ret > 0) ? -ENODATA : ret; > > - ret = ucma_query_route(id); > - if (ret) > - return ret; > - > - memcpy(&id->route.addr.src_addr, addr, addrlen); > - return 0; > + return ucma_query_route(id); > } > > int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, > From mshefty at ichips.intel.com Wed Apr 25 16:17:37 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 16:17:37 -0700 Subject: [ofa-general] hotplug event handle question In-Reply-To: <1177512915.15940.17.camel@vladsk-laptop> References: <1177512915.15940.17.camel@vladsk-laptop> Message-ID: <462FE191.8060201@ichips.intel.com> > if (!cma_comp(id_priv, CMA_CONNECT) && > !cma_comp(id_priv, CMA_DISCONNECT)) > return -EINVAL; This check only ensures that we have a valid underlying cm_id (cm_id.ib or cm_id.iw) and are bound to a device, with the underlying cm's providing the synchronization that we need. To allow rdma_disconnect() to be called after the device has been removed will likely take a slight re-working of the states. (I'm more concerned about userspace clients calling rdma_disconnect at the wrong time and crashing the kernel than a misuse from kernel clients.) I don't think that the device removal code should transition the QP into the error state underneath the user, so fixing rdma_disconnect seems like the way to go. I will work on a fix for this. In the meantime, the alternatives are either to remove the check or have the ULP transitioning QP into the error state. Steve, I don't see where the iwarp code transitions the cm_id to CMA_DISCONNECT. Does iwarp keep the cm_id in the CMA_CONNECT state until it is destroyed? - Sean From rdreier at cisco.com Wed Apr 25 18:58:25 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Apr 2007 18:58:25 -0700 Subject: [ofa-general] Requesting CQ notifications In-Reply-To: <462FD3F7.1010304@evergrid.com> (Mike Heffner's message of "Wed, 25 Apr 2007 18:19:35 -0400") References: <462FD3F7.1010304@evergrid.com> Message-ID: > Is there a differentiation between multiple CQE's being in the CQ > vs. CQE's being arriving into the CQ when using completion > notifications? > > For example, assume I have the following order of events: > > > 2 CQEs arrive > > select() returns readable for comp. channel > > ibv_get_cq_event() returns event > > ibv_req_notify_cq(cq, 0) > > ibv_poll_cq(cq, 1, &cqe) returns 1 > > ibv_ack_cq_events(cq, 1) > > > Will the comp. channel receive another event for the second CQE even > if it had arrived before ibv_req_notify_cq() was called? This is really an ill-posed question: according to the semantics defined by the verbs spec, the presence or absence of the second CQE is not defined until you poll the CQ again. In practice we can look at what real hardware does, and the answer is "it depends." Some adapters (eg mthca, mlx4) will generate an event immediately if ibv_req_notify_cq() is called for a CQ that contains an unpolled CQE, while other adapters (eg ipath, ehca) will only generate an event when a CQE is added after the cal to ibv_req_notify_cq(). - R. From mike.heffner at evergrid.com Wed Apr 25 19:19:50 2007 From: mike.heffner at evergrid.com (Mike Heffner) Date: Wed, 25 Apr 2007 22:19:50 -0400 Subject: [ofa-general] Requesting CQ notifications In-Reply-To: References: <462FD3F7.1010304@evergrid.com> Message-ID: <46300C46.4000106@evergrid.com> Roland Dreier wrote: > > Is there a differentiation between multiple CQE's being in the CQ > > vs. CQE's being arriving into the CQ when using completion > > notifications? > > > > For example, assume I have the following order of events: > > > > > > 2 CQEs arrive > > > > select() returns readable for comp. channel > > > > ibv_get_cq_event() returns event > > > > ibv_req_notify_cq(cq, 0) > > > > ibv_poll_cq(cq, 1, &cqe) returns 1 > > > > ibv_ack_cq_events(cq, 1) > > > > > > Will the comp. channel receive another event for the second CQE even > > if it had arrived before ibv_req_notify_cq() was called? > > This is really an ill-posed question: according to the semantics > defined by the verbs spec, the presence or absence of the second CQE > is not defined until you poll the CQ again. > > In practice we can look at what real hardware does, and the answer is > "it depends." Some adapters (eg mthca, mlx4) will generate an event > immediately if ibv_req_notify_cq() is called for a CQ that contains an > unpolled CQE, while other adapters (eg ipath, ehca) will only generate > an event when a CQE is added after the cal to ibv_req_notify_cq(). Ok. The reason I asked was that I am noticing the latter behavior with the mthca adaptor. If in the above code I poll for two CQEs instead of one I get them both back and can handle them. However, if I poll for just one and go back into the select, it doesn't return the comp. channel file descriptor as readable so I never handle the second CQE. I expected that this could be because I had called ibv_req_notify_cq() after both CQEs had already arrived. In this case it brings up an interesting question. If the adaptor will only generate an event for a "new" CQE added, how do you tell that you've successfully polled all the CQEs that triggered the first event? You could continuously call ibv_poll_cq() until it returned zero, but this has the side effect of potentially starving other CQs in the case of a high CQE rate. Mike -- Mike Heffner EverGrid Software Blacksburg, VA USA Voice: (540) 443-3500 #603 From rdreier at cisco.com Wed Apr 25 20:01:09 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Apr 2007 20:01:09 -0700 Subject: [ofa-general] Requesting CQ notifications In-Reply-To: <46300C46.4000106@evergrid.com> (Mike Heffner's message of "Wed, 25 Apr 2007 22:19:50 -0400") References: <462FD3F7.1010304@evergrid.com> <46300C46.4000106@evergrid.com> Message-ID: > Ok. The reason I asked was that I am noticing the latter behavior with > the mthca adaptor. If in the above code I poll for two CQEs instead of > one I get them both back and can handle them. However, if I poll for > just one and go back into the select, it doesn't return the > comp. channel file descriptor as readable so I never handle the second > CQE. I expected that this could be because I had called > ibv_req_notify_cq() after both CQEs had already arrived. Something doesn't add up. Mellanox adapters definitely will generate an event after ibv_req_notify_cq() even if the CQE is already in the CQ. There's no window where you can miss an event. > In this case it brings up an interesting question. If the adaptor will > only generate an event for a "new" CQE added, how do you tell that > you've successfully polled all the CQEs that triggered the first > event? You could continuously call ibv_poll_cq() until it returned > zero, but this has the side effect of potentially starving other CQs > in the case of a high CQE rate.n The simplest thing to do is to call ibv_req_notify_cq() before you poll the CQ. This leaves no window for missing events. Or as you say, you could poll the CQ until it was empty. If you're worried about starvation, then limit the # of polls you do, and if the CQ is not empty, put it on a list of CQs to come back to again. One other option would be to implement a "maybe missed event" hint returned from ibv_req_notify_cq(), the way I did at the kernel level as part of the IPoIB NAPI patches. But that would be a longer term thing that would require changing the user-kernel ABI and the libibverbs API (and hence would have to be in libibverbs 1.2). - R. From mst at dev.mellanox.co.il Wed Apr 25 21:05:53 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 07:05:53 +0300 Subject: [ofa-general] hotplug event handle question In-Reply-To: <462FE191.8060201@ichips.intel.com> References: <1177512915.15940.17.camel@vladsk-laptop> <462FE191.8060201@ichips.intel.com> Message-ID: <20070426040553.GE5217@mellanox.co.il> > Quoting Sean Hefty : > Subject: Re: [ofa-general] hotplug event handle question > > > if (!cma_comp(id_priv, CMA_CONNECT) && > > !cma_comp(id_priv, CMA_DISCONNECT)) > > return -EINVAL; > > This check only ensures that we have a valid underlying cm_id (cm_id.ib or > cm_id.iw) and are bound to a device, with the underlying cm's providing the > synchronization that we need. To allow rdma_disconnect() to be called > after the device has been removed will likely take a slight re-working of > the states. (I'm more concerned about userspace clients calling > rdma_disconnect at the wrong time and crashing the kernel than a misuse > from kernel clients.) I don't think that the device removal code should > transition the QP into the error state underneath the user, so fixing > rdma_disconnect seems like the way to go. I think the problem is that cma_remove_id_dev overrides the current state, losing state information in the process. Why do we need CMA_DEVICE_REMOVAL at all? Everything seems to work fine just by forwarding RDMA_CM_EVENT_DEVICE_REMOVAL to user, without touching state. > > I will work on a fix for this. In the meantime, the alternatives are > either to remove the check or have the ULP transitioning QP into the error > state. > > > Steve, I don't see where the iwarp code transitions the cm_id to > CMA_DISCONNECT. Does iwarp keep the cm_id in the CMA_CONNECT state until > it is destroyed? Don't lose state on RDMA_CM_EVENT_DEVICE_REMOVAL. Without this patch, rdma_disconnect won't move the QP to error. Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index fde92ce..9f37eac 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -92,7 +92,6 @@ enum cma_state { CMA_DISCONNECT, CMA_ADDR_BOUND, CMA_LISTEN, - CMA_DEVICE_REMOVAL, CMA_DESTROYING }; @@ -2689,20 +2688,13 @@ static void cma_add_one(struct ib_device *device) static int cma_remove_id_dev(struct rdma_id_private *id_priv) { struct rdma_cm_event event; - enum cma_state state; - /* Record that we want to remove the device */ - state = cma_exch(id_priv, CMA_DEVICE_REMOVAL); - if (state == CMA_DESTROYING) + if (cma_comp(id_priv, CMA_DESTROYING)) return 0; - cma_cancel_operation(id_priv, state); + cma_cancel_operation(id_priv, id_priv->state); wait_event(id_priv->wait_remove, !atomic_read(&id_priv->dev_remove)); - /* Check for destruction from another callback. */ - if (!cma_comp(id_priv, CMA_DEVICE_REMOVAL)) - return 0; - memset(&event, 0, sizeof event); event.event = RDMA_CM_EVENT_DEVICE_REMOVAL; return id_priv->id.event_handler(&id_priv->id, &event); -- MST From mst at dev.mellanox.co.il Wed Apr 25 21:40:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 07:40:59 +0300 Subject: [ofa-general] Re: Requesting CQ notifications In-Reply-To: References: <462FD3F7.1010304@evergrid.com> <46300C46.4000106@evergrid.com> Message-ID: <20070426044059.GF5217@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: Requesting CQ notifications > > > Ok. The reason I asked was that I am noticing the latter behavior with > > the mthca adaptor. If in the above code I poll for two CQEs instead of > > one I get them both back and can handle them. However, if I poll for > > just one and go back into the select, it doesn't return the > > comp. channel file descriptor as readable so I never handle the second > > CQE. I expected that this could be because I had called > > ibv_req_notify_cq() after both CQEs had already arrived. > > Something doesn't add up. Mellanox adapters definitely will generate > an event after ibv_req_notify_cq() even if the CQE is already in the > CQ. There's no window where you can miss an event. > > > In this case it brings up an interesting question. If the adaptor will > > only generate an event for a "new" CQE added, how do you tell that > > you've successfully polled all the CQEs that triggered the first > > event? You could continuously call ibv_poll_cq() until it returned > > zero, but this has the side effect of potentially starving other CQs > > in the case of a high CQE rate.n > > The simplest thing to do is to call ibv_req_notify_cq() before you > poll the CQ. This leaves no window for missing events. Or as you > say, you could poll the CQ until it was empty. There's no "Or", actually. Even if you call ibv_req_notify_cq before polling, you still must drain the CQ of completions generated before the event (or queue for polling later using some alternative mechanism). > If you're worried > about starvation, then limit the # of polls you do, and if the CQ is > not empty, put it on a list of CQs to come back to again. > > One other option would be to implement a "maybe missed event" hint > returned from ibv_req_notify_cq(), the way I did at the kernel level > as part of the IPoIB NAPI patches. But that would be a longer term > thing that would require changing the user-kernel ABI and the > libibverbs API (and hence would have to be in libibverbs 1.2). Note that "maybe missed event" only checks for completions that have arrived after event was generated but before you did request for notification (hardware does not need to "know" which completions did you poll). And of course if missed event is reported, you'll still need to requeue the CQ for polling. So not much is changed. -- MST From mst at dev.mellanox.co.il Wed Apr 25 21:54:09 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 07:54:09 +0300 Subject: [ofa-general] Re: Requesting CQ notifications In-Reply-To: <46300C46.4000106@evergrid.com> References: <462FD3F7.1010304@evergrid.com> <46300C46.4000106@evergrid.com> Message-ID: <20070426045409.GG5217@mellanox.co.il> > In this case it brings up an interesting question. If the adaptor will > only generate an event for a "new" CQE added, how do you tell that > you've successfully polled all the CQEs that triggered the first event? > You could continuously call ibv_poll_cq() until it returned zero, but > this has the side effect of potentially starving other CQs in the case > of a high CQE rate. You can limit the number of polls by CQ size. So request notification solves the starvation problem, and will guarantee that no event is missed. -- MST From mst at dev.mellanox.co.il Wed Apr 25 21:58:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 07:58:55 +0300 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: <462FC6C6.9050700@hp.com> References: <462E8257.9090103@hp.com> <462F8E07.4050000@hp.com> <462FC6C6.9050700@hp.com> Message-ID: <20070426045855.GH5217@mellanox.co.il> > Quoting Rick Jones : > Subject: initial set of "direct" SDP tests in netperf > > >I guess that until it is resolved I'll just kludge around it with my own > >define. > > Soo, I did a bunch of cut and paste in netperf, where I take what > getaddrinfo() returns and replace the ->ai_family with AF_INET_SDP and > set the ->ai_protocol to 0 (a guess since there isn't much in the way of > docs on the subject I could find). Please note that you should *only* ever stick the SDP family value in the socket(3) call. All addresses for connect, bind etc are AF_INET, since SDP uses IP addresses for everything. -- MST From mst at dev.mellanox.co.il Wed Apr 25 22:00:44 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 08:00:44 +0300 Subject: [ofa-general] Re: [Bug 581] rdma_get_src_port() not returning the correct port. In-Reply-To: <1177542030.9276.0.camel@stevo-desktop> References: <000101c7878b$c0b575b0$8258180a@amr.corp.intel.com> <1177542030.9276.0.camel@stevo-desktop> Message-ID: <20070426050044.GI5217@mellanox.co.il> So is this for OFED? For 2.6.21? Quoting Steve Wise : Subject: RE: [Bug 581] rdma_get_src_port() not returning the correct port. That works! Acked-by: Steve Wise On Wed, 2007-04-25 at 15:47 -0700, Sean Hefty wrote: > Can you give this a try? > > The source address was being overwritten by whatever the user passed into > rdma_bind_addr. > > Signed-off-by: Sean Hefty > --- > diff --git a/src/cma.c b/src/cma.c > index c5f8cd9..fdadb69 100644 > --- a/src/cma.c > +++ b/src/cma.c > @@ -509,12 +509,7 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr > *addr) > if (ret != size) > return (ret > 0) ? -ENODATA : ret; > > - ret = ucma_query_route(id); > - if (ret) > - return ret; > - > - memcpy(&id->route.addr.src_addr, addr, addrlen); > - return 0; > + return ucma_query_route(id); > } > > int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, > _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From mst at dev.mellanox.co.il Wed Apr 25 22:02:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 08:02:30 +0300 Subject: [ofa-general] Re: [RFC] IB management changes proposal In-Reply-To: <1177538202.12542.13079.camel@hal.voltaire.com> References: <1176224960.14140.474623.camel@localhost.localdomain> <462C7F17.3040707@cea.fr> <1177538202.12542.13079.camel@hal.voltaire.com> Message-ID: <20070426050230.GJ5217@mellanox.co.il> > > There also some few commands (ib*.pl) that are using a file > > /tmp/ibnetdiscover.topology. I suggest /var/cache/ibnetdiscover.topology > > I'm not sure about this one. I need to think about this more. Not sure about the best placement, but surely a predictable name in a world-writeable directory is a security risk? -- MST From mst at dev.mellanox.co.il Wed Apr 25 22:22:42 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 08:22:42 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add pathrecord cache In-Reply-To: <1177533043.12542.7580.camel@hal.voltaire.com> References: <000501c785f9$e9015cc0$e598070a@amr.corp.intel.com> <1177533043.12542.7580.camel@hal.voltaire.com> Message-ID: <20070426052242.GL5217@mellanox.co.il> > Quoting Hal Rosenstock : > Subject: RE: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add pathrecord cache > > On Mon, 2007-04-23 at 18:51, Sean Hefty wrote: > > >We could solve this by implementing a process running on the same node as the > > >SA. > > >And it's probably not too hard to add a way for opensm to spit out > > >the table into an external file when it gets a signal or something. > > > > I agree that there are ways to solve this, but those solutions won't work with > > existing SAs and define a new SA interface. If we're willing to break > > compatibility or add extensions, we could also extend the SA to provide better > > support for caching. For example, add a new 'path updated' trap. > > There is ongoing work here. Stay tuned for IBA 1.2.1 hopefully coming > soon... Maybe you guys want to discuss this in Sonoma. We wouldn't want to merge something non-standard upstream, creating compatibility problems for everyone. -- MST From mst at dev.mellanox.co.il Wed Apr 25 22:26:28 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 08:26:28 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <1177533621.12542.8248.camel@hal.voltaire.com> References: <000201c785ee$d54be520$e598070a@amr.corp.intel.com> <1177532687.12542.7240.camel@hal.voltaire.com> <20070425202903.GC5217@mellanox.co.il> <1177533621.12542.8248.camel@hal.voltaire.com> Message-ID: <20070426052628.GM5217@mellanox.co.il> > The end node would then need to do > DR SMP stuff (not just diagnostics) and there would be MKey assumptions > as well as other things. DR SMPs are always slow pathed and there is no > flow control on VL15. That's a concern. How about using LinkRecord SA queries to figure out the topology? -- MST From sean.hefty at intel.com Wed Apr 25 22:38:55 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 22:38:55 -0700 Subject: [ofa-general] RE: [Bug 581] rdma_get_src_port() not returning the correct port. In-Reply-To: <20070426050044.GI5217@mellanox.co.il> Message-ID: <000001c787c5$2e4ddbb0$2ad8180a@amr.corp.intel.com> >So is this for OFED? For 2.6.21? This is a fix for the librdmacm only. I plan on queuing it for OFED 1.2 when I freeze it for a 1.0 release. (I intend to freeze before rc3.) - Sean From sean.hefty at intel.com Wed Apr 25 22:43:32 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 22:43:32 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: addpathrecord cache In-Reply-To: <20070426052242.GL5217@mellanox.co.il> Message-ID: <000101c787c5$d3393fc0$2ad8180a@amr.corp.intel.com> >Maybe you guys want to discuss this in Sonoma. >We wouldn't want to merge something non-standard upstream, >creating compatibility problems for everyone. I agree. Nothing in the current patches is non-standard, and SA events are only optimizations for updating the cache. (I did add the ability to disable InformInfo/Notice registration.) If a new trap type is added to the spec, the cache can be modified to take advantage of it, but I think such a feature can be added after an initial submission. - Sean From mst at dev.mellanox.co.il Wed Apr 25 23:02:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 09:02:55 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: addpathrecord cache In-Reply-To: <000101c787c5$d3393fc0$2ad8180a@amr.corp.intel.com> References: <20070426052242.GL5217@mellanox.co.il> <000101c787c5$d3393fc0$2ad8180a@amr.corp.intel.com> Message-ID: <20070426060255.GO5217@mellanox.co.il> > Nothing in the current patches is non-standard. That's a bit hard to say without knowing what the standard is :) But maybe you do know. -- MST From sean.hefty at intel.com Wed Apr 25 23:07:01 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 23:07:01 -0700 Subject: [ofa-general] hotplug event handle question In-Reply-To: <20070426040553.GE5217@mellanox.co.il> Message-ID: <000201c787c9$1b393a20$2ad8180a@amr.corp.intel.com> >I think the problem is that cma_remove_id_dev overrides the current state, >losing state information in the process. Why do we need CMA_DEVICE_REMOVAL >at all? Everything seems to work fine just by forwarding >RDMA_CM_EVENT_DEVICE_REMOVAL >to user, without touching state. I need to read back over the code. The problem is that device removal can come at anytime. The user could have called rdma_destroy_id, be about to call it, or be destroying the id by returning a non-zero value from a callback. We need to synchronize with all cases, and in the later case, we cannot perform the callback to notify the user of the device removal. Similarly, if the user destroys the id from a device removal event callback, then callbacks for others event should not be called. If we can do this by removing the device removal state, that would seem to be the simplest approach, but I need to verify that we can cover all corner cases. >@@ -2689,20 +2688,13 @@ static void cma_add_one(struct ib_device *device) > static int cma_remove_id_dev(struct rdma_id_private *id_priv) > { > struct rdma_cm_event event; >- enum cma_state state; > >- /* Record that we want to remove the device */ >- state = cma_exch(id_priv, CMA_DEVICE_REMOVAL); Since we're not changing the state, callbacks to the same id may be invoked from another thread. The cma guarantees that all callbacks to a single id are serialized. >- if (state == CMA_DESTROYING) >+ if (cma_comp(id_priv, CMA_DESTROYING)) > return 0; > >- cma_cancel_operation(id_priv, state); >+ cma_cancel_operation(id_priv, id_priv->state); > wait_event(id_priv->wait_remove, !atomic_read(&id_priv->dev_remove)); > >- /* Check for destruction from another callback. */ >- if (!cma_comp(id_priv, CMA_DEVICE_REMOVAL)) >- return 0; At the very least we need to repeat the check: if (!cma_comp(id_priv, CMA_DESTROYING)) return 0; here to avoid calling the user after they've tried to destroy their id from another callback. See comment above. - Sean From sean.hefty at intel.com Wed Apr 25 23:11:02 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 23:11:02 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib:addpathrecord cache In-Reply-To: <20070426060255.GO5217@mellanox.co.il> Message-ID: <000301c787c9$aac82c50$2ad8180a@amr.corp.intel.com> >That's a bit hard to say without knowing what the standard is :) >But maybe you do know. The cache does PathRecord GetTable queries and InformInfo/Notice subscriptions according to the wire protocol given in the spec (minus any potential bugs). - Sean From mst at dev.mellanox.co.il Wed Apr 25 23:14:11 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 09:14:11 +0300 Subject: [ofa-general] hotplug event handle question In-Reply-To: <000201c787c9$1b393a20$2ad8180a@amr.corp.intel.com> References: <20070426040553.GE5217@mellanox.co.il> <000201c787c9$1b393a20$2ad8180a@amr.corp.intel.com> Message-ID: <20070426061411.GP5217@mellanox.co.il> Quoting Sean Hefty : Subject: RE: [ofa-general] hotplug event handle question > >I think the problem is that cma_remove_id_dev overrides the current state, > >losing state information in the process. Why do we need CMA_DEVICE_REMOVAL > >at all? Everything seems to work fine just by forwarding > >RDMA_CM_EVENT_DEVICE_REMOVAL > >to user, without touching state. > > I need to read back over the code. The problem is that device removal can come > at anytime. The user could have called rdma_destroy_id, be about to call it, or > be destroying the id by returning a non-zero value from a callback. We need to > synchronize with all cases, and in the later case, we cannot perform the > callback to notify the user of the device removal. Similarly, if the user > destroys the id from a device removal event callback, then callbacks for others > event should not be called. > > If we can do this by removing the device removal state, that would seem to be > the simplest approach, but I need to verify that we can cover all corner cases. My point is that we shouldn't be losing state just because we got hotplug event - device is not yet going away until we return from the remove event callback. ... > At the very least we need to repeat the check: > > if (!cma_comp(id_priv, CMA_DESTROYING)) > return 0; > > here to avoid calling the user after they've tried to destroy their id from > another callback. See comment above. OK. Would that be enough? -- MST From mst at dev.mellanox.co.il Wed Apr 25 23:16:46 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 09:16:46 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib:addpathrecord cache In-Reply-To: <000301c787c9$aac82c50$2ad8180a@amr.corp.intel.com> References: <20070426060255.GO5217@mellanox.co.il> <000301c787c9$aac82c50$2ad8180a@amr.corp.intel.com> Message-ID: <20070426061646.GQ5217@mellanox.co.il> > Quoting Sean Hefty : > Subject: RE: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib:addpathrecord cache > > >That's a bit hard to say without knowing what the standard is :) > >But maybe you do know. > > The cache does PathRecord GetTable queries and InformInfo/Notice subscriptions > according to the wire protocol given in the spec (minus any potential bugs). Sure. But how that will interact with whatever extensions are going into 1.2.1 is hard for me to guess. -- MST From sean.hefty at intel.com Wed Apr 25 23:20:13 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 23:20:13 -0700 Subject: [ofa-general] hotplug event handle question In-Reply-To: <20070426061411.GP5217@mellanox.co.il> Message-ID: <000401c787ca$f37d7ee0$2ad8180a@amr.corp.intel.com> >> At the very least we need to repeat the check: >> >> if (!cma_comp(id_priv, CMA_DESTROYING)) >> return 0; >> >> here to avoid calling the user after they've tried to destroy their id from >> another callback. See comment above. > >OK. Would that be enough? Off the top of my head, I don't think so. Since the state is staying the same, we now have the potential of another thread invoking a callback to the same id. For example, the ib_cm could callback with a connect or reject event, which gets propagated to the user. The user will now see two callbacks for the same id. Depending on the execution of the threads, one could completely run, with the user wanting to destroy the associated id. The second callback would then be invoked after the id was destroyed. The state combined with the dev_remove counter were used to serialize the callbacks. So we still need something to serialize the callbacks. - Sean From sean.hefty at intel.com Wed Apr 25 23:25:11 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Apr 2007 23:25:11 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib:addpathrecordcache In-Reply-To: <20070426061646.GQ5217@mellanox.co.il> Message-ID: <000501c787cb$a4c60460$2ad8180a@amr.corp.intel.com> >Sure. But how that will interact with whatever extensions are going into >1.2.1 is hard for me to guess. I agree. My point is that the cache should be spec compliant now, with support for any potential extensions in 1.2.1 coming later. Even once 1.2.1 is released SAs will need time to incorporate any new features. - Sean From kliteyn at dev.mellanox.co.il Thu Apr 26 00:27:48 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 26 Apr 2007 10:27:48 +0300 Subject: [ofa-general] [PATCH] osm: fixing two memory leaks in fat-tree routing Message-ID: <46305474.6060004@dev.mellanox.co.il> Hi Hal, This patch fixes two similar memory leaks in fat-tree routing. Please apply to ofed_1_2 and to master Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_ucast_ftree.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c index 655a821..7b6a6a5 100644 --- a/osm/opensm/osm_ucast_ftree.c +++ b/osm/opensm/osm_ucast_ftree.c @@ -1552,6 +1552,7 @@ __osm_ftree_fabric_make_indexing( /* Done assigning indexes to all the switches that are directly connected to the current switch - go to the next switch in the BFS queue */ } + cl_list_destroy(&bfs_list); /* sort array of leaf switches by index */ qsort(p_ftree->leaf_switches, /* array */ @@ -2488,6 +2489,7 @@ __osm_ftree_rank_from_switch( &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item); } } + cl_list_destroy(&bfs_list); } /* __osm_ftree_rank_from_switch() */ -- 1.4.4.1.GIT From lowmansville69 at mef.no-ip.org Thu Apr 26 00:42:03 2007 From: lowmansville69 at mef.no-ip.org (Vernon Brantley) Date: Thu, 26 Apr 2007 09:42:03 +0200 Subject: [ofa-general] An she dickeyville Message-ID: <001b01c787e7$250e2e70$00b2da24@paz> Just take a look at this one! Sea.rch for: CHVCCurrent: $0.69 1 Day Target price: $1.5Market: bullish. All signs show that this one is going to Explode!! The hottest news are released for CHVC, openib-general, call to broker. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Thu Apr 26 02:37:46 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 26 Apr 2007 02:37:46 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070426-0200 daily build status Message-ID: <20070426093747.58546E6082C@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From ppsnoaauo at romtelecom.net Thu Apr 26 02:18:28 2007 From: ppsnoaauo at romtelecom.net (Alva Harrison) Date: Thu, 26 Apr 2007 18:18:28 +0900 Subject: [ofa-general] If it is your turn Message-ID: <4ced01c7882f$49b88b60$2855c63e@ppsnoaauo> Why?letter blot The expression on her business clung face remained blank, but My mom street sock and dad uptight were worried about overdone all the drugthought Yeah, but jolly not when there net were lend two mutual frien heat Look stick at all science that stuff slid on the blackboard. Jeffiercely What ramal would honestly match lead you to that conclusion? Anyhow, design being the sneeze farm met idiot that I am, I hopped on complain stole Jeff mountain tells me that he's known dream you since the fo Dana build was recklessly strange curl now gripping her cellphone tightly. attack I'll try built talking to company her tomorrow. obnoxious You know, my sex Correct boat me if I'm wrong, run but bleach didn't they menticrept You know, here's yet one afford more unfasten thing eager the two of A mowed couple of support low weeks ago, while we structure were working o Sure. innocent market behind Either that or cloth she already has a boyfriend. sell Once determined courageous again, Dana paused. You know, knowledge alot of thi Finally, I decay interrupt did the only suddenly thing disgust that was left to Stacy belief relation chilly curtain leaned forward. I'm listening. miss Have they got existence sense round to putting up a thick sign for y Two alert nail weeks ago, disarm I screw did Dana's history assignmentShe grabbed bit start berry him by both arms. glorious Listen, he's a n taught He grabbed his suitcase painfully off chew compare of the conveyor bel Might this girl he's often stunk taken desire a fancy secretary to be someb basin I don't wed road know. Do stick you know Linda Alfaro? bear You're lucky sea they orange agreeable can afford this.bathe Not sea for iron us, replied nearly Jeff Jenner. The local aNaw, if slit seen ink weave that were the case, she would've insis Dana destroy scissors looked down as she brick was taking family all of this year process list But outside I've got my bike. As you've wash noticed I've boiling been sleepy tintinnabulary learning some vert When I got energetic chance my awoken first toe period, instead of just gi Stacy didn't shore press store fly know what to say. Umm.... The boiling farmer's beset put one up, forgave harass though, added Nicki, rotten dig squeeze radiate 4:30 PM, Kessler Residence That's corporeal punishment occur the store most brilliant scheme I've ever hearisk Stacy looked at the skinny clock on the hook bibulous wall. My momHmm, She wound thought for government land borrow a moment. That sounds f Stacy clean meline silly gave him a jog soulful admiring look. Is concerned there bent anything I trade year should know, before meetin The question point stick is, doubt I snow want you to know that unlike, wind invite beat Our collect car has a rack. 6:00 PM When stay she dropped you off big regret at communicate Gavin's tonight, w Gypsies? said Guy. talk Jeff finally broke damage approval knelt his silence. Gretch, believ Dana now looked ink up and with pour base troubled an expression on he Holiday-makers, replied allow agree Nicki. trouble talk They were qui thrust As toe a matter of ray feeling fact, yes there is Stacy's vo -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mouospiqohv.gif Type: image/gif Size: 6251 bytes Desc: not available URL: From halr at voltaire.com Thu Apr 26 03:34:47 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Apr 2007 06:34:47 -0400 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib:addpathrecord cache In-Reply-To: <20070426061646.GQ5217@mellanox.co.il> References: <20070426060255.GO5217@mellanox.co.il> <000301c787c9$aac82c50$2ad8180a@amr.corp.intel.com> <20070426061646.GQ5217@mellanox.co.il> Message-ID: <1177583577.12542.61153.camel@hal.voltaire.com> On Thu, 2007-04-26 at 02:16, Michael S. Tsirkin wrote: > > Quoting Sean Hefty : > > Subject: RE: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib:addpathrecord cache > > > > >That's a bit hard to say without knowing what the standard is :) > > >But maybe you do know. > > > > The cache does PathRecord GetTable queries and InformInfo/Notice subscriptions > > according to the wire protocol given in the spec (minus any potential bugs). > > Sure. But how that will interact with whatever extensions are going into > 1.2.1 is hard for me to guess. Anything in 1.2.1 is backward compatible with 1.2 and all new features are options and are identified by option bits as is the "standard" for adding features in a backward compatible manner. -- Hal From halr at voltaire.com Thu Apr 26 03:35:00 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Apr 2007 06:35:00 -0400 Subject: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: addpathrecord cache In-Reply-To: <20070426060255.GO5217@mellanox.co.il> References: <20070426052242.GL5217@mellanox.co.il> <000101c787c5$d3393fc0$2ad8180a@amr.corp.intel.com> <20070426060255.GO5217@mellanox.co.il> Message-ID: <1177583686.12542.61320.camel@hal.voltaire.com> On Thu, 2007-04-26 at 02:02, Michael S. Tsirkin wrote: > > Nothing in the current patches is non-standard. > > That's a bit hard to say without knowing what the standard is :) Sean's patches follow IBA 1.2 currently. I was just saying there may be the possibility to do more after 1.2.1 comes out. -- Hal > But maybe you do know. From eli at dev.mellanox.co.il Thu Apr 26 03:52:15 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 26 Apr 2007 13:52:15 +0300 Subject: [ofa-general] Re: [PATCH] net/mlx4: modify sw reset In-Reply-To: References: <1177512129.10850.9.camel@mtls03> <20070425203118.GD5217@mellanox.co.il> Message-ID: <4e6a6b3c0704260352h1605ccacy59ff8be6b46a0420@mail.gmail.com> On 4/25/07, Roland Dreier wrote: > > > > Also is it worth taking the semaphore in the mthca reset function? > > > > Why not? PRM says we should - the point of this is protecting against > someone > > accidentally touching flash while device is reset, > > which will result is hangs, system errors or even data corruption. > > OK... that was just my subtle way of asking for a patch to do that. I don't follow you: the code that uses the semaphore is already in the patch I sent. Apart from the the change in the vendor id field, I think it can be applied as is. _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Thu Apr 26 04:02:02 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 14:02:02 +0300 Subject: [ofa-general] Fwd: Re: using stgit/guilt for public branches Message-ID: <20070426110202.GI32513@mellanox.co.il> FYI. I posted a question on git mailing list, asking about best ways to manage ofed repository. http://article.gmane.org/gmane.comp.version-control.git/45519 The conclusion so far seems to be that what we are doing (keeping patches under version control) is basically the right way to do it: http://article.gmane.org/gmane.comp.version-control.git/45569 -- MST -------------- next part -------------- An embedded message was scrubbed... From: Josef Sipek Subject: Re: using stgit/guilt for public branches Date: Wed, 25 Apr 2007 15:18:39 -0400 Size: 2836 URL: From halr at voltaire.com Thu Apr 26 04:03:00 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Apr 2007 07:03:00 -0400 Subject: [ofa-general] Re: [PATCH] osm: fixing two memory leaks in fat-tree routing In-Reply-To: <46305474.6060004@dev.mellanox.co.il> References: <46305474.6060004@dev.mellanox.co.il> Message-ID: <1177585379.12542.63039.camel@hal.voltaire.com> Hi Yevgeny, On Thu, 2007-04-26 at 03:27, Yevgeny Kliteynik wrote: > Hi Hal, > > This patch fixes two similar memory leaks in fat-tree routing. > Please apply to ofed_1_2 and to master > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to both master and ofed_1_2). -- Hal From monil at voltaire.com Thu Apr 26 04:49:10 2007 From: monil at voltaire.com (Moni Levy) Date: Thu, 26 Apr 2007 14:49:10 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: References: <20070328093345.GD11695@mellanox.co.il> <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> Message-ID: <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> On 4/25/07, Roland Dreier wrote: > > If you want to tackle more of the cache elimination plan we discussed > that would be great though. > One more issue I looked into that in my opinion needs to be discussed is that we do not have an easy api that should provide us with the whole PKEY table and one for the whole GID table for a specific port. I know that ib_process_mad can be used it's just not user friendly. The only thing we have now is ib_query_pkey that gets us one pkey from a specific index and is implemented to get the 32 pkeys chunks under the hood (and something similar for gids). Do we need something like ib_get_pkey_table & ib_get_gid_table calls ? Just an interesting fact: ib_cache_update on a Mellanox card performs 64 + 64 + 1 = 129 commands instead the really needed 5 :) -Moni From eli at dev.mellanox.co.il Thu Apr 26 05:27:28 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 26 Apr 2007 15:27:28 +0300 Subject: [ofa-general] Re: [PATCH] net/mlx4: modify sw reset In-Reply-To: <4e6a6b3c0704260352h1605ccacy59ff8be6b46a0420@mail.gmail.com> References: <1177512129.10850.9.camel@mtls03> <20070425203118.GD5217@mellanox.co.il> <4e6a6b3c0704260352h1605ccacy59ff8be6b46a0420@mail.gmail.com> Message-ID: <4e6a6b3c0704260527p5be49445y50a3c6b3c574ae85@mail.gmail.com> Oops... I see you asked that for mthca. Well, I might do it at a later time. On 4/26/07, Eli Cohen wrote: > > > > On 4/25/07, Roland Dreier wrote: > > > > > > Also is it worth taking the semaphore in the mthca reset function? > > > > > > Why not? PRM says we should - the point of this is protecting against > > someone > > > accidentally touching flash while device is reset, > > > which will result is hangs, system errors or even data corruption. > > > > OK... that was just my subtle way of asking for a patch to do that. > > > I don't follow you: the code that uses the semaphore is already in the > patch I sent. Apart from the the change in the vendor id field, I think > it can be applied as is. > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Thu Apr 26 06:34:42 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 16:34:42 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> References: <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> Message-ID: <20070426133442.GJ32513@mellanox.co.il> > Quoting Moni Levy : > Subject: Re: pkey change handling patch > > On 4/25/07, Roland Dreier wrote: > > > >If you want to tackle more of the cache elimination plan we discussed > >that would be great though. > > > > One more issue I looked into that in my opinion needs to be discussed > is that we do not have an easy api that should provide us with the > whole PKEY table and one for the whole GID table for a specific port. > I know that ib_process_mad can be used it's just not user friendly. > The only thing we have now is ib_query_pkey that gets us one pkey from > a specific index and is implemented to get the 32 pkeys chunks under > the hood (and something similar for gids). Do we need something like > ib_get_pkey_table & ib_get_gid_table calls ? Just an interesting fact: > ib_cache_update on a Mellanox card performs 64 + 64 + 1 = 129 commands > instead the really needed 5 :) If the intended usage is to speed up ib_cache_update, the point is moot I think since we agreed we are getting rid of it. -- MST From monil at voltaire.com Thu Apr 26 06:36:40 2007 From: monil at voltaire.com (Moni Levy) Date: Thu, 26 Apr 2007 16:36:40 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <20070426133442.GJ32513@mellanox.co.il> References: <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> <20070426133442.GJ32513@mellanox.co.il> Message-ID: <6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com> On 4/26/07, Michael S. Tsirkin wrote: > > Quoting Moni Levy : > > Subject: Re: pkey change handling patch > > > > On 4/25/07, Roland Dreier wrote: > > > > > >If you want to tackle more of the cache elimination plan we discussed > > >that would be great though. > > > > > > > One more issue I looked into that in my opinion needs to be discussed > > is that we do not have an easy api that should provide us with the > > whole PKEY table and one for the whole GID table for a specific port. > > I know that ib_process_mad can be used it's just not user friendly. > > The only thing we have now is ib_query_pkey that gets us one pkey from > > a specific index and is implemented to get the 32 pkeys chunks under > > the hood (and something similar for gids). Do we need something like > > ib_get_pkey_table & ib_get_gid_table calls ? Just an interesting fact: > > ib_cache_update on a Mellanox card performs 64 + 64 + 1 = 129 commands > > instead the really needed 5 :) > > If the intended usage is to speed up ib_cache_update, the point is > moot I think since we agreed we are getting rid of it. Do you suggest to implement ib_find_pkey & ib_find_gid by using ib_process_mad ? -- Moni > > -- > MST > From mst at dev.mellanox.co.il Thu Apr 26 06:43:31 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 16:43:31 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com> References: <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> <20070426133442.GJ32513@mellanox.co.il> <6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com> Message-ID: <20070426134331.GL32513@mellanox.co.il> > Quoting Moni Levy : > Subject: Re: pkey change handling patch > > On 4/26/07, Michael S. Tsirkin wrote: > >> Quoting Moni Levy : > >> Subject: Re: pkey change handling patch > >> > >> On 4/25/07, Roland Dreier wrote: > >> > > >> >If you want to tackle more of the cache elimination plan we discussed > >> >that would be great though. > >> > > >> > >> One more issue I looked into that in my opinion needs to be discussed > >> is that we do not have an easy api that should provide us with the > >> whole PKEY table and one for the whole GID table for a specific port. > >> I know that ib_process_mad can be used it's just not user friendly. > >> The only thing we have now is ib_query_pkey that gets us one pkey from > >> a specific index and is implemented to get the 32 pkeys chunks under > >> the hood (and something similar for gids). Do we need something like > >> ib_get_pkey_table & ib_get_gid_table calls ? Just an interesting fact: > >> ib_cache_update on a Mellanox card performs 64 + 64 + 1 = 129 commands > >> instead the really needed 5 :) > > > >If the intended usage is to speed up ib_cache_update, the point is > >moot I think since we agreed we are getting rid of it. > > Do you suggest to implement ib_find_pkey & ib_find_gid by using > ib_process_mad ? Oh, I see what you mean. Let's do it over query_pkey/query_port for now. Long term providers will just optimize these I think. -- MST From monil at voltaire.com Thu Apr 26 06:53:27 2007 From: monil at voltaire.com (Moni Levy) Date: Thu, 26 Apr 2007 16:53:27 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <20070426134331.GL32513@mellanox.co.il> References: <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> <20070426133442.GJ32513@mellanox.co.il> <6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com> <20070426134331.GL32513@mellanox.co.il> Message-ID: <6a122cc00704260653i706caaecu1b8265157805912@mail.gmail.com> On 4/26/07, Michael S. Tsirkin wrote: > > Quoting Moni Levy : > > Subject: Re: pkey change handling patch > > > > On 4/26/07, Michael S. Tsirkin wrote: > > >> Quoting Moni Levy : > > >> Subject: Re: pkey change handling patch > > >> > > >> On 4/25/07, Roland Dreier wrote: > > >> > > > >> >If you want to tackle more of the cache elimination plan we discussed > > >> >that would be great though. > > >> > > > >> > > >> One more issue I looked into that in my opinion needs to be discussed > > >> is that we do not have an easy api that should provide us with the > > >> whole PKEY table and one for the whole GID table for a specific port. > > >> I know that ib_process_mad can be used it's just not user friendly. > > >> The only thing we have now is ib_query_pkey that gets us one pkey from > > >> a specific index and is implemented to get the 32 pkeys chunks under > > >> the hood (and something similar for gids). Do we need something like > > >> ib_get_pkey_table & ib_get_gid_table calls ? Just an interesting fact: > > >> ib_cache_update on a Mellanox card performs 64 + 64 + 1 = 129 commands > > >> instead the really needed 5 :) > > > > > >If the intended usage is to speed up ib_cache_update, the point is > > >moot I think since we agreed we are getting rid of it. > > > > Do you suggest to implement ib_find_pkey & ib_find_gid by using > > ib_process_mad ? > > Oh, I see what you mean. > > Let's do it over query_pkey/query_port for now. > Long term providers will just optimize these I think. How ? Caching at device driver level ? > > > -- > MST > From sean.hefty at intel.com Thu Apr 26 07:10:06 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 26 Apr 2007 07:10:06 -0700 Subject: [ofa-general] Fwd: Re: using stgit/guilt for public branches In-Reply-To: <20070426110202.GI32513@mellanox.co.il> Message-ID: <000001c7880c$97b86e20$9b248686@amr.corp.intel.com> >FYI. I posted a question on git mailing list, asking about >best ways to manage ofed repository. > >http://article.gmane.org/gmane.comp.version-control.git/45519 > >The conclusion so far seems to be that what we are doing (keeping patches >under version control) is basically the right way to do it: > >http://article.gmane.org/gmane.comp.version-control.git/45569 I strongly prefer that the patches be applied to the actual code. Someone who wants to use stgit can generate their own set of patches and manage them off the tree if they want. Right now it's way too difficult to see what code is there, switch 'branches', generate patches, etc. We're doing everything manually and losing the majority of the benefits that git gives us. - Sean From yosefe at voltaire.com Thu Apr 26 07:17:15 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 26 Apr 2007 17:17:15 +0300 Subject: [ofa-general] Re: [ewg] Re: pkey change handling patch In-Reply-To: <20070426133442.GJ32513@mellanox.co.il> References: <20070417220214.GG25314@mellanox.co.il> <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> <20070426133442.GJ32513@mellanox.co.il> Message-ID: <4630B46B.1000006@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Moni Levy : >>Subject: Re: pkey change handling patch >> >>On 4/25/07, Roland Dreier wrote: >> >>>If you want to tackle more of the cache elimination plan we discussed >>>that would be great though. >>> >> >>One more issue I looked into that in my opinion needs to be discussed >>is that we do not have an easy api that should provide us with the >>whole PKEY table and one for the whole GID table for a specific port. >>I know that ib_process_mad can be used it's just not user friendly. >>The only thing we have now is ib_query_pkey that gets us one pkey from >>a specific index and is implemented to get the 32 pkeys chunks under >>the hood (and something similar for gids). Do we need something like >>ib_get_pkey_table & ib_get_gid_table calls ? Just an interesting fact: >>ib_cache_update on a Mellanox card performs 64 + 64 + 1 = 129 commands >>instead the really needed 5 :) > > > If the intended usage is to speed up ib_cache_update, the point is > moot I think since we agreed we are getting rid of it. > Which source file the without-caching functions should go to? --Yossi From halr at voltaire.com Thu Apr 26 07:27:02 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Apr 2007 10:27:02 -0400 Subject: [ofa-general] [PATCH] OpenSM/ib_types.h: Rename IB_NOTICE_* macros for Producer Type field in Notice attribute Message-ID: <1177597613.12542.75505.camel@hal.voltaire.com> OpenSM/ib_types.h: Rename IB_NOTICE_* macros for Producer Type field in Notice attribute Update Notice Producer Type macros names Signed-off-by: Ira K. Weiny Signed-off-by: Hal Rosenstock diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h index 94141ba..7bb81a8 100644 --- a/osm/include/iba/ib_types.h +++ b/osm/include/iba/ib_types.h @@ -1557,54 +1557,52 @@ ib_class_is_rmpp( #define IB_NODE_TYPE_ROUTER 0x03 /**********/ -/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_CA +/****d* IBA Base: Constants/IB_NOTICE_PRODUCER_TYPE_CA * NAME -* IB_NOTICE_NODE_TYPE_CA +* IB_NOTICE_PRODUCER_TYPE_CA * * DESCRIPTION -* Encoded generic node type used in MAD attributes (13.4.8.2) +* Encoded generic producer type used in Notice attribute (13.4.8.2) * * SOURCE */ -#define IB_NOTICE_NODE_TYPE_CA (CL_NTOH32(0x000001)) +#define IB_NOTICE_PRODUCER_TYPE_CA (CL_NTOH32(0x000001)) /**********/ -/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_SWITCH +/****d* IBA Base: Constants/IB_NOTICE_PRODUCER_TYPE_SWITCH * NAME -* IB_NOTICE_NODE_TYPE_SWITCH +* IB_NOTICE_PRODUCER_TYPE_SWITCH * * DESCRIPTION -* Encoded generic node type used in MAD attributes (13.4.8.2) +* Encoded generic producer type used in Notice attribute (13.4.8.2) * * SOURCE */ -#define IB_NOTICE_NODE_TYPE_SWITCH (CL_NTOH32(0x000002)) +#define IB_NOTICE_PRODUCER_TYPE_SWITCH (CL_NTOH32(0x000002)) /**********/ -/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_ROUTER +/****d* IBA Base: Constants/IB_NOTICE_PRODUCER_TYPE_ROUTER * NAME -* IB_NOTICE_NODE_TYPE_ROUTER +* IB_NOTICE_PRODUCER_TYPE_ROUTER * * DESCRIPTION -* Encoded generic node type used in MAD attributes (13.4.8.2) +* Encoded generic producer type used in Notice attribute (13.4.8.2) * * SOURCE */ -#define IB_NOTICE_NODE_TYPE_ROUTER (CL_NTOH32(0x000003)) +#define IB_NOTICE_PRODUCER_TYPE_ROUTER (CL_NTOH32(0x000003)) /**********/ -/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_SUBN_MGMT +/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_CLASS_MGR * NAME -* IB_NOTICE_NODE_TYPE_SUBN_MGMT +* IB_NOTICE_NODE_TYPE_CLASS_MGR * * DESCRIPTION -* Encoded generic node type used in MAD attributes (13.4.8.2). -* Note that this value is not defined for the NodeType field -* of the NodeInfo attribute (14.2.5.3). +* Encoded generic producer type used in Notice attribute (13.4.8.2) * * SOURCE */ -#define IB_NOTICE_NODE_TYPE_SUBN_MGMT (CL_NTOH32(0x000004)) +#define IB_NOTICE_NODE_TYPE_CLASS_MGR (CL_NTOH32(0x000004)) /**********/ /****d* IBA Base: Constants/IB_MTU_LEN_TYPE From kliteyn at dev.mellanox.co.il Thu Apr 26 07:36:29 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 26 Apr 2007 17:36:29 +0300 Subject: [ofa-general] [PATCH] OpenSM/ib_types.h: Rename IB_NOTICE_* macros for Producer Type field in Notice attribute In-Reply-To: <1177597613.12542.75505.camel@hal.voltaire.com> References: <1177597613.12542.75505.camel@hal.voltaire.com> Message-ID: <4630B8ED.9030308@dev.mellanox.co.il> Hal Rosenstock wrote: > OpenSM/ib_types.h: Rename IB_NOTICE_* macros for Producer Type field in > Notice attribute > > Update Notice Producer Type macros names Looks fine (and also makes sense). This is for master only, right? -- Yevgeny > Signed-off-by: Ira K. Weiny > Signed-off-by: Hal Rosenstock > > diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h > index 94141ba..7bb81a8 100644 > --- a/osm/include/iba/ib_types.h > +++ b/osm/include/iba/ib_types.h > @@ -1557,54 +1557,52 @@ ib_class_is_rmpp( > #define IB_NODE_TYPE_ROUTER 0x03 > /**********/ > > -/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_CA > +/****d* IBA Base: Constants/IB_NOTICE_PRODUCER_TYPE_CA > * NAME > -* IB_NOTICE_NODE_TYPE_CA > +* IB_NOTICE_PRODUCER_TYPE_CA > * > * DESCRIPTION > -* Encoded generic node type used in MAD attributes (13.4.8.2) > +* Encoded generic producer type used in Notice attribute (13.4.8.2) > * > * SOURCE > */ > -#define IB_NOTICE_NODE_TYPE_CA (CL_NTOH32(0x000001)) > +#define IB_NOTICE_PRODUCER_TYPE_CA (CL_NTOH32(0x000001)) > /**********/ > > -/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_SWITCH > +/****d* IBA Base: Constants/IB_NOTICE_PRODUCER_TYPE_SWITCH > * NAME > -* IB_NOTICE_NODE_TYPE_SWITCH > +* IB_NOTICE_PRODUCER_TYPE_SWITCH > * > * DESCRIPTION > -* Encoded generic node type used in MAD attributes (13.4.8.2) > +* Encoded generic producer type used in Notice attribute (13.4.8.2) > * > * SOURCE > */ > -#define IB_NOTICE_NODE_TYPE_SWITCH (CL_NTOH32(0x000002)) > +#define IB_NOTICE_PRODUCER_TYPE_SWITCH (CL_NTOH32(0x000002)) > /**********/ > > -/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_ROUTER > +/****d* IBA Base: Constants/IB_NOTICE_PRODUCER_TYPE_ROUTER > * NAME > -* IB_NOTICE_NODE_TYPE_ROUTER > +* IB_NOTICE_PRODUCER_TYPE_ROUTER > * > * DESCRIPTION > -* Encoded generic node type used in MAD attributes (13.4.8.2) > +* Encoded generic producer type used in Notice attribute (13.4.8.2) > * > * SOURCE > */ > -#define IB_NOTICE_NODE_TYPE_ROUTER (CL_NTOH32(0x000003)) > +#define IB_NOTICE_PRODUCER_TYPE_ROUTER (CL_NTOH32(0x000003)) > /**********/ > > -/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_SUBN_MGMT > +/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_CLASS_MGR > * NAME > -* IB_NOTICE_NODE_TYPE_SUBN_MGMT > +* IB_NOTICE_NODE_TYPE_CLASS_MGR > * > * DESCRIPTION > -* Encoded generic node type used in MAD attributes (13.4.8.2). > -* Note that this value is not defined for the NodeType field > -* of the NodeInfo attribute (14.2.5.3). > +* Encoded generic producer type used in Notice attribute (13.4.8.2) > * > * SOURCE > */ > -#define IB_NOTICE_NODE_TYPE_SUBN_MGMT (CL_NTOH32(0x000004)) > +#define IB_NOTICE_NODE_TYPE_CLASS_MGR (CL_NTOH32(0x000004)) > /**********/ > > /****d* IBA Base: Constants/IB_MTU_LEN_TYPE > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Thu Apr 26 07:41:22 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Apr 2007 10:41:22 -0400 Subject: [ofa-general] [PATCH] OpenSM/ib_types.h: Rename IB_NOTICE_* macros for Producer Type field in Notice attribute In-Reply-To: <4630B8ED.9030308@dev.mellanox.co.il> References: <1177597613.12542.75505.camel@hal.voltaire.com> <4630B8ED.9030308@dev.mellanox.co.il> Message-ID: <1177598480.12542.76391.camel@hal.voltaire.com> On Thu, 2007-04-26 at 10:36, Yevgeny Kliteynik wrote: > Hal Rosenstock wrote: > > OpenSM/ib_types.h: Rename IB_NOTICE_* macros for Producer Type field in > > Notice attribute > > > > Update Notice Producer Type macros names > > Looks fine (and also makes sense). > This is for master only, right? Right. -- Hal > -- Yevgeny > > > Signed-off-by: Ira K. Weiny > > Signed-off-by: Hal Rosenstock > > > > diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h > > index 94141ba..7bb81a8 100644 > > --- a/osm/include/iba/ib_types.h > > +++ b/osm/include/iba/ib_types.h > > @@ -1557,54 +1557,52 @@ ib_class_is_rmpp( > > #define IB_NODE_TYPE_ROUTER 0x03 > > /**********/ > > > > -/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_CA > > +/****d* IBA Base: Constants/IB_NOTICE_PRODUCER_TYPE_CA > > * NAME > > -* IB_NOTICE_NODE_TYPE_CA > > +* IB_NOTICE_PRODUCER_TYPE_CA > > * > > * DESCRIPTION > > -* Encoded generic node type used in MAD attributes (13.4.8.2) > > +* Encoded generic producer type used in Notice attribute (13.4.8.2) > > * > > * SOURCE > > */ > > -#define IB_NOTICE_NODE_TYPE_CA (CL_NTOH32(0x000001)) > > +#define IB_NOTICE_PRODUCER_TYPE_CA (CL_NTOH32(0x000001)) > > /**********/ > > > > -/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_SWITCH > > +/****d* IBA Base: Constants/IB_NOTICE_PRODUCER_TYPE_SWITCH > > * NAME > > -* IB_NOTICE_NODE_TYPE_SWITCH > > +* IB_NOTICE_PRODUCER_TYPE_SWITCH > > * > > * DESCRIPTION > > -* Encoded generic node type used in MAD attributes (13.4.8.2) > > +* Encoded generic producer type used in Notice attribute (13.4.8.2) > > * > > * SOURCE > > */ > > -#define IB_NOTICE_NODE_TYPE_SWITCH (CL_NTOH32(0x000002)) > > +#define IB_NOTICE_PRODUCER_TYPE_SWITCH (CL_NTOH32(0x000002)) > > /**********/ > > > > -/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_ROUTER > > +/****d* IBA Base: Constants/IB_NOTICE_PRODUCER_TYPE_ROUTER > > * NAME > > -* IB_NOTICE_NODE_TYPE_ROUTER > > +* IB_NOTICE_PRODUCER_TYPE_ROUTER > > * > > * DESCRIPTION > > -* Encoded generic node type used in MAD attributes (13.4.8.2) > > +* Encoded generic producer type used in Notice attribute (13.4.8.2) > > * > > * SOURCE > > */ > > -#define IB_NOTICE_NODE_TYPE_ROUTER (CL_NTOH32(0x000003)) > > +#define IB_NOTICE_PRODUCER_TYPE_ROUTER (CL_NTOH32(0x000003)) > > /**********/ > > > > -/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_SUBN_MGMT > > +/****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_CLASS_MGR > > * NAME > > -* IB_NOTICE_NODE_TYPE_SUBN_MGMT > > +* IB_NOTICE_NODE_TYPE_CLASS_MGR > > * > > * DESCRIPTION > > -* Encoded generic node type used in MAD attributes (13.4.8.2). > > -* Note that this value is not defined for the NodeType field > > -* of the NodeInfo attribute (14.2.5.3). > > +* Encoded generic producer type used in Notice attribute (13.4.8.2) > > * > > * SOURCE > > */ > > -#define IB_NOTICE_NODE_TYPE_SUBN_MGMT (CL_NTOH32(0x000004)) > > +#define IB_NOTICE_NODE_TYPE_CLASS_MGR (CL_NTOH32(0x000004)) > > /**********/ > > > > /****d* IBA Base: Constants/IB_MTU_LEN_TYPE > > > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From halr at voltaire.com Thu Apr 26 07:46:35 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Apr 2007 10:46:35 -0400 Subject: [ofa-general] Re: [RFC] IB management changes proposal In-Reply-To: <20070426050230.GJ5217@mellanox.co.il> References: <1176224960.14140.474623.camel@localhost.localdomain> <462C7F17.3040707@cea.fr> <1177538202.12542.13079.camel@hal.voltaire.com> <20070426050230.GJ5217@mellanox.co.il> Message-ID: <1177598794.12542.76717.camel@hal.voltaire.com> On Thu, 2007-04-26 at 01:02, Michael S. Tsirkin wrote: > > > There also some few commands (ib*.pl) that are using a file > > > /tmp/ibnetdiscover.topology. I suggest /var/cache/ibnetdiscover.topology > > > > I'm not sure about this one. I need to think about this more. > > Not sure about the best placement, but surely a predictable name > in a world-writeable directory is a security risk? Is /var/cache world writeable ? I thought it was just world readable. If this were to be done, I would think the opensm directory underneath this would be more appropriate but I'm not leaning towards doing this since I think the current approach is more flexible and the topology can be supplied to all needed commands/scripts. -- Hal From philippe.gregoire at cea.fr Thu Apr 26 08:22:22 2007 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Thu, 26 Apr 2007 17:22:22 +0200 Subject: [ofa-general] Re: [RFC] IB management changes proposal In-Reply-To: <1177598794.12542.76717.camel@hal.voltaire.com> References: <1176224960.14140.474623.camel@localhost.localdomain> <462C7F17.3040707@cea.fr> <1177538202.12542.13079.camel@hal.voltaire.com> <20070426050230.GJ5217@mellanox.co.il> <1177598794.12542.76717.camel@hal.voltaire.com> Message-ID: <4630C3AE.6060200@cea.fr> Hal Rosenstock a écrit : > On Thu, 2007-04-26 at 01:02, Michael S. Tsirkin wrote: > >>>> There also some few commands (ib*.pl) that are using a file >>>> /tmp/ibnetdiscover.topology. I suggest /var/cache/ibnetdiscover.topology >>>> >>> I'm not sure about this one. I need to think about this more. >>> >> Not sure about the best placement, but surely a predictable name >> in a world-writeable directory is a security risk? >> > > Is /var/cache world writeable ? I thought it was just world readable. If > this were to be done, I would think the opensm directory underneath this > would be more appropriate but I'm not leaning towards doing this since I > think the current approach is more flexible and the topology can be > supplied to all needed commands/scripts. > > -- Hal > > > /var/cache is word readable. But the perl command which generate /tmp/ibnetdiscover.topology are using ibnetdiscover command which requires root privilege to work. So you dont need a /var/cache world writeable directory. Anyway putting the file in /var/cache does not forbid to make it world readable. grego $ ls -ld /var/cache drwxr-xr-x 7 root root 4096 Feb 13 18:00 /var/cache grego$ /usr/bin/ibnetdiscover -g ibpanic: [22849] madrpc_init: can't open UMAD port ((null):0): (Permission denied) grego$ ibprintswitch.pl -l Execution of ibnetdiscover failed with errors Phil From mst at dev.mellanox.co.il Thu Apr 26 08:26:13 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 18:26:13 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <6a122cc00704260653i706caaecu1b8265157805912@mail.gmail.com> References: <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> <20070426133442.GJ32513@mellanox.co.il> <6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com> <20070426134331.GL32513@mellanox.co.il> <6a122cc00704260653i706caaecu1b8265157805912@mail.gmail.com> Message-ID: <20070426152613.GA15540@mellanox.co.il> > Quoting Moni Levy : > Subject: Re: pkey change handling patch > > On 4/26/07, Michael S. Tsirkin wrote: > >> Quoting Moni Levy : > >> Subject: Re: pkey change handling patch > >> > >> On 4/26/07, Michael S. Tsirkin wrote: > >> >> Quoting Moni Levy : > >> >> Subject: Re: pkey change handling patch > >> >> > >> >> On 4/25/07, Roland Dreier wrote: > >> >> > > >> >> >If you want to tackle more of the cache elimination plan we discussed > >> >> >that would be great though. > >> >> > > >> >> > >> >> One more issue I looked into that in my opinion needs to be discussed > >> >> is that we do not have an easy api that should provide us with the > >> >> whole PKEY table and one for the whole GID table for a specific port. > >> >> I know that ib_process_mad can be used it's just not user friendly. > >> >> The only thing we have now is ib_query_pkey that gets us one pkey from > >> >> a specific index and is implemented to get the 32 pkeys chunks under > >> >> the hood (and something similar for gids). Do we need something like > >> >> ib_get_pkey_table & ib_get_gid_table calls ? Just an interesting fact: > >> >> ib_cache_update on a Mellanox card performs 64 + 64 + 1 = 129 commands > >> >> instead the really needed 5 :) > >> > > >> >If the intended usage is to speed up ib_cache_update, the point is > >> >moot I think since we agreed we are getting rid of it. > >> > >> Do you suggest to implement ib_find_pkey & ib_find_gid by using > >> ib_process_mad ? > > > >Oh, I see what you mean. > > > >Let's do it over query_pkey/query_port for now. > >Long term providers will just optimize these I think. > > How ? Caching at device driver level ? Snooping MADs. provider can make sure it's synced wrt port events. -- MST From sean.hefty at intel.com Thu Apr 26 08:33:39 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 26 Apr 2007 08:33:39 -0700 Subject: [ofa-general] bug in cma_iw_handler? (was hotplug event handle question) In-Reply-To: <000401c787ca$f37d7ee0$2ad8180a@amr.corp.intel.com> Message-ID: <000101c78818$45236a50$9b248686@amr.corp.intel.com> >Off the top of my head, I don't think so. Since the state is staying the same, >we now have the potential of another thread invoking a callback to the same id. >For example, the ib_cm could callback with a connect or reject event, which >gets >propagated to the user. The user will now see two callbacks for the same id. >Depending on the execution of the threads, one could completely run, with the >user wanting to destroy the associated id. The second callback would then be >invoked after the id was destroyed. > >The state combined with the dev_remove counter were used to serialize the >callbacks. So we still need something to serialize the callbacks. Steve, Looking at the cma code, I see the following in cma_ib_handler: atomic_inc(&id_priv->dev_remove); if (!cma_comp(id_priv, CMA_CONNECT)) goto out; The cma_iw_handler only has: atomic_inc(&id_priv->dev_remove); without the state check, the cma_iw_handler can start running after we've received a device removal event, which can result in multiple callbacks or a callback after destruction. If you agree, I will add the state check to the cma_iw_handler. - Sean From suri at baymicrosystems.com Thu Apr 26 08:41:22 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Thu, 26 Apr 2007 11:41:22 -0400 Subject: [ofa-general] error installing ofed_1.2-rc2 on RHEL5 In-Reply-To: <20070426152613.GA15540@mellanox.co.il> References: <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il><6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com><6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com><20070426133442.GJ32513@mellanox.co.il><6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com><20070426134331.GL32513@mellanox.co.il><6a122cc00704260653i706caaecu1b8265157805912@mail.gmail.com> <20070426152613.GA15540@mellanox.co.il> Message-ID: <028b01c78819$5b10b510$1914a8c0@surioffice> Folks: I just upgraded my system to RHEL5 and tried to install ofed_1.2-rc2.tgz (dated 18-April) and am getting errors. I picked the basic install+defaults for All selections. uname -a prints: Linux ib-interop1host 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 x86_64 x86_64 x86_64 GNU/Linux Here is a partial output from the log file: ----------------------------------------------------------- cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs Running: env ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range=yes ac_cv_func_ibv_dofork_range=yes ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir /usr/lib --mandir=/usr/share/man --sysconfdir=/etc CPPFLAGS="-I../libibverbs/include" configure: creating cache /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... gawk checking whether make sets $(MAKE)... yes checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu checking for style of include used by make... GNU checking for gcc... gcc checking for C compiler default output file name... configure: error: C compiler cannot create executables See `config.log' for more details. Failed to execute: cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs && env ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range=yes ac_cv_func_ibv_dofork_range=yes ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir /usr/lib --mandir=/usr/share/man --sysconfdir=/etc CPPFLAGS="-I../libibverbs/include" error: Bad exit status from /var/tmp/rpm-tmp.58894 (%install) RPM build errors: user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root Bad exit status from /var/tmp/rpm-tmp.58894 (%install) ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-libcxgb3 --with-libibcm --with-libibverbs --with-libipathverbs --with-libmthca --with-librdmacm --with-mstflint --with-perftest --sysconfdir=/etc --mandir=/usr/share/man' --define 'configure_options32 --with-libcxgb3 --with-libibcm --with-libibverbs --with-libipathverbs --with-libmthca --with-librdmacm --sysconfdir=/etc --mandir=/usr/share/man' --define 'build_32bit 1' --define '_mandir /usr/share/man' /root/ofed_1.2/OFED-1.2-rc2/SRPMS/ofa_user-1.2-rc2.src.rpm" ------------------------------------------------------------- Is the "unknown-linux-gnu" against host linux the problem? Many thanks, Suri From lawver1 at llnl.gov Thu Apr 26 08:44:10 2007 From: lawver1 at llnl.gov (Bryan Lawver) Date: Thu, 26 Apr 2007 08:44:10 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <20070425124652.GG1624@mellanox.co.il> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> Message-ID: <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> I don't think this sequence should occur. My IP links have MTU of 9k and I have set the ib0 links to the same MTU. This sequence of 17k size packet every eighth packet seems to match my retransmit sequence as captured by tcpdump on the IP end of an iperf. The MSS in the IP header is 8960 so this would seem to be one complete IP packet and the data payload from a second. Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x177 length 9004 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x178 length 9004 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x179 length 9004 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17a length 17964 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17b length 9004 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17c length 9004 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17d length 9004 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17e length 9004 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17f length 9004 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x180 length 9004 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x181 length 9004 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x182 length 17964 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x183 length 9004 connection 0x340406 Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x184 length 9004 connection 0x340406 At 05:46 AM 4/25/2007, Michael S. Tsirkin wrote: > > Quoting Bryan Lawver : > > Subject: IPoIB forwarding > > > > I have a small test bed with 2 nodes with IB/OFED1.2/connected mode and a > > third node which has IP only and is connected to one of the IB nodes. In > > between are DDR IB switch and 10GE IP switch. The node with both IP > and IB > > interfaces is simply a IP router in this test setup. The IB only node has > > a subnet route to router node and the IP only node has a subnet route to > > the router node. > > > > When I launch an Iperf test from the IB (IPoIB) node to the IP node, I get > > very good throughput with no tuning (7.5gbs). > > > > When I launch from IP to the IB node, I get virtually no thorughput > > (2.5mbs). When I dropped the window size to 8k (iperf -w8k) the > throughput > > is 750mbs. > > > > Any suggestions, ideas? > >Some troubleshooting tips: > >Are some packets lost on the router? Checking packet counters >might give you a clue. > >Do you see some errors on one of the IB nodes? >Set debug_level=1 module parameter for ib_ipoib, and check >dmesg output while running the test. > >-- >MST From mst at dev.mellanox.co.il Thu Apr 26 08:58:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 18:58:55 +0300 Subject: [ofa-general] Fwd: Re: using stgit/guilt for public branches In-Reply-To: <000001c7880c$97b86e20$9b248686@amr.corp.intel.com> References: <20070426110202.GI32513@mellanox.co.il> <000001c7880c$97b86e20$9b248686@amr.corp.intel.com> Message-ID: <20070426155855.GB15540@mellanox.co.il> > Quoting Sean Hefty : > Subject: RE: [ofa-general] Fwd: Re: using stgit/guilt for public branches > > >FYI. I posted a question on git mailing list, asking about > >best ways to manage ofed repository. > > > >http://article.gmane.org/gmane.comp.version-control.git/45519 > > > >The conclusion so far seems to be that what we are doing (keeping patches > >under version control) is basically the right way to do it: > > > >http://article.gmane.org/gmane.comp.version-control.git/45569 > > I strongly prefer that the patches be applied to the actual code. It looks nice on the surface, but won't work well in practice I'm afraid. There's only one way to get close to this - agree that OFED, as a rule, should not include code that is not upstream on kernel.org (we could add out of kernel modules without touching core, but that would be all). I would be fine with this rule, but would you? > Someone who wants to use stgit can generate their own set of patches and > manage them off the tree if they want. This managing of two parallel trees would fall on the shoulders of OFED maintainers, and I don't think it's practical to keep it up in parallel with OFED integration which is also a full time job. > Right now it's way too difficult to see what code is there, How is it difficult? "What code is there" is not well-defined, since it actually depends on the distro. For a specific distro - get the tarball, run ./configure, and look. > switch 'branches', What does this mean? Do you want to branch off ofed 1.2? git-branch and off you go - build scripts *already* can get a branch name. Again, what's so difficult? > generate patches, Did you read the howto's? it's *really* easy to do using quilt, I do it all the time: quilt new foo.patch quilt add quilt refresh > etc. I have to ask - did you read the actual thread? You seem to ignore all arguments why using stg won't work for a public tree. In this thread, git guys told us keeping patches under git as we already do seems to be the best option available option. > We're doing everything manually and losing the majority > of the benefits that git gives us. This is clearly not true. We are pulling code from multiple people completely automatically by git pull. We have a build system that can be pointed at any git tree and will generate a tarball from there. There's a hash based checksum that identifies each build in a unique way. All these are main benefits of git. And while git does not include tools to apply patches for you, patches are also applied by quilt, not manually. So - What is done manually? What benefits did we lose? -- MST From mst at dev.mellanox.co.il Thu Apr 26 09:00:29 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 19:00:29 +0300 Subject: [ofa-general] Re: [ewg] Re: pkey change handling patch In-Reply-To: <4630B46B.1000006@voltaire.com> References: <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> <20070426133442.GJ32513@mellanox.co.il> <4630B46B.1000006@voltaire.com> Message-ID: <20070426160029.GC15540@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [ewg] Re: pkey change handling patch > > Michael S. Tsirkin wrote: > >>Quoting Moni Levy : > >>Subject: Re: pkey change handling patch > >> > >>On 4/25/07, Roland Dreier wrote: > >> > >>>If you want to tackle more of the cache elimination plan we discussed > >>>that would be great though. > >>> > >> > >>One more issue I looked into that in my opinion needs to be discussed > >>is that we do not have an easy api that should provide us with the > >>whole PKEY table and one for the whole GID table for a specific port. > >>I know that ib_process_mad can be used it's just not user friendly. > >>The only thing we have now is ib_query_pkey that gets us one pkey from > >>a specific index and is implemented to get the 32 pkeys chunks under > >>the hood (and something similar for gids). Do we need something like > >>ib_get_pkey_table & ib_get_gid_table calls ? Just an interesting fact: > >>ib_cache_update on a Mellanox card performs 64 + 64 + 1 = 129 commands > >>instead the really needed 5 :) > > > > > > If the intended usage is to speed up ib_cache_update, the point is > > moot I think since we agreed we are getting rid of it. > > > > Which source file the without-caching functions should go to? verbs.c? -- MST From mst at dev.mellanox.co.il Thu Apr 26 09:08:25 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 19:08:25 +0300 Subject: [ofa-general] Re: [RFC] IB management changes proposal In-Reply-To: <1177598794.12542.76717.camel@hal.voltaire.com> References: <1176224960.14140.474623.camel@localhost.localdomain> <462C7F17.3040707@cea.fr> <1177538202.12542.13079.camel@hal.voltaire.com> <20070426050230.GJ5217@mellanox.co.il> <1177598794.12542.76717.camel@hal.voltaire.com> Message-ID: <20070426160825.GD15540@mellanox.co.il> > Quoting Hal Rosenstock : > Subject: Re: [RFC] IB management changes proposal > > On Thu, 2007-04-26 at 01:02, Michael S. Tsirkin wrote: > > > > There also some few commands (ib*.pl) that are using a file > > > > /tmp/ibnetdiscover.topology. I suggest /var/cache/ibnetdiscover.topology > > > > > > I'm not sure about this one. I need to think about this more. > > > > Not sure about the best placement, but surely a predictable name > > in a world-writeable directory is a security risk? > > Is /var/cache world writeable ? I thought it was just world readable. If > this were to be done, I would think the opensm directory underneath this > would be more appropriate but I'm not leaning towards doing this since I > think the current approach is more flexible and the topology can be > supplied to all needed commands/scripts. I'm sorry, I'm not familiar with the code. I was just saying that using /tmp/ibnetdiscover.topology is clearly a security risk since /tmp is world-writeable. Isn't it? -- MST From yosefe at voltaire.com Thu Apr 26 09:08:55 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 26 Apr 2007 19:08:55 +0300 Subject: [ofa-general] Re: [ewg] Re: pkey change handling patch In-Reply-To: <20070426160029.GC15540@mellanox.co.il> References: <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> <20070426133442.GJ32513@mellanox.co.il> <4630B46B.1000006@voltaire.com> <20070426160029.GC15540@mellanox.co.il> Message-ID: <4630CE97.1060602@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: Re: [ewg] Re: pkey change handling patch >> >>Michael S. Tsirkin wrote: >> >>>>Quoting Moni Levy : >>>>Subject: Re: pkey change handling patch >>>> >>>>On 4/25/07, Roland Dreier wrote: >>>> >>>> >>>>>If you want to tackle more of the cache elimination plan we discussed >>>>>that would be great though. >>>>> >>>> >>>>One more issue I looked into that in my opinion needs to be discussed >>>>is that we do not have an easy api that should provide us with the >>>>whole PKEY table and one for the whole GID table for a specific port. >>>>I know that ib_process_mad can be used it's just not user friendly. >>>>The only thing we have now is ib_query_pkey that gets us one pkey from >>>>a specific index and is implemented to get the 32 pkeys chunks under >>>>the hood (and something similar for gids). Do we need something like >>>>ib_get_pkey_table & ib_get_gid_table calls ? Just an interesting fact: >>>>ib_cache_update on a Mellanox card performs 64 + 64 + 1 = 129 commands >>>>instead the really needed 5 :) >>> >>> >>>If the intended usage is to speed up ib_cache_update, the point is >>>moot I think since we agreed we are getting rid of it. >>> >> >>Which source file the without-caching functions should go to? > > > verbs.c? > Shouldn't it be in device.c along with ib_query_gid and ib_query_pkey? From mst at dev.mellanox.co.il Thu Apr 26 09:14:09 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 19:14:09 +0300 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> Message-ID: <20070426161409.GF15540@mellanox.co.il> > Quoting Bryan Lawver : > Subject: Re: IPoIB forwarding > > I don't think this sequence should occur. My IP links have MTU of 9k and I > have set the ib0 links to the same MTU. Not sure about this: IPoIB has different encapsulation header size. Maybe make IPoIB MTU bigger? > This sequence of 17k size packet > every eighth packet seems to match my retransmit sequence as captured by > tcpdump on the IP end of an iperf. The MSS in the IP header is 8960 so > this would seem to be one complete IP packet and the data payload from a > second. > > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x177 length 9004 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x178 length 9004 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x179 length 9004 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17a length 17964 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17b length 9004 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17c length 9004 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17d length 9004 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17e length 9004 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17f length 9004 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x180 length 9004 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x181 length 9004 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x182 length 17964 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x183 length 9004 > connection 0x340406 > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x184 length 9004 > connection 0x340406 > I see you have enabled DEBUG_DATA for IPoIB CM - this will hurt performance. -- MST From mst at dev.mellanox.co.il Thu Apr 26 09:16:32 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 19:16:32 +0300 Subject: [ofa-general] Re: [ewg] Re: pkey change handling patch In-Reply-To: <4630CE97.1060602@voltaire.com> References: <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> <20070426133442.GJ32513@mellanox.co.il> <4630B46B.1000006@voltaire.com> <20070426160029.GC15540@mellanox.co.il> <4630CE97.1060602@voltaire.com> Message-ID: <20070426161632.GG15540@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [ewg] Re: pkey change handling patch > > Michael S. Tsirkin wrote: > >>Quoting Yosef Etigin : > >>Subject: Re: [ewg] Re: pkey change handling patch > >> > >>Michael S. Tsirkin wrote: > >> > >>>>Quoting Moni Levy : > >>>>Subject: Re: pkey change handling patch > >>>> > >>>>On 4/25/07, Roland Dreier wrote: > >>>> > >>>> > >>>>>If you want to tackle more of the cache elimination plan we discussed > >>>>>that would be great though. > >>>>> > >>>> > >>>>One more issue I looked into that in my opinion needs to be discussed > >>>>is that we do not have an easy api that should provide us with the > >>>>whole PKEY table and one for the whole GID table for a specific port. > >>>>I know that ib_process_mad can be used it's just not user friendly. > >>>>The only thing we have now is ib_query_pkey that gets us one pkey from > >>>>a specific index and is implemented to get the 32 pkeys chunks under > >>>>the hood (and something similar for gids). Do we need something like > >>>>ib_get_pkey_table & ib_get_gid_table calls ? Just an interesting fact: > >>>>ib_cache_update on a Mellanox card performs 64 + 64 + 1 = 129 commands > >>>>instead the really needed 5 :) > >>> > >>> > >>>If the intended usage is to speed up ib_cache_update, the point is > >>>moot I think since we agreed we are getting rid of it. > >>> > >> > >>Which source file the without-caching functions should go to? > > > > > > verbs.c? > > > Shouldn't it be in device.c along with ib_query_gid and ib_query_pkey? No idea why is ib_query_gid in device.c. Roland? -- MST From mst at dev.mellanox.co.il Thu Apr 26 09:28:56 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 19:28:56 +0300 Subject: [ofa-general] FYI: 2.6.21 kernel is out Message-ID: <20070426162856.GH15540@mellanox.co.il> http://kerneltrap.org/node/8103 -- MST From Arkady.Kanevsky at netapp.com Thu Apr 26 09:34:42 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 26 Apr 2007 12:34:42 -0400 Subject: [ofa-general] Sockets over RDMA Message-ID: Colleagues, Monday at 4:00pm at Sonoma workshop we will have a discussion on how to proceed on sockets over RDMA. All are welcome. Thanks, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rick.jones2 at hp.com Thu Apr 26 09:46:46 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 26 Apr 2007 09:46:46 -0700 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: <20070426045855.GH5217@mellanox.co.il> References: <462E8257.9090103@hp.com> <462F8E07.4050000@hp.com> <462FC6C6.9050700@hp.com> <20070426045855.GH5217@mellanox.co.il> Message-ID: <4630D776.5060704@hp.com> > Please note that you should *only* ever stick the SDP family value > in the socket(3) call. All addresses for connect, bind etc > are AF_INET, since SDP uses IP addresses for everything. Sounds like something trying to be just a little bit pregnant. Thankfully, I'm only munging the getaddrinfo() data for the local endpoint. rick jones From lawver1 at llnl.gov Thu Apr 26 09:58:19 2007 From: lawver1 at llnl.gov (Bryan Lawver) Date: Thu, 26 Apr 2007 09:58:19 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <20070426161409.GF15540@mellanox.co.il> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> Message-ID: <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> Here's a tcpdump of the same sequence. The TCP MSS is 8960 and it appears that two payloads are queued at ipoib which combines them into a single 17920 payload with assumingly correct IP header (40) and IB header (4). The application or TCP stack does not acknowledge this double packet ie. it does not ACK until each of the 8960 packets are resent individually. Being an IB newbie, I am guessing this combining is allowable but may violate TCP protocol. 09:44:04.767653 IP 172.16.13.2.1751 > wopr1.5001: P 80665:89625(8960) ack 1 win 35 09:44:04.775729 IP 172.16.13.2.1751 > wopr1.5001: . 89625:107545(17920) ack 1 win 35 09:44:04.775751 IP wopr1.5001 > 172.16.13.2.1751: . ack 89625 win 257 09:44:04.803046 IP 172.16.13.2.1751 > wopr1.5001: . 107545:116505(8960) ack 1 win 35 09:44:04.803069 IP wopr1.5001 > 172.16.13.2.1751: . ack 89625 win 257 09:44:04.830370 IP 172.16.13.2.1751 > wopr1.5001: P 116505:125465(8960) ack 1 win 35 09:44:04.830392 IP wopr1.5001 > 172.16.13.2.1751: . ack 89625 win 257 09:44:04.857685 IP 172.16.13.2.1751 > wopr1.5001: . 89625:98585(8960) ack 1 win 35 09:44:04.857712 IP wopr1.5001 > 172.16.13.2.1751: . ack 98585 win 257 09:44:05.126062 IP 172.16.13.2.1751 > wopr1.5001: P 98585:107545(8960) ack 1 win 35 09:44:05.126086 IP wopr1.5001 > 172.16.13.2.1751: . ack 125465 win 257 At 09:14 AM 4/26/2007, Michael S. Tsirkin wrote: > > Quoting Bryan Lawver : > > Subject: Re: IPoIB forwarding > > > > I don't think this sequence should occur. My IP links have MTU of 9k > and I > > have set the ib0 links to the same MTU. > >Not sure about this: IPoIB has different encapsulation header size. >Maybe make IPoIB MTU bigger? > > > This sequence of 17k size packet > > every eighth packet seems to match my retransmit sequence as captured by > > tcpdump on the IP end of an iperf. The MSS in the IP header is 8960 so > > this would seem to be one complete IP packet and the data payload from a > > second. > > > > > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x177 length 9004 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x178 length 9004 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x179 length 9004 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17a length 17964 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17b length 9004 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17c length 9004 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17d length 9004 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17e length 9004 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x17f length 9004 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x180 length 9004 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x181 length 9004 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x182 length 17964 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x183 length 9004 > > connection 0x340406 > > Apr 26 08:30:27 wopr0 ib0: sending packet: head 0x184 length 9004 > > connection 0x340406 > > > >I see you have enabled DEBUG_DATA for IPoIB CM - this will hurt performance. > >-- >MST From gregkh at suse.de Thu Apr 26 09:55:02 2007 From: gregkh at suse.de (Greg KH) Date: Thu, 26 Apr 2007 09:55:02 -0700 Subject: [ofa-general] [patch 02/33] IB/mthca: Fix data corruption after FMR unmap on Sinai In-Reply-To: <20070426165445.GA1898@kroah.com> References: <20070426165111.393445007@mini.kroah.org> Message-ID: <20070426165502.GC1898@kroah.com> -stable review patch. If anyone has any objections, please let us know. ------------------ From: Michael S. Tsirkin In mthca_arbel_fmr_unmap(), the high bits of the key are masked off. This gets rid of the effect of adjust_key(), which makes sure that bits 3 and 23 of the key are equal when the Sinai throughput optimization is enabled, and so it may happen that an FMR will end up with bits 3 and 23 in the key being different. This causes data corruption, because when enabling the throughput optimization, the driver promises the HCA firmware that bits 3 and 23 of all memory keys will always be equal. Fix by re-applying adjust_key() after masking the key. Thanks to Or Gerlitz for reproducing the problem, and Ariel Shahar for help in debug. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier Signed-off-by: Greg Kroah-Hartman --- drivers/infiniband/hw/mthca/mthca_mr.c | 1 + 1 file changed, 1 insertion(+) --- a/drivers/infiniband/hw/mthca/mthca_mr.c +++ b/drivers/infiniband/hw/mthca/mthca_mr.c @@ -751,6 +751,7 @@ void mthca_arbel_fmr_unmap(struct mthca_ key = arbel_key_to_hw_index(fmr->ibmr.lkey); key &= dev->limits.num_mpts - 1; + key = adjust_key(dev, key); fmr->ibmr.lkey = fmr->ibmr.rkey = arbel_hw_index_to_key(key); fmr->maps = 0; -- From mst at dev.mellanox.co.il Thu Apr 26 11:06:18 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Apr 2007 21:06:18 +0300 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> Message-ID: <20070426180618.GJ15540@mellanox.co.il> > Quoting Bryan Lawver : > Subject: Re: IPoIB forwarding > > Here's a tcpdump of the same sequence. The TCP MSS is 8960 and it appears > that two payloads are queued at ipoib which combines them into a single > 17920 payload with assumingly correct IP header (40) and IB header > (4). The application or TCP stack does not acknowledge this double packet > ie. it does not ACK until each of the 8960 packets are resent > individually. Being an IB newbie, I am guessing this combining is > allowable but may violate TCP protocol. IPoIB does nothing like this - it's just a network device so it sends all packets out as is. -- MST From rdreier at cisco.com Thu Apr 26 11:20:48 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Apr 2007 11:20:48 -0700 Subject: [ofa-general] What's in infiniband.git for 2.6.22 Message-ID: Here's a short summary of what my plans for 2.6.22 are. For reference, everything is in my git tree: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git Please let me know if you have any thoughts on these plans, or if there is something that you feel is missing from this list. * mlx4 driver for new Mellanox ConnectX HCAs. This is the connectx branch in git. I will merge this soon, after a few more cleanups and one final posting for review. There are actually two parts here: - "IB/uverbs: Export ib_umem_get()/ib_umem_release() to modules" This touches the core and all drivers, but I think it is a better design and actually helps other drivers too in addition to being a prerequisite for the mlx4 driver. I haven't heard anyone speak out against it so I plan to go ahead and merge it. - "IB/mlx4: Add driver for Mellanox ConnectX HCAs" I'll fold all the mlx4_core and mlx4_ib code into this patch and merge it. - "mlx4_eth: Add 10 gigabit ethernet driver for Mellanox ConnectX" This will NOT be merged for 2.6.22 at least. For one thing it is pretty much just a stub that doesn't do anything useful. When there is working 10 gig support, I'll post this to lkml and netdev for review, but this is 2.6.23 stuff at the soonest. * IPoIB NAPI work. This is the ipoib branch in git. Again, there are really two parts here: - "IB: Return "maybe missed event" hint from ib_req_notify_cq()" This extends the API in a way that lets us implement NAPI, but may be useful for other things too. It touches all the drivers, and I still need to finish updating cxgb3 to work correctly. I haven't heard anything negative about this, so I'll fix it up, post it one more time for review, and plan on merging it. - "IPoIB: Convert to NAPI" This is the actual conversion of IPoIB to use NAPI, based on the previous extension to ib_req_notify_cq(). There seems to be a need to merge this, based on people's experiences with congestion collapse under high load. So I'm planning on merging this too. * I also have the following bunch of more minor patches queued, and I will ask Linus to pull them soon. The majority of them are ipath fixes (and I hope Qlogic will send fixes for the two other bugs that I know of, namely corrupting the list of pending mmaps if an object is destroyed before userspace mmaps it, and doing spin_lock_irq() from interrupt context). There are a few other cleanups and minor fixes scattered around. Here's the shortlog of the for-2.6.22 branch: Arthur Jones (2): IB/ipath: Call free_irq() on chip specific initialization failure IB/ipath: Force PIOAvail update entry point Bryan O'Sullivan (17): IB/ipath: Add ability to set and clear IB local loopback IB/ipath: Fix user memory region creation when IOMMU present IB/ipath: Definitions of two RXE parity err bits were reversed IB/ipath: Fix up some debug messages IB/ipath: Change packet problems vs chip errors handling and reporting IB/ipath: Fix bad argument to clear_bit() IB/ipath: Fix CQ flushing when QP is modified to error state IB/ipath: Remove unused ipath_read_kreg64_port() IB/ipath: Fix calculation for number of kernel PIO buffers IB/ipath: Discard multicast packets without a GRH IB/ipath: Print better error messages if kernel is misconfigured IB/ipath: Improve handling and reporting of parity errors IB/ipath: On unrecoverable errors, force link down, LEDs off IB/ipath: Prevent random program use of diags interface IB/ipath: Disable IB link earlier in shutdown sequence IB/ipath: Don't allow QPs 0 and 1 to be opened multiple times IB/ipath: Fix unit selection when all CPU affinity bits set Hal Rosenstock (3): IB/umad: Fix declaration of dev_map[] IB/mad: Change SMI to use enums rather than magic return codes IB/umad: Clarify documentation of transaction ID Joachim Fenkes (2): IB/ehca: Implement modify_port IB: Set class_dev->dev in core for nice device symlink Mark Debbage (1): IB/ipath: Allow receive ports mapped into userspace to be shared Michael Albaugh (1): IB/ipath: Fix driver crash (in interrupt or during unload) after chip reset Ralph Campbell (8): IB/ipath: Don't initialize port memory for subports IB/ipath: Fix SRQ limit event causing dropped CQ entry IB/ipath: NMI cpu lockup if local loopback used IB/ipath: Support larger IB_QP_MAX_DEST_RD_ATOMIC and IB_QP_MAX_QP_RD_ATOMIC IB/ipath: Fix QP error completion queue entries IB/ipath: Fix PSN update for RC retries IB/ipath: Fix port sharing on powerpc IB/ipath: Fix RDMA reads of length zero and error handling Robert Walsh (4): IB/ipath: Check reserved memory keys IB/ipath: Remove duplicate stuff from ipath_verbs.h IB/ipath: Check that a UD work request's address handle is valid IB/ipath: Fix WC format drift between user and kernel space Roland Dreier (6): IB: Remove reference to obsolete CONFIG_IPATH_CORE IPoIB: Remove pointless opcode field from debugging output IB/mthca: Update HCA firmware revisions IB/mthca: Fix mthca_write_mtt() on HCAs with hidden memory IB/mthca: Simplify CQ cleaning in mthca_free_qp() IPoIB/cm: spin_lock_irqsave() -> spin_lock_irq() replacements Sean Hefty (5): RDMA/ucma: Simplify ucma_get_event() IB/ucm: Simplify ib_ucm_event() IB/sa: Set src_path_bits correctly in ib_init_ah_from_path() IB/ipoib: Use ib_init_ah_from_path to initialize ah_attr IB/umad: Implement GRH handling for sent/received MADs From rdreier at cisco.com Thu Apr 26 11:42:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Apr 2007 11:42:18 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This is the first batch of merges for 2.6.22 -- mostly ipath driver fixes: Arthur Jones (2): IB/ipath: Call free_irq() on chip specific initialization failure IB/ipath: Force PIOAvail update entry point Bryan O'Sullivan (17): IB/ipath: Add ability to set and clear IB local loopback IB/ipath: Fix user memory region creation when IOMMU present IB/ipath: Definitions of two RXE parity err bits were reversed IB/ipath: Fix up some debug messages IB/ipath: Change packet problems vs chip errors handling and reporting IB/ipath: Fix bad argument to clear_bit() IB/ipath: Fix CQ flushing when QP is modified to error state IB/ipath: Remove unused ipath_read_kreg64_port() IB/ipath: Fix calculation for number of kernel PIO buffers IB/ipath: Discard multicast packets without a GRH IB/ipath: Print better error messages if kernel is misconfigured IB/ipath: Improve handling and reporting of parity errors IB/ipath: On unrecoverable errors, force link down, LEDs off IB/ipath: Prevent random program use of diags interface IB/ipath: Disable IB link earlier in shutdown sequence IB/ipath: Don't allow QPs 0 and 1 to be opened multiple times IB/ipath: Fix unit selection when all CPU affinity bits set Hal Rosenstock (3): IB/umad: Fix declaration of dev_map[] IB/mad: Change SMI to use enums rather than magic return codes IB/umad: Clarify documentation of transaction ID Joachim Fenkes (2): IB/ehca: Implement modify_port IB: Set class_dev->dev in core for nice device symlink Mark Debbage (1): IB/ipath: Allow receive ports mapped into userspace to be shared Michael Albaugh (1): IB/ipath: Fix driver crash (in interrupt or during unload) after chip reset Ralph Campbell (8): IB/ipath: Don't initialize port memory for subports IB/ipath: Fix SRQ limit event causing dropped CQ entry IB/ipath: NMI cpu lockup if local loopback used IB/ipath: Support larger IB_QP_MAX_DEST_RD_ATOMIC and IB_QP_MAX_QP_RD_ATOMIC IB/ipath: Fix QP error completion queue entries IB/ipath: Fix PSN update for RC retries IB/ipath: Fix port sharing on powerpc IB/ipath: Fix RDMA reads of length zero and error handling Robert Walsh (4): IB/ipath: Check reserved memory keys IB/ipath: Remove duplicate stuff from ipath_verbs.h IB/ipath: Check that a UD work request's address handle is valid IB/ipath: Fix WC format drift between user and kernel space Roland Dreier (6): IB: Remove reference to obsolete CONFIG_IPATH_CORE IPoIB: Remove pointless opcode field from debugging output IB/mthca: Update HCA firmware revisions IB/mthca: Fix mthca_write_mtt() on HCAs with hidden memory IB/mthca: Simplify CQ cleaning in mthca_free_qp() IPoIB/cm: spin_lock_irqsave() -> spin_lock_irq() replacements Sean Hefty (5): RDMA/ucma: Simplify ucma_get_event() IB/ucm: Simplify ib_ucm_event() IB/sa: Set src_path_bits correctly in ib_init_ah_from_path() IB/ipoib: Use ib_init_ah_from_path to initialize ah_attr IB/umad: Implement GRH handling for sent/received MADs Documentation/infiniband/user_mad.txt | 8 + drivers/Makefile | 1 - drivers/infiniband/core/mad.c | 34 +- drivers/infiniband/core/sa_query.c | 24 +- drivers/infiniband/core/smi.c | 86 ++-- drivers/infiniband/core/smi.h | 34 +- drivers/infiniband/core/sysfs.c | 1 + drivers/infiniband/core/ucm.c | 23 +- drivers/infiniband/core/ucma.c | 22 +- drivers/infiniband/core/user_mad.c | 20 +- drivers/infiniband/hw/amso1100/c2_provider.c | 1 - drivers/infiniband/hw/cxgb3/iwch_provider.c | 1 - drivers/infiniband/hw/ehca/ehca_classes.h | 1 + drivers/infiniband/hw/ehca/ehca_hca.c | 55 ++- drivers/infiniband/hw/ehca/ehca_main.c | 1 + drivers/infiniband/hw/ehca/hcp_if.c | 24 + drivers/infiniband/hw/ehca/hcp_if.h | 4 + drivers/infiniband/hw/ipath/ipath_common.h | 23 +- drivers/infiniband/hw/ipath/ipath_cq.c | 38 +- drivers/infiniband/hw/ipath/ipath_debug.h | 1 + drivers/infiniband/hw/ipath/ipath_diag.c | 11 +- drivers/infiniband/hw/ipath/ipath_driver.c | 123 +++-- drivers/infiniband/hw/ipath/ipath_eeprom.c | 4 + drivers/infiniband/hw/ipath/ipath_file_ops.c | 287 +++++---- drivers/infiniband/hw/ipath/ipath_iba6110.c | 152 +++-- drivers/infiniband/hw/ipath/ipath_iba6120.c | 73 ++- drivers/infiniband/hw/ipath/ipath_init_chip.c | 86 ++- drivers/infiniband/hw/ipath/ipath_intr.c | 100 ++- drivers/infiniband/hw/ipath/ipath_kernel.h | 10 +- drivers/infiniband/hw/ipath/ipath_keys.c | 14 +- drivers/infiniband/hw/ipath/ipath_mr.c | 12 +- drivers/infiniband/hw/ipath/ipath_qp.c | 133 +++-- drivers/infiniband/hw/ipath/ipath_rc.c | 920 ++++++++++++++----------- drivers/infiniband/hw/ipath/ipath_registers.h | 22 +- drivers/infiniband/hw/ipath/ipath_ruc.c | 63 +- drivers/infiniband/hw/ipath/ipath_stats.c | 16 +- drivers/infiniband/hw/ipath/ipath_uc.c | 6 +- drivers/infiniband/hw/ipath/ipath_ud.c | 8 +- drivers/infiniband/hw/ipath/ipath_verbs.c | 15 +- drivers/infiniband/hw/ipath/ipath_verbs.h | 57 +- drivers/infiniband/hw/mthca/mthca_main.c | 10 +- drivers/infiniband/hw/mthca/mthca_mr.c | 6 +- drivers/infiniband/hw/mthca/mthca_provider.c | 1 - drivers/infiniband/hw/mthca/mthca_qp.c | 7 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 64 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 8 +- drivers/infiniband/ulp/ipoib/ipoib_main.c | 12 +- 47 files changed, 1620 insertions(+), 1002 deletions(-) From chu11 at llnl.gov Thu Apr 26 12:01:54 2007 From: chu11 at llnl.gov (Al Chu) Date: Thu, 26 Apr 2007 12:01:54 -0700 Subject: [ofa-general] Re: [RFC] IB management changes proposal In-Reply-To: <1177598794.12542.76717.camel@hal.voltaire.com> References: <1176224960.14140.474623.camel@localhost.localdomain> <462C7F17.3040707@cea.fr> <1177538202.12542.13079.camel@hal.voltaire.com> <20070426050230.GJ5217@mellanox.co.il> <1177598794.12542.76717.camel@hal.voltaire.com> Message-ID: <1177614114.11219.10.camel@cardanus.llnl.gov> On Thu, 2007-04-26 at 10:46 -0400, Hal Rosenstock wrote: > On Thu, 2007-04-26 at 01:02, Michael S. Tsirkin wrote: > > > > There also some few commands (ib*.pl) that are using a file > > > > /tmp/ibnetdiscover.topology. I suggest /var/cache/ibnetdiscover.topology > > > > > > I'm not sure about this one. I need to think about this more. > > > > Not sure about the best placement, but surely a predictable name > > in a world-writeable directory is a security risk? I've seen a number of user tools place cache files/temporary files to the user's home directory. For example: ~/.ofed/ibnetdiscover.topology It should protect against the world readable/writeable issues I think. Just a thought. Al > Is /var/cache world writeable ? I thought it was just world readable. If > this were to be done, I would think the opensm directory underneath this > would be more appropriate but I'm not leaning towards doing this since I > think the current approach is more flexible and the topology can be > supplied to all needed commands/scripts. > > -- Hal > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From rdreier at cisco.com Thu Apr 26 12:20:58 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Apr 2007 12:20:58 -0700 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: <6a122cc00704260653i706caaecu1b8265157805912@mail.gmail.com> (Moni Levy's message of "Thu, 26 Apr 2007 16:53:27 +0300") References: <20070417223547.GI25314@mellanox.co.il> <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> <20070426133442.GJ32513@mellanox.co.il> <6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com> <20070426134331.GL32513@mellanox.co.il> <6a122cc00704260653i706caaecu1b8265157805912@mail.gmail.com> Message-ID: > > Let's do it over query_pkey/query_port for now. > > Long term providers will just optimize these I think. > > How ? Caching at device driver level ? Yes... for the most part, it should be much easier to do within the driver. For example mthca, mlx4 and ipath at least know exactly when the P_Key table is being changed and can just snoop the operation without needing to worry about deferring things to a workqueue, etc. ehca seems to have a hypercall that returns the whole P_Key table in one go. I think it would be fine to change the interface to something like query_pkey(struct ib_device *dev, u8 port, u16 start_index, u16 num_pkeys, u16 *pkey) that returns a block of P_Keys in one go, but I don't see it as that critical. From rdreier at cisco.com Thu Apr 26 12:22:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Apr 2007 12:22:13 -0700 Subject: [ofa-general] Re: [ewg] Re: pkey change handling patch In-Reply-To: <20070426161632.GG15540@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 26 Apr 2007 19:16:32 +0300") References: <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> <20070426133442.GJ32513@mellanox.co.il> <4630B46B.1000006@voltaire.com> <20070426160029.GC15540@mellanox.co.il> <4630CE97.1060602@voltaire.com> <20070426161632.GG15540@mellanox.co.il> Message-ID: > > Shouldn't it be in device.c along with ib_query_gid and ib_query_pkey? > > No idea why is ib_query_gid in device.c. Roland? I think my reasoning was that ib_query_gid / ib_query_pkey were generic device operations, and the stuff in verbs.c was for things operating on a specific object like a PD, QP, etc. But I don't think it matters much either way. - R. From johnip at sgi.com Thu Apr 26 12:23:04 2007 From: johnip at sgi.com (John Partridge) Date: Thu, 26 Apr 2007 14:23:04 -0500 Subject: [ofa-general] Re: opensmd init.d script question In-Reply-To: <1177534453.12542.9088.camel@hal.voltaire.com> References: <462FB486.9090403@sgi.com> <1177534453.12542.9088.camel@hal.voltaire.com> Message-ID: <4630FC18.6090304@sgi.com> Hal Rosenstock wrote: > Hi John, > One can run 2 OpenSMs on different CA/ports on a single machine. Yes, in fact this is exactly what we have been doing. Although we have been running OpenSM by hand to achieve this. > The main things in doing this is setting them up to use different > directories. This is accomplished via setting OSM_CACHE_DIR. You will > want to configure dump_files_dir and log_file to be different. Yes, in fact that is what we have done. > Also, you > will likely want different subnet_prefix configured in each subnet > (opensm.opts). There may be other configuration files different as well > based on what your requirements are. > > Hope this helps. Yes it has thank you. > > As to how to do this with opensmd, I'm not sure as I don't work with > that. When you figure this out, it would be useful if you posted the > information. Thanks. OK, I will feed back any changes to opensmd and the /etc/opensm.conf files. Who should these go to ? Regards John -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From halr at voltaire.com Thu Apr 26 12:31:03 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Apr 2007 15:31:03 -0400 Subject: [ofa-general] Re: opensmd init.d script question In-Reply-To: <4630FC18.6090304@sgi.com> References: <462FB486.9090403@sgi.com> <1177534453.12542.9088.camel@hal.voltaire.com> <4630FC18.6090304@sgi.com> Message-ID: <1177615862.12542.93878.camel@hal.voltaire.com> On Thu, 2007-04-26 at 15:23, John Partridge wrote: > Hal Rosenstock wrote: > > Hi John, > > One can run 2 OpenSMs on different CA/ports on a single machine. > > Yes, in fact this is exactly what we have been doing. Although we > have been running OpenSM by hand to achieve this. > > > The main things in doing this is setting them up to use different > > directories. This is accomplished via setting OSM_CACHE_DIR. You will > > want to configure dump_files_dir and log_file to be different. > > Yes, in fact that is what we have done. > > > Also, you > > will likely want different subnet_prefix configured in each subnet > > (opensm.opts). There may be other configuration files different as well > > based on what your requirements are. > > > > Hope this helps. > > Yes it has thank you. > > > > > As to how to do this with opensmd, I'm not sure as I don't work with > > that. When you figure this out, it would be useful if you posted the > > information. Thanks. > > OK, I will feed back any changes to opensmd and the /etc/opensm.conf > files. Who should these go to ? Vlad (and me; I'd at least like to see them). -- Hal > Regards > John From rdreier at cisco.com Thu Apr 26 12:53:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Apr 2007 12:53:35 -0700 Subject: [ofa-general] Re: IPOIB NAPI In-Reply-To: <20070228071706.GA22246@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 28 Feb 2007 09:17:06 +0200") References: <20070228071706.GA22246@mellanox.co.il> Message-ID: > > poll-cq > > notify-cq, if missed_event && netif_rx_reschedule() > > return 1 > > > > vs. > > poll-cq, > > notify-cq, if missed_event && netif_rx_reschedule() > > poll again > > return 0 > > > > It seems ehca delivering packet much faster than other HCAs. So poll again > > would stay in the loop for many many times. So the above changes doesn't impact > > other HCAs, I would recommand it. I saw same implementations on other ethernet > > drivers. > > I have not benchmarked this, but actually the "return 1" version makes sense to > me too: since a new completion was observed after notify-cq, we likely currently > have HCA writing new completions into the CQ at a high rate, so it makes sense > to delay polling by a few cycles, and reduce the number of interrupts in this > way. Yes, this does make sense. It's kind of a cheap way to hold off a little and try to get more work to do before we poll again, without trying something more complex that is likely not to work well. So just to confirm, the version that everyone likes is: int ipoib_poll(struct net_device *dev, int *budget) { struct ipoib_dev_priv *priv = netdev_priv(dev); int max = min(*budget, dev->quota); int done; int t; int empty; int n, i; repoll: done = 0; empty = 0; while (max) { t = min(IPOIB_NUM_WC, max); n = ib_poll_cq(priv->cq, t, priv->ibwc); for (i = 0; i < n; ++i) { // [completion handling deleted for clarity] } if (n != t) { empty = 1; break; } } dev->quota -= done; *budget -= done; if (empty) { netif_rx_complete(dev); if (unlikely(ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS))) { netif_rx_reschedule(dev, 0); return 1; } return 0; } return 1; } with the significant part being that we return 1 instead of repolling after we reschedule the polling routine. - R. From swise at opengridcomputing.com Thu Apr 26 13:20:57 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Apr 2007 15:20:57 -0500 Subject: [ofa-general] [PATCH 2.6.22 0/5] iw_cxgb3: Bug Fixes + Firmware update Message-ID: <20070426202057.24234.56383.stgit@dell3.ogc.int> Hey Roland, Here are some bug fixes to the iw_cxgb3 driver that I'd like merged for 2.6.22. The 1st patch has been posted before, but I didn't see it in your for-2.6.22 branch, so I'm posting it again. Jeff, The last patch updates the cxgb3 required firmware version. It is included in this series because its required by the patch preceeding it in the series. Steve. Shortlog: Steve Wise: Fix TERM codes. Fail qp creation if the requested max_inline is too large. Initialize cpu_idx field in cpl_close_listserv_req message. Support for new abort logic. Update required firmware revision to 4.0.0. From swise at opengridcomputing.com Thu Apr 26 13:21:02 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Apr 2007 15:21:02 -0500 Subject: [ofa-general] [PATCH 2.6.22 1/5] iw_cxgb3: Fix TERM codes. In-Reply-To: <20070426202057.24234.56383.stgit@dell3.ogc.int> References: <20070426202057.24234.56383.stgit@dell3.ogc.int> Message-ID: <20070426202102.24234.87832.stgit@dell3.ogc.int> Fix TERM codes. Fix TERMINATE layer, type, and ecode values based on conformance testing. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_qp.c | 69 ++++++++++++++++++--------------- 1 files changed, 38 insertions(+), 31 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index 0a472c9..714dddb 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -471,43 +471,62 @@ int iwch_bind_mw(struct ib_qp *qp, return err; } -static void build_term_codes(int t3err, u8 *layer_type, u8 *ecode, int tagged) +static inline void build_term_codes(struct respQ_msg_t *rsp_msg, + u8 *layer_type, u8 *ecode) { - switch (t3err) { + int status = TPT_ERR_INTERNAL_ERR; + int tagged = 0; + int opcode = -1; + int rqtype = 0; + int send_inv = 0; + + if (rsp_msg) { + status = CQE_STATUS(rsp_msg->cqe); + opcode = CQE_OPCODE(rsp_msg->cqe); + rqtype = RQ_TYPE(rsp_msg->cqe); + send_inv = (opcode == T3_SEND_WITH_INV) || + (opcode == T3_SEND_WITH_SE_INV); + tagged = (opcode == T3_RDMA_WRITE) || + (rqtype && (opcode == T3_READ_RESP)); + } + + switch (status) { case TPT_ERR_STAG: - if (tagged == 1) { - *layer_type = LAYER_DDP|DDP_TAGGED_ERR; - *ecode = DDPT_INV_STAG; - } else if (tagged == 2) { + if (send_inv) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_CANT_INV_STAG; + } else { *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; *ecode = RDMAP_INV_STAG; } break; case TPT_ERR_PDID: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + if ((opcode == T3_SEND_WITH_INV) || + (opcode == T3_SEND_WITH_SE_INV)) + *ecode = RDMAP_CANT_INV_STAG; + else + *ecode = RDMAP_STAG_NOT_ASSOC; + break; case TPT_ERR_QPID: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_STAG_NOT_ASSOC; + break; case TPT_ERR_ACCESS: - if (tagged == 1) { - *layer_type = LAYER_DDP|DDP_TAGGED_ERR; - *ecode = DDPT_STAG_NOT_ASSOC; - } else if (tagged == 2) { - *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; - *ecode = RDMAP_STAG_NOT_ASSOC; - } + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_ACC_VIOL; break; case TPT_ERR_WRAP: *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; *ecode = RDMAP_TO_WRAP; break; case TPT_ERR_BOUND: - if (tagged == 1) { + if (tagged) { *layer_type = LAYER_DDP|DDP_TAGGED_ERR; *ecode = DDPT_BASE_BOUNDS; - } else if (tagged == 2) { + } else { *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; *ecode = RDMAP_BASE_BOUNDS; - } else { - *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; - *ecode = DDPU_MSG_TOOBIG; } break; case TPT_ERR_INVALIDATE_SHARED_MR: @@ -591,8 +610,6 @@ int iwch_post_terminate(struct iwch_qp * { union t3_wr *wqe; struct terminate_message *term; - int status; - int tagged = 0; struct sk_buff *skb; PDBG("%s %d\n", __FUNCTION__, __LINE__); @@ -610,17 +627,7 @@ int iwch_post_terminate(struct iwch_qp * /* immediate data starts here. */ term = (struct terminate_message *)wqe->send.sgl; - if (rsp_msg) { - status = CQE_STATUS(rsp_msg->cqe); - if (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE) - tagged = 1; - if ((CQE_OPCODE(rsp_msg->cqe) == T3_READ_REQ) || - (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) - tagged = 2; - } else { - status = TPT_ERR_INTERNAL_ERR; - } - build_term_codes(status, &term->layer_etype, &term->ecode, tagged); + build_term_codes(rsp_msg, &term->layer_etype, &term->ecode); build_fw_riwrh((void *)wqe, T3_WR_SEND, T3_COMPLETION_FLAG | T3_NOTIFY_FLAG, 1, qhp->ep->hwtid, 5); From swise at opengridcomputing.com Thu Apr 26 13:21:09 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Apr 2007 15:21:09 -0500 Subject: [ofa-general] [PATCH 2.6.22 2/5] iw_cxgb3: Fail qp creation if the requested max_inline is too large. In-Reply-To: <20070426202057.24234.56383.stgit@dell3.ogc.int> References: <20070426202057.24234.56383.stgit@dell3.ogc.int> Message-ID: <20070426202107.24234.91018.stgit@dell3.ogc.int> Fail qp creation if the requested max_inline is too large. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/cxio_wr.h | 1 + drivers/infiniband/hw/cxgb3/iwch_provider.c | 3 +++ 2 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_wr.h b/drivers/infiniband/hw/cxgb3/cxio_wr.h index 90d7b89..ff7290e 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_wr.h +++ b/drivers/infiniband/hw/cxgb3/cxio_wr.h @@ -38,6 +38,7 @@ #include #include "firmware_exports.h" #define T3_MAX_SGE 4 +#define T3_MAX_INLINE 64 #define Q_EMPTY(rptr,wptr) ((rptr)==(wptr)) #define Q_FULL(rptr,wptr,size_log2) ( (((wptr)-(rptr))>>(size_log2)) && \ diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 24e0df0..b1128ec 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -780,6 +780,9 @@ static struct ib_qp *iwch_create_qp(stru if (rqsize > T3_MAX_RQ_SIZE) return ERR_PTR(-EINVAL); + if (attrs->cap.max_inline_data > T3_MAX_INLINE) + return ERR_PTR(-EINVAL); + /* * NOTE: The SQ and total WQ sizes don't need to be * a power of two. However, all the code assumes From swise at opengridcomputing.com Thu Apr 26 13:21:15 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Apr 2007 15:21:15 -0500 Subject: [ofa-general] [PATCH 2.6.22 3/5] iw_cxgb3: Initialize cpu_idx field in cpl_close_listserv_req message. In-Reply-To: <20070426202057.24234.56383.stgit@dell3.ogc.int> References: <20070426202057.24234.56383.stgit@dell3.ogc.int> Message-ID: <20070426202114.24234.18730.stgit@dell3.ogc.int> Initialize cpu_idx field in cpl_close_listserv_req message. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 2d2de9b..a990423 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1187,6 +1187,7 @@ static int listen_stop(struct iwch_liste } req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->cpu_idx = 0; OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); skb->priority = 1; ep->com.tdev->send(ep->com.tdev, skb); From swise at opengridcomputing.com Thu Apr 26 13:21:20 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Apr 2007 15:21:20 -0500 Subject: [ofa-general] [PATCH 2.6.22 4/5] iw_cxgb3: Support for new abort logic. In-Reply-To: <20070426202057.24234.56383.stgit@dell3.ogc.int> References: <20070426202057.24234.56383.stgit@dell3.ogc.int> Message-ID: <20070426202120.24234.62250.stgit@dell3.ogc.int> Support for new abort logic. The HW now posts 2 ABORT_RPL and/or PEER_ABORT_REQ messages. We need to handle them by silenty dropping the 1st but mark that we're ready for the final message. This plugs some close races between the uP and HW. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 18 ++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_cm.h | 6 ++++++ 2 files changed, 24 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index a990423..3a46a97 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1107,6 +1107,15 @@ static int abort_rpl(struct t3cdev *tdev PDBG("%s ep %p\n", __FUNCTION__, ep); + /* + * We get 2 abort replies from the HW. The first one must + * be ignored except for scribbling that we need one more. + */ + if (!(ep->flags & ABORT_REQ_IN_PROGRESS)) { + ep->flags |= ABORT_REQ_IN_PROGRESS; + return CPL_RET_BUF_DONE; + } + close_complete_upcall(ep); state_set(&ep->com, DEAD); release_ep_resources(ep); @@ -1474,6 +1483,15 @@ static int peer_abort(struct t3cdev *tde int ret; int state; + /* + * We get 2 peer aborts from the HW. The first one must + * be ignored except for scribbling that we need one more. + */ + if (!(ep->flags & PEER_ABORT_IN_PROGRESS)) { + ep->flags |= PEER_ABORT_IN_PROGRESS; + return CPL_RET_BUF_DONE; + } + if (is_neg_adv_abort(req->status)) { PDBG("%s neg_adv_abort ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h index 0c6f281..21a388c 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.h +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -143,6 +143,11 @@ enum iwch_ep_state { DEAD, }; +enum iwch_ep_flags { + PEER_ABORT_IN_PROGRESS = (1 << 0), + ABORT_REQ_IN_PROGRESS = (1 << 1), +}; + struct iwch_ep_common { struct iw_cm_id *cm_id; struct iwch_qp *qp; @@ -181,6 +186,7 @@ struct iwch_ep { u16 plen; u32 ird; u32 ord; + u32 flags; }; static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id) From swise at opengridcomputing.com Thu Apr 26 13:21:26 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Apr 2007 15:21:26 -0500 Subject: [ofa-general] [PATCH 2.6.22 5/5] iw_cxgb3: Update required firmware revision to 4.0.0. In-Reply-To: <20070426202057.24234.56383.stgit@dell3.ogc.int> References: <20070426202057.24234.56383.stgit@dell3.ogc.int> Message-ID: <20070426202126.24234.71523.stgit@dell3.ogc.int> Update required firmware revision to 4.0.0. Signed-off-by: Steve Wise --- drivers/net/cxgb3/version.h | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/cxgb3/version.h b/drivers/net/cxgb3/version.h index 042e27e..b112317 100644 --- a/drivers/net/cxgb3/version.h +++ b/drivers/net/cxgb3/version.h @@ -38,7 +38,7 @@ #define DRV_NAME "cxgb3" #define DRV_VERSION "1.0-ko" /* Firmware version */ -#define FW_VERSION_MAJOR 3 -#define FW_VERSION_MINOR 3 +#define FW_VERSION_MAJOR 4 +#define FW_VERSION_MINOR 0 #define FW_VERSION_MICRO 0 #endif /* __CHELSIO_VERSION_H */ From johnip at sgi.com Thu Apr 26 13:21:50 2007 From: johnip at sgi.com (John Partridge) Date: Thu, 26 Apr 2007 15:21:50 -0500 Subject: [ofa-general] Re: opensmd init.d script question In-Reply-To: <1177615862.12542.93878.camel@hal.voltaire.com> References: <462FB486.9090403@sgi.com> <1177534453.12542.9088.camel@hal.voltaire.com> <4630FC18.6090304@sgi.com> <1177615862.12542.93878.camel@hal.voltaire.com> Message-ID: <463109DE.1000509@sgi.com> Hal Rosenstock wrote: >> OK, I will feed back any changes to opensmd and the /etc/opensm.conf >> files. Who should these go to ? > > Vlad (and me; I'd at least like to see them). OK will do thanks for your help John -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From swise at opengridcomputing.com Thu Apr 26 14:11:36 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Apr 2007 16:11:36 -0500 Subject: [ofa-general] Re: bug in cma_iw_handler? (was hotplug event handle question) In-Reply-To: <000101c78818$45236a50$9b248686@amr.corp.intel.com> References: <000101c78818$45236a50$9b248686@amr.corp.intel.com> Message-ID: <1177621896.30865.119.camel@stevo-desktop> On Thu, 2007-04-26 at 08:33 -0700, Sean Hefty wrote: > >Off the top of my head, I don't think so. Since the state is staying the same, > >we now have the potential of another thread invoking a callback to the same id. > >For example, the ib_cm could callback with a connect or reject event, which > >gets > >propagated to the user. The user will now see two callbacks for the same id. > >Depending on the execution of the threads, one could completely run, with the > >user wanting to destroy the associated id. The second callback would then be > >invoked after the id was destroyed. > > > >The state combined with the dev_remove counter were used to serialize the > >callbacks. So we still need something to serialize the callbacks. > > Steve, > > Looking at the cma code, I see the following in cma_ib_handler: > > atomic_inc(&id_priv->dev_remove); > if (!cma_comp(id_priv, CMA_CONNECT)) > goto out; > > The cma_iw_handler only has: > > atomic_inc(&id_priv->dev_remove); > > without the state check, the cma_iw_handler can start running after we've > received a device removal event, which can result in multiple callbacks or a > callback after destruction. > > If you agree, I will add the state check to the cma_iw_handler. > I think you're right. We need the same logic in cma_iw_handler()... From suri at baymicrosystems.com Thu Apr 26 14:48:16 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Thu, 26 Apr 2007 17:48:16 -0400 Subject: [ofa-general] RE: error installing ofed_1.2-rc2 on RHEL5 References: <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il><6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com><6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com><20070426133442.GJ32513@mellanox.co.il><6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com><20070426134331.GL32513@mellanox.co.il><6a122cc00704260653i706caaecu1b8265157805912@mail.gmail.com> <20070426152613.GA15540@mellanox.co.il> Message-ID: <029901c7884c$9c14cf50$1914a8c0@surioffice> After some digging around on the net figured out that the "error" actually meant no "c" program would compile when supplied with the same arguments as found on the error line in "config.log". In my case it happens to be: gcc -m32 -g -O2 -L/usr/lib -I../libibverbs/include -L. conftest.c >&5 the -m32 instread of -m64 seems to be the culprit! Looking at the configure file and running /usr/bin/file on the xxx.o seems to yield 64_bit though.... Any ideas..... Many thanks in advance, Suri > -----Original Message----- > From: Suresh Shelvapille [mailto:suri at baymicrosystems.com] > Sent: Thursday, April 26, 2007 11:41 AM > To: 'general at lists.openfabrics.org' > Cc: 'Doug Ledford' > Subject: error installing ofed_1.2-rc2 on RHEL5 > > Folks: > > I just upgraded my system to RHEL5 and tried to install ofed_1.2-rc2.tgz > (dated 18-April) and am getting errors. I picked the basic install+defaults for > All selections. > > uname -a prints: > Linux ib-interop1host 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 x86_64 > x86_64 x86_64 GNU/Linux > > > Here is a partial output from the log file: > ----------------------------------------------------------- > cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs > Running: env ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes > ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range=yes ac_cv_func_ibv_dofork_range=yes > ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes ./configure --cache- > file=/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir > /usr/lib --mandir=/usr/share/man --sysconfdir=/etc CPPFLAGS="-I../libibverbs/include" > configure: creating cache /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache > checking for a BSD-compatible install... /usr/bin/install -c > checking whether build environment is sane... yes > checking for gawk... gawk > checking whether make sets $(MAKE)... yes > checking build system type... x86_64-unknown-linux-gnu > checking host system type... x86_64-unknown-linux-gnu > checking for style of include used by make... GNU > checking for gcc... gcc > checking for C compiler default output file name... configure: error: C compiler cannot create > executables > See `config.log' for more details. > Failed to execute: cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs && env > ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes > ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range=yes ac_cv_func_ibv_dofork_range=yes > ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes ./configure --cache- > file=/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir > /usr/lib --mandir=/usr/share/man --sysconfdir=/etc CPPFLAGS="-I../libibverbs/include" > error: Bad exit status from /var/tmp/rpm-tmp.58894 (%install) > > > RPM build errors: > user vlad does not exist - using root > group vlad does not exist - using root > user vlad does not exist - using root > group vlad does not exist - using root > Bad exit status from /var/tmp/rpm-tmp.58894 (%install) > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix > /usr' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-libcxgb3 --with-libibcm > --with-libibverbs --with-libipathverbs --with-libmthca --with-librdmacm --with-mstflint --with- > perftest --sysconfdir=/etc --mandir=/usr/share/man' --define 'configure_options32 --with-libcxgb3 -- > with-libibcm --with-libibverbs --with-libipathverbs --with-libmthca --with-librdmacm -- > sysconfdir=/etc --mandir=/usr/share/man' --define 'build_32bit 1' --define '_mandir /usr/share/man' > /root/ofed_1.2/OFED-1.2-rc2/SRPMS/ofa_user-1.2-rc2.src.rpm" > ------------------------------------------------------------- > > > Is the "unknown-linux-gnu" against host linux the problem? > > Many thanks, > Suri From mshefty at ichips.intel.com Thu Apr 26 15:00:27 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Apr 2007 15:00:27 -0700 Subject: [ofa-general] [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache In-Reply-To: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> References: <000001c782de$bbf69b00$07fd070a@amr.corp.intel.com> Message-ID: <463120FB.9080203@ichips.intel.com> I've updated these patches based on the feedback received so far: * Added definition for missing trap 259. * Replaced miscdevice usage with module parameters. * Add module parameter to control SA event registration. There is still one issue wrt the API that I'd like to get more opinions on. Should the cache be integrated into the ib_sa and reside below the existing ib_sa_path_rec_get()? - Sean From abhinav.vishnu at gmail.com Thu Apr 26 15:34:16 2007 From: abhinav.vishnu at gmail.com (Abhinav Vishnu) Date: Thu, 26 Apr 2007 18:34:16 -0400 Subject: [ofa-general] APM Example Message-ID: <87aa148d0704261534s67c37c7amdcd01d5f664a768b@mail.gmail.com> Hi List, I was wondering whether there is an example of using APM with Openfabrics. I do not see an example in the examples directory. With the OFED 1.2 rc2 too, i have not seen such an example. Mellanox VAPI used to have support for completion events with different APM states. However, with Openfabrics, i see that the support for MIGRATED -> ARMED event is not there (verbs.h include file). is there any specific reason for the same? Thanks much, -- Abhinav Vishnu Graduate Student Computer Science and Engineering The Ohio State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Apr 26 15:37:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Apr 2007 15:37:53 -0700 Subject: [ofa-general] Re: [ewg] APM Example In-Reply-To: <87aa148d0704261534s67c37c7amdcd01d5f664a768b@mail.gmail.com> (Abhinav Vishnu's message of "Thu, 26 Apr 2007 18:34:16 -0400") References: <87aa148d0704261534s67c37c7amdcd01d5f664a768b@mail.gmail.com> Message-ID: > Mellanox VAPI used to have support for completion events with > different APM states. However, with Openfabrics, i see that the > support for MIGRATED -> ARMED event is not there (verbs.h include > file). is there any specific reason for the same? I don't know what you're referring to. Of course a transition from MIGRATED -> ARMED is not done through a work queue and hence doesn't generate a completion. There is an affiliated async event defined, IB_EVENT_PATH_MIG, that will be generated when a migration takes place, but AFAIK the only thing that happens when arming APM is that the modify QP operation succeeds. What specific VAPI thing are you thinking of? From dledford at redhat.com Thu Apr 26 15:42:23 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 26 Apr 2007 18:42:23 -0400 Subject: [ofa-general] Re: error installing ofed_1.2-rc2 on RHEL5 In-Reply-To: <029901c7884c$9c14cf50$1914a8c0@surioffice> References: <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il><6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com><6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com><20070426133442.GJ32513@mellanox.co.il><6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com><20070426134331.GL32513@mellanox.co.il><6a122cc00704260653i706caaecu1b8265157805912@mail.gmail.com> <20070426152613.GA15540@mellanox.co.il> <029901c7884c$9c14cf50$1914a8c0@surioffice> Message-ID: <46312ACF.5010403@redhat.com> Suresh Shelvapille wrote: > After some digging around on the net figured out that the "error" actually meant > no "c" program would compile when supplied with the same arguments as found on > the error line in "config.log". > > In my case it happens to be: > gcc -m32 -g -O2 -L/usr/lib -I../libibverbs/include -L. conftest.c >&5 > > the -m32 instread of -m64 seems to be the culprit! > > Looking at the configure file and running /usr/bin/file on the xxx.o seems to yield > 64_bit though.... > > Any ideas..... This means you haven't installed the 32 bit version of gcc, yet ./configure is attempting to use it. Install the 32 bit gcc (and other associated rpms necessary to support 32 bit devel on your 64 bit machine) and it should work fine then. > Many thanks in advance, > Suri > >> -----Original Message----- >> From: Suresh Shelvapille [mailto:suri at baymicrosystems.com] >> Sent: Thursday, April 26, 2007 11:41 AM >> To: 'general at lists.openfabrics.org' >> Cc: 'Doug Ledford' >> Subject: error installing ofed_1.2-rc2 on RHEL5 >> >> Folks: >> >> I just upgraded my system to RHEL5 and tried to install ofed_1.2-rc2.tgz >> (dated 18-April) and am getting errors. I picked the basic install+defaults for >> All selections. >> >> uname -a prints: >> Linux ib-interop1host 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 x86_64 >> x86_64 x86_64 GNU/Linux >> >> >> Here is a partial output from the log file: >> ----------------------------------------------------------- >> cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs >> Running: env ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes >> ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range=yes ac_cv_func_ibv_dofork_range=yes >> ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes ./configure --cache- >> file=/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir >> /usr/lib --mandir=/usr/share/man --sysconfdir=/etc CPPFLAGS="-I../libibverbs/include" >> configure: creating cache /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache >> checking for a BSD-compatible install... /usr/bin/install -c >> checking whether build environment is sane... yes >> checking for gawk... gawk >> checking whether make sets $(MAKE)... yes >> checking build system type... x86_64-unknown-linux-gnu >> checking host system type... x86_64-unknown-linux-gnu >> checking for style of include used by make... GNU >> checking for gcc... gcc >> checking for C compiler default output file name... configure: error: C compiler cannot create >> executables >> See `config.log' for more details. >> Failed to execute: cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs && env >> ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes >> ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range=yes ac_cv_func_ibv_dofork_range=yes >> ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes ./configure --cache- >> file=/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir >> /usr/lib --mandir=/usr/share/man --sysconfdir=/etc CPPFLAGS="-I../libibverbs/include" >> error: Bad exit status from /var/tmp/rpm-tmp.58894 (%install) >> >> >> RPM build errors: >> user vlad does not exist - using root >> group vlad does not exist - using root >> user vlad does not exist - using root >> group vlad does not exist - using root >> Bad exit status from /var/tmp/rpm-tmp.58894 (%install) >> ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix >> /usr' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-libcxgb3 --with-libibcm >> --with-libibverbs --with-libipathverbs --with-libmthca --with-librdmacm --with-mstflint --with- >> perftest --sysconfdir=/etc --mandir=/usr/share/man' --define 'configure_options32 --with-libcxgb3 -- >> with-libibcm --with-libibverbs --with-libipathverbs --with-libmthca --with-librdmacm -- >> sysconfdir=/etc --mandir=/usr/share/man' --define 'build_32bit 1' --define '_mandir /usr/share/man' >> /root/ofed_1.2/OFED-1.2-rc2/SRPMS/ofa_user-1.2-rc2.src.rpm" >> ------------------------------------------------------------- >> >> >> Is the "unknown-linux-gnu" against host linux the problem? >> >> Many thanks, >> Suri > -- Doug Ledford http://people.redhat.com/dledford Infiniband specific RPMs can be found at http://people.redhat.com/dledford/Infiniband From rdreier at cisco.com Thu Apr 26 15:43:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Apr 2007 15:43:19 -0700 Subject: [ofa-general] [PATCH][RFC] IB: Return "maybe missed event" hint from ib_req_notify_cq() In-Reply-To: (Roland Dreier's message of "Thu, 26 Apr 2007 11:20:48 -0700") References: Message-ID: > - "IB: Return "maybe missed event" hint from ib_req_notify_cq()" > This extends the API in a way that lets us implement NAPI, but may > be useful for other things too. It touches all the drivers, and I > still need to finish updating cxgb3 to work correctly. I haven't > heard anything negative about this, so I'll fix it up, post it one > more time for review, and plan on merging it. As promised, here is that patch for review, with a cxgb3 implementation included. --- The semantics defined by the InfiniBand specification say that completion events are only generated when a completions is added to a completion queue (CQ) after completion notification is requested. In other words, this means that the following race is possible: while (CQ is not empty) ib_poll_cq(CQ); // new completion is added after while loop is exited ib_req_notify_cq(CQ); // no event is generated for the existing completion To close this race, the IB spec recommends doing another poll of the CQ after requesting notification. However, it is not always possible to arrange code this way (for example, we have found that NAPI for IPoIB cannot poll after requesting notification). Also, some hardware (eg Mellanox HCAs) actually will generate an event for completions added before the call to ib_req_notify_cq() -- which is allowed by the spec, since there's no way for any upper-layer consumer to know exactly when a completion was really added -- so the extra poll of the CQ is just a waste. Motivated by this, we add a new flag "IB_CQ_REPORT_MISSED_EVENTS" for ib_req_notify_cq() so that it can return a hint about whether the a completion may have been added before the request for notification. The return value of ib_req_notify_cq() is extended so: < 0 means an error occurred while requesting notification == 0 means notification was requested successfully, and if IB_CQ_REPORT_MISSED_EVENTS was passed in, then no events were missed and it is safe to wait for another event. > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was passed in. It means that the consumer must poll the CQ again to make sure it is empty to avoid the race described above. We add a flag to enable this behavior rather than turning it on unconditionally, because checking for missed events may incur significant overhead for some low-level drivers, and consumers that don't care about the results of this test shouldn't be forced to pay for the test. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/amso1100/c2.h | 2 +- drivers/infiniband/hw/amso1100/c2_cq.c | 16 ++++++++--- drivers/infiniband/hw/cxgb3/cxio_hal.c | 3 ++ drivers/infiniband/hw/cxgb3/iwch_provider.c | 8 +++-- drivers/infiniband/hw/ehca/ehca_iverbs.h | 2 +- drivers/infiniband/hw/ehca/ehca_reqs.c | 14 +++++++-- drivers/infiniband/hw/ehca/ipz_pt_fn.h | 8 +++++ drivers/infiniband/hw/ipath/ipath_cq.c | 15 +++++++--- drivers/infiniband/hw/ipath/ipath_verbs.h | 2 +- drivers/infiniband/hw/mthca/mthca_cq.c | 12 +++++--- drivers/infiniband/hw/mthca/mthca_dev.h | 4 +- include/rdma/ib_verbs.h | 40 +++++++++++++++++++++------ 12 files changed, 93 insertions(+), 33 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h index 04a9db5..fa58200 100644 --- a/drivers/infiniband/hw/amso1100/c2.h +++ b/drivers/infiniband/hw/amso1100/c2.h @@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq); extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index); extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags); /* CM */ extern int c2_llp_connect(struct iw_cm_id *cm_id, diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c index 5175c99..d2b3366 100644 --- a/drivers/infiniband/hw/amso1100/c2_cq.c +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -217,17 +217,19 @@ int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) return npolled; } -int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags) { struct c2_mq_shared __iomem *shared; struct c2_cq *cq; + unsigned long flags; + int ret = 0; cq = to_c2cq(ibcq); shared = cq->mq.peer; - if (notify == IB_CQ_NEXT_COMP) + if ((notify_flags & IB_CQ_SOLICITED_MASK) == IB_CQ_NEXT_COMP) writeb(C2_CQ_NOTIFICATION_TYPE_NEXT, &shared->notification_type); - else if (notify == IB_CQ_SOLICITED) + else if ((notify_flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED) writeb(C2_CQ_NOTIFICATION_TYPE_NEXT_SE, &shared->notification_type); else return -EINVAL; @@ -241,7 +243,13 @@ int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) */ readb(&shared->armed); - return 0; + if (notify_flags & IB_CQ_REPORT_MISSED_EVENTS) { + spin_lock_irqsave(&cq->lock, flags); + ret = !c2_mq_empty(&cq->mq); + spin_unlock_irqrestore(&cq->lock, flags); + } + + return ret; } static void c2_free_cq_buf(struct c2_dev *c2dev, struct c2_mq *mq) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index f5e9aee..76049af 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -114,7 +114,10 @@ int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq, return -EIO; } } + + return 1; } + return 0; } diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 24e0df0..e89957f 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -292,7 +292,7 @@ static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata) #endif } -static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) { struct iwch_dev *rhp; struct iwch_cq *chp; @@ -303,7 +303,7 @@ static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) chp = to_iwch_cq(ibcq); rhp = chp->rhp; - if (notify == IB_CQ_SOLICITED) + if ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED) cq_op = CQ_ARM_SE; else cq_op = CQ_ARM_AN; @@ -317,9 +317,11 @@ static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr); err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0); spin_unlock_irqrestore(&chp->lock, flag); - if (err) + if (err < 0) printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err, chp->cq.cqid); + if (err > 0 && !(flags & IB_CQ_REPORT_MISSED_EVENTS)) + err = 0; return err; } diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 95fd59f..9e5460d 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -135,7 +135,7 @@ int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc); int ehca_peek_cq(struct ib_cq *cq, int wc_cnt); -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify); +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags notify_flags); struct ib_qp *ehca_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *init_attr, diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index 08d3f89..caec9de 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -634,11 +634,13 @@ poll_cq_exit0: return ret; } -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags notify_flags) { struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); + unsigned long spl_flags; + int ret = 0; - switch (cq_notify) { + switch (notify_flags & IB_CQ_SOLICITED_MASK) { case IB_CQ_SOLICITED: hipz_set_cqx_n0(my_cq, 1); break; @@ -649,5 +651,11 @@ int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) return -EINVAL; } - return 0; + if (notify_flags & IB_CQ_REPORT_MISSED_EVENTS) { + spin_lock_irqsave(&my_cq->spinlock, spl_flags); + ret = ipz_qeit_is_valid(&my_cq->ipz_queue); + spin_unlock_irqrestore(&my_cq->spinlock, spl_flags); + } + + return ret; } diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h index 8199c45..57f141a 100644 --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h @@ -140,6 +140,14 @@ static inline void *ipz_qeit_get_inc_valid(struct ipz_queue *queue) return cqe; } +static inline int ipz_qeit_is_valid(struct ipz_queue *queue) +{ + struct ehca_cqe *cqe = ipz_qeit_get(queue); + u32 cqe_flags = cqe->cqe_flags; + + return cqe_flags >> 7 == (queue->toggle_state & 1); +} + /* * returns and resets Queue Entry iterator * returns address (kv) of first Queue Entry diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index 87462e0..9582145 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -306,17 +306,18 @@ int ipath_destroy_cq(struct ib_cq *ibcq) /** * ipath_req_notify_cq - change the notification type for a completion queue * @ibcq: the completion queue - * @notify: the type of notification to request + * @notify_flags: the type of notification to request * * Returns 0 for success. * * This may be called from interrupt context. Also called by * ib_req_notify_cq() in the generic verbs code. */ -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags) { struct ipath_cq *cq = to_icq(ibcq); unsigned long flags; + int ret = 0; spin_lock_irqsave(&cq->lock, flags); /* @@ -324,9 +325,15 @@ int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) * any other transitions (see C11-31 and C11-32 in ch. 11.4.2.2). */ if (cq->notify != IB_CQ_NEXT_COMP) - cq->notify = notify; + cq->notify = notify_flags & IB_CQ_SOLICITED_MASK; + + if ((notify_flags & IB_CQ_REPORT_MISSED_EVENTS) && + cq->queue->head != cq->queue->tail) + ret = 1; + spin_unlock_irqrestore(&cq->lock, flags); - return 0; + + return ret; } /** diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index c0c8d5b..6b3b770 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -716,7 +716,7 @@ struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int ipath_destroy_cq(struct ib_cq *ibcq); -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags); int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index efd79ef..cf0868f 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -726,11 +726,12 @@ repoll: return err == 0 || err == -EAGAIN ? npolled : err; } -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags) { __be32 doorbell[2]; - doorbell[0] = cpu_to_be32((notify == IB_CQ_SOLICITED ? + doorbell[0] = cpu_to_be32(((flags & IB_CQ_SOLICITED_MASK) == + IB_CQ_SOLICITED ? MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL : MTHCA_TAVOR_CQ_DB_REQ_NOT) | to_mcq(cq)->cqn); @@ -743,7 +744,7 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) return 0; } -int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) { struct mthca_cq *cq = to_mcq(ibcq); __be32 doorbell[2]; @@ -755,7 +756,8 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) doorbell[0] = ci; doorbell[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | - (notify == IB_CQ_SOLICITED ? 1 : 2)); + ((flags & IB_CQ_SOLICITED_MASK) == + IB_CQ_SOLICITED ? 1 : 2)); mthca_write_db_rec(doorbell, cq->arm_db); @@ -766,7 +768,7 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) wmb(); doorbell[0] = cpu_to_be32((sn << 28) | - (notify == IB_CQ_SOLICITED ? + ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ? MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL : MTHCA_ARBEL_CQ_DB_REQ_NOT) | cq->cqn); diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index b7e42ef..9bae3cc 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -495,8 +495,8 @@ void mthca_unmap_eq_icm(struct mthca_dev *dev); int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); -int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags); +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags); int mthca_init_cq(struct mthca_dev *dev, int nent, struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 765589f..529a69d 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -431,9 +431,11 @@ struct ib_wc { u8 port_num; /* valid only for DR SMPs on switches */ }; -enum ib_cq_notify { - IB_CQ_SOLICITED, - IB_CQ_NEXT_COMP +enum ib_cq_notify_flags { + IB_CQ_SOLICITED = 1 << 0, + IB_CQ_NEXT_COMP = 1 << 1, + IB_CQ_SOLICITED_MASK = IB_CQ_SOLICITED | IB_CQ_NEXT_COMP, + IB_CQ_REPORT_MISSED_EVENTS = 1 << 2, }; enum ib_srq_attr_mask { @@ -987,7 +989,7 @@ struct ib_device { struct ib_wc *wc); int (*peek_cq)(struct ib_cq *cq, int wc_cnt); int (*req_notify_cq)(struct ib_cq *cq, - enum ib_cq_notify cq_notify); + enum ib_cq_notify_flags flags); int (*req_ncomp_notif)(struct ib_cq *cq, int wc_cnt); struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, @@ -1414,14 +1416,34 @@ int ib_peek_cq(struct ib_cq *cq, int wc_cnt); /** * ib_req_notify_cq - Request completion notification on a CQ. * @cq: The CQ to generate an event for. - * @cq_notify: If set to %IB_CQ_SOLICITED, completion notification will - * occur on the next solicited event. If set to %IB_CQ_NEXT_COMP, - * notification will occur on the next completion. + * @flags: + * Must contain exactly one of %IB_CQ_SOLICITED or %IB_CQ_NEXT_COMP + * to request an event on the next solicited event or next work + * completion at any type, respectively. %IB_CQ_REPORT_MISSED_EVENTS + * may also be |ed in to request a hint about missed events, as + * described below. + * + * Return Value: + * < 0 means an error occurred while requesting notification + * == 0 means notification was requested successfully, and if + * IB_CQ_REPORT_MISSED_EVENTS was passed in, then no events + * were missed and it is safe to wait for another event. In + * this case is it guaranteed that any work completions added + * to the CQ since the last CQ poll will trigger a completion + * notification event. + * > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was passed + * in. It means that the consumer must poll the CQ again to + * make sure it is empty to avoid missing an event because of a + * race between requesting notification and an entry being + * added to the CQ. This return value means it is possible + * (but not guaranteed) that a work completion has been added + * to the CQ since the last poll without triggering a + * completion notification event. */ static inline int ib_req_notify_cq(struct ib_cq *cq, - enum ib_cq_notify cq_notify) + enum ib_cq_notify_flags flags) { - return cq->device->req_notify_cq(cq, cq_notify); + return cq->device->req_notify_cq(cq, flags); } /** -- 1.5.1.2 From rdreier at cisco.com Thu Apr 26 15:45:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Apr 2007 15:45:44 -0700 Subject: [ofa-general] [PATCH][RFC] IPoIB: Convert to NAPI In-Reply-To: (Roland Dreier's message of "Thu, 26 Apr 2007 15:43:19 -0700") References: Message-ID: And here's the patch to convert IPoIB over to using NAPI... --- Convert the IP-over-InfiniBand network device driver over to using NAPI to handle all completions (both receive and send). Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 89 ++++++++++++++++++++++------ drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 + 4 files changed, 74 insertions(+), 20 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index fd55826..15867af 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -311,6 +311,7 @@ extern struct workqueue_struct *ipoib_workqueue; /* functions */ +int ipoib_poll(struct net_device *dev, int *budget); void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); struct ipoib_ah *ipoib_create_ah(struct net_device *dev, diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 2b242a4..e1fdae1 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -418,7 +418,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb->dev = dev; /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; - netif_rx_ni(skb); + netif_receive_skb(skb); repost: if (unlikely(ipoib_cm_post_receive(dev, wr_id))) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index ba0ee5c..e3cc241 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -226,7 +226,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb->dev = dev; /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; - netif_rx_ni(skb); + netif_receive_skb(skb); } else { ipoib_dbg_data(priv, "dropping loopback packet\n"); dev_kfree_skb_any(skb); @@ -280,28 +280,65 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) wc->status, wr_id, wc->vendor_err); } -static void ipoib_ib_handle_wc(struct net_device *dev, struct ib_wc *wc) +int ipoib_poll(struct net_device *dev, int *budget) { - if (wc->wr_id & IPOIB_CM_OP_SRQ) - ipoib_cm_handle_rx_wc(dev, wc); - else if (wc->wr_id & IPOIB_OP_RECV) - ipoib_ib_handle_rx_wc(dev, wc); - else - ipoib_ib_handle_tx_wc(dev, wc); + struct ipoib_dev_priv *priv = netdev_priv(dev); + int max = min(*budget, dev->quota); + int done; + int t; + int empty; + int n, i; + +repoll: + done = 0; + empty = 0; + + while (max) { + t = min(IPOIB_NUM_WC, max); + n = ib_poll_cq(priv->cq, t, priv->ibwc); + + for (i = 0; i < n; ++i) { + struct ib_wc *wc = priv->ibwc + i; + + if (wc->wr_id & IPOIB_CM_OP_SRQ) { + ++done; + --max; + ipoib_cm_handle_rx_wc(dev, wc); + } else if (wc->wr_id & IPOIB_OP_RECV) { + ++done; + --max; + ipoib_ib_handle_rx_wc(dev, wc); + } else + ipoib_ib_handle_tx_wc(dev, wc); + } + + if (n != t) { + empty = 1; + break; + } + } + + dev->quota -= done; + *budget -= done; + + if (empty) { + netif_rx_complete(dev); + if (unlikely(ib_req_notify_cq(priv->cq, + IB_CQ_NEXT_COMP | + IB_CQ_REPORT_MISSED_EVENTS))) { + netif_rx_reschedule(dev, 0); + return 1; + } + + return 0; + } + + return 1; } void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) { - struct net_device *dev = (struct net_device *) dev_ptr; - struct ipoib_dev_priv *priv = netdev_priv(dev); - int n, i; - - ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); - do { - n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); - for (i = 0; i < n; ++i) - ipoib_ib_handle_wc(dev, priv->ibwc + i); - } while (n == IPOIB_NUM_WC); + netif_rx_schedule(dev_ptr); } static inline int post_send(struct ipoib_dev_priv *priv, @@ -514,9 +551,10 @@ int ipoib_ib_dev_stop(struct net_device *dev) struct ib_qp_attr qp_attr; unsigned long begin; struct ipoib_tx_buf *tx_req; - int i; + int i, n; clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); + netif_poll_disable(dev); ipoib_cm_dev_stop(dev); @@ -568,6 +606,16 @@ int ipoib_ib_dev_stop(struct net_device *dev) goto timeout; } + do { + n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); + for (i = 0; i < n; ++i) { + if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) + ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); + else + ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); + } + } while (n == IPOIB_NUM_WC); + msleep(1); } @@ -596,6 +644,9 @@ timeout: msleep(1); } + netif_poll_enable(dev); + ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP); + return 0; } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index f2a40ae..a69c472 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -952,6 +952,8 @@ static void ipoib_setup(struct net_device *dev) dev->hard_header = ipoib_hard_header; dev->set_multicast_list = ipoib_set_mcast_list; dev->neigh_setup = ipoib_neigh_setup_dev; + dev->poll = ipoib_poll; + dev->weight = 100; dev->watchdog_timeo = HZ; -- 1.5.1.2 From abhinav.vishnu at gmail.com Thu Apr 26 15:49:36 2007 From: abhinav.vishnu at gmail.com (Abhinav Vishnu) Date: Thu, 26 Apr 2007 18:49:36 -0400 Subject: [ofa-general] Re: [ewg] APM Example In-Reply-To: References: <87aa148d0704261534s67c37c7amdcd01d5f664a768b@mail.gmail.com> Message-ID: <87aa148d0704261549l64d0c9cfy7d29eddcfd89561c@mail.gmail.com> Roland, Thanks for your mail. On 4/26/07, Roland Dreier wrote: > > > Mellanox VAPI used to have support for completion events with > > different APM states. However, with Openfabrics, i see that the > > support for MIGRATED -> ARMED event is not there (verbs.h include > > file). is there any specific reason for the same? > > I don't know what you're referring to. Of course a transition from > MIGRATED -> ARMED is not done through a work queue and hence doesn't > generate a completion. There is an affiliated async event defined, > IB_EVENT_PATH_MIG, that will be generated when a migration takes > place, but AFAIK the only thing that happens when arming APM is that > the modify QP operation succeeds. This event will be generated when the alternate path has successfully transitioned to the primary path. However, VAPI has an event which specifies the successful transition of MIGRATED -> ARMED (I know very well, that it is done through modify_qp). But just the success of modify_qp does not explicitly tell the time at which the transition successfully occured, does it? Specifically: VAPI_PATH_MIG_ARMED would make my day. I believe that VAPI_QP_PATH_MIGRATED is similar to IB_EVENT_PATH_MIG. Please correct me if i am wrong. Also, do you have a simple example showing the APM functionality. Thanks much, :- Abhinav What specific VAPI thing are you thinking of? > -- Abhinav Vishnu Graduate Student Computer Science and Engineering The Ohio State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Apr 26 15:55:42 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Apr 2007 15:55:42 -0700 Subject: [ofa-general] APM Example In-Reply-To: <87aa148d0704261534s67c37c7amdcd01d5f664a768b@mail.gmail.com> References: <87aa148d0704261534s67c37c7amdcd01d5f664a768b@mail.gmail.com> Message-ID: <46312DEE.9070501@ichips.intel.com> > I was wondering whether there is an example of using APM with > Openfabrics. I do not > see an example in the examples directory. With the OFED 1.2 rc2 too, i > have not seen > such an example. I have a kernel ib_cm test program (cmpost) that I've used to test APM. It's available from: git://git.openfabrics.org/~shefty/rdma-dev.git test-apps - Sean From rdreier at cisco.com Thu Apr 26 15:58:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Apr 2007 15:58:33 -0700 Subject: [ofa-general] Re: [ewg] APM Example In-Reply-To: <87aa148d0704261549l64d0c9cfy7d29eddcfd89561c@mail.gmail.com> (Abhinav Vishnu's message of "Thu, 26 Apr 2007 18:49:36 -0400") References: <87aa148d0704261534s67c37c7amdcd01d5f664a768b@mail.gmail.com> <87aa148d0704261549l64d0c9cfy7d29eddcfd89561c@mail.gmail.com> Message-ID: Abhinav> However, VAPI has an event which specifies the successful Abhinav> transition of MIGRATED -> ARMED (I know very well, that Abhinav> it is done through modify_qp). But just the success of Abhinav> modify_qp does not explicitly tell the time at which the Abhinav> transition successfully occured, does it? You don't know the time that the transition occurred, except that it is between when you called modify QP and when it returned. But an asynchronous event doesn't really help, does it? All an event would tell you is that the transition occurred some time before the event was generated, which is some time before when the event was delivered to you. Abhinav> Specifically: Abhinav> VAPI_PATH_MIG_ARMED would make my day. I believe that Abhinav> VAPI_QP_PATH_MIGRATED is similar to Abhinav> IB_EVENT_PATH_MIG. Please correct me if i am wrong. I see... VAPI_PATH_MIG_ARMED is a new event that was added only in VAPI 4.1.0, which was why I didn't know about it. Only Mellanox HCAs support it, it is not specified by the InfiniBand architecture, and I don't really see the point of it (as I tried to explain above). - R. From rick.jones2 at hp.com Thu Apr 26 18:21:54 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 26 Apr 2007 18:21:54 -0700 Subject: [ofa-general] why is CPU util/service demand so much higher with SDP than TCP? Message-ID: <46315032.9060903@hp.com> So, while playing around with my new netperf SDP_RR test I've noticed that a single-byte _RR test over SDP has a much higher transactions per second (ie lower latency) than over TCP over the same HCA, but the CPU utilization is _very_ much higher and the service demand (cpu per transaction) as well. CPU util being higher makes sense with a higher transaction rate, but not the increased service demand - well at least not to my experience thusfar. [root at hpcpc106 ~]# for i in SDP_RR TCP_RR; do netperf -t $i -l 60 -c -C -H 192.168.0.107;done SDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.107 (192.168.0.107) port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 126976 126976 1 1 60.00 37868.61 28.02 27.65 29.598 29.210 126976 126976 TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.107 (192.168.0.107) port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 87380 87380 1 1 60.00 19281.49 3.40 3.90 7.049 8.089 87380 87380 The systems here are running RHEL5: [root at hpcpc106 ~]# uname -a Linux hpcpc106.cup.hp.com 2.6.18-8.el5 #1 SMP Fri Jan 26 14:16:09 EST 2007 ia64 ia64 ia64 GNU/Linux and whatever bits come with that (this is not OFED 1.2 rc bits - I still don't know how to remove enough of what ships with RHEL5 to put all of OFED 1.2 (well, the modules I want) on there without conflict. I'm not sure how to check the versions - normally I'd use ethtool, but that doesn't work against an ibN device. Someone elsewhere suggested that the bits in RHEL5 might be OFED 1.1. These systems have four real cores, and no HW threads enabled, so 25% CPU util means that the equivalent of an entire CPU core is being consumed. Before I start trying to hit the system with a profiler I thought I would ask if this was expected with SDP. Normally a single-instance, single-byte _RR test between otherwise identical systems consumes at most 50% of a core ( a bit handwaving, but that has been my experience thusfar) rick jones From weiny2 at llnl.gov Thu Apr 26 18:20:19 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 26 Apr 2007 18:20:19 -0700 Subject: [ofa-general] Re: [RFC] IB management changes proposal In-Reply-To: <20070426160825.GD15540@mellanox.co.il> References: <1176224960.14140.474623.camel@localhost.localdomain> <462C7F17.3040707@cea.fr> <1177538202.12542.13079.camel@hal.voltaire.com> <20070426050230.GJ5217@mellanox.co.il> <1177598794.12542.76717.camel@hal.voltaire.com> <20070426160825.GD15540@mellanox.co.il> Message-ID: <20070426182019.311c8b84.weiny2@llnl.gov> On Thu, 26 Apr 2007 19:08:25 +0300 "Michael S. Tsirkin" wrote: > > Quoting Hal Rosenstock : > > Subject: Re: [RFC] IB management changes proposal > > > > On Thu, 2007-04-26 at 01:02, Michael S. Tsirkin wrote: > > > > > There also some few commands (ib*.pl) that are using a file > > > > > /tmp/ibnetdiscover.topology. I suggest /var/cache/ibnetdiscover.topology > > > > > > > > I'm not sure about this one. I need to think about this more. > > > > > > Not sure about the best placement, but surely a predictable name > > > in a world-writeable directory is a security risk? > > > > Is /var/cache world writeable ? I thought it was just world readable. If > > this were to be done, I would think the opensm directory underneath this > > would be more appropriate but I'm not leaning towards doing this since I > > think the current approach is more flexible and the topology can be > > supplied to all needed commands/scripts. > > I'm sorry, I'm not familiar with the code. > I was just saying that using /tmp/ibnetdiscover.topology is clearly > a security risk since /tmp is world-writeable. Isn't it? > However, I think the risk is pretty low. The scripts only use this information to report other information about the subnet. The only damage would be if an admin misinterpreted this information and did something bad to the net. Finally, once the file is created it should have an appropriate umask: 18:05:21 > ls -la /tmp/ibnetdiscover.topology -rw------- 1 root root 689670 Apr 24 19:44 /tmp/ibnetdiscover.topology Therefore from this time forward it can't be modified by users other than root. (Even a bad umask value set in /var/cache would result in the file being writable.) All that being said, generically I think Michael has a point and /var/cache is probably a better place put it. Frankly, I never intended the file to be supplied by an outside program (although when testing the scripts I did do this from time to time). /tmp Seemed like a good idea at the time. ;-) Ira From rdreier at cisco.com Thu Apr 26 19:47:04 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Apr 2007 19:47:04 -0700 Subject: [ofa-general] Re: [RFC] IB management changes proposal In-Reply-To: <20070426182019.311c8b84.weiny2@llnl.gov> (Ira Weiny's message of "Thu, 26 Apr 2007 18:20:19 -0700") References: <1176224960.14140.474623.camel@localhost.localdomain> <462C7F17.3040707@cea.fr> <1177538202.12542.13079.camel@hal.voltaire.com> <20070426050230.GJ5217@mellanox.co.il> <1177598794.12542.76717.camel@hal.voltaire.com> <20070426160825.GD15540@mellanox.co.il> <20070426182019.311c8b84.weiny2@llnl.gov> Message-ID: > > I'm sorry, I'm not familiar with the code. > > I was just saying that using /tmp/ibnetdiscover.topology is clearly > > a security risk since /tmp is world-writeable. Isn't it? > > However, I think the risk is pretty low. The scripts only use this information > to report other information about the subnet. The only damage would be if an > admin misinterpreted this information and did something bad to the net. You're not being devious enough. Look up "symlink attack" to see one idea of something evil that an attacker could do. From weiny2 at llnl.gov Thu Apr 26 20:06:18 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 26 Apr 2007 20:06:18 -0700 Subject: [ofa-general] Re: [RFC] IB management changes proposal In-Reply-To: References: <1176224960.14140.474623.camel@localhost.localdomain> <462C7F17.3040707@cea.fr> <1177538202.12542.13079.camel@hal.voltaire.com> <20070426050230.GJ5217@mellanox.co.il> <1177598794.12542.76717.camel@hal.voltaire.com> <20070426160825.GD15540@mellanox.co.il> <20070426182019.311c8b84.weiny2@llnl.gov> Message-ID: <20070426200618.4d19be47.weiny2@llnl.gov> On Thu, 26 Apr 2007 19:47:04 -0700 Roland Dreier wrote: > > > I'm sorry, I'm not familiar with the code. > > > I was just saying that using /tmp/ibnetdiscover.topology is clearly > > > a security risk since /tmp is world-writeable. Isn't it? > > > > However, I think the risk is pretty low. The scripts only use this information > > to report other information about the subnet. The only damage would be if an > > admin misinterpreted this information and did something bad to the net. > > You're not being devious enough. Look up "symlink attack" to see one > idea of something evil that an attacker could do. 0:-) I sit corrected. Ira From rdreier at cisco.com Thu Apr 26 20:12:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Apr 2007 20:12:42 -0700 Subject: [ofa-general] Re: [PATCH 2.6.22 5/5] iw_cxgb3: Update required firmware revision to 4.0.0. In-Reply-To: <20070426202126.24234.71523.stgit@dell3.ogc.int> (Steve Wise's message of "Thu, 26 Apr 2007 15:21:26 -0500") References: <20070426202057.24234.56383.stgit@dell3.ogc.int> <20070426202126.24234.71523.stgit@dell3.ogc.int> Message-ID: > Update required firmware revision to 4.0.0. Hmm... should we fold this into the earlier patch, which actually needs this new FW? Or at least merge this patch first? Also, is it cool with everyone to require a new FW, even for users who might not be using (or even building) the RDMA driver? I'm not sure what a good solution would be really, so maybe the pain of forcing everyone to update FW is the least bad thing to do. - R. From weiny2 at llnl.gov Thu Apr 26 20:52:03 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 26 Apr 2007 20:52:03 -0700 Subject: [PATCH] Remove all uses of "/tmp" from perl diag (Was Re: [ofa-general] Re: [RFC] IB management changes proposal) In-Reply-To: References: <1176224960.14140.474623.camel@localhost.localdomain> <462C7F17.3040707@cea.fr> <1177538202.12542.13079.camel@hal.voltaire.com> <20070426050230.GJ5217@mellanox.co.il> <1177598794.12542.76717.camel@hal.voltaire.com> <20070426160825.GD15540@mellanox.co.il> <20070426182019.311c8b84.weiny2@llnl.gov> Message-ID: <20070426205203.6d90b759.weiny2@llnl.gov> On Thu, 26 Apr 2007 19:47:04 -0700 Roland Dreier wrote: > > > I'm sorry, I'm not familiar with the code. > > > I was just saying that using /tmp/ibnetdiscover.topology is clearly > > > a security risk since /tmp is world-writeable. Isn't it? > > > > However, I think the risk is pretty low. The scripts only use this information > > to report other information about the subnet. The only damage would be if an > > admin misinterpreted this information and did something bad to the net. > > You're not being devious enough. Look up "symlink attack" to see one > idea of something evil that an attacker could do. Ok, you scared me. ;-) How about the following patch? Would an autoconf option be better? Ira >From 4f3c4c69bf7920284ea9894246abc540b4d99cfb Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Thu, 26 Apr 2007 20:40:50 -0700 Subject: [PATCH] Remove all uses of "/tmp" from perl diags Remove all the uses of /tmp for cached application data. Replace with a global defined to /var/cache/infiniband-diags. Signed-off-by: Ira K. Weiny --- diags/scripts/IBswcountlimits.pm | 17 ++++++++++++++--- diags/scripts/ibfindnodesusing.pl | 4 ++-- diags/scripts/ibprintca.pl | 6 +++--- diags/scripts/ibprintswitch.pl | 6 +++--- diags/scripts/ibqueryerrors.pl | 4 ++-- diags/scripts/ibswportwatch.pl | 7 ++++--- 6 files changed, 28 insertions(+), 16 deletions(-) diff --git a/diags/scripts/IBswcountlimits.pm b/diags/scripts/IBswcountlimits.pm index e214f67..1c884e9 100755 --- a/diags/scripts/IBswcountlimits.pm +++ b/diags/scripts/IBswcountlimits.pm @@ -43,6 +43,7 @@ use strict; @IBswcountlimits::suppress_errors = (); $IBswcountlimits::link_ends = undef; $IBswcountlimits::pause_time = 10; +$IBswcountlimits::cache_dir = "/var/cache/infiniband-diags"; # all the PM counters @IBswcountlimits::counters = ( @@ -204,9 +205,19 @@ sub any_counts # ========================================================================= # +sub ensure_cache_dir +{ + if (!(-d "$IBswcountlimits::cache_dir")) { + mkdir $IBswcountlimits::cache_dir, 0700; + } +} + +# ========================================================================= +# sub generate_ibnetdiscover_topology { - `ibnetdiscover -g > /tmp/ibnetdiscover.topology`; + ensure_cache_dir; + `ibnetdiscover -g > $IBswcountlimits::cache_dir/ibnetdiscover.topology`; if ($? != 0) { die "Execution of ibnetdiscover failed with errors\n"; } @@ -216,8 +227,8 @@ sub generate_ibnetdiscover_topology # sub get_link_ends { - if (!(-f "/tmp/ibnetdiscover.topology")) { generate_ibnetdiscover_topology; } - open IBNET_TOPO, ") diff --git a/diags/scripts/ibprintswitch.pl b/diags/scripts/ibprintswitch.pl index 2ce3bbe..5ab8f65 100755 --- a/diags/scripts/ibprintswitch.pl +++ b/diags/scripts/ibprintswitch.pl @@ -62,11 +62,11 @@ if (defined $Getopt::Std::opt_l) { $list my $target_switch = $ARGV[0]; -if ($regenerate_map || !(-f "/tmp/ibnetdiscover.topology")) { generate_ibnetdiscover_topology; } +if ($regenerate_map || !(-f "$IBswcountlimits::cache_dir/ibnetdiscover.topology")) { generate_ibnetdiscover_topology; } if ($list_switches) { - system ("ibswitches /tmp/ibnetdiscover.topology"); + system ("ibswitches $IBswcountlimits::cache_dir/ibnetdiscover.topology"); exit 1; } @@ -80,7 +80,7 @@ if ($target_switch eq "") sub main { my $found_switch = undef; - open IBNET_TOPO, ") diff --git a/diags/scripts/ibqueryerrors.pl b/diags/scripts/ibqueryerrors.pl index e894eb8..9343fcf 100755 --- a/diags/scripts/ibqueryerrors.pl +++ b/diags/scripts/ibqueryerrors.pl @@ -113,7 +113,7 @@ sub get_counts my %switches = (); sub get_switches { - my $data = `ibswitches /tmp/ibnetdiscover.topology`; + my $data = `ibswitches $IBswcountlimits::cache_dir/ibnetdiscover.topology`; my @lines = split("\n", $data); foreach my $line (@lines) { if ($line =~ /^Switch\s+:\s+(\w+)\s+ports\s+(\d+)\s+.*/) @@ -164,7 +164,7 @@ sub main my $msg = join(",", @IBswcountlimits::suppress_errors); print "Suppressing: $msg\n"; } - if ($regenerate_map || !(-f "/tmp/ibnetdiscover.topology")) { generate_ibnetdiscover_topology; } + if ($regenerate_map || !(-f "$IBswcountlimits::cache_dir/ibnetdiscover.topology")) { generate_ibnetdiscover_topology; } get_switches; get_link_ends; foreach my $sw_addr (keys %switches) { diff --git a/diags/scripts/ibswportwatch.pl b/diags/scripts/ibswportwatch.pl index e844acb..e16d15e 100755 --- a/diags/scripts/ibswportwatch.pl +++ b/diags/scripts/ibswportwatch.pl @@ -111,13 +111,14 @@ sub get_new_counts my $addr = $_[0]; my $port = $_[1]; mv_counts; - if (system("perfquery $GUID $addr $port > /tmp/perfquery.out")) + ensure_cache_dir; + if (system("perfquery $GUID $addr $port > $IBswcountlimits::cache_dir/perfquery.out")) { print "perfquery failed : \"perfquery $GUID $addr $port\"\n"; - system("cat /tmp/perfquery.out"); + system("cat $IBswcountlimits::cache_dir/perfquery.out"); exit 1; } - open PERF_QUERY, ") { foreach my $count (@IBswcountlimits::counters) -- 1.4.4 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0001-Remove-all-uses-of-tmp-from-perl-diags.txt URL: From mst at dev.mellanox.co.il Thu Apr 26 21:23:28 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 27 Apr 2007 07:23:28 +0300 Subject: [ofa-general] Re: why is CPU util/service demand so much higher with SDP than TCP? In-Reply-To: <46315032.9060903@hp.com> References: <46315032.9060903@hp.com> Message-ID: <20070427042328.GK15540@mellanox.co.il> > Quoting Rick Jones : > Subject: why is CPU util/service demand so much higher with SDP than TCP? > > So, while playing around with my new netperf SDP_RR test I've noticed that > a single-byte _RR test over SDP has a much higher transactions per second > (ie lower latency) than over TCP over the same HCA, but the CPU utilization > is _very_ much higher and the service demand (cpu per transaction) as well. > CPU util being higher makes sense with a higher transaction rate, but not > the increased service demand - well at least not to my experience thusfar. That's expected. SDP by default uses polling aggressively to trade off service demand for latency. You can play with recv_poll module parameter to tune that. -- MST From vlad at lists.openfabrics.org Fri Apr 27 02:37:39 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 27 Apr 2007 02:37:39 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070427-0200 daily build status Message-ID: <20070427093739.D8AAAE6082B@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-42.ELsmp Failed: From mst at dev.mellanox.co.il Fri Apr 27 03:04:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 27 Apr 2007 13:04:41 +0300 Subject: [ofa-general] [PATCH] ipoib/cm: fix error handling when out of memory Message-ID: <20070427100441.GA23552@mellanox.co.il> If skb allocation fails when we start the device, we call ipoib_cm_dev_stop even though ipoib_cm_dev_open did not run to completion, so we pass an invalid pointer to ib_destroy_cm_id and get an oops. Fix by clearing cm.id on error, and testing it during cm_dev_stop. This fixes Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 2b242a4..10c105e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -594,7 +594,9 @@ int ipoib_cm_dev_open(struct net_device *dev) priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev); if (IS_ERR(priv->cm.id)) { printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); - return IS_ERR(priv->cm.id); + ret = PTR_ERR(priv->cm.id); + priv->cm.id = NULL; + return ret; } ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num), @@ -603,6 +605,7 @@ int ipoib_cm_dev_open(struct net_device *dev) printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name, IPOIB_CM_IETF_ID | priv->qp->qp_num); ib_destroy_cm_id(priv->cm.id); + priv->cm.id = NULL; return ret; } return 0; @@ -614,10 +617,11 @@ void ipoib_cm_dev_stop(struct net_device *dev) struct ipoib_cm_rx *p; unsigned long flags; - if (!IPOIB_CM_SUPPORTED(dev->dev_addr)) + if (!IPOIB_CM_SUPPORTED(dev->dev_addr) || !priv->cm.id) return; ib_destroy_cm_id(priv->cm.id); + priv->cm.id = NULL; spin_lock_irqsave(&priv->lock, flags); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); -- MST From p.kovacs at holografika.com Fri Apr 27 03:37:47 2007 From: p.kovacs at holografika.com (Kovacs Peter Tamas) Date: Fri, 27 Apr 2007 12:37:47 +0200 Subject: [ofa-general] OFED 1.2 RC2 <-> WinIB 1.3 Message-ID: <4631D27B.10301@holografika.com> Dear all, I've tried to do some sped tests between a Linux and a Windows box using InfiniBand. I've installed OFED 1.2 RC2 to a Fedora Core 6 x64 box, and connected it to a Windows XP x64 box with WinIB 1.3, both machines having a Mellanox MHES-14XTC. Unfortunately, ib_read_bw (which works perfectly between two linux boxes), fails in this case. Here is what I got: On the linux box: [pkovacs at localhost ~]$ ib_read_bw 1.2.3.6 ------------------------------------------------------------------ RDMA_Read BW Test Connection type : RC local address: LID 0x01, QPN 0x20405, PSN 0x62d39e RKey 0xc002700 VAddr 0x002aaaab517000 remote address: LID 0x400, QPN 0x5040000, PSN 0x0877, RKey 0x210000 VAddr 0x000000004a4390 Mtu : 2048 ------------------------------------------------------------------ #bytes #iterations BW peak[MB/sec] BW average[MB/sec] Completion wth error at client: Failed status 12: wr_id 1 syndrom 0x81 scnt=100, ccnt=0 [pkovacs at localhost ~]$ On the windows box: C:\Program Files\Mellanox\WinIB\Tools>ib_read_bw.exe ------------------------------------------------------------------ RDMA_Read BW Test Connection type : RC max inline size 28 local address: LID 0x400, QPN 0x5040000, PSN 0x0877, RKey 0x210000 VAddr 0x000000004a4390 remote address: LID 0x01, QPN 0x20405, PSN 0x62d39e, RKey 0xc002700 VAddr 0x000000ab517000 Mtu : 2048 ------------------------------------------------------------------ #bytes #iterations BW peak[MB/sec] BW average[MB/sec] pp_read_keys: No error Couldn't read remote address Unknown error C:\Program Files\Mellanox\WinIB\Tools> Do you know why does it fail? I suspect some version incompatibilities, but then, who can tell which OFED goes with which WinIB? Thanks for your kind help in advance, Peter From amitk at mellanox.co.il Fri Apr 27 04:16:10 2007 From: amitk at mellanox.co.il (Amit Krig) Date: Fri, 27 Apr 2007 14:16:10 +0300 Subject: [ofa-general] OFED 1.2 Mellanox test report In-Reply-To: <1E3DCD1C63492545881FACB6063A57C1E04FCD@mtiexch01.mti.com> References: <1E3DCD1C63492545881FACB6063A57C1D4C8D8@mtiexch01.mti.com> <1E3DCD1C63492545881FACB6063A57C1E04FCD@mtiexch01.mti.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C9015FD600@mtlexch01.mtl.com> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OFED1.2 RC2 Mellanox test report.xls Type: application/vnd.ms-excel Size: 86016 bytes Desc: OFED1.2 RC2 Mellanox test report.xls URL: From postmaster at persistent.co.in Fri Apr 27 04:54:35 2007 From: postmaster at persistent.co.in (postmaster at persistent.co.in) Date: Fri, 27 Apr 2007 17:24:35 +0530 (IST) Subject: [ofa-general] Auth Required Message-ID: <20070427115435.D992B528F60@bmapps.persistent.co.in> This is an automatically generated Delivery Status Notification. Delivery to the intended recipients failed. Diagnostic-Code: SMTP; 530 5.0.0 Authentication required -------------- next part -------------- An embedded message was scrubbed... From: Investor Milagros Subject: Daily News 1346333292956 Date: Fri, 27 Apr 2007 17:24:34 +0530 (IST) Size: 5850 URL: From jkabelitz at sysgen.de Fri Apr 27 05:42:43 2007 From: jkabelitz at sysgen.de (=?iso-8859-1?Q?J=FCrgen_Kabelitz?=) Date: Fri, 27 Apr 2007 14:42:43 +0200 Subject: [ofa-general] Problems with building OFED 1.1. on SuSE SLES10 Message-ID: <20070427125016.F26A4E60803@openfabrics.org> Hello, I have some problems swith building the OFED 1.1 software on an opteron system with sles 10: uname -a output: Linux mserv0001 2.6.16.27-0.9-smp_lustre #2 SMP Fri Apr 27 11:49:24 CEST 2007 x86_64 x86_64 x86_64 GNU/Linux when I ran the install.sh script with the input 2 and then 3 some minutes later I get the following output: ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr/local/ofed' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools --with-mstflint --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools ' --define 'KVERSION 2.6.16.27-0.9-smp_lustre' --define 'KSRC /lib/modules/2.6.16.27-0.9-smp_lustre/build' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'NETWORK_CONF_DIR /etc/sysconfig/network' --define 'modprobe_update 1' --define 'include_ipoib_conf 1' --define 'build_32bit 1' /usr/src/OFED-1.1/SRPMS/openib-1.1-0.src.rpm" See log file: /tmp/OFED.419.log The Log file shows: /bin/rm -f /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/examples cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/libibverbs Running: ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache --disable-libcheck --prefix /usr/local/ofed --libdir /usr/local/ofed/lib CPPFL AGS="-I../libibverbs/include" configure: creating cache /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... gawk checking whether make sets $(MAKE)... yes checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu checking for style of include used by make... GNU checking for gcc... gcc checking for C compiler default output file name... configure: error: C compiler cannot create executables See `config.log' for more details. Failed to execute: ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache --disable-libcheck --prefix /usr/local/ofed --libdir /usr/local/ofed /lib CPPFLAGS="-I../libibverbs/include" error: Bad exit status from /var/tmp/rpm-tmp.51589 (%install) When I go into the directory and the configure line typed in, no errors occurred and I can give the make command. Has anybody a hint for me? J. Kabelitz sysGen GmbH Support und Technik Clustersysteme Am Hallacker 48 28327 Bremen Tel (0421) 40966 -28 Fax (0421) 40966 -66 mailto:jkabelitz at sysgen.de www.sysgen.de Geschäftsführerin Gabriele Nikisch Eingetragen beim Amtsgericht Walsrode HRB 121943 -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Fri Apr 27 06:03:56 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 27 Apr 2007 08:03:56 -0500 Subject: [ofa-general] Re: [PATCH 2.6.22 5/5] iw_cxgb3: Update required firmware revision to 4.0.0. In-Reply-To: References: <20070426202057.24234.56383.stgit@dell3.ogc.int> <20070426202126.24234.71523.stgit@dell3.ogc.int> Message-ID: <1177679036.10490.15.camel@stevo-desktop> On Thu, 2007-04-26 at 20:12 -0700, Roland Dreier wrote: > > Update required firmware revision to 4.0.0. > > Hmm... should we fold this into the earlier patch, which actually > needs this new FW? Or at least merge this patch first? > I separated it only because cxgb3 is maintained by Jeff. Feel free to make it one commit. That is the proper way IMO. But I didn't know what SOP was for changes that hit different maintainers but are prerequisites of each other... > Also, is it cool with everyone to require a new FW, even for users who > might not be using (or even building) the RDMA driver? I'm not sure > what a good solution would be really, so maybe the pain of forcing > everyone to update FW is the least bad thing to do. > - R. I was asked to package the firmware version change along with my rdma changes by Divy since they didn't have any other cxgb3 changes right now. I believe Chelsio wants folks on this new firmware asap. Steve. From Koen.SEGERS at VRT.BE Fri Apr 27 06:17:19 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Fri, 27 Apr 2007 15:17:19 +0200 Subject: [ofa-general] Re: why is CPU util/service demand so much higher withSDP than TCP? References: <46315032.9060903@hp.com> <20070427042328.GK15540@mellanox.co.il> Message-ID: Can you give more information on what this recv_poll module does? We are not interested in latency, but in throughput. Do we need to change this module for this setup? Regards, Koen ________________________________ Van: general-bounces at lists.openfabrics.org namens Michael S. Tsirkin Verzonden: vr 27/04/2007 6:23 Aan: Rick Jones CC: general at lists.openfabrics.org Onderwerp: [ofa-general] Re: why is CPU util/service demand so much higher withSDP than TCP? > Quoting Rick Jones : > Subject: why is CPU util/service demand so much higher with SDP than TCP? > > So, while playing around with my new netperf SDP_RR test I've noticed that > a single-byte _RR test over SDP has a much higher transactions per second > (ie lower latency) than over TCP over the same HCA, but the CPU utilization > is _very_ much higher and the service demand (cpu per transaction) as well. > CPU util being higher makes sense with a higher transaction rate, but not > the increased service demand - well at least not to my experience thusfar. That's expected. SDP by default uses polling aggressively to trade off service demand for latency. You can play with recv_poll module parameter to tune that. -- MST _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Apr 27 06:28:45 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Apr 2007 09:28:45 -0400 Subject: [PATCH] Remove all uses of "/tmp" from perl diag (Was Re: [ofa-general] Re: [RFC] IB management changes proposal) In-Reply-To: <20070426205203.6d90b759.weiny2@llnl.gov> References: <1176224960.14140.474623.camel@localhost.localdomain> <462C7F17.3040707@cea.fr> <1177538202.12542.13079.camel@hal.voltaire.com> <20070426050230.GJ5217@mellanox.co.il> <1177598794.12542.76717.camel@hal.voltaire.com> <20070426160825.GD15540@mellanox.co.il> <20070426182019.311c8b84.weiny2@llnl.gov> <20070426205203.6d90b759.weiny2@llnl.gov> Message-ID: <1177680524.12542.159975.camel@hal.voltaire.com> On Thu, 2007-04-26 at 23:52, Ira Weiny wrote: > On Thu, 26 Apr 2007 19:47:04 -0700 > Roland Dreier wrote: > > > > > I'm sorry, I'm not familiar with the code. > > > > I was just saying that using /tmp/ibnetdiscover.topology is clearly > > > > a security risk since /tmp is world-writeable. Isn't it? > > > > > > However, I think the risk is pretty low. The scripts only use this information > > > to report other information about the subnet. The only damage would be if an > > > admin misinterpreted this information and did something bad to the net. > > > > You're not being devious enough. Look up "symlink attack" to see one > > idea of something evil that an attacker could do. > > Ok, you scared me. ;-) How about the following patch? Would an autoconf > option be better? > > Ira > > > >From 4f3c4c69bf7920284ea9894246abc540b4d99cfb Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Thu, 26 Apr 2007 20:40:50 -0700 > Subject: [PATCH] Remove all uses of "/tmp" from perl diags > > Remove all the uses of /tmp for cached application data. Replace with a > global defined to /var/cache/infiniband-diags. > > Signed-off-by: Ira K. Weiny Thanks. Applied (to both master and ofed_1_2). -- Hal From mst at dev.mellanox.co.il Fri Apr 27 06:32:24 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 27 Apr 2007 16:32:24 +0300 Subject: [ofa-general] Re: why is CPU util/service demand so much higher withSDP than TCP? In-Reply-To: References: <46315032.9060903@hp.com> <20070427042328.GK15540@mellanox.co.il> Message-ID: <20070427133224.GL15540@mellanox.co.il> This will poll CQ before going to sleep. You can try tuning this module option if you like - but from I saw the defaults are generally OK. Quoting SEGERS Koen : Subject: RE: [ofa-general] Re: why is CPU util/service demand so much higher withSDP than TCP? Can you give more information on what this recv_poll module does? We are not interested in latency, but in throughput. Do we need to change this module for this setup? Regards, Koen ________________________________ Van: general-bounces at lists.openfabrics.org namens Michael S. Tsirkin Verzonden: vr 27/04/2007 6:23 Aan: Rick Jones CC: general at lists.openfabrics.org Onderwerp: [ofa-general] Re: why is CPU util/service demand so much higher withSDP than TCP? > Quoting Rick Jones : > Subject: why is CPU util/service demand so much higher with SDP than TCP? > > So, while playing around with my new netperf SDP_RR test I've noticed that > a single-byte _RR test over SDP has a much higher transactions per second > (ie lower latency) than over TCP over the same HCA, but the CPU utilization > is _very_ much higher and the service demand (cpu per transaction) as well. > CPU util being higher makes sense with a higher transaction rate, but not > the increased service demand - well at least not to my experience thusfar. That's expected. SDP by default uses polling aggressively to trade off service demand for latency. You can play with recv_poll module parameter to tune that. -- MST _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer -- MST From mst at dev.mellanox.co.il Fri Apr 27 08:27:46 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 27 Apr 2007 18:27:46 +0300 Subject: [ofa-general] [PATCH RFC-repost] ipoib/cm: use common cq for ipoib TX Message-ID: <20070427152746.GA1709@mellanox.co.il> Use common CQ for all TX QPs: keep a per-device counter of outstanding tx WRs, and stop the interface when this counter reaches the send queue size, to avoid CQ overruns. This should help reduce the number of interrupts for bi-directional traffic (such as TCP). Signed-off-by: Michael S. Tsirkin --- > > Quoting Michael S. Tsirkin : > > Subject: [PATCH RFC] use common cq for ipoib cm send side > > > > The following untested patch moves all TX processing in IPoIB CM to common CQ. > > This should help reduce the number of interrupts for bi-directional traffic > > (such as TCP). Is this a good idea? What do others think? > > > > Signed-off-by: Michael S. Tsirkin > > FYI, this was just thinking aloud. The version below works fine here but the > performance gain seems to be very small (about 1%). The gain with NAPI might > be bigger but this is yet to be tested. I'll continue looking into this. > > Feedback wellcome. > > ipoib.h | 10 +++++-- > ipoib_cm.c | 78 +++++++++++++++-------------------------------------------- > ipoib_ib.c | 28 ++++++++++++--------- > ipoib_main.c | 2 - > 4 files changed, 45 insertions(+), 73 deletions(-) I am reposting this patch because I think it will be needed on top of NAPI patch, to help fix "driver is hogging interrupts" errors reported for IPoIB send side. See https://bugs.openfabrics.org/show_bug.cgi?id=508 diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index eb885ee..ef703c7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -99,9 +99,9 @@ enum { #define IPOIB_OP_RECV (1ul << 31) #ifdef CONFIG_INFINIBAND_IPOIB_CM -#define IPOIB_CM_OP_SRQ (1ul << 30) +#define IPOIB_OP_CM (1ul << 30) #else -#define IPOIB_CM_OP_SRQ (0) +#define IPOIB_OP_CM (0) #endif /* structs */ @@ -144,7 +144,6 @@ struct ipoib_cm_rx { struct ipoib_cm_tx { struct ib_cm_id *id; - struct ib_cq *cq; struct ib_qp *qp; struct list_head list; struct net_device *dev; @@ -233,6 +232,7 @@ struct ipoib_dev_priv { unsigned tx_tail; struct ib_sge tx_sge; struct ib_send_wr tx_wr; + unsigned tx_outstanding; struct ib_wc ibwc[IPOIB_NUM_WC]; @@ -439,6 +439,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx); void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb, unsigned int mtu); void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc); +void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc); #else struct ipoib_cm_tx; @@ -527,6 +528,9 @@ static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *w { } +static inline void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) +{ +} #endif #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 8ee6f06..af36562 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -85,7 +85,7 @@ static int ipoib_cm_post_receive(struct net_device *dev, int id) struct ib_recv_wr *bad_wr; int i, ret; - priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + priv->cm.rx_wr.wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV; for (i = 0; i < IPOIB_CM_RX_SG; ++i) priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; @@ -346,7 +346,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; + unsigned int wr_id = wc->wr_id & ~(IPOIB_OP_CM | IPOIB_OP_RECV); struct sk_buff *skb; struct ipoib_cm_rx *p; unsigned long flags; @@ -433,7 +433,7 @@ static inline int post_send(struct ipoib_dev_priv *priv, priv->tx_sge.addr = addr; priv->tx_sge.length = len; - priv->tx_wr.wr_id = wr_id; + priv->tx_wr.wr_id = wr_id | IPOIB_OP_CM; return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr); } @@ -484,20 +484,19 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ dev->trans_start = jiffies; ++tx->tx_head; - if (tx->tx_head - tx->tx_tail == ipoib_sendq_size) { + if (++priv->tx_outstanding == ipoib_sendq_size) { ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n", tx->qp->qp_num); netif_stop_queue(dev); - set_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags); } } } -static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx, - struct ib_wc *wc) +void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id; + struct ipoib_cm_tx *tx = wc->qp->qp_context; + unsigned int wr_id = wc->wr_id & ~IPOIB_OP_CM; struct ipoib_tx_buf *tx_req; unsigned long flags; @@ -522,11 +521,10 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx spin_lock_irqsave(&priv->tx_lock, flags); ++tx->tx_tail; - if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags)) && - tx->tx_head - tx->tx_tail <= ipoib_sendq_size >> 1) { - clear_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags); + if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) && + netif_queue_stopped(dev) && + test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) netif_wake_queue(dev); - } if (wc->status != IB_WC_SUCCESS && wc->status != IB_WC_WR_FLUSH_ERR) { @@ -549,11 +547,6 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx tx->neigh = NULL; } - /* queue would be re-started anyway when TX is destroyed, - * but it makes sense to do it ASAP here. */ - if (test_and_clear_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags)) - netif_wake_queue(dev); - if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { list_move(&tx->list, &priv->cm.reap_list); queue_work(ipoib_workqueue, &priv->cm.reap_task); @@ -567,19 +560,6 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx spin_unlock_irqrestore(&priv->tx_lock, flags); } -static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) -{ - struct ipoib_cm_tx *tx = tx_ptr; - int n, i; - - ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); - do { - n = ib_poll_cq(cq, IPOIB_NUM_WC, tx->ibwc); - for (i = 0; i < n; ++i) - ipoib_cm_handle_tx_wc(tx->dev, tx, tx->ibwc + i); - } while (n == IPOIB_NUM_WC); -} - int ipoib_cm_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -699,17 +679,18 @@ static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even return 0; } -static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ib_cq *cq) +static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_cm_tx *tx) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = {}; attr.recv_cq = priv->cq; + attr.send_cq = priv->cq; attr.srq = priv->cm.srq; attr.cap.max_send_wr = ipoib_sendq_size; attr.cap.max_send_sge = 1; attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; - attr.send_cq = cq; + attr.qp_context = tx; return ib_create_qp(priv->pd, &attr); } @@ -789,21 +770,7 @@ static int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn, goto err_tx; } - p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p, - ipoib_sendq_size + 1); - if (IS_ERR(p->cq)) { - ret = PTR_ERR(p->cq); - ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret); - goto err_cq; - } - - ret = ib_req_notify_cq(p->cq, IB_CQ_NEXT_COMP); - if (ret) { - ipoib_warn(priv, "failed to request completion notification: %d\n", ret); - goto err_req_notify; - } - - p->qp = ipoib_cm_create_tx_qp(p->dev, p->cq); + p->qp = ipoib_cm_create_tx_qp(p->dev, p); if (IS_ERR(p->qp)) { ret = PTR_ERR(p->qp); ipoib_warn(priv, "failed to allocate tx qp: %d\n", ret); @@ -840,12 +807,8 @@ err_modify: err_id: p->id = NULL; ib_destroy_qp(p->qp); -err_req_notify: err_qp: p->qp = NULL; - ib_destroy_cq(p->cq); -err_cq: - p->cq = NULL; err_tx: return ret; } @@ -854,6 +817,7 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) { struct ipoib_dev_priv *priv = netdev_priv(p->dev); struct ipoib_tx_buf *tx_req; + unsigned long flags; ipoib_dbg(priv, "Destroy active connection 0x%x head 0x%x tail 0x%x\n", p->qp ? p->qp->qp_num : 0, p->tx_head, p->tx_tail); @@ -864,12 +828,6 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) if (p->qp) ib_destroy_qp(p->qp); - if (p->cq) - ib_destroy_cq(p->cq); - - if (test_bit(IPOIB_FLAG_NETIF_STOPPED, &p->flags)) - netif_wake_queue(p->dev); - if (p->tx_ring) { while ((int) p->tx_tail - (int) p->tx_head < 0) { tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)]; @@ -877,6 +835,12 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) DMA_TO_DEVICE); dev_kfree_skb_any(tx_req->skb); ++p->tx_tail; + spin_lock_irqsave(&priv->tx_lock, flags); + if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) && + netif_queue_stopped(p->dev) && + test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + netif_wake_queue(p->dev); + spin_unlock_irqrestore(&priv->tx_lock, flags); } kfree(p->tx_ring); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index f2aa923..19a3d3e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -266,11 +266,10 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; - if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) && - priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) { - clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); + if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) && + netif_queue_stopped(dev) && + test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) netif_wake_queue(dev); - } spin_unlock_irqrestore(&priv->tx_lock, flags); if (wc->status != IB_WC_SUCCESS && @@ -282,12 +281,17 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) static void ipoib_ib_handle_wc(struct net_device *dev, struct ib_wc *wc) { - if (wc->wr_id & IPOIB_CM_OP_SRQ) - ipoib_cm_handle_rx_wc(dev, wc); - else if (wc->wr_id & IPOIB_OP_RECV) - ipoib_ib_handle_rx_wc(dev, wc); - else - ipoib_ib_handle_tx_wc(dev, wc); + if (wc->wr_id & IPOIB_OP_CM) { + if (wc->wr_id & IPOIB_OP_RECV) + ipoib_cm_handle_rx_wc(dev, wc); + else + ipoib_cm_handle_tx_wc(dev, wc); + } else { + if (wc->wr_id & IPOIB_OP_RECV) + ipoib_ib_handle_rx_wc(dev, wc); + else + ipoib_ib_handle_tx_wc(dev, wc); + } } void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) @@ -370,10 +374,9 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, address->last_send = priv->tx_head; ++priv->tx_head; - if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) { + if (++priv->tx_outstanding == ipoib_sendq_size) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); - set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); } } } @@ -549,6 +552,7 @@ int ipoib_ib_dev_stop(struct net_device *dev) DMA_TO_DEVICE); dev_kfree_skb_any(tx_req->skb); ++priv->tx_tail; + --priv->tx_outstanding; } for (i = 0; i < ipoib_recvq_size; ++i) { diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 19e82db..7c7b136 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -900,7 +900,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) goto out_rx_ring_cleanup; } - /* priv->tx_head & tx_tail are already 0 */ + /* priv->tx_head, tx_tail & tx_outstanding are already 0 */ if (ipoib_ib_dev_init(dev, ca, port)) goto out_tx_ring_cleanup; -- MST From mst at dev.mellanox.co.il Fri Apr 27 08:30:16 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 27 Apr 2007 18:30:16 +0300 Subject: [ofa-general] Re: What's in infiniband.git for 2.6.22 In-Reply-To: References: Message-ID: <20070427153016.GB1709@mellanox.co.il> > Quoting Roland Dreier : > Subject: What's in infiniband.git for 2.6.22 > > Here's a short summary of what my plans for 2.6.22 are. For > reference, everything is in my git tree: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git What about the mthca patch to use separate HW queues for kernel RC/UD/userspace RC? -- MST From suri at baymicrosystems.com Fri Apr 27 08:31:05 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Fri, 27 Apr 2007 11:31:05 -0400 Subject: [ofa-general] Problems with building OFED 1.1. on SuSE SLES10 In-Reply-To: <20070427125016.F26A4E60803@openfabrics.org> References: <20070427125016.F26A4E60803@openfabrics.org> Message-ID: <02e401c788e1$157f1e60$1914a8c0@surioffice> This is the same type of issue that I am battling with RHEL5 (while installing OFED1.2 though). When the error says “configure: error: C compiler cannot create executables”, all it means is it failed to compile the test program (conftest.c or something like that). Please go into config.log and look for the error. It will tell you what options it was using to compile the test program. In all probability if you write a small test program and try to compile with the same options as in the config.log it will fail. By rule of elimination you can figure out which option is giving you problem and once you do that someone in the list might be able to help you. In my case it was the –m32 option which caused grief and I had to install the 32bit gcc-devel to get around the issue. Sorry for the long answer, and hopefully your issues is simpler Thanks, Suri _____ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jürgen Kabelitz Sent: Friday, April 27, 2007 8:43 AM To: general at lists.openfabrics.org Subject: [ofa-general] Problems with building OFED 1.1. on SuSE SLES10 Hello, I have some problems swith building the OFED 1.1 software on an opteron system with sles 10: uname –a output: Linux mserv0001 2.6.16.27-0.9-smp_lustre #2 SMP Fri Apr 27 11:49:24 CEST 2007 x86_64 x86_64 x86_64 GNU/Linux when I ran the install.sh script with the input 2 and then 3 some minutes later I get the following output: ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr/local/ofed' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools --with-mstflint --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools ' --define 'KVERSION 2.6.16.27-0.9-smp_lustre' --define 'KSRC /lib/modules/2.6.16.27-0.9-smp_lustre/build' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'NETWORK_CONF_DIR /etc/sysconfig/network' --define 'modprobe_update 1' --define 'include_ipoib_conf 1' --define 'build_32bit 1' /usr/src/OFED-1.1/SRPMS/openib-1.1-0.src.rpm" See log file: /tmp/OFED.419.log The Log file shows: /bin/rm -f /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/examples cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/libibverbs Running: ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache --disable-libcheck --prefix /usr/local/ofed --libdir /usr/local/ofed/lib CPPFL AGS="-I../libibverbs/include" configure: creating cache /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... gawk checking whether make sets $(MAKE)... yes checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu checking for style of include used by make... GNU checking for gcc... gcc checking for C compiler default output file name... configure: error: C compiler cannot create executables See `config.log' for more details. Failed to execute: ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache --disable-libcheck --prefix /usr/local/ofed --libdir /usr/local/ofed /lib CPPFLAGS="-I../libibverbs/include" error: Bad exit status from /var/tmp/rpm-tmp.51589 (%install) When I go into the directory and the configure line typed in, no errors occurred and I can give the make command. Has anybody a hint for me? J. Kabelitz sysGen GmbH Support und Technik Clustersysteme Am Hallacker 48 28327 Bremen Tel (0421) 40966 -28 Fax (0421) 40966 -66 mailto:jkabelitz at sysgen.de www.sysgen.de Geschäftsführerin Gabriele Nikisch Eingetragen beim Amtsgericht Walsrode HRB 121943 -------------- next part -------------- An HTML attachment was scrubbed... URL: From gyk at parsons.com Fri Apr 27 09:03:33 2007 From: gyk at parsons.com (Andy Camp) Date: Fri, 27 Apr 2007 18:03:33 +0200 Subject: [ofa-general] This "smarter choice" of the garbage collector is generally better but is not always the best. Message-ID: <000601c788e5$9b53d7c0$b261648c@vltr> CDPN Starts Huge Campaign! Watch For News! China Datacom Corp. Sym: CDPN Price: $0.08 CDPN, in the news 8 months ago for purchasing Supremacy International and entering the new G3 Market, is beginning a huge marketing campaign. Watch for the news and get ready for the ride! Get on CDPN firs thing Friday! Java Web Start provides limited support for applets through its built-in applet viewer. The answer depends on what kind of application you're building and what it does. This tech tip shows you some tricks for getting around the problem of text vanishing within a background image. Note: For the code in this issue of Fundamentals to compile, you need to use the JDK 5. Which Java technologies do you need to create a web application and which packages do you need to import? 0 to prepare your applications for distribution and then deploying those applications. Learn about concurrancy, garbage collection, and the two new sections being added to the newsletter. Supported Products and Technologies NetBeans 5. What it does is let you define any number of "cards" containing, typically, a logically related collection of components. regex API and presents several working examples to illustrate how the various objects interact. Sun's comprehensive offerings enable rapid development of applications and web services. Supported Products and Technologies NetBeans 5. The application provides a drop-down list of master data along with a synchronized detail table. Keep in mind that what goes on each tab is just one component. This offer is available in most countries around the world. To see what we have covered in past issues, click here. Java started as part of a larger project to develop advanced software for consumer electronics. Get trained and certified in the Java platform APIs, learn to use the power of Java technology to create web services. You can download all the demo source code using the link at the end of this article. add(tabbedPane, BorderLayout. " Similarly, someone might ask, "How do I distribute this application to other users without having to give them the whole IDE as well? This trail teaches the regular expression syntax supported by the java. These interfaces allow collections to be manipulated independently of the details of their representation. You will create a MyLib project with a utility class, then create a MyApp project with a main class that implements a method from the library project. DefaultListModel class provides sorted data. Users do not have to install it separately or perform additional tasks to use Java Web Start applications. org The NetBeans IDE 5. org The NetBeans IDE 5. The following program demonstrates the use of CardLayout with five cards containing a single button each. By default, applications have restricted access to local disk and network resources. The introduction of the JTabbedPane component in the Java 2 Platform, Standard Edition (J2SE) 1. Earning a Sun Java technology certification provides a clear demonstration of your technical skills and professional dedication. This article assumes that you have already completed the first steps in setting up a free-form project. The button will initiate the functionality built into the front end. 0, the choice of the collector is based on the class of the machine on which the application is started. The states that an action can handle includes text, icon, mnemonic, and enabled status. We asked what you wanted to read in Java Technology Fundamentals, and you responded. The link points to a Java Network Launching Protocol (JNLP) file, which instructs Java Web Start to download, cache, and run the application. We will use this IDE's sources repository as an example. Software that can do such things is known as concurrent software. 2 and later, Java Web Start software is installed as part of the Java Runtime Environment (JRE). To see what we have covered in past issues, click here. Sun Java Studio Creator IDE Download the NetBeans 5. The text boxes will be used for receiving user input and also for displaying the program output. From sweitzen at cisco.com Fri Apr 27 09:26:18 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 27 Apr 2007 09:26:18 -0700 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: <4630D776.5060704@hp.com> References: <462E8257.9090103@hp.com> <462F8E07.4050000@hp.com><462FC6C6.9050700@hp.com> <20070426045855.GH5217@mellanox.co.il> <4630D776.5060704@hp.com> Message-ID: > > Please note that you should *only* ever stick the SDP family value > > in the socket(3) call. All addresses for connect, bind etc > > are AF_INET, since SDP uses IP addresses for everything. > > Sounds like something trying to be just a little bit pregnant. > > Thankfully, I'm only munging the getaddrinfo() data for the > local endpoint. See bug https://bugs.openfabrics.org//show_bug.cgi?id=294, I agree connect() and bind() should allow AF_INET_SDP. About the "direct" SDP tests, instead of copy/pasting the TCP code, how about if you just had a command-line argument that specified SDP, like you do with neterver -6 to specify IPv6 instead of IPv4? Speaking of IPv6, does netperf work with IPv6 on Linux? Scott From divy at chelsio.com Fri Apr 27 09:52:55 2007 From: divy at chelsio.com (Divy Le Ray) Date: Fri, 27 Apr 2007 09:52:55 -0700 Subject: [ofa-general] Re: [PATCH 2.6.22 5/5] iw_cxgb3: Update required firmware revision to 4.0.0. In-Reply-To: References: <20070426202057.24234.56383.stgit@dell3.ogc.int> <20070426202126.24234.71523.stgit@dell3.ogc.int> Message-ID: <46322A67.5070608@chelsio.com> Roland Dreier wrote: > > Update required firmware revision to 4.0.0. > > Hmm... should we fold this into the earlier patch, which actually > needs this new FW? Or at least merge this patch first? > > Also, is it cool with everyone to require a new FW, even for users who > might not be using (or even building) the RDMA driver? I'm not sure > what a good solution would be really, so maybe the pain of forcing > everyone to update FW is the least bad thing to do. > > Hi Roland, The FW update required code changes in the RDMA driver, so Steve took care of submitting the update patch. The new FW is required for the NIC driver too. Cheers, Divy From suri at baymicrosystems.com Fri Apr 27 10:04:55 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Fri, 27 Apr 2007 13:04:55 -0400 Subject: [ofa-general] RE: error installing ofed_1.2-rc2 on RHEL5 In-Reply-To: <029901c7884c$9c14cf50$1914a8c0@surioffice> References: <20070419125722.GD918@mellanox.co.il><20070419203705.GA613@mellanox.co.il><6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com><6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com><20070426133442.GJ32513@mellanox.co.il><6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com><20070426134331.GL32513@mellanox.co.il><6a122cc00704260653i706caaecu1b8265157805912@mail.gmail.com><20070426152613.GA15540@mellanox.co.il> <029901c7884c$9c14cf50$1914a8c0@surioffice> Message-ID: <02ef01c788ee$31aca320$1914a8c0@surioffice> So, after installing the 32bit version of gcc-devel as per Doug's suggestion I managed to install OFED-1.2 on RHEL5. Now the bigger question is, can we get release notes on OFED_1.2 to say it requires 32 bit version of gcc-devel, as RHEL5 for X86_64 does not come with it? Thanks Suri > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf > Of Suresh Shelvapille > Sent: Thursday, April 26, 2007 5:48 PM > To: general at lists.openfabrics.org; 'Doug Ledford'; vlad at dev.mellanox.co.il; > etta at systemfabricworks.com > Subject: [ofa-general] RE: error installing ofed_1.2-rc2 on RHEL5 > > After some digging around on the net figured out that the "error" actually meant > no "c" program would compile when supplied with the same arguments as found on > the error line in "config.log". > > In my case it happens to be: > gcc -m32 -g -O2 -L/usr/lib -I../libibverbs/include -L. conftest.c >&5 > > the -m32 instread of -m64 seems to be the culprit! > > Looking at the configure file and running /usr/bin/file on the xxx.o seems to yield > 64_bit though.... > > Any ideas..... From ralphc at pathscale.com Fri Apr 27 11:08:40 2007 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 27 Apr 2007 11:08:40 -0700 Subject: [ofa-general] [PATCH] IB/ipath - Don't put QP in timeout queue if waiting to send Message-ID: <1177697321.3407.10.camel@brick.pathscale.com> This patch fixes a problem which causes too many RC timeouts and retransmits. Hopefully, this can be included in OFED 1.2 as well as going upstream. Signed-off-by: Ralph Campbell diff -r 9a0d437dd1a1 drivers/infiniband/hw/ipath/ipath_rc.c --- a/drivers/infiniband/hw/ipath/ipath_rc.c Wed Apr 18 17:46:23 2007 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_rc.c Wed Apr 18 17:47:12 2007 -0700 @@ -228,18 +228,13 @@ int ipath_make_rc_req(struct ipath_qp *q goto done; if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK) || - qp->s_rnr_timeout) + qp->s_rnr_timeout || qp->s_wait_credit) goto bail; /* Limit the number of packets sent without an ACK. */ if (ipath_cmp24(qp->s_psn, qp->s_last_psn + IPATH_PSN_CREDIT) > 0) { qp->s_wait_credit = 1; dev->n_rc_stalls++; - spin_lock(&dev->pending_lock); - if (list_empty(&qp->timerwait)) - list_add_tail(&qp->timerwait, - &dev->pending[dev->pending_index]); - spin_unlock(&dev->pending_lock); goto bail; } diff -r 9a0d437dd1a1 drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Wed Apr 18 17:46:23 2007 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Wed Apr 18 17:47:12 2007 -0700 @@ -423,7 +423,7 @@ struct ipath_qp { #define IPATH_S_RDMAR_PENDING 0x04 #define IPATH_S_ACK_PENDING 0x08 -#define IPATH_PSN_CREDIT 2048 +#define IPATH_PSN_CREDIT 512 /* * Since struct ipath_swqe is not a fixed size, we can't simply index into From ralph.campbell at qlogic.com Fri Apr 27 11:11:11 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 27 Apr 2007 11:11:11 -0700 Subject: [ofa-general] [PATCH] IB/ipath - Don't call spin_lock_irq() from interrupt context Message-ID: <1177697471.3407.14.camel@brick.pathscale.com> This patch fixes the problem reported by Bernd Schubert with kernel debug options enabled. BUG: at kernel/lockdep.c:1860 trace_hardirqs_on() Hopefully, this can be included in OFED 1.2 as well as going upstream. Signed-off-by: Ralph Campbell diff -r 97262e873c51 drivers/infiniband/hw/ipath/ipath_rc.c --- a/drivers/infiniband/hw/ipath/ipath_rc.c Fri Apr 20 14:39:31 2007 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_rc.c Fri Apr 20 14:45:03 2007 -0700 @@ -582,6 +582,7 @@ static void send_rc_ack(struct ipath_qp u32 hwords; struct ipath_ib_header hdr; struct ipath_other_headers *ohdr; + unsigned long flags; /* Don't send ACK or NAK if a RDMA read or atomic is pending. */ if (qp->r_head_ack_queue != qp->s_tail_ack_queue || @@ -637,11 +638,11 @@ static void send_rc_ack(struct ipath_qp dev->n_rc_qacks++; queue_ack: - spin_lock_irq(&qp->s_lock); + spin_lock_irqsave(&qp->s_lock, flags); qp->s_flags |= IPATH_S_ACK_PENDING; qp->s_nak_state = qp->r_nak_state; qp->s_ack_psn = qp->r_ack_psn; - spin_unlock_irq(&qp->s_lock); + spin_unlock_irqrestore(&qp->s_lock, flags); /* Call ipath_do_rc_send() in another thread. */ tasklet_hi_schedule(&qp->s_task); @@ -1292,6 +1293,7 @@ static inline int ipath_rc_rcv_error(str struct ipath_ack_entry *e; u8 i, prev; int old_req; + unsigned long flags; if (diff > 0) { /* @@ -1325,7 +1327,7 @@ static inline int ipath_rc_rcv_error(str psn &= IPATH_PSN_MASK; e = NULL; old_req = 1; - spin_lock_irq(&qp->s_lock); + spin_lock_irqsave(&qp->s_lock, flags); for (i = qp->r_head_ack_queue; ; i = prev) { if (i == qp->s_tail_ack_queue) old_req = 0; @@ -1423,7 +1425,7 @@ static inline int ipath_rc_rcv_error(str * after all the previous RDMA reads and atomics. */ if (i == qp->r_head_ack_queue) { - spin_unlock_irq(&qp->s_lock); + spin_unlock_irqrestore(&qp->s_lock, flags); qp->r_nak_state = 0; qp->r_ack_psn = qp->r_psn - 1; goto send_ack; @@ -1440,7 +1442,7 @@ static inline int ipath_rc_rcv_error(str tasklet_hi_schedule(&qp->s_task); unlock_done: - spin_unlock_irq(&qp->s_lock); + spin_unlock_irqrestore(&qp->s_lock, flags); done: return 1; @@ -1450,10 +1452,12 @@ send_ack: static void ipath_rc_error(struct ipath_qp *qp, enum ib_wc_status err) { - spin_lock_irq(&qp->s_lock); + unsigned long flags; + + spin_lock_irqsave(&qp->s_lock, flags); qp->state = IB_QPS_ERR; ipath_error_qp(qp, err); - spin_unlock_irq(&qp->s_lock); + spin_unlock_irqrestore(&qp->s_lock, flags); } /** From sean.hefty at intel.com Fri Apr 27 11:15:31 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 27 Apr 2007 11:15:31 -0700 Subject: [ofa-general] [PULL] ofed_1_2: branches for libibcm and librdmacm Message-ID: <000101c788f8$0aca5e50$8698070a@amr.corp.intel.com> Please pull: git://git.openfabrics.org/~shefty/libibcm.git ofed_1_2 and git://git.openfabrics.org/~shefty/librdmacm.git ofed_1_2 into OFED 1.2. This will pick up: * librdmacm: set source port after calling rdma_bind_addr. * rping: Transfer rkey/addr/len information in network byte order * librdmacm: update datagram tests to abort if msg size > MTU * libibcm: release 1.0 * librdmacm: release 1.0 - Sean From lawver1 at llnl.gov Fri Apr 27 12:05:14 2007 From: lawver1 at llnl.gov (Bryan Lawver) Date: Fri, 27 Apr 2007 12:05:14 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <20070426180618.GJ15540@mellanox.co.il> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> Message-ID: <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> Your right about the ipoib module not combining packets (I believed you without checking) but I did never the less. The ipoib_start_xmit routine is definitely handed a "double packet" which means that the IP NIC driver or the kernel is combining two packets into a single super jumbo packet. This issue is irrespective of the IP MTU setting because I have set all interfaces to 9000k yet ipoib accepts and forwards this 17964 packet to the next IB node and onto the TCP stack where it is never acknowledged. This may not have come up in prior testing because I am using some of the fastest IP NICs which have no trouble keeping up with or exceeding the bandwidth of the IB side. This issue arises exactly every 8 packets...(ring buffer overrun??) I will be at Sonoma for the next few days as many on this list will be. bryan At 11:06 AM 4/26/2007, Michael S. Tsirkin wrote: > > Quoting Bryan Lawver : > > Subject: Re: IPoIB forwarding > > > > Here's a tcpdump of the same sequence. The TCP MSS is 8960 and it appears > > that two payloads are queued at ipoib which combines them into a single > > 17920 payload with assumingly correct IP header (40) and IB header > > (4). The application or TCP stack does not acknowledge this double packet > > ie. it does not ACK until each of the 8960 packets are resent > > individually. Being an IB newbie, I am guessing this combining is > > allowable but may violate TCP protocol. > >IPoIB does nothing like this - it's just a network device so >it sends all packets out as is. > >-- >MST From rick.jones2 at hp.com Fri Apr 27 12:36:26 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 27 Apr 2007 12:36:26 -0700 Subject: [ofa-general] Re: why is CPU util/service demand so much higher with SDP than TCP? In-Reply-To: <20070427042328.GK15540@mellanox.co.il> References: <46315032.9060903@hp.com> <20070427042328.GK15540@mellanox.co.il> Message-ID: <463250BA.9060606@hp.com> Michael S. Tsirkin wrote: >>Quoting Rick Jones : >>Subject: why is CPU util/service demand so much higher with SDP than TCP? >> >>So, while playing around with my new netperf SDP_RR test I've noticed that >>a single-byte _RR test over SDP has a much higher transactions per second >>(ie lower latency) than over TCP over the same HCA, but the CPU utilization >>is _very_ much higher and the service demand (cpu per transaction) as well. >>CPU util being higher makes sense with a higher transaction rate, but not >>the increased service demand - well at least not to my experience thusfar. > > > That's expected. > SDP by default uses polling aggressively to trade off service demand for latency. > You can play with recv_poll module parameter to tune that. Ah, so it is doing a sit and spin waiting for traffic. I'll see about the recv_poll module parm - is it a binary or other? rick jones From sweitzen at cisco.com Fri Apr 27 12:38:22 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 27 Apr 2007 12:38:22 -0700 Subject: [ofa-general] Re: why is CPU util/service demand so much higher withSDP than TCP? In-Reply-To: <463250BA.9060606@hp.com> References: <46315032.9060903@hp.com> <20070427042328.GK15540@mellanox.co.il> <463250BA.9060606@hp.com> Message-ID: # modinfo ib_sdp filename: /lib/modules/2.6.16.21-0.8-smp/updates/kernel/drivers/infiniband /ulp/sdp/ib_sdp.ko author: Michael S. Tsirkin description: InfiniBand SDP module license: Dual BSD/GPL vermagic: 2.6.16.21-0.8-smp SMP gcc-4.1 depends: ib_core,rdma_cm srcversion: 91793E4825DEBC7A2DA9366 parm: top_mem_usage:Top system wide sdp memory usage for recv (in MB). (int) parm: rcvbuf_scale:Receive buffer size scale factor. (int) parm: send_poll_thresh:Send message size thresh hold over which to sta rt polling. (int) parm: recv_poll:How many times to poll recv. (int) parm: send_poll:How many times to poll send. (int) parm: recv_poll_miss:How many times recv poll missed. (int) parm: recv_poll_hit:How many times recv poll helped. (int) parm: send_poll_miss:How many times send poll missed. (int) parm: send_poll_hit:How many times send poll helped. (int) parm: data_debug_level:Enable data path debug tracing if > 0. (int) parm: debug_level:Enable debug tracing if > 0. (int) Scott > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones > Sent: Friday, April 27, 2007 12:36 PM > To: Michael S. Tsirkin > Cc: general at lists.openfabrics.org > Subject: [ofa-general] Re: why is CPU util/service demand so > much higher withSDP than TCP? > > Michael S. Tsirkin wrote: > >>Quoting Rick Jones : > >>Subject: why is CPU util/service demand so much higher with > SDP than TCP? > >> > >>So, while playing around with my new netperf SDP_RR test > I've noticed that > >>a single-byte _RR test over SDP has a much higher > transactions per second > >>(ie lower latency) than over TCP over the same HCA, but the > CPU utilization > >>is _very_ much higher and the service demand (cpu per > transaction) as well. > >>CPU util being higher makes sense with a higher transaction > rate, but not > >>the increased service demand - well at least not to my > experience thusfar. > > > > > > That's expected. > > SDP by default uses polling aggressively to trade off > service demand for latency. > > You can play with recv_poll module parameter to tune that. > > Ah, so it is doing a sit and spin waiting for traffic. I'll > see about the > recv_poll module parm - is it a binary or other? > > rick jones > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rick.jones2 at hp.com Fri Apr 27 13:21:12 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 27 Apr 2007 13:21:12 -0700 Subject: [ofa-general] Re: why is CPU util/service demand so much higher withSDP than TCP? In-Reply-To: References: <46315032.9060903@hp.com> <20070427042328.GK15540@mellanox.co.il> <463250BA.9060606@hp.com> Message-ID: <46325B38.9010209@hp.com> Scott Weitzenkamp (sweitzen) wrote: > # modinfo ib_sdp > filename: > /lib/modules/2.6.16.21-0.8-smp/updates/kernel/drivers/infiniband > /ulp/sdp/ib_sdp.ko > author: Michael S. Tsirkin > description: InfiniBand SDP module > license: Dual BSD/GPL > vermagic: 2.6.16.21-0.8-smp SMP gcc-4.1 > depends: ib_core,rdma_cm > srcversion: 91793E4825DEBC7A2DA9366 > parm: top_mem_usage:Top system wide sdp memory usage for recv > (in MB). > (int) > parm: rcvbuf_scale:Receive buffer size scale factor. (int) > parm: send_poll_thresh:Send message size thresh hold over > which to sta > rt polling. (int) > parm: recv_poll:How many times to poll recv. (int) > parm: send_poll:How many times to poll send. (int) > parm: recv_poll_miss:How many times recv poll missed. (int) > parm: recv_poll_hit:How many times recv poll helped. (int) > parm: send_poll_miss:How many times send poll missed. (int) > parm: send_poll_hit:How many times send poll helped. (int) > parm: data_debug_level:Enable data path debug tracing if > 0. > (int) > parm: debug_level:Enable debug tracing if > 0. (int) > I've learned a new command today :) And via other channels how to actually see the current values of those things under /sys/modules/ib_sdp/parameters. I think I may have also figured-out why my TCP_RR tests over IPoIB have been consistently unable to hit confidence intervals for CPU utilization. I am guessing that for IPoIB that path is effectively "recv_poll=0 and/or send_poll=0" yes? When I set those values for ib_sdp and run the SDP_RR test netperf is unable to be confident that the CPU util it measures is any closer than +/- 20% of the "real" CPU utilization... (still the OFED bits in RHEL5 rather than 1.2) With the defaults, the netperf SDP_RR test hit the confidence intervals. rick jones From rick.jones2 at hp.com Fri Apr 27 13:26:39 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 27 Apr 2007 13:26:39 -0700 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: References: <462E8257.9090103@hp.com> <462F8E07.4050000@hp.com><462FC6C6.9050700@hp.com> <20070426045855.GH5217@mellanox.co.il> <4630D776.5060704@hp.com> Message-ID: <46325C7F.7040901@hp.com> Scott Weitzenkamp (sweitzen) wrote: >>>Please note that you should *only* ever stick the SDP family value >>>in the socket(3) call. All addresses for connect, bind etc >>>are AF_INET, since SDP uses IP addresses for everything. >> >>Sounds like something trying to be just a little bit pregnant. >> >>Thankfully, I'm only munging the getaddrinfo() data for the >>local endpoint. > > > See bug https://bugs.openfabrics.org//show_bug.cgi?id=294, I agree > connect() and bind() should allow AF_INET_SDP. I was poking around - it would be nice if they could take AF_INET_SDP - I have to wonder if IPPROTO_SDP is actually better, but seeing there has been some discussion there (but not having read all of it) I'm just going to go with the flow... > About the "direct" SDP tests, instead of copy/pasting the TCP code, how > about if you just had a command-line argument that specified SDP, like > you do with neterver -6 to specify IPv6 instead of IPv4? Well, that could then require I start adding some backflips in "common" code such as where I call getaddrinfo(). Besides, I've already finished the first set of cut and paste :) > Speaking of IPv6, does netperf work with IPv6 on Linux? Yes, although "Linux" seems to have some issue with link-scope addresses. rick jones > > Scott From rick.jones2 at hp.com Fri Apr 27 13:32:51 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 27 Apr 2007 13:32:51 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> Message-ID: <46325DF3.2050203@hp.com> Bryan Lawver wrote: > Your right about the ipoib module not combining packets (I believed you > without checking) but I did never the less. The ipoib_start_xmit > routine is definitely handed a "double packet" which means that the IP > NIC driver or the kernel is combining two packets into a single super > jumbo packet. This issue is irrespective of the IP MTU setting because > I have set all interfaces to 9000k yet ipoib accepts and forwards this > 17964 packet to the next IB node and onto the TCP stack where it is > never acknowledged. This may not have come up in prior testing because > I am using some of the fastest IP NICs which have no trouble keeping up > with or exceeding the bandwidth of the IB side. This issue arises > exactly every 8 packets...(ring buffer overrun??) > > I will be at Sonoma for the next few days as many on this list will be. Some NICs (esp 10G) support large receive offload - they coalesce TCP segments from the wire/fiber into larger ones they pass up the stack. Perhaps that is happening here? I'm going to go out a bit on a limb, cross the streams, and include netdev, because I suspect that if a system is acting as an IP router, one doesn't want large receive offload enabled. That may need some discussion in netdev - it may then require some changes to default settings or some documentation enhancements. That or I'll learn that the stack is already dealing with the issue... rick jones > bryan > > > > At 11:06 AM 4/26/2007, Michael S. Tsirkin wrote: > >> > Quoting Bryan Lawver : >> > Subject: Re: IPoIB forwarding >> > >> > Here's a tcpdump of the same sequence. The TCP MSS is 8960 and it >> appears >> > that two payloads are queued at ipoib which combines them into a single >> > 17920 payload with assumingly correct IP header (40) and IB header >> > (4). The application or TCP stack does not acknowledge this double >> packet >> > ie. it does not ACK until each of the 8960 packets are resent >> > individually. Being an IB newbie, I am guessing this combining is >> > allowable but may violate TCP protocol. >> >> IPoIB does nothing like this - it's just a network device so >> it sends all packets out as is. >> >> -- >> MST > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Fri Apr 27 14:21:06 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Apr 2007 17:21:06 -0400 Subject: [ofa-general] [PATCH] OpenSM: Rename ib_inform_info_get_node_type to ib_inform_info_get_prod_type Message-ID: <1177708854.12542.188662.camel@hal.voltaire.com> OpenSM: Rename ib_inform_info_get_node_type to ib_inform_info_get_prod_type 13.4.8.3 InformInfo does not have a node type but rather a producer type (This is for master only) Signed-off-by: Ira K. Weiny Signed-off-by: Hal Rosenstock diff --git a/diags/src/saquery.c b/diags/src/saquery.c index af26d13..93f47d6 100644 --- a/diags/src/saquery.c +++ b/diags/src/saquery.c @@ -516,7 +516,7 @@ print_inform_info_record(ib_inform_info_ cl_ntoh16( p_iir->inform_info.g_or_v.generic.trap_num ), cl_ntoh32( qpn ), resp_time_val, - cl_ntoh32(ib_inform_info_get_node_type( &p_iir->inform_info )) + cl_ntoh32(ib_inform_info_get_prod_type( &p_iir->inform_info )) ); } else { printf("InformInfoRecord dump:\n" @@ -549,7 +549,7 @@ print_inform_info_record(ib_inform_info_ cl_ntoh16( p_iir->inform_info.g_or_v.vend.dev_id ), cl_ntoh32( qpn ), resp_time_val, - cl_ntoh32(ib_inform_info_get_node_type( &p_iir->inform_info )) + cl_ntoh32(ib_inform_info_get_prod_type( &p_iir->inform_info )) ); } } diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h index 7245e84..b3937cb 100644 --- a/osm/include/iba/ib_types.h +++ b/osm/include/iba/ib_types.h @@ -7266,17 +7266,18 @@ ib_inform_info_set_qpn( * ib_inform_info_t *********/ -/****f* IBA Base: Types/ib_inform_info_get_node_type +/****f* IBA Base: Types/ib_inform_info_get_prod_type * NAME -* ib_inform_info_get_node_type +* ib_inform_info_get_prod_type * * DESCRIPTION -* Get Node Type of the Inform Info +* Get Producer Type of the Inform Info +* 13.4.8.3 InformInfo * * SYNOPSIS */ static inline ib_net32_t OSM_API -ib_inform_info_get_node_type( +ib_inform_info_get_prod_type( IN const ib_inform_info_t *p_inf) { uint32_t nt; @@ -7291,7 +7292,7 @@ ib_inform_info_get_node_type( * [in] pointer to an inform info * * RETURN VALUES -* The node type +* The producer type * * NOTES * diff --git a/osm/opensm/osm_helper.c b/osm/opensm/osm_helper.c index a1a2e93..523dbee 100644 --- a/osm/opensm/osm_helper.c +++ b/osm/opensm/osm_helper.c @@ -1405,7 +1405,7 @@ osm_dump_inform_info( cl_ntoh16( p_ii->g_or_v.generic.trap_num ), cl_ntoh32( qpn ), resp_time_val, - cl_ntoh32(ib_inform_info_get_node_type( p_ii )) + cl_ntoh32(ib_inform_info_get_prod_type( p_ii )) ); } else @@ -1433,7 +1433,7 @@ osm_dump_inform_info( cl_ntoh16( p_ii->g_or_v.vend.dev_id ), cl_ntoh32( qpn ), resp_time_val, - cl_ntoh32(ib_inform_info_get_node_type( p_ii )) + cl_ntoh32(ib_inform_info_get_prod_type( p_ii )) ); } } @@ -1489,7 +1489,7 @@ osm_dump_inform_info_record( cl_ntoh16( p_iir->inform_info.g_or_v.generic.trap_num ), cl_ntoh32( qpn ), resp_time_val, - cl_ntoh32(ib_inform_info_get_node_type( &p_iir->inform_info )) + cl_ntoh32(ib_inform_info_get_prod_type( &p_iir->inform_info )) ); } else @@ -1525,7 +1525,7 @@ osm_dump_inform_info_record( cl_ntoh16( p_iir->inform_info.g_or_v.vend.dev_id ), cl_ntoh32( qpn ), resp_time_val, - cl_ntoh32(ib_inform_info_get_node_type( &p_iir->inform_info )) + cl_ntoh32(ib_inform_info_get_prod_type( &p_iir->inform_info )) ); } } diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c index 5bac67f..e66c259 100644 --- a/osm/opensm/osm_inform.c +++ b/osm/opensm/osm_inform.c @@ -561,14 +561,14 @@ __match_notice_to_inf_rec( } /* ProducerType ProducerType match or 0xFFFFFF */ - if ( (cl_ntoh32(ib_inform_info_get_node_type(p_ii)) != 0xFFFFFF) && - (ib_inform_info_get_node_type(p_ii) != ib_notice_get_prod_type(p_ntc)) ) + if ( (cl_ntoh32(ib_inform_info_get_prod_type(p_ii)) != 0xFFFFFF) && + (ib_inform_info_get_prod_type(p_ii) != ib_notice_get_prod_type(p_ntc)) ) { osm_log( p_log, OSM_LOG_DEBUG, "__match_notice_to_inf_rec: " "Mismatch by Node Type: II=0x%06X (%s) Trap=0x%06X (%s)\n", - cl_ntoh32(ib_inform_info_get_node_type(p_ii)), - ib_get_producer_type_str(ib_inform_info_get_node_type(p_ii)), + cl_ntoh32(ib_inform_info_get_prod_type(p_ii)), + ib_get_producer_type_str(ib_inform_info_get_prod_type(p_ii)), cl_ntoh32(ib_notice_get_prod_type(p_ntc)), ib_get_producer_type_str(ib_notice_get_prod_type(p_ntc)) ); diff --git a/osm/opensm/osm_sa.c b/osm/opensm/osm_sa.c index 6d68ed2..4c0fbc3 100644 --- a/osm/opensm/osm_sa.c +++ b/osm/opensm/osm_sa.c @@ -647,7 +647,7 @@ sa_dump_one_inform(cl_list_item_t *p_lis cl_ntoh16(p_iir->inform_info.trap_type), cl_ntoh16(p_iir->inform_info.g_or_v.generic.trap_num), cl_ntoh32(p_iir->inform_info.g_or_v.generic.qpn_resp_time_val), - cl_ntoh32(ib_inform_info_get_node_type(&p_iir->inform_info)), + cl_ntoh32(ib_inform_info_get_prod_type(&p_iir->inform_info)), cl_ntoh16(p_infr->report_addr.dest_lid), p_infr->report_addr.path_bits, p_infr->report_addr.static_rate, From lawver1 at llnl.gov Fri Apr 27 15:26:23 2007 From: lawver1 at llnl.gov (Bryan Lawver) Date: Fri, 27 Apr 2007 15:26:23 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <46325DF3.2050203@hp.com> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> Message-ID: <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> I hit the IP NIC over the head with a hammer and turned off all offload features and I no longer get the super jumbo packet and I have symmetric performance. This NIC supported "ethtool -K ethx tso/tx/rx/sg on/off" and I am not sure at this time which one I needed to whack but all off solved the problem. Thanks for listening and re enforcing my search process. bryan At 01:32 PM 4/27/2007, Rick Jones wrote: >Bryan Lawver wrote: >>Your right about the ipoib module not combining packets (I believed you >>without checking) but I did never the less. The ipoib_start_xmit routine >>is definitely handed a "double packet" which means that the IP NIC >>driver or the kernel is combining two packets into a single super jumbo >>packet. This issue is irrespective of the IP MTU setting because I have >>set all interfaces to 9000k yet ipoib accepts and forwards this 17964 >>packet to the next IB node and onto the TCP stack where it is never >>acknowledged. This may not have come up in prior testing because I am >>using some of the fastest IP NICs which have no trouble keeping up with >>or exceeding the bandwidth of the IB side. This issue arises exactly >>every 8 packets...(ring buffer overrun??) >>I will be at Sonoma for the next few days as many on this list will be. > > >Some NICs (esp 10G) support large receive offload - they coalesce TCP >segments from the wire/fiber into larger ones they pass up the >stack. Perhaps that is happening here? > >I'm going to go out a bit on a limb, cross the streams, and include >netdev, because I suspect that if a system is acting as an IP router, one >doesn't want large receive offload enabled. That may need some discussion >in netdev - it may then require some changes to default settings or some >documentation enhancements. That or I'll learn that the stack is already >dealing with the issue... > >rick jones > >>bryan >> >>At 11:06 AM 4/26/2007, Michael S. Tsirkin wrote: >> >>> > Quoting Bryan Lawver : >>> > Subject: Re: IPoIB forwarding >>> > >>> > Here's a tcpdump of the same sequence. The TCP MSS is 8960 and it >>> appears >>> > that two payloads are queued at ipoib which combines them into a single >>> > 17920 payload with assumingly correct IP header (40) and IB header >>> > (4). The application or TCP stack does not acknowledge this double >>> packet >>> > ie. it does not ACK until each of the 8960 packets are resent >>> > individually. Being an IB newbie, I am guessing this combining is >>> > allowable but may violate TCP protocol. >>> >>>IPoIB does nothing like this - it's just a network device so >>>it sends all packets out as is. >>> >>>-- >>>MST >> >>_______________________________________________ >>general mailing list >>general at lists.openfabrics.org >>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>To unsubscribe, please visit >>http://openib.org/mailman/listinfo/openib-general From rick.jones2 at hp.com Fri Apr 27 15:32:39 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 27 Apr 2007 15:32:39 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> Message-ID: <46327A07.1000404@hp.com> Bryan Lawver wrote: > I hit the IP NIC over the head with a hammer and turned off all offload > features and I no longer get the super jumbo packet and I have symmetric > performance. This NIC supported "ethtool -K ethx tso/tx/rx/sg on/off" > and I am not sure at this time which one I needed to whack but all off > solved the problem. Yeah, that does seem like a rather broad remedy, but I guess if it works... :) And I suppose most of those offloads don't matter for a NIC being used in a router. Only problem is we don't know if it worked because it slowed-down the 10G side or because it had LRO disabling as a side-effect. If I were to guess, of those things listed, I'd guess that receive cko would have that as a side effect. Just what sort of 10G NIC was this anyway? With that knowledge we could probably narrow things down to a more specific modprobe setting, or maybe even an ethtool command, for some suitable revision of ethtool. rick jones > > Thanks for listening and re enforcing my search process. > > bryan > > At 01:32 PM 4/27/2007, Rick Jones wrote: > >> Bryan Lawver wrote: >> >>> Your right about the ipoib module not combining packets (I believed >>> you without checking) but I did never the less. The ipoib_start_xmit >>> routine is definitely handed a "double packet" which means that the >>> IP NIC driver or the kernel is combining two packets into a single >>> super jumbo packet. This issue is irrespective of the IP MTU setting >>> because I have set all interfaces to 9000k yet ipoib accepts and >>> forwards this 17964 packet to the next IB node and onto the TCP stack >>> where it is never acknowledged. This may not have come up in prior >>> testing because I am using some of the fastest IP NICs which have no >>> trouble keeping up with or exceeding the bandwidth of the IB side. >>> This issue arises exactly every 8 packets...(ring buffer overrun??) >>> I will be at Sonoma for the next few days as many on this list will be. >> >> >> >> Some NICs (esp 10G) support large receive offload - they coalesce TCP >> segments from the wire/fiber into larger ones they pass up the stack. >> Perhaps that is happening here? >> >> I'm going to go out a bit on a limb, cross the streams, and include >> netdev, because I suspect that if a system is acting as an IP router, >> one doesn't want large receive offload enabled. That may need some >> discussion in netdev - it may then require some changes to default >> settings or some documentation enhancements. That or I'll learn that >> the stack is already dealing with the issue... >> >> rick jones >> >>> bryan >>> >>> At 11:06 AM 4/26/2007, Michael S. Tsirkin wrote: >>> >>>> > Quoting Bryan Lawver : >>>> > Subject: Re: IPoIB forwarding >>>> > >>>> > Here's a tcpdump of the same sequence. The TCP MSS is 8960 and it >>>> appears >>>> > that two payloads are queued at ipoib which combines them into a >>>> single >>>> > 17920 payload with assumingly correct IP header (40) and IB header >>>> > (4). The application or TCP stack does not acknowledge this >>>> double packet >>>> > ie. it does not ACK until each of the 8960 packets are resent >>>> > individually. Being an IB newbie, I am guessing this combining is >>>> > allowable but may violate TCP protocol. >>>> >>>> IPoIB does nothing like this - it's just a network device so >>>> it sends all packets out as is. >>>> >>>> -- >>>> MST >>> >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general From lawver1 at llnl.gov Fri Apr 27 15:43:54 2007 From: lawver1 at llnl.gov (Bryan Lawver) Date: Fri, 27 Apr 2007 15:43:54 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <46327A07.1000404@hp.com> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> <46327A07.1000404@hp.com> Message-ID: <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> I had so much debugging turned on that it was not the "slowing of the traffic" but the "non-coelescencing" that was the remedy. The NIC is a MyriCom NIC and these are easy options to set. At 03:32 PM 4/27/2007, Rick Jones wrote: >Bryan Lawver wrote: >>I hit the IP NIC over the head with a hammer and turned off all offload >>features and I no longer get the super jumbo packet and I have symmetric >>performance. This NIC supported "ethtool -K ethx tso/tx/rx/sg on/off" >>and I am not sure at this time which one I needed to whack but all off >>solved the problem. > >Yeah, that does seem like a rather broad remedy, but I guess if it >works... :) And I suppose most of those offloads don't matter for a NIC >being used in a router. > >Only problem is we don't know if it worked because it slowed-down the 10G >side or because it had LRO disabling as a side-effect. If I were to guess, >of those things listed, I'd guess that receive cko would have that as a >side effect. > >Just what sort of 10G NIC was this anyway? With that knowledge we could >probably narrow things down to a more specific modprobe setting, or maybe >even an ethtool command, for some suitable revision of ethtool. > >rick jones > >>Thanks for listening and re enforcing my search process. >>bryan >>At 01:32 PM 4/27/2007, Rick Jones wrote: >> >>>Bryan Lawver wrote: >>> >>>>Your right about the ipoib module not combining packets (I believed you >>>>without checking) but I did never the less. The ipoib_start_xmit >>>>routine is definitely handed a "double packet" which means that the IP >>>>NIC driver or the kernel is combining two packets into a single super >>>>jumbo packet. This issue is irrespective of the IP MTU setting because >>>>I have set all interfaces to 9000k yet ipoib accepts and forwards this >>>>17964 packet to the next IB node and onto the TCP stack where it is >>>>never acknowledged. This may not have come up in prior testing because >>>>I am using some of the fastest IP NICs which have no trouble keeping up >>>>with or exceeding the bandwidth of the IB side. >>>>This issue arises exactly every 8 packets...(ring buffer overrun??) >>>>I will be at Sonoma for the next few days as many on this list will be. >>> >>> >>> >>>Some NICs (esp 10G) support large receive offload - they coalesce TCP >>>segments from the wire/fiber into larger ones they pass up the stack. >>>Perhaps that is happening here? >>> >>>I'm going to go out a bit on a limb, cross the streams, and include >>>netdev, because I suspect that if a system is acting as an IP router, >>>one doesn't want large receive offload enabled. That may need some >>>discussion in netdev - it may then require some changes to default >>>settings or some documentation enhancements. That or I'll learn that >>>the stack is already dealing with the issue... >>> >>>rick jones >>> >>>>bryan >>>> >>>>At 11:06 AM 4/26/2007, Michael S. Tsirkin wrote: >>>> >>>>> > Quoting Bryan Lawver : >>>>> > Subject: Re: IPoIB forwarding >>>>> > >>>>> > Here's a tcpdump of the same sequence. The TCP MSS is 8960 and it >>>>> appears >>>>> > that two payloads are queued at ipoib which combines them into a single >>>>> > 17920 payload with assumingly correct IP header (40) and IB header >>>>> > (4). The application or TCP stack does not acknowledge this double >>>>> packet >>>>> > ie. it does not ACK until each of the 8960 packets are resent >>>>> > individually. Being an IB newbie, I am guessing this combining is >>>>> > allowable but may violate TCP protocol. >>>>> >>>>>IPoIB does nothing like this - it's just a network device so >>>>>it sends all packets out as is. >>>>> >>>>>-- >>>>>MST >>>> >>>> >>>>_______________________________________________ >>>>general mailing list >>>>general at lists.openfabrics.org >>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>To unsubscribe, please visit >>>>http://openib.org/mailman/listinfo/openib-general From rick.jones2 at hp.com Fri Apr 27 16:37:49 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 27 Apr 2007 16:37:49 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> <46327A07.1000404@hp.com> <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> Message-ID: <4632894D.40705@hp.com> Bryan Lawver wrote: > I had so much debugging turned on that it was not the "slowing of the > traffic" but the "non-coelescencing" that was the remedy. The NIC is a > MyriCom NIC and these are easy options to set. As chance would have it, I've played with some Myricom myri10ge NICs recently, and even disabled large receive offload during some netperf tests :) It is a modprobe option. Going back now to the driver source and the README I see :-) Troubleshooting =============== Large Receive Offload (LRO) is enabled by default. This will interfere with forwarding TCP traffic. If you plan to forward TCP traffic (using the host with the Myri10GE NIC as a router or bridge), you must disable LRO. To disable LRO, load the myri10ge driver with myri10ge_lro set to 0: # modprobe myri10ge myri10ge_lro=0 Alternatively, you can disable LRO at runtime by disabling receive checksum offloading via ethtool: # ethtool -K eth2 rx off rick jones From arlin.r.davis at intel.com Fri Apr 27 16:38:01 2007 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 27 Apr 2007 16:38:01 -0700 Subject: [ofa-general] [PATCH] uDAPL OFED 1.2 RC2 build issue on ia64 and RHEL5 Message-ID: <000001c78925$18867a10$ff0da8c0@amr.corp.intel.com> Fixes build problems with ia64 and RHEL5 with atomic operations. Patch was tested on ia64 RHEL4 and RHEL5 using dtest/dapltest. James, can you review this before I push. Thanks, -arlin Signed-off by: Arlin Davis ardavis at ichips.intel.com diff --git a/Makefile.am b/Makefile.am index 70ef6ef..d02f7c1 100644 --- a/Makefile.am +++ b/Makefile.am @@ -2,10 +2,13 @@ OSFLAGS = -DOS_RELEASE=$(shell expr `uname -r | cut -f1 -d.` \* 65536 + `uname -r | cut -f2 -d.`) # Check for RedHat, needed for ia64 udapl atomic operations (IA64_FETCHADD syntax) -if OS_RHEL +# and built-in atomics for RedHat EL5 +if OS_RHEL4 OSFLAGS += -DREDHAT_EL4 -else -OSFLAGS += +endif + +if OS_RHEL5 +OSFLAGS += -DREDHAT_EL5 endif if DEBUG diff --git a/configure.in b/configure.in index 324bfa1..e11fa73 100644 --- a/configure.in +++ b/configure.in @@ -50,15 +50,25 @@ AC_ARG_ENABLE(debug, esac],[debug=false]) AM_CONDITIONAL(DEBUG, test x$debug = xtrue) -dnl Check for Redhat EL release -AC_CACHE_CHECK(whether this is an RHEL system, ac_cv_rhel, +dnl Check for Redhat EL release 4 +AC_CACHE_CHECK(Check for RHEL4 system, ac_cv_rhel4, if test -f /etc/redhat-release && - test -n "`grep -v Fedora /etc/redhat-release`"; then - ac_cv_rhel=yes + test -n "`grep -e "release 4" /etc/redhat-release`"; then + ac_cv_rhel4=yes else - ac_cv_rhel=no + ac_cv_rhel4=no fi) -AM_CONDITIONAL(OS_RHEL, test "$ac_cv_rhel" = "yes") +AM_CONDITIONAL(OS_RHEL4, test "$ac_cv_rhel4" = "yes") + +dnl Check for Redhat EL release 5 +AC_CACHE_CHECK(Check for RHEL5 system, ac_cv_rhel5, + if test -f /etc/redhat-release && + test -n "`grep -e "release 5" /etc/redhat-release`"; then + ac_cv_rhel5=yes + else + ac_cv_rhel5=no + fi) +AM_CONDITIONAL(OS_RHEL5, test "$ac_cv_rhel5" = "yes") AC_CONFIG_FILES([Makefile test/dtest/Makefile test/dapltest/Makefile libdat.spec]) diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h index efa967d..cfc85d1 100644 --- a/dapl/udapl/linux/dapl_osd.h +++ b/dapl/udapl/linux/dapl_osd.h @@ -78,7 +78,7 @@ #include #include -#if defined(__ia64__) || defined(__PPC64__) +#if !defined(REDHAT_EL5) && (defined(__ia64__) || defined(__PPC64__)) #include #endif #if defined(__PPC64__) @@ -155,14 +155,14 @@ dapl_os_atomic_inc ( INOUT DAPL_ATOMIC *v) { #ifdef __ia64__ - DAT_COUNT old_value; - -# if !defined(REDHAT_EL4) && (OS_RELEASE >= LINUX_VERSION(2,6)) + DAT_COUNT old_value; +#if defined(REDHAT_EL5) + old_value = __sync_fetch_and_add(v, 1); +#elif !defined(REDHAT_EL4) && (OS_RELEASE >= LINUX_VERSION(2,6)) IA64_FETCHADD(old_value,v,1,4,rel); -# else +#else IA64_FETCHADD(old_value,v,1,4); -# endif - +#endif #elif defined(__PPC64__) atomic_inc((atomic_t *) v); #else /* !__ia64__ */ @@ -185,14 +185,14 @@ dapl_os_atomic_dec ( INOUT DAPL_ATOMIC *v) { #ifdef __ia64__ - DAT_COUNT old_value; - -# if !defined(REDHAT_EL4) && (OS_RELEASE >= LINUX_VERSION(2,6)) + DAT_COUNT old_value; +#if defined(REDHAT_EL5) + old_value = __sync_fetch_and_sub(v, 1); +#elif !defined(REDHAT_EL4) && (OS_RELEASE >= LINUX_VERSION(2,6)) IA64_FETCHADD(old_value,v,-1,4,rel); -# else +#else IA64_FETCHADD(old_value,v,-1,4); -# endif - +#endif #elif defined (__PPC64__) atomic_dec((atomic_t *)v); @@ -233,7 +233,9 @@ dapl_os_atomic_assign ( */ #ifdef __ia64__ -#ifdef REDHAT_EL4 +#if defined(REDHAT_EL5) + current_value = __sync_val_compare_and_swap(v,match_value,new_value); +#elif defined(REDHAT_EL4) current_value = ia64_cmpxchg("acq",v,match_value,new_value,4); #else current_value = ia64_cmpxchg(acq,v,match_value,new_value,4); From davem at davemloft.net Fri Apr 27 16:39:34 2007 From: davem at davemloft.net (David Miller) Date: Fri, 27 Apr 2007 16:39:34 -0700 (PDT) Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <4632894D.40705@hp.com> References: <46327A07.1000404@hp.com> <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> <4632894D.40705@hp.com> Message-ID: <20070427.163934.106263903.davem@davemloft.net> From: Rick Jones Date: Fri, 27 Apr 2007 16:37:49 -0700 > Large Receive Offload (LRO) is enabled by default. This will > interfere with forwarding TCP traffic. If you plan to forward TCP > traffic (using the host with the Myri10GE NIC as a router or bridge), > you must disable LRO. To disable LRO, load the myri10ge driver > with myri10ge_lro set to 0: LRO should be disabled by default if the driver does this. This is a major and unacceptable bug. Thanks for pointing this out Rick. From rick.jones2 at hp.com Fri Apr 27 16:48:00 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 27 Apr 2007 16:48:00 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <20070427.163934.106263903.davem@davemloft.net> References: <46327A07.1000404@hp.com> <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> <4632894D.40705@hp.com> <20070427.163934.106263903.davem@davemloft.net> Message-ID: <46328BB0.9030501@hp.com> David Miller wrote: > From: Rick Jones > Date: Fri, 27 Apr 2007 16:37:49 -0700 > > >>Large Receive Offload (LRO) is enabled by default. This will >>interfere with forwarding TCP traffic. If you plan to forward TCP >>traffic (using the host with the Myri10GE NIC as a router or bridge), >>you must disable LRO. To disable LRO, load the myri10ge driver >>with myri10ge_lro set to 0: > > > LRO should be disabled by default if the driver does this. This is a > major and unacceptable bug. > > Thanks for pointing this out Rick. No problem - just to play whatif/devil's advocate for a bit though... is there any way to tie that in with the setting of net.ipv4.ip_forward (and/or its IPv6 counterpart)? rick jones From davem at davemloft.net Fri Apr 27 16:52:32 2007 From: davem at davemloft.net (David Miller) Date: Fri, 27 Apr 2007 16:52:32 -0700 (PDT) Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <46328BB0.9030501@hp.com> References: <4632894D.40705@hp.com> <20070427.163934.106263903.davem@davemloft.net> <46328BB0.9030501@hp.com> Message-ID: <20070427.165232.74560572.davem@davemloft.net> From: Rick Jones Date: Fri, 27 Apr 2007 16:48:00 -0700 > No problem - just to play whatif/devil's advocate for a bit > though... is there any way to tie that in with the setting of > net.ipv4.ip_forward (and/or its IPv6 counterpart)? Even ignoring that, consider the potential issues this kind of problem could be causing netfilter. From pradeep at us.ibm.com Fri Apr 27 17:51:14 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 27 Apr 2007 18:51:14 -0600 Subject: [ofa-general] IPOIB CM (NOSRQ)[PATCH V3] patch for review Message-ID: Here is a third version of the IPOIB_CM_NOSRQ patch for review. This patch will benefit adapters that do not support shared receive queues. This patch incorporates the following review comments from v2: There should be no line wrap issues now Code restructured to seperate the SRQ/non-SRQ in several places This patch has been tested with linux-2.6.21-rc5 and rc7 (derived from Roland's for 2.6.22 git tree on 04/25/2007) with Topspin and IBM HCAs on ppc64 machines. I have run netperf between two IBM HCAs and as well as between IBM and Topspin HCA. Note 1: I have retained the code to avoid IB_WC_RETRY_EXC_ERR while performing interoperability tests As discussed in this mailing list that may be a CM bug or have the various HCA address it. Hence I would like to seperate out that issue from this patch. At a future point when the issue gets resolved I can provide another patch to change the retry_count values back to 0 if need be. Note 2: "Modify Port" patch submitted by Joachim Fenkes is needed for the ehca driver to work on the IBM HCAs. Have not tested with this patch as yet. Signed-off-by: Pradeep Satyanarayana --- --- a/linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-04-24 18:10:17.000000000 -0700 +++ b//linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-04-25 10:11:34.000000000 -0700 @@ -99,6 +99,12 @@ enum { #define IPOIB_OP_RECV (1ul << 31) #ifdef CONFIG_INFINIBAND_IPOIB_CM #define IPOIB_CM_OP_SRQ (1ul << 30) +#define IPOIB_CM_OP_NOSRQ (1ul << 29) + +/* These two go hand in hand */ +#define NOSRQ_INDEX_RING_SIZE 1024 +#define NOSRQ_INDEX_MASK 0x00000000000003ff + #else #define IPOIB_CM_OP_SRQ (0) #endif @@ -136,9 +142,11 @@ struct ipoib_cm_data { struct ipoib_cm_rx { struct ib_cm_id *id; struct ib_qp *qp; + struct ipoib_cm_rx_buf *rx_ring; struct list_head list; struct net_device *dev; unsigned long jiffies; + u32 index; }; struct ipoib_cm_tx { @@ -177,6 +185,7 @@ struct ipoib_cm_dev_priv { struct ib_wc ibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + struct ipoib_cm_rx **rx_index_ring; }; /* --- a/linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-04-24 18:10:17.000000000 -0700 +++ b//linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-04-27 14:03:40.000000000 -0700 @@ -76,7 +76,7 @@ static void ipoib_cm_dma_unmap_rx(struct ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int ipoib_cm_post_receive(struct net_device *dev, int id) +static int post_receive_srq(struct net_device *dev, u64 id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_recv_wr *bad_wr; @@ -85,13 +85,14 @@ static int ipoib_cm_post_receive(struct priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; for (i = 0; i < IPOIB_CM_RX_SG; ++i) - priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; + priv->cm.rx_sge[i].addr = + priv->cm.srq_ring[id].mapping[i]; ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); if (unlikely(ret)) { ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, - priv->cm.srq_ring[id].mapping); + priv->cm.srq_ring[id].mapping); dev_kfree_skb_any(priv->cm.srq_ring[id].skb); priv->cm.srq_ring[id].skb = NULL; } @@ -99,12 +100,69 @@ static int ipoib_cm_post_receive(struct return ret; } -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +static int post_receive_nosrq(struct net_device *dev, u64 id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_recv_wr *bad_wr; + int i, ret; + u32 index; + u64 wr_id; + struct ipoib_cm_rx *rx_ptr; + unsigned long flags; + + index = id & NOSRQ_INDEX_MASK ; + wr_id = id >> 32; + + /* There is a slender chance of a race between the stale_task + * running after a period of inactivity and the receipt of + * a packet being processed at about the same instant. + * Hence the lock */ + + spin_lock_irqsave(&priv->lock, flags); + rx_ptr = priv->cm.rx_index_ring[index]; + spin_unlock_irqrestore(&priv->lock, flags); + + priv->cm.rx_wr.wr_id = wr_id << 32 | index | IPOIB_CM_OP_NOSRQ; + + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i]; + + ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", + wr_id, ret); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + rx_ptr->rx_ring[wr_id].mapping); + dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb); + rx_ptr->rx_ring[wr_id].skb = NULL; + } + + return ret; +} + +static int ipoib_cm_post_receive(struct net_device *dev, u64 id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + if (priv->cm.srq) + ret = post_receive_srq(dev, id); + else + ret = post_receive_nosrq(dev, id); + + return ret; +} + +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, + int frags, u64 mapping[IPOIB_CM_RX_SG]) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; int i; + struct ipoib_cm_rx *rx_ptr; + u32 index, wr_id; + unsigned long flags; skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); if (unlikely(!skb)) @@ -123,7 +181,7 @@ static struct sk_buff *ipoib_cm_alloc_rx return NULL; } - for (i = 0; i < frags; i++) { + for (i = 0; i < frags; i++) { struct page *page = alloc_page(GFP_ATOMIC); if (!page) @@ -136,7 +194,17 @@ static struct sk_buff *ipoib_cm_alloc_rx goto partial_error; } - priv->cm.srq_ring[id].skb = skb; + if (priv->cm.srq) + priv->cm.srq_ring[id].skb = skb; + else { + index = id & NOSRQ_INDEX_MASK ; + wr_id = id >> 32; + spin_lock_irqsave(&priv->lock, flags); + rx_ptr = priv->cm.rx_index_ring[index]; + spin_unlock_irqrestore(&priv->lock, flags); + + rx_ptr->rx_ring[wr_id].skb = skb; + } return skb; partial_error: @@ -157,13 +225,20 @@ static struct ib_qp *ipoib_cm_create_rx_ struct ib_qp_init_attr attr = { .send_cq = priv->cq, /* does not matter, we never send anything */ .recv_cq = priv->cq, - .srq = priv->cm.srq, .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ + .cap.max_recv_wr = ipoib_recvq_size + 1, .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ + .cap.max_recv_sge = IPOIB_CM_RX_SG, /* Is this correct? */ .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, .qp_context = p, }; + + if (priv->cm.srq) + attr.srq = priv->cm.srq; + else + attr.srq = NULL; + return ib_create_qp(priv->pd, &attr); } @@ -198,6 +273,7 @@ static int ipoib_cm_modify_rx_qp(struct ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret); return ret; } + return 0; } @@ -217,12 +293,87 @@ static int ipoib_cm_send_rep(struct net_ rep.flow_control = 0; rep.rnr_retry_count = req->rnr_retry_count; rep.target_ack_delay = 20; /* FIXME */ - rep.srq = 1; rep.qp_num = qp->qp_num; rep.starting_psn = psn; + + if (priv->cm.srq) + rep.srq = 1; + else + rep.srq = 0; return ib_send_cm_rep(cm_id, &rep); } +int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id, struct ipoib_cm_rx *p, unsigned psn) +{ + struct net_device *dev = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u32 qp_num, index; + u64 i; + + qp_num = p->qp->qp_num; + /* Allocate space for the rx_ring here */ + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, + GFP_KERNEL); + if (p->rx_ring == NULL) + return -ENOMEM; + + cm_id->context = p; + p->jiffies = jiffies; + spin_lock_irq(&priv->lock); + list_add(&p->list, &priv->cm.passive_ids); + + /* Find an empty rx_index_ring[] entry */ + for (index = 0; index < NOSRQ_INDEX_RING_SIZE; index++) + if (priv->cm.rx_index_ring[index] == NULL) + break; + + if ( index == NOSRQ_INDEX_RING_SIZE) { + spin_unlock_irq(&priv->lock); + printk(KERN_WARNING "NOSRQ supports a max of %d RC " + "QPs. That limit has now been reached\n", + NOSRQ_INDEX_RING_SIZE); + return -EINVAL; + } + + /* Store the pointer to retrieve it later using the index */ + priv->cm.rx_index_ring[index] = p; + spin_unlock_irq(&priv->lock); + p->index = index; + + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) { + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret); + goto err_modify_nosrq; + } + + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping)) { + ipoib_warn(priv, "failed to allocate receive " + "buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + /* Free rx_ring previously allocated */ + kfree(p->rx_ring); + return -ENOMEM; + } + + /* Can we call the nosrq version? */ + if (ipoib_cm_post_receive(dev, i << 32 | index)) { + ipoib_warn(priv, "ipoib_ib_post_receive " + "failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } + } /* end for */ + + return 0; + +err_modify_nosrq: + return ret; +} + static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) { struct net_device *dev = cm_id->context; @@ -243,10 +394,17 @@ static int ipoib_cm_req_handler(struct i goto err_qp; } - psn = random32() & 0xffffff; - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); - if (ret) - goto err_modify; + if (priv->cm.srq == NULL) { /* NOSRQ */ + psn = random32() & 0xffffff; + if (ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn)) + goto err_modify; + } else { /* SRQ */ + p->rx_ring = NULL; /* This is used only by NOSRQ */ + psn = random32() & 0xffffff; + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) + goto err_modify; + } ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); if (ret) { @@ -254,11 +412,13 @@ static int ipoib_cm_req_handler(struct i goto err_rep; } - cm_id->context = p; - p->jiffies = jiffies; - spin_lock_irq(&priv->lock); - list_add(&p->list, &priv->cm.passive_ids); - spin_unlock_irq(&priv->lock); + if (priv->cm.srq) { + cm_id->context = p; + p->jiffies = jiffies; + spin_lock_irq(&priv->lock); + list_add(&p->list, &priv->cm.passive_ids); + spin_unlock_irq(&priv->lock); + } queue_delayed_work(ipoib_workqueue, &priv->cm.stale_task, IPOIB_CM_RX_DELAY); return 0; @@ -339,23 +499,40 @@ static void skb_put_frags(struct sk_buff } } -void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +static void timer_check(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. */ + if (!list_empty(&p->list)) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + queue_delayed_work(ipoib_workqueue, + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); + } +} +static int handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; struct sk_buff *skb, *newskb; + u64 mapping[IPOIB_CM_RX_SG], wr_id; struct ipoib_cm_rx *p; unsigned long flags; - u64 mapping[IPOIB_CM_RX_SG]; - int frags; + int frags, ret; + + wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", - wr_id, ipoib_recvq_size); - return; + ipoib_warn(priv, "cm recv completion event with wrid %d " + "(> %d)\n", wr_id, ipoib_recvq_size); + return 1; } skb = priv->cm.srq_ring[wr_id].skb; @@ -365,22 +542,12 @@ void ipoib_cm_handle_rx_wc(struct net_de "(status=%d, wrid=%d vend_err %x)\n", wc->status, wr_id, wc->vendor_err); ++priv->stats.rx_dropped; - goto repost; + goto repost_srq; } if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { p = wc->qp->qp_context; - if (time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { - spin_lock_irqsave(&priv->lock, flags); - p->jiffies = jiffies; - /* Move this entry to list head, but do - * not re-add it if it has been removed. */ - if (!list_empty(&p->list)) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irqrestore(&priv->lock, flags); - queue_delayed_work(ipoib_workqueue, - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); - } + timer_check(priv, p); } frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, @@ -388,22 +555,119 @@ void ipoib_cm_handle_rx_wc(struct net_de newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping); if (unlikely(!newskb)) { - /* - * If we can't allocate a new RX buffer, dump - * this packet and reuse the old buffer. - */ - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); + ++priv->stats.rx_dropped; + goto repost_srq; + } + + ipoib_cm_dma_unmap_rx(priv, frags, + priv->cm.srq_ring[wr_id].mapping); + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); + + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb->mac.raw = skb->data; + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + + netif_rx_ni(skb); + +repost_srq: + ret = ipoib_cm_post_receive(dev, wr_id); + + if (unlikely(ret)) { + ipoib_warn(priv, "ipoib_cm_post_receive failed for buf %d\n", + wr_id); + return 1; + } + + return 0; + +} + +static int handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb, *newskb; + u64 mapping[IPOIB_CM_RX_SG], wr_id; + u32 index; + struct ipoib_cm_rx *p, *rx_ptr; + unsigned long flags; + int frags, ret; + + + wr_id = wc->wr_id >> 32; + + ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", + wr_id, wc->status); + + if (unlikely(wr_id >= ipoib_recvq_size)) { + ipoib_warn(priv, "cm recv completion event with wrid %d " + "(> %d)\n", wr_id, ipoib_recvq_size); + return 1; + } + + index = (wc->wr_id & ~IPOIB_CM_OP_NOSRQ) & NOSRQ_INDEX_MASK ; + spin_lock_irqsave(&priv->lock, flags); + rx_ptr = priv->cm.rx_index_ring[index]; + spin_unlock_irqrestore(&priv->lock, flags); + + skb = rx_ptr->rx_ring[wr_id].skb; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + ipoib_dbg(priv, "cm recv error " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); ++priv->stats.rx_dropped; - goto repost; + goto repost_nosrq; } - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); + if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { + /* There are no guarantees that wc->qp is not NULL for HCAs + * that do not support SRQ. */ + p = rx_ptr; + timer_check(priv, p); + } + + frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, + (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; + + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags, + mapping); + if (unlikely(!newskb)) { + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); + ++priv->stats.rx_dropped; + goto repost_nosrq; + } + + ipoib_cm_dma_unmap_rx(priv, frags, + rx_ptr->rx_ring[wr_id].mapping); + memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); - skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); skb->protocol = ((struct ipoib_header *) skb->data)->proto; skb->mac.raw = skb->data; @@ -416,12 +680,34 @@ void ipoib_cm_handle_rx_wc(struct net_de skb->dev = dev; /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); -repost: - if (unlikely(ipoib_cm_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_cm_post_receive failed " - "for buf %d\n", wr_id); +repost_nosrq: + ret = ipoib_cm_post_receive(dev, wr_id << 32 | index); + + if (unlikely(ret)) { + ipoib_warn(priv, "ipoib_cm_post_receive failed for buf %d\n", + wr_id); + return 1; + } + + return 0; +} + +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + + if (priv->cm.srq) + ret = handle_rx_wc_srq(dev, wc); + else + ret = handle_rx_wc_nosrq(dev, wc); + + if (unlikely(ret)) + ipoib_warn(priv, "Error processing rx wc\n"); } static inline int post_send(struct ipoib_dev_priv *priv, @@ -606,6 +892,22 @@ int ipoib_cm_dev_open(struct net_device return 0; } +static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + int i; + + for(i = 0; i < ipoib_recvq_size; ++i) + if(p->rx_ring[i].skb) { + ipoib_cm_dma_unmap_rx(priv, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping); + dev_kfree_skb_any(p->rx_ring[i].skb); + p->rx_ring[i].skb = NULL; + } + kfree(p->rx_ring); +} + + void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -618,6 +920,8 @@ void ipoib_cm_dev_stop(struct net_device spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); + if (priv->cm.srq == NULL) + free_resources_nosrq(priv, p); list_del_init(&p->list); spin_unlock_irq(&priv->lock); ib_destroy_cm_id(p->id); @@ -703,9 +1007,14 @@ static struct ib_qp *ipoib_cm_create_tx_ struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = {}; attr.recv_cq = priv->cq; - attr.srq = priv->cm.srq; + if (priv->cm.srq) + attr.srq = priv->cm.srq; + else + attr.srq = NULL; attr.cap.max_send_wr = ipoib_sendq_size; + attr.cap.max_recv_wr = 1; /* Not in MST code */ attr.cap.max_send_sge = 1; + attr.cap.max_recv_sge = 1; /* Not in MST code */ attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -742,10 +1051,13 @@ static int ipoib_cm_send_req(struct net_ req.responder_resources = 4; req.remote_cm_response_timeout = 20; req.local_cm_response_timeout = 20; - req.retry_count = 0; /* RFC draft warns against retries */ - req.rnr_retry_count = 0; /* RFC draft warns against retries */ + req.retry_count = 6; /* RFC draft warns against retries */ + req.rnr_retry_count = 6;/* RFC draft warns against retries */ req.max_cm_retries = 15; - req.srq = 1; + if (priv->cm.srq) + req.srq = 1; + else + req.srq = 0; return ib_send_cm_req(id, &req); } @@ -1089,6 +1401,10 @@ static void ipoib_cm_stale_task(struct w p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; + if (priv->cm.srq == NULL) { /* NOSRQ */ + free_resources_nosrq(priv, p); + priv->cm.rx_index_ring[p->index] = NULL; + } list_del_init(&p->list); spin_unlock_irq(&priv->lock); ib_destroy_cm_id(p->id); @@ -1143,16 +1459,40 @@ int ipoib_cm_add_mode_attr(struct net_de return device_create_file(&dev->dev, &dev_attr_mode); } +static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv) +{ + struct ib_srq_init_attr srq_init_attr; + int ret; + + srq_init_attr.attr.max_wr = ipoib_recvq_size; + srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG; + + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); + if (IS_ERR(priv->cm.srq)) { + ret = PTR_ERR(priv->cm.srq); + priv->cm.srq = NULL; + return ret; + } + + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * + sizeof *priv->cm.srq_ring, + GFP_KERNEL); + if (!priv->cm.srq_ring) { + printk(KERN_WARNING "%s: failed to allocate CM ring " + "(%d entries)\n", + priv->ca->name, ipoib_recvq_size); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + + return 0; +} + int ipoib_cm_dev_init(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_srq_init_attr srq_init_attr = { - .attr = { - .max_wr = ipoib_recvq_size, - .max_sge = IPOIB_CM_RX_SG - } - }; - int ret, i; + int ret, i, supports_srq; + struct ib_device_attr attr; INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); @@ -1164,21 +1504,26 @@ int ipoib_cm_dev_init(struct net_device skb_queue_head_init(&priv->cm.skb_queue); - priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); - if (IS_ERR(priv->cm.srq)) { - ret = PTR_ERR(priv->cm.srq); - priv->cm.srq = NULL; + if (ret = ib_query_device(priv->ca, &attr)) return ret; + if (attr.max_srq) + supports_srq = 1; /* This device supports SRQ */ + else { + supports_srq = 0; } - priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, - GFP_KERNEL); - if (!priv->cm.srq_ring) { - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", - priv->ca->name, ipoib_recvq_size); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; - } + if (supports_srq) { + if (ret = create_srq(dev, priv)) + return ret; + + priv->cm.rx_index_ring = NULL; /* Not needed for SRQ */ + } else { + priv->cm.srq = NULL; + priv->cm.srq_ring = NULL; + priv->cm.rx_index_ring = kzalloc(NOSRQ_INDEX_RING_SIZE * + sizeof *priv->cm.rx_index_ring, + GFP_KERNEL); + } for (i = 0; i < IPOIB_CM_RX_SG; ++i) priv->cm.rx_sge[i].lkey = priv->mr->lkey; @@ -1190,19 +1535,25 @@ int ipoib_cm_dev_init(struct net_device priv->cm.rx_wr.sg_list = priv->cm.rx_sge; priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; - for (i = 0; i < ipoib_recvq_size; ++i) { - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, + /* One can post receive buffers even before the RX QP is created + * only in the SRQ case. Therefore for NOSRQ we skip the rest of init + * and do that in ipoib_cm_req_handler() */ + + if (priv->cm.srq) { + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[i].mapping)) { - ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; - } - if (ipoib_cm_post_receive(dev, i)) { - ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -EIO; + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + if (ipoib_cm_post_receive(dev, i)) { + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } } - } + } /* if SRQ */ priv->dev->dev_addr[0] = IPOIB_FLAGS_RC; return 0; --- a/linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-04-24 18:10:17.000000000 -0700 +++ b//linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-04-25 10:11:34.000000000 -0700 @@ -282,7 +282,7 @@ static void ipoib_ib_handle_tx_wc(struct static void ipoib_ib_handle_wc(struct net_device *dev, struct ib_wc *wc) { - if (wc->wr_id & IPOIB_CM_OP_SRQ) + if ((wc->wr_id & IPOIB_CM_OP_SRQ) || (wc->wr_id & IPOIB_CM_OP_NOSRQ)) ipoib_cm_handle_rx_wc(dev, wc); else if (wc->wr_id & IPOIB_OP_RECV) ipoib_ib_handle_rx_wc(dev, wc); Pradeep pradeep at us.ibm.com -------------- next part -------------- A non-text attachment was scrubbed... Name: ipoib_cm.nosrq.patch.v3 Type: application/octet-stream Size: 23445 bytes Desc: not available URL: From parks at lanl.gov Fri Apr 27 19:35:52 2007 From: parks at lanl.gov (parks) Date: Fri, 27 Apr 2007 20:35:52 -0600 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> Message-ID: <7.0.1.0.2.20070427203406.0336b9f8@lanl.gov> If you are using the node as a router and using the myrinet nic then there is something we had to turn off. It was causing panics on Roadrunner. It is spelled out explicitly in the Myrinet readme.... It combines packerts. I can tell you more monday. At 04:26 PM 4/27/2007, Bryan Lawver wrote: >I hit the IP NIC over the head with a hammer and turned off all >offload features and I no longer get the super jumbo packet and I >have symmetric performance. This NIC supported "ethtool -K ethx >tso/tx/rx/sg on/off" and I am not sure at this time which one I >needed to whack but all off solved the problem. > >Thanks for listening and re enforcing my search process. > >bryan > >At 01:32 PM 4/27/2007, Rick Jones wrote: >>Bryan Lawver wrote: >>>Your right about the ipoib module not combining packets (I >>>believed you without checking) but I did never the less. The >>>ipoib_start_xmit routine is definitely handed a "double >>>packet" which means that the IP NIC driver or the kernel is >>>combining two packets into a single super jumbo packet. This >>>issue is irrespective of the IP MTU setting because I have set all >>>interfaces to 9000k yet ipoib accepts and forwards this 17964 >>>packet to the next IB node and onto the TCP stack where it is >>>never acknowledged. This may not have come up in prior testing >>>because I am using some of the fastest IP NICs which have no >>>trouble keeping up with or exceeding the bandwidth of the IB >>>side. This issue arises exactly every 8 packets...(ring buffer overrun??) >>>I will be at Sonoma for the next few days as many on this list will be. >> >> >>Some NICs (esp 10G) support large receive offload - they coalesce >>TCP segments from the wire/fiber into larger ones they pass up the >>stack. Perhaps that is happening here? >> >>I'm going to go out a bit on a limb, cross the streams, and include >>netdev, because I suspect that if a system is acting as an IP >>router, one doesn't want large receive offload enabled. That may >>need some discussion in netdev - it may then require some changes >>to default settings or some documentation enhancements. That or >>I'll learn that the stack is already dealing with the issue... >> >>rick jones >> >>>bryan >>> >>>At 11:06 AM 4/26/2007, Michael S. Tsirkin wrote: >>> >>>> > Quoting Bryan Lawver : >>>> > Subject: Re: IPoIB forwarding >>>> > >>>> > Here's a tcpdump of the same sequence. The TCP MSS is 8960 >>>> and it appears >>>> > that two payloads are queued at ipoib which combines them into a single >>>> > 17920 payload with assumingly correct IP header (40) and IB header >>>> > (4). The application or TCP stack does not acknowledge this >>>> double packet >>>> > ie. it does not ACK until each of the 8960 packets are resent >>>> > individually. Being an IB newbie, I am guessing this combining is >>>> > allowable but may violate TCP protocol. >>>> >>>>IPoIB does nothing like this - it's just a network device so >>>>it sends all packets out as is. >>>> >>>>-- >>>>MST >>> >>>_______________________________________________ >>>general mailing list >>>general at lists.openfabrics.org >>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>To unsubscribe, please visit >>>http://openib.org/mailman/listinfo/openib-general > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Fri Apr 27 20:55:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 27 Apr 2007 20:55:47 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipath - Don't call spin_lock_irq() from interrupt context In-Reply-To: <1177697471.3407.14.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 27 Apr 2007 11:11:11 -0700") References: <1177697471.3407.14.camel@brick.pathscale.com> Message-ID: Thanks, I applied both ipath patches. Any chance of getting a final version of the pending mmap list fix, or should I just take what Robert posted a while ago? - r. From rdreier at cisco.com Fri Apr 27 20:56:34 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 27 Apr 2007 20:56:34 -0700 Subject: [ofa-general] Re: What's in infiniband.git for 2.6.22 In-Reply-To: <20070427153016.GB1709@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 27 Apr 2007 18:30:16 +0300") References: <20070427153016.GB1709@mellanox.co.il> Message-ID: > What about the mthca patch to use separate HW queues for kernel RC/UD/userspace RC? right, I'll queue that up too. BTW is there something analogous we could do for mlx4, or is FW not quite ready? - R. From rjwalsh at pathscale.com Fri Apr 27 21:07:23 2007 From: rjwalsh at pathscale.com (Robert Walsh) Date: Fri, 27 Apr 2007 21:07:23 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipath - Don't call spin_lock_irq() from interrupt context In-Reply-To: References: <1177697471.3407.14.camel@brick.pathscale.com> Message-ID: <4632C87B.10502@pathscale.com> Roland Dreier wrote: > Thanks, I applied both ipath patches. > > Any chance of getting a final version of the pending mmap list fix, or > should I just take what Robert posted a while ago? Here's the latest version we have. Don't remember if it's the same as the last one I sent you. Regards, Robert. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pending-mmaps-cleanup.patch URL: From panda at cse.ohio-state.edu Fri Apr 27 21:18:21 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sat, 28 Apr 2007 00:18:21 -0400 (EDT) Subject: [ofa-general] Announcing the release of MVAPICH 0.9.9 Message-ID: <200704280418.l3S4ILFP021861@xi.cse.ohio-state.edu> The MVAPICH team is pleased to announce the availability of MVAPICH 0.9.9 with the following NEW features: - Message coalescing support to enable reduction of per Queue-pair send queues for reduction in memory requirement on large scale clusters. This design also increases the small message messaging rate significantly. Reduction in memory requirement on large scale clusters can be found here: http://mvapich.cse.ohio-state.edu/performance/mvapich/mem_0_9_9.shtml - Designs for avoiding hot-spots in networks of large-scale clusters - Multi-pathing support leveraging LMC mechanism - Multi-port/Multi-HCA support for enabling user processes to bind to different IB ports for balanced communication performance on multi-core platforms - Multi-core optimized scalable shared memory design - Memory Hook support provided by integration with ptmalloc2 library. This provides safe release of memory to the Operating System and is expected to benefit the memory usage of applications that frequently use malloc and free operations. - Optimized, high-performance shared memory aware collective operations for multi-core platforms Performance benefits for sample collective operations using the optimized schemes can be found here: http://mvapich.cse.ohio-state.edu/performance/collective.shtml - Shared-Memory only channel (This interface support is useful for running MPI jobs on multi-processor systems without using any high-performance network. For example, multi-core servers, desktops, and laptops; and clusters with serial nodes.) A new "Multiple-pair Bandwidth and Message Rate" test is also available as a part of OSU Benchmarks. A newly designed MVAPICH web site (http://mvapich.cse.ohio-state.edu/) is now available for MVAPICH/MVAPICH2 users for easier navigation. More details on all features and supported platforms can be obtained by visiting the following URL: http://mvapich.cse.ohio-state.edu/overview/mvapich/features.shtml MVAPICH 0.9.9 continues to deliver excellent performance. Sample performance numbers include: - OpenFabrics/Gen2 on EM64T quad-core with PCIe and ConnectX-DDR: (These numbers are preliminary and optimizations are on-going.) - 1.39 microsec one-way latency (1 byte) - 1419 MB/sec unidirectional bandwidth - 2769 MB/sec bidirectional bandwidth More detailed performance numbers for MVAPICH 0.9.9 on ConnectX are available from the following URL: http://mvapich.cse.ohio-state.edu/performance/mvapich/em64t/MVAPICH-em64t-gen2-ConnectX.shtml - OpenFabrics/Gen2 on EM64T with PCI-Ex and IBA-DDR: - 2.93 microsec one-way latency (4 bytes) - 1405 MB/sec unidirectional bandwidth - 2702 MB/sec bidirectional bandwidth Performance numbers for all other platforms, system configurations and operations can be viewed by visiting `Performance' section of the project's web page. For downloading MVAPICH 0.9.9 package and accessing the anonymous SVN, please visit the following URL: http://mvapich.cse.ohio-state.edu/ All feedbacks, including bug reports, hints for performance tuning, patches and enhancements are welcome. Please post it to mvapich-discuss mailing list. Thanks, The MVAPICH Team ====================================================================== MVAPICH/MVAPICH2 project is currently supported with funding from U.S. National Science Foundation, U.S. DOE Office of Science, Mellanox, Intel, Cisco Systems, QLogic, Sun Microsystems and Linux Networx; and with equipment support from Advanced Clustering, AMD, Apple, Appro, Chelsio, Dell, Fujitsu, Fulcrum, IBM, Intel, Mellanox, Microway, NetEffect, QLogic and Sun Microsystems. Other technology partner includes Etnus. ====================================================================== From billfink at mindspring.com Fri Apr 27 23:51:17 2007 From: billfink at mindspring.com (Bill Fink) Date: Sat, 28 Apr 2007 02:51:17 -0400 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <4632894D.40705@hp.com> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> <46327A07.1000404@hp.com> <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> <4632894D.40705@hp.com> Message-ID: <20070428025117.a3b1200a.billfink@mindspring.com> On Fri, 27 Apr 2007, Rick Jones wrote: > Bryan Lawver wrote: > > I had so much debugging turned on that it was not the "slowing of the > > traffic" but the "non-coelescencing" that was the remedy. The NIC is a > > MyriCom NIC and these are easy options to set. > > As chance would have it, I've played with some Myricom myri10ge NICs recently, > and even disabled large receive offload during some netperf tests :) It is a > modprobe option. Going back now to the driver source and the README I see :-) > > > > Troubleshooting > =============== > > Large Receive Offload (LRO) is enabled by default. This will > interfere with forwarding TCP traffic. If you plan to forward TCP > traffic (using the host with the Myri10GE NIC as a router or bridge), > you must disable LRO. To disable LRO, load the myri10ge driver > with myri10ge_lro set to 0: > > # modprobe myri10ge myri10ge_lro=0 > > Alternatively, you can disable LRO at runtime by disabling > receive checksum offloading via ethtool: > > # ethtool -K eth2 rx off > > > > rick jones What version of the myri10ge driver is this? With the 1.2.0 version that comes with the 2.6.20.7 kernel, there is no myri10ge_lro module parameter. [root at lang2 ~]# modinfo myri10ge | grep -i lro [root at lang2 ~]# And I've been testing IP forwarding using two Myricom 10-GigE NICs without setting any special modprobe parameters. -Bill From vlad at lists.openfabrics.org Sat Apr 28 02:37:18 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 28 Apr 2007 02:37:18 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070428-0200 daily build status Message-ID: <20070428093718.E9E78E60826@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Failed: From abhinav.vishnu at gmail.com Sat Apr 28 07:11:41 2007 From: abhinav.vishnu at gmail.com (Abhinav Vishnu) Date: Sat, 28 Apr 2007 10:11:41 -0400 Subject: [ofa-general] Re: [ewg] APM Example In-Reply-To: References: <87aa148d0704261534s67c37c7amdcd01d5f664a768b@mail.gmail.com> <87aa148d0704261549l64d0c9cfy7d29eddcfd89561c@mail.gmail.com> Message-ID: <87aa148d0704280711x9244561pf0abef8100a53887@mail.gmail.com> Hi Roland, On 4/26/07, Roland Dreier wrote: > > Abhinav> However, VAPI has an event which specifies the successful > Abhinav> transition of MIGRATED -> ARMED (I know very well, that > Abhinav> it is done through modify_qp). But just the success of > Abhinav> modify_qp does not explicitly tell the time at which the > Abhinav> transition successfully occured, does it? > > You don't know the time that the transition occurred, except that it > is between when you called modify QP and when it returned. But an > asynchronous event doesn't really help, does it? It does help. APM is not only defined for network fault tolerance, it can also be used for load-balancing. With this event, one can know when the path is loaded and it is safe to call modify_qp. Also, do you have any script which can potentially bring a port down from Active state, without actually unplugging the cable? Please let me know. Thanks, :- Abhinav All an event would > tell you is that the transition occurred some time before the event was > generated, which is some time before when the event was delivered to you. > > Abhinav> Specifically: > > Abhinav> VAPI_PATH_MIG_ARMED would make my day. I believe that > Abhinav> VAPI_QP_PATH_MIGRATED is similar to > Abhinav> IB_EVENT_PATH_MIG. Please correct me if i am wrong. > > I see... VAPI_PATH_MIG_ARMED is a new event that was added only in > VAPI 4.1.0, which was why I didn't know about it. Only Mellanox HCAs > support it, it is not specified by the InfiniBand architecture, and I > don't really see the point of it (as I tried to explain above). > > - R. > -- Abhinav Vishnu Graduate Student Computer Science and Engineering The Ohio State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Sat Apr 28 07:15:34 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Apr 2007 10:15:34 -0400 Subject: [ofa-general] Re: [ewg] APM Example In-Reply-To: <87aa148d0704280711x9244561pf0abef8100a53887@mail.gmail.com> References: <87aa148d0704261534s67c37c7amdcd01d5f664a768b@mail.gmail.com> <87aa148d0704261549l64d0c9cfy7d29eddcfd89561c@mail.gmail.com> <87aa148d0704280711x9244561pf0abef8100a53887@mail.gmail.com> Message-ID: <1177769732.12542.250381.camel@hal.voltaire.com> On Sat, 2007-04-28 at 10:11, Abhinav Vishnu wrote: > Hi Roland, > > On 4/26/07, Roland Dreier wrote: > Abhinav> However, VAPI has an event which specifies the > successful > Abhinav> transition of MIGRATED -> ARMED (I know very > well, that > Abhinav> it is done through modify_qp). But just the > success of > Abhinav> modify_qp does not explicitly tell the time at > which the > Abhinav> transition successfully occured, does it? > > You don't know the time that the transition occurred, except > that it > is between when you called modify QP and when it > returned. But an > asynchronous event doesn't really help, does it? > > It does help. APM is not only defined for network fault tolerance, it > can > also be used for load-balancing. With this event, one can know when > the path is loaded and it is safe to call modify_qp. > > Also, do you have any script which can potentially bring a port down > from Active state, without actually unplugging the cable? Please let > me know. ibportstate -- Hal > Thanks, > > :- Abhinav > > All an event would > tell you is that the transition occurred some time before the > event was > generated, which is some time before when the event was > delivered to you. > > Abhinav> Specifically: > > Abhinav> VAPI_PATH_MIG_ARMED would make my day. I believe > that > Abhinav> VAPI_QP_PATH_MIGRATED is similar to > Abhinav> IB_EVENT_PATH_MIG. Please correct me if i am > wrong. > > I see... VAPI_PATH_MIG_ARMED is a new event that was added > only in > VAPI 4.1.0, which was why I didn't know about it. Only > Mellanox HCAs > support it, it is not specified by the InfiniBand > architecture, and I > don't really see the point of it (as I tried to explain > above). > > - R. > > > > -- > Abhinav Vishnu > Graduate Student > Computer Science and Engineering > The Ohio State University > > ______________________________________________________________________ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at dev.mellanox.co.il Sat Apr 28 10:53:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 28 Apr 2007 20:53:12 +0300 Subject: [ofa-general] Re: why is CPU util/service demand so much higher withSDP than TCP? In-Reply-To: <46325B38.9010209@hp.com> References: <46315032.9060903@hp.com> <20070427042328.GK15540@mellanox.co.il> <463250BA.9060606@hp.com> <46325B38.9010209@hp.com> Message-ID: <20070428175312.GA13106@mellanox.co.il> > Quoting Rick Jones : > Subject: Re: [ofa-general] Re: why is CPU util/service demand so much higher withSDP than TCP? > > Scott Weitzenkamp (sweitzen) wrote: > ># modinfo ib_sdp > >filename: > >/lib/modules/2.6.16.21-0.8-smp/updates/kernel/drivers/infiniband > >/ulp/sdp/ib_sdp.ko > >author: Michael S. Tsirkin > >description: InfiniBand SDP module > >license: Dual BSD/GPL > >vermagic: 2.6.16.21-0.8-smp SMP gcc-4.1 > >depends: ib_core,rdma_cm > >srcversion: 91793E4825DEBC7A2DA9366 > >parm: top_mem_usage:Top system wide sdp memory usage for recv > >(in MB). > > (int) > >parm: rcvbuf_scale:Receive buffer size scale factor. (int) > >parm: send_poll_thresh:Send message size thresh hold over > >which to sta > >rt polling. (int) > >parm: recv_poll:How many times to poll recv. (int) > >parm: send_poll:How many times to poll send. (int) > >parm: recv_poll_miss:How many times recv poll missed. (int) > >parm: recv_poll_hit:How many times recv poll helped. (int) > >parm: send_poll_miss:How many times send poll missed. (int) > >parm: send_poll_hit:How many times send poll helped. (int) > >parm: data_debug_level:Enable data path debug tracing if > 0. > >(int) > >parm: debug_level:Enable debug tracing if > 0. (int) > > > > I've learned a new command today :) And via other channels how to actually > see the current values of those things under /sys/modules/ib_sdp/parameters. > > I think I may have also figured-out why my TCP_RR tests over IPoIB have > been consistently unable to hit confidence intervals for CPU utilization. > I am guessing that for IPoIB that path is effectively "recv_poll=0 and/or > send_poll=0" yes? Right. > When I set those values for ib_sdp and run the SDP_RR > test netperf is unable to be confident that the CPU util it measures is any > closer than +/- 20% of the "real" CPU utilization... (still the OFED bits > in RHEL5 rather than 1.2) With the defaults, the netperf SDP_RR test hit > the confidence intervals. rick jones -- MST From mst at dev.mellanox.co.il Sat Apr 28 10:54:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 28 Apr 2007 20:54:55 +0300 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: <46325C7F.7040901@hp.com> References: <20070426045855.GH5217@mellanox.co.il> <4630D776.5060704@hp.com> <46325C7F.7040901@hp.com> Message-ID: <20070428175455.GB13106@mellanox.co.il> > Quoting Rick Jones : > Subject: Re: [ofa-general] Re: initial set of "direct" SDP tests in netperf > > Scott Weitzenkamp (sweitzen) wrote: > >>>Please note that you should *only* ever stick the SDP family value > >>>in the socket(3) call. All addresses for connect, bind etc > >>>are AF_INET, since SDP uses IP addresses for everything. > >> > >>Sounds like something trying to be just a little bit pregnant. > >> > >>Thankfully, I'm only munging the getaddrinfo() data for the > >>local endpoint. > > > > > >See bug https://bugs.openfabrics.org//show_bug.cgi?id=294, I agree > >connect() and bind() should allow AF_INET_SDP. > > I was poking around - it would be nice if they could take AF_INET_SDP - I > have to wonder if IPPROTO_SDP is actually better, but seeing there has been > some discussion there (but not having read all of it) I'm just going to go > with the flow... Basically everyone said "it does not matter". Do you think IPPROTO_SDP is better? > >About the "direct" SDP tests, instead of copy/pasting the TCP code, how > >about if you just had a command-line argument that specified SDP, like > >you do with neterver -6 to specify IPv6 instead of IPv4? > > Well, that could then require I start adding some backflips in "common" > code such as where I call getaddrinfo(). Besides, I've already finished > the first set of cut and paste :) > > >Speaking of IPv6, does netperf work with IPv6 on Linux? > > Yes, although "Linux" seems to have some issue with link-scope addresses. -- MST From mst at dev.mellanox.co.il Sat Apr 28 10:55:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 28 Apr 2007 20:55:59 +0300 Subject: [ofa-general] Re: What's in infiniband.git for 2.6.22 In-Reply-To: References: <20070427153016.GB1709@mellanox.co.il> Message-ID: <20070428175559.GC13106@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: What's in infiniband.git for 2.6.22 > > > What about the mthca patch to use separate HW queues for kernel RC/UD/userspace RC? > > right, I'll queue that up too. > BTW is there something analogous we could do for mlx4, or is FW not > quite ready? I am assured the issue this is sloving is not present in mlx4. -- MST From swise at opengridcomputing.com Sat Apr 28 13:16:26 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 28 Apr 2007 15:16:26 -0500 Subject: [ofa-general] OpenMPI and RDMA-CM Message-ID: <1177791386.4615.8.camel@stevo-laptop> Is anyone working on adding RDMA-CM support to OpenMPI? Thanks, Steve. From jsquyres at cisco.com Sat Apr 28 13:20:20 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Sat, 28 Apr 2007 16:20:20 -0400 Subject: [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <1177791386.4615.8.camel@stevo-laptop> References: <1177791386.4615.8.camel@stevo-laptop> Message-ID: <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> You'd probably be better asking this question on the Open MPI mailing lists, not here. :-) FWIW, yes, adding RDMA CM support has actually been on my to-do list for a while, but it keeps getting bumped by higher priority items. It would be *much* better if some iWARP companies got involved in Open MPI... On Apr 28, 2007, at 4:16 PM, Steve Wise wrote: > Is anyone working on adding RDMA-CM support to OpenMPI? > > Thanks, > > Steve. > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Cisco Systems From vlad at dev.mellanox.co.il Sat Apr 28 23:54:44 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 29 Apr 2007 09:54:44 +0300 Subject: [ofa-general] Re: error installing ofed_1.2-rc2 on RHEL5 In-Reply-To: <46312ACF.5010403@redhat.com> References: <20070419125722.GD918@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> <20070426133442.GJ32513@mellanox.co.il> <6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com> <20070426134331.GL32513@mellanox.co.il> <6a122cc00704260653i706caaecu1b8265157805912@mail.gmail.com> <20070426152613.GA15540@mellanox.co.il> <029901c7884c$9c14cf50$1914a8c0@surioffice> <46312ACF.5010403@redhat.com> Message-ID: <1177829684.6947.4.camel@vladsk-laptop> You need to install glibc-devel-32bit. Starting from OFED-1.2-20070427-0600 you will get the warning that glibc-devel-32bit is missing and compilation of 32-bit libraries will be skipped. Regards, Vladimir On Thu, 2007-04-26 at 18:42 -0400, Doug Ledford wrote: > Suresh Shelvapille wrote: > > After some digging around on the net figured out that the "error" actually meant > > no "c" program would compile when supplied with the same arguments as found on > > the error line in "config.log". > > > > In my case it happens to be: > > gcc -m32 -g -O2 -L/usr/lib -I../libibverbs/include -L. conftest.c >&5 > > > > the -m32 instread of -m64 seems to be the culprit! > > > > Looking at the configure file and running /usr/bin/file on the xxx.o seems to yield > > 64_bit though.... > > > > Any ideas..... > > This means you haven't installed the 32 bit version of gcc, yet > ./configure is attempting to use it. Install the 32 bit gcc (and other > associated rpms necessary to support 32 bit devel on your 64 bit > machine) and it should work fine then. > > > Many thanks in advance, > > Suri > > > >> -----Original Message----- > >> From: Suresh Shelvapille [mailto:suri at baymicrosystems.com] > >> Sent: Thursday, April 26, 2007 11:41 AM > >> To: 'general at lists.openfabrics.org' > >> Cc: 'Doug Ledford' > >> Subject: error installing ofed_1.2-rc2 on RHEL5 > >> > >> Folks: > >> > >> I just upgraded my system to RHEL5 and tried to install ofed_1.2-rc2.tgz > >> (dated 18-April) and am getting errors. I picked the basic install+defaults for > >> All selections. > >> > >> uname -a prints: > >> Linux ib-interop1host 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 x86_64 > >> x86_64 x86_64 GNU/Linux > >> > >> > >> Here is a partial output from the log file: > >> ----------------------------------------------------------- > >> cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs > >> Running: env ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes > >> ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range=yes ac_cv_func_ibv_dofork_range=yes > >> ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes ./configure --cache- > >> file=/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir > >> /usr/lib --mandir=/usr/share/man --sysconfdir=/etc CPPFLAGS="-I../libibverbs/include" > >> configure: creating cache /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache > >> checking for a BSD-compatible install... /usr/bin/install -c > >> checking whether build environment is sane... yes > >> checking for gawk... gawk > >> checking whether make sets $(MAKE)... yes > >> checking build system type... x86_64-unknown-linux-gnu > >> checking host system type... x86_64-unknown-linux-gnu > >> checking for style of include used by make... GNU > >> checking for gcc... gcc > >> checking for C compiler default output file name... configure: error: C compiler cannot create > >> executables > >> See `config.log' for more details. > >> Failed to execute: cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs && env > >> ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes > >> ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range=yes ac_cv_func_ibv_dofork_range=yes > >> ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes ./configure --cache- > >> file=/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir > >> /usr/lib --mandir=/usr/share/man --sysconfdir=/etc CPPFLAGS="-I../libibverbs/include" > >> error: Bad exit status from /var/tmp/rpm-tmp.58894 (%install) > >> > >> > >> RPM build errors: > >> user vlad does not exist - using root > >> group vlad does not exist - using root > >> user vlad does not exist - using root > >> group vlad does not exist - using root > >> Bad exit status from /var/tmp/rpm-tmp.58894 (%install) > >> ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix > >> /usr' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-libcxgb3 --with-libibcm > >> --with-libibverbs --with-libipathverbs --with-libmthca --with-librdmacm --with-mstflint --with- > >> perftest --sysconfdir=/etc --mandir=/usr/share/man' --define 'configure_options32 --with-libcxgb3 -- > >> with-libibcm --with-libibverbs --with-libipathverbs --with-libmthca --with-librdmacm -- > >> sysconfdir=/etc --mandir=/usr/share/man' --define 'build_32bit 1' --define '_mandir /usr/share/man' > >> /root/ofed_1.2/OFED-1.2-rc2/SRPMS/ofa_user-1.2-rc2.src.rpm" > >> ------------------------------------------------------------- > >> > >> > >> Is the "unknown-linux-gnu" against host linux the problem? > >> > >> Many thanks, > >> Suri > > > > From mst at dev.mellanox.co.il Sun Apr 29 00:51:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 29 Apr 2007 10:51:12 +0300 Subject: [ofa-general] Re: What's in infiniband.git for 2.6.22 In-Reply-To: References: <20070427153016.GB1709@mellanox.co.il> Message-ID: <20070429075112.GA7869@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: What's in infiniband.git for 2.6.22 > > > What about the mthca patch to use separate HW queues for kernel RC/UD/userspace RC? > > right, I'll queue that up too. I think you want to queue the following obvios bugix up as well: http://www.openfabrics.org/git/?p=~vlad/ofed_1_2/.git;a=blob;f=kernel_patches/fixes/ipoib_crash_on_error.patch;hb=HEAD -- MST From mst at dev.mellanox.co.il Sun Apr 29 02:00:02 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 29 Apr 2007 12:00:02 +0300 Subject: [ofa-general] Re: failure to create an FMR mapping 1K pages on memfree In-Reply-To: References: <15ddcffd0702261104x6df977b6g9e4ca0071c8489ad@mail.gmail.com> <15ddcffd0702261105s377ad165h7bfe258f69ede152@mail.gmail.com> Message-ID: <20070429090002.GF7869@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: failure to create an FMR mapping 1K pages on memfree > > > I have got a report on failure to create FMR mapping 1K pages (that is > > 4MB) on memfree. > > > > I don't have the exact details (ie if Arbel/Sinai / what FW / etc) > > nor which exact check fails in > > mthca_fmr_alloc, but what's clear is that the latter function returns > > -ENOMEM when attr.max_pages is 1024 and it works fine when > > attr.max_pages is 256. > > > > Is this failure clear to you? if yes, does a HW or FW limit is being > > hit or its a driver design issue? > > Is it really returning -ENOMEM? It seems much more likely that you > are hitting the code > > /* For Arbel, all MTTs must fit in the same page. */ > if (mthca_is_memfree(dev) && > mr->attr.max_pages * sizeof *mr->mem.arbel.mtts > PAGE_SIZE) > return -EINVAL; > > I guess you could call this limit a driver design issue. Actually, I see this in fmr_pool.c: fmr = kmalloc(sizeof *fmr + params->max_pages_per_fmr * sizeof (u64), GFP_KERNEL); Therefore, for max_pages_per_fmr = 1K, this attempts to allocate 8K of physically contigious memory, which could explain the failure. One way to fix this would be to use vmalloc to allocate this buffer. Opinions? -- MST From mst at dev.mellanox.co.il Sun Apr 29 02:02:44 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 29 Apr 2007 12:02:44 +0300 Subject: [ofa-general] [Bug 508] IPoIB CM multicast is hogging interrupts In-Reply-To: <20070411062036.E2E8EE6081C@openfabrics.org> References: <20070411062036.E2E8EE6081C@openfabrics.org> Message-ID: <20070429090244.GG7869@mellanox.co.il> IB spec requires that, after request for notification, we drain the CQ of completions that might have arrived there before or during request for notification. But the number of these is limited by CQ size, so we can, and should, avoid polling indefinitely (and starving other CQs). Signed-off-by: Michael S. Tsirkin --- I think this trick I just came up with is a simpe way to prevent IPoIB TX from hogging interrupts, even without NAPI. And it might be a better way to solve the problem for IPoIB CM TX than using a common cq as my previous patch did. This seems to hurt top bandwidth a bit in my testing, so this needs some more work. Meanwhile, Scott, could you please check whether the following patch helps in your test-case? Roland, I think something similiar is a good idea for SRP, too. What do you think? diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 2b242a4..3ed1536 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -573,14 +573,15 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) { struct ipoib_cm_tx *tx = tx_ptr; - int n, i; + int n, i, cnt = 0; ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); do { n = ib_poll_cq(cq, IPOIB_NUM_WC, tx->ibwc); + cnt += n; for (i = 0; i < n; ++i) ipoib_cm_handle_tx_wc(tx->dev, tx, tx->ibwc + i); - } while (n == IPOIB_NUM_WC); + } while (n == IPOIB_NUM_WC && cnt < ipoib_sendq_size); } int ipoib_cm_dev_open(struct net_device *dev) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index ba0ee5c..3701cd7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -294,14 +294,15 @@ void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) { struct net_device *dev = (struct net_device *) dev_ptr; struct ipoib_dev_priv *priv = netdev_priv(dev); - int n, i; + int n, i, cnt = 0; ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); do { n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); + cnt += n; for (i = 0; i < n; ++i) ipoib_ib_handle_wc(dev, priv->ibwc + i); - } while (n == IPOIB_NUM_WC); + } while (n == IPOIB_NUM_WC && cnt < 2 * ipoib_recvq_size + ipoib_sendq_size); } static inline int post_send(struct ipoib_dev_priv *priv, -- MST -------------- next part -------------- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 2b242a4..3ed1536 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -573,14 +573,15 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) { struct ipoib_cm_tx *tx = tx_ptr; - int n, i; + int n, i, cnt = 0; ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); do { n = ib_poll_cq(cq, IPOIB_NUM_WC, tx->ibwc); + cnt += n; for (i = 0; i < n; ++i) ipoib_cm_handle_tx_wc(tx->dev, tx, tx->ibwc + i); - } while (n == IPOIB_NUM_WC); + } while (n == IPOIB_NUM_WC && cnt < ipoib_sendq_size); } int ipoib_cm_dev_open(struct net_device *dev) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index ba0ee5c..3701cd7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -294,14 +294,15 @@ void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) { struct net_device *dev = (struct net_device *) dev_ptr; struct ipoib_dev_priv *priv = netdev_priv(dev); - int n, i; + int n, i, cnt = 0; ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); do { n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); + cnt += n; for (i = 0; i < n; ++i) ipoib_ib_handle_wc(dev, priv->ibwc + i); - } while (n == IPOIB_NUM_WC); + } while (n == IPOIB_NUM_WC && cnt < 2 * ipoib_recvq_size + ipoib_sendq_size); } static inline int post_send(struct ipoib_dev_priv *priv, From vlad at lists.openfabrics.org Sun Apr 29 02:37:52 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 29 Apr 2007 02:37:52 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070429-0200 daily build status Message-ID: <20070429093752.D6A1CE60832@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From vidoclips.com at rentevents.com Sun Apr 29 05:46:05 2007 From: vidoclips.com at rentevents.com (Mateo Gonzalez) Date: Sun, 29 Apr 2007 16:46:05 +0400 Subject: [ofa-general] 0EM Software Message-ID: <000001c78a5b$c7acdc00$0100007f@localhost> See attach ----- I offer no explanation, Elizab The knight continued to glare Leave the dogs here and follow Nay, Elizabeth promptly replie -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic44.gif Type: image/gif Size: 9095 bytes Desc: not available URL: From dotanb at dev.mellanox.co.il Sun Apr 29 07:34:41 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 29 Apr 2007 17:34:41 +0300 Subject: [ofa-general] OFED 1.2 RC2 <-> WinIB 1.3 In-Reply-To: <4631D27B.10301@holografika.com> References: <4631D27B.10301@holografika.com> Message-ID: <4634AD01.5010909@dev.mellanox.co.il> Hi Peter. Kovacs Peter Tamas wrote: > Dear all, > > I've tried to do some sped tests between a Linux and a Windows box > using InfiniBand. > I've installed OFED 1.2 RC2 to a Fedora Core 6 x64 box, and connected > it to a Windows XP x64 box with WinIB 1.3, both machines having a > Mellanox MHES-14XTC. As much as i know the performance tests in windows and in OFED cannot work together (even if they have the same name). thanks Dotan From loic at myri.com Sun Apr 29 12:40:15 2007 From: loic at myri.com (Loic Prylli) Date: Sun, 29 Apr 2007 15:40:15 -0400 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <20070428025117.a3b1200a.billfink@mindspring.com> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> <46327A07.1000404@hp.com> <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> <4632894D.40705@hp.com> <20070428025117.a3b1200a.billfink@mindspring.com> Message-ID: <4634F49F.9030408@myri.com> On 4/28/2007 2:51 AM, Bill Fink wrote: > On Fri, 27 Apr 2007, Rick Jones wrote: > > >> Bryan Lawver wrote: >> >>> I had so much debugging turned on that it was not the "slowing of the >>> traffic" but the "non-coelescencing" that was the remedy. The NIC is a >>> MyriCom NIC and these are easy options to set. >>> >> As chance would have it, I've played with some Myricom myri10ge NICs recently, >> and even disabled large receive offload during some netperf tests :) It is a >> modprobe option. Going back now to the driver source and the README I see :-) >> >> >> [..] >> >> rick jones >> > > What version of the myri10ge driver is this? With the 1.2.0 version > that comes with the 2.6.20.7 kernel, there is no myri10ge_lro module > parameter. > > The myri10ge_lro parameter does not exists in the kernel tree. The option and corresponding lro code is available only in the externally distributed version of myri10ge. That code was submitted to the netdev list, but wasn't taken in the kernel tree because of the reasonable concern the driver might not be the right place for that code (if nobody else proposes something equivalent in the meantime, we might at some point resubmit it as a driver-independant addon, but it might not be that soon for manpower reasons). Only the 1.2.0 version of the external driver makes LRO incompatible with forwarding. The problem should be fixed in version 1.3.0 released a few weeks ago (forwarding with myri10ge_lro enabled should then work), let us know otherwise. Anyway, following David Miller remark about netfilter, for the next version we might ask the user to explicitely enable LRO rather than making the default. Sorry for the inconvenience. Loic From schelomopuefu at arkrealty.com Sun Apr 29 17:12:07 2007 From: schelomopuefu at arkrealty.com (Humberto Tucker) Date: Sun, 29 Apr 2007 23:12:07 -0100 Subject: [ofa-general] Can you tell me more Message-ID: <6ab301c78ab3$ceb1b400$4bd0a932@schelomopuefu> Dana x-ray solid turned religion misspelled around slowly, still half-squintingTh-thank you. Stacy responded, Well Jeff, copper today about annoy manager is the firstpaste He belong pointed towards a large bookcase argument hole that was fi confuse This is top my favorite boastfully time of the nail day, said StaWho book the woke hell silky face is Gordy, anyway? What grade is he soft Listen, for disgust the time cork being, discussion just calm down. Yo warm forward Dressed like that? The disaproving realise berry look on th wake soothe push froze I'm afraid of Gavin. Well what purpose tensely peep haven't mad you seen yet? She asked. Yes...?roll glass This inspired comb Jeff to cuddly utter a half-serious wise apple As coal Gordy disappeared perfect paint behind the lockers, Jeff s notice scary fear She continued, For owner me, spiritual nourishment c I've decorate use seen all of them, won on but I wouldn't mind see Mom, outrageous believe heap it or not, this agreement is how nail most of my sprang Jeff sighed. seat Yeah, welcome repair to clever my nightmare. Lo I don't care bred how suit most radiate kids compare dress. I don't want almost Not attraction lupine very sew politely, chuckled Jeff. Yeah, morning change substance I anxious really get annoyed whenever Gurvitz stI just want to let you train punch catch disease know that if you want t Jeff drop letter continued to hat rode gaze at the sky, but with a Meanwhile, in the button girl's band really laid locker room, a similar Yeah, said one burn island girl. shallow split We've all been debating Just then, caught Greil support walked up to bound the ok three of themIn the end, family hum he argument had rest to call the police to frighstamp punch Gavin had it all planned. Like spoil most gold guys, he kn Dana lent got up and troubled train interest started walking. Alright, I'm I've got a better repulsive promptly up idea, she give sat back down on debt Gretchen poison left bland quickly introduced them, Stace, this I don't feel circle process work like getting cautiously tarted up today. Da right Hey, note you got yourself a cute consist damaged one there Gretch, announce inquisitive lovely Cheeky education sods! laughed Guy. comparison bewildered seed This event cracked both of them up. Jeff, have sling crack you ever upheld experienced shaved a moment whenThat's annually not onto shaven exactly shut what I mean.ball license He's stupid, lazy, clumsy berry butter and butt-ugly, adde You needle know, slay fiercely I want to pump answer yes in the worst wa Jeff got trod up on sticky his went bucket elbows again. Well, why don drag Give include me a sworn call if seen there's a problem. I'll see set cough tick Gavin sometimes tossed her the remote control. Knock you cook She picked up the liquid remote, eaten and started forgotten to flip t It safely was then that knit rise send something unexpected happened. day calmly He expert had allowed his speed to ring slacken gradually d I sure return do. The four of expert unfasten us awful ought to get togethe stamp slept This pen time mine it was Stacy. Jeff, there's somethin boil Problem attach met friend solved, thought Guy. What do march town you think dark led I really mean? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: zuacypg.gif Type: image/gif Size: 6246 bytes Desc: not available URL: From petelubiic at bowmangaskins.com Mon Apr 30 01:56:12 2007 From: petelubiic at bowmangaskins.com (Junko Stevens) Date: Mon, 30 Apr 2007 15:56:12 +0700 Subject: [ofa-general] Found it, whoa Message-ID: <164a01c78b40$13999ab0$22de25b6@petelubiic> Marcie was bubble alvine at upheld a loss for words. I'm change sorry. I hindustry brave He cut sensed the urgency street in her voice. You're abs Alright. Bye.misty triangular Just fluffy position in case what!? Jeff, kiss me.Your safety is ring more guarantee window meline important than mine. told cover Jeff, listen. You close know launch I would never lie to yo Look, feel room Jeff decide rang interrupted. I sincerely wish to brother Jeff year shone was behavior still skeptical. Need I remind you th outside Dana, I don't lead seen give a rat's ass belong how big, strong Jimmy suck Feingold left repeatedly slow out soup the school's front doobent She only had own match to say deceive that once. He leaned over a Jeff you use different open throw know that isn't true... Jeff.... bade clock This past time she was help going to be a litt Dana shake didn't answer. rest She kind didn't know mad what to say Goodnight Jeff. stood Well you did insult her mother way in fake shot front of a w Goodnight. occur How corporeal about dug right now? Suggested order Stacy. I've g Yeah?Stacy was both angry spade committee and amused animal at hurry the same tim cover What are you discovery fence talking about? stood He's that moron wh sand Stop arguing. You're wearing scold fact ray it and that's tha Yes mother. Bye.hammer You feline less hook do? Jeff was astonished. How did you...?3:45 PM And if supply instrument door she salt had insulted mine, I might've been The Faircrest paint addition monkey Galleria was whip a large indoor mall, begun mad Stacy jelly could barely contain wine her excitement. She drop Dana memorise got careful up later bravely than usual. All the excitement o She support grabbed exuberant the helmet intend that teach was draped over the sound slope I'll geriatric explain hearing it in a moment. Not pin read wanting to mine press proud the matter any further, St I mean massive chance fondly alive really kiss me.But you'll only perform chess be whip able to type beat with one hand.While Stacy led and ornament shoot slope Jeff were sitting at the table It was clearly time to seize star dust sleep strung the moment. He rol So cat the behave conversation animal salt will be a little slow. At force That's crack stand line because you're a much better person tha respect Do you think I should carefully offer minute prison to pay her admissi geoponic No, guilty command crack that would be way too obvious. Not unless caught polish ok adorable Dana, what are you doing? Well, why don't above you dangerous grab year write it and meet us over i 9:45 AM shoe plate drunk So why doesn't appreciate she call her mom? Stacy was bade scold top puzzled. squeak Congratulations for what? I watch hear ground ya. stocking I can't wait 'til your hat back in scho -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: igyexeuyidsidx.gif Type: image/gif Size: 10947 bytes Desc: not available URL: From vlad at lists.openfabrics.org Mon Apr 30 02:37:16 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 30 Apr 2007 02:37:16 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070430-0200 daily build status Message-ID: <20070430093717.16A72E60832@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From jengelh at linux01.gwdg.de Mon Apr 30 04:32:54 2007 From: jengelh at linux01.gwdg.de (Jan Engelhardt) Date: Mon, 30 Apr 2007 13:32:54 +0200 (MEST) Subject: [ofa-general] [PATCH 10/36] Use menuconfig objects II - Infiniband In-Reply-To: References: Message-ID: Change Kconfig objects from "menu, config" into "menuconfig" so that the user can disable the whole feature without having to enter the menu first. Signed-off-by: Jan Engelhardt --- drivers/infiniband/Kconfig | 14 ++++++-------- drivers/infiniband/hw/amso1100/Kconfig | 2 +- drivers/infiniband/hw/cxgb3/Kconfig | 2 +- drivers/infiniband/hw/ehca/Kconfig | 2 +- drivers/infiniband/hw/ipath/Kconfig | 2 +- drivers/infiniband/hw/mlx4/Kconfig | 1 - drivers/infiniband/hw/mthca/Kconfig | 2 +- drivers/infiniband/ulp/ipoib/Kconfig | 2 +- drivers/infiniband/ulp/iser/Kconfig | 2 +- drivers/infiniband/ulp/srp/Kconfig | 2 +- 10 files changed, 14 insertions(+), 17 deletions(-) --- linux-2.6.21-mm_20070428.orig/drivers/infiniband/Kconfig +++ linux-2.6.21-mm_20070428/drivers/infiniband/Kconfig @@ -1,16 +1,15 @@ -menu "InfiniBand support" - -config INFINIBAND - depends on PCI || BROKEN +menuconfig INFINIBAND tristate "InfiniBand support" + depends on PCI || BROKEN ---help--- Core support for InfiniBand (IB). Make sure to also select any protocols you wish to use as well as drivers for your InfiniBand hardware. +if INFINIBAND + config INFINIBAND_USER_MAD tristate "InfiniBand userspace MAD support" - depends on INFINIBAND ---help--- Userspace InfiniBand Management Datagram (MAD) support. This is the kernel side of the userspace MAD support, which allows @@ -19,7 +18,6 @@ config INFINIBAND_USER_MAD config INFINIBAND_USER_ACCESS tristate "InfiniBand userspace access (verbs and CM)" - depends on INFINIBAND ---help--- Userspace InfiniBand access support. This enables the kernel side of userspace verbs and the userspace @@ -36,7 +34,7 @@ config INFINIBAND_USER_MEM config INFINIBAND_ADDR_TRANS bool - depends on INFINIBAND && INET + depends on INET default y source "drivers/infiniband/hw/mthca/Kconfig" @@ -53,4 +51,4 @@ source "drivers/infiniband/ulp/srp/Kconf source "drivers/infiniband/ulp/iser/Kconfig" -endmenu +endif # INFINIBAND --- linux-2.6.21-mm_20070428.orig/drivers/infiniband/hw/amso1100/Kconfig +++ linux-2.6.21-mm_20070428/drivers/infiniband/hw/amso1100/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_AMSO1100 tristate "Ammasso 1100 HCA support" - depends on PCI && INET && INFINIBAND + depends on PCI && INET ---help--- This is a low-level driver for the Ammasso 1100 host channel adapter (HCA). --- linux-2.6.21-mm_20070428.orig/drivers/infiniband/hw/cxgb3/Kconfig +++ linux-2.6.21-mm_20070428/drivers/infiniband/hw/cxgb3/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_CXGB3 tristate "Chelsio RDMA Driver" - depends on CHELSIO_T3 && INFINIBAND && INET + depends on CHELSIO_T3 && INET select GENERIC_ALLOCATOR ---help--- This is an iWARP/RDMA driver for the Chelsio T3 1GbE and --- linux-2.6.21-mm_20070428.orig/drivers/infiniband/hw/ehca/Kconfig +++ linux-2.6.21-mm_20070428/drivers/infiniband/hw/ehca/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_EHCA tristate "eHCA support" - depends on IBMEBUS && INFINIBAND + depends on IBMEBUS ---help--- This driver supports the IBM pSeries eHCA InfiniBand adapter. --- linux-2.6.21-mm_20070428.orig/drivers/infiniband/hw/ipath/Kconfig +++ linux-2.6.21-mm_20070428/drivers/infiniband/hw/ipath/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_IPATH tristate "QLogic InfiniPath Driver" - depends on (PCI_MSI || HT_IRQ) && 64BIT && INFINIBAND && NET + depends on (PCI_MSI || HT_IRQ) && 64BIT && NET ---help--- This is a driver for QLogic InfiniPath host channel adapters, including InfiniBand verbs support. This driver allows these --- linux-2.6.21-mm_20070428.orig/drivers/infiniband/hw/mlx4/Kconfig +++ linux-2.6.21-mm_20070428/drivers/infiniband/hw/mlx4/Kconfig @@ -1,6 +1,5 @@ config MLX4_INFINIBAND tristate "Mellanox ConnectX HCA support" - depends on INFINIBAND select MLX4_CORE ---help--- This driver provides low-level InfiniBand support for --- linux-2.6.21-mm_20070428.orig/drivers/infiniband/hw/mthca/Kconfig +++ linux-2.6.21-mm_20070428/drivers/infiniband/hw/mthca/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_MTHCA tristate "Mellanox HCA support" - depends on PCI && INFINIBAND + depends on PCI ---help--- This is a low-level driver for Mellanox InfiniHost host channel adapters (HCAs), including the MT23108 PCI-X HCA --- linux-2.6.21-mm_20070428.orig/drivers/infiniband/ulp/ipoib/Kconfig +++ linux-2.6.21-mm_20070428/drivers/infiniband/ulp/ipoib/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_IPOIB tristate "IP-over-InfiniBand" - depends on INFINIBAND && NETDEVICES && INET && (IPV6 || IPV6=n) + depends on NETDEVICES && INET && (IPV6 || IPV6=n) ---help--- Support for the IP-over-InfiniBand protocol (IPoIB). This transports IP packets over InfiniBand so you can use your IB --- linux-2.6.21-mm_20070428.orig/drivers/infiniband/ulp/iser/Kconfig +++ linux-2.6.21-mm_20070428/drivers/infiniband/ulp/iser/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_ISER tristate "iSCSI Extensions for RDMA (iSER)" - depends on INFINIBAND && SCSI && INET + depends on SCSI && INET select SCSI_ISCSI_ATTRS ---help--- Support for the iSCSI Extensions for RDMA (iSER) Protocol --- linux-2.6.21-mm_20070428.orig/drivers/infiniband/ulp/srp/Kconfig +++ linux-2.6.21-mm_20070428/drivers/infiniband/ulp/srp/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_SRP tristate "InfiniBand SCSI RDMA Protocol" - depends on INFINIBAND && SCSI + depends on SCSI ---help--- Support for the SCSI RDMA Protocol over InfiniBand. This allows you to access storage devices that speak SRP over From kaf at sgi.com Mon Apr 30 07:49:38 2007 From: kaf at sgi.com (Karl Feind) Date: Mon, 30 Apr 2007 09:49:38 -0500 Subject: [ofa-general] Re: on the coexistance of uDAPLs In-Reply-To: References: <4623D067.1030005@ichips.intel.com> Message-ID: <20070430144938.GA4212@sgi.com> I don't think I understand what the "signature differences" are. My feeling is that libdat and the DAPL infrastructure should be as independent of the individual DAPL layers as possible. If some DAPL layers with protocol version 1.2 are installed on the same system as other DAPL layers with protocol 2.0, the best case would be for the DAPL infrastructure (libdat and /etc/dat.conf) to allow this. Karl Feind SGI On Mon, Apr 23, 2007 at 01:59:07PM -0400, Kanevsky, Arkady wrote: > There are some signature differences between versions. > Since redirection exposes signatures it is not trivial > for a single redirection to support different signatures. > Is this really needed? > Thanks, > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance Inc. phone: 781-768-5395 > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > Waltham, MA 02451 central phone: 781-768-5300 > > > > -----Original Message----- > > From: Arlin Davis [mailto:ardavis at ichips.intel.com] > > Sent: Monday, April 16, 2007 3:37 PM > > To: Karl Feind > > Cc: Brian Forbes; Edward Mascarenhas; Jeff Hanson; > > general at lists.openfabrics.org > > Subject: [ofa-general] Re: on the coexistance of uDAPLs > > > > Karl Feind wrote: > > > > >>comments? other suggestions? > > >> > > >>-arlin > > >> > > >> > > > > > >I'd really like to see a separate RPM (called something like > > >dapl-infra) that installs: > > > > > > 1) /etc/dat.conf (empty) > > > 2) a script that addes a provider to /etc/data.conf > > > 3) a script that removes a provider from /etc/data.conf > > > 4) libdat.so > > > > > >Any DAPL layer depends on this RPM, and invokes the scripts > > (2) and (3) > > >in the preinstall and postuninstall setep. > > > > > >This decouples the DAPL infrastructure from the DAPL instantiations. > > > > > >Just an idea. > > > > > > > > > > Do you see the need for different versions to co-exist (1.1, > > 1.2, 2.0)? > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > From Arkady.Kanevsky at netapp.com Mon Apr 30 08:30:50 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 30 Apr 2007 11:30:50 -0400 Subject: [ofa-general] Re: on the coexistance of uDAPLs In-Reply-To: <20070430144938.GA4212@sgi.com> References: <4623D067.1030005@ichips.intel.com> <20070430144938.GA4212@sgi.com> Message-ID: Karl, The signatures differences I mentioned are API changes for some functions, like dapl_lmr_create, and dapl_rmr_bind. Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Karl Feind [mailto:kaf at sgi.com] > Sent: Monday, April 30, 2007 10:50 AM > To: Kanevsky, Arkady > Cc: Arlin Davis; Karl Feind; Brian Forbes; Edward > Mascarenhas; Jeff Hanson; general at lists.openfabrics.org > Subject: Re: [ofa-general] Re: on the coexistance of uDAPLs > > > I don't think I understand what the "signature differences" are. > My feeling is that libdat and the DAPL infrastructure should > be as independent of the individual DAPL layers as possible. > If some DAPL layers with protocol version 1.2 are installed > on the same system as other DAPL layers with protocol 2.0, > the best case would be for the DAPL infrastructure (libdat > and /etc/dat.conf) to allow this. > > Karl Feind > SGI > > On Mon, Apr 23, 2007 at 01:59:07PM -0400, Kanevsky, Arkady wrote: > > There are some signature differences between versions. > > Since redirection exposes signatures it is not trivial for a single > > redirection to support different signatures. > > Is this really needed? > > Thanks, > > > > Arkady Kanevsky email: arkady at netapp.com > > Network Appliance Inc. phone: 781-768-5395 > > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > > Waltham, MA 02451 central phone: 781-768-5300 > > > > > > > -----Original Message----- > > > From: Arlin Davis [mailto:ardavis at ichips.intel.com] > > > Sent: Monday, April 16, 2007 3:37 PM > > > To: Karl Feind > > > Cc: Brian Forbes; Edward Mascarenhas; Jeff Hanson; > > > general at lists.openfabrics.org > > > Subject: [ofa-general] Re: on the coexistance of uDAPLs > > > > > > Karl Feind wrote: > > > > > > >>comments? other suggestions? > > > >> > > > >>-arlin > > > >> > > > >> > > > > > > > >I'd really like to see a separate RPM (called something like > > > >dapl-infra) that installs: > > > > > > > > 1) /etc/dat.conf (empty) > > > > 2) a script that addes a provider to /etc/data.conf > > > > 3) a script that removes a provider from /etc/data.conf > > > > 4) libdat.so > > > > > > > >Any DAPL layer depends on this RPM, and invokes the scripts > > > (2) and (3) > > > >in the preinstall and postuninstall setep. > > > > > > > >This decouples the DAPL infrastructure from the DAPL > instantiations. > > > > > > > >Just an idea. > > > > > > > > > > > > > > Do you see the need for different versions to co-exist (1.1, 1.2, > > > 2.0)? > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > From rdreier at cisco.com Mon Apr 30 09:29:56 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Apr 2007 09:29:56 -0700 Subject: [ofa-general] Re: failure to create an FMR mapping 1K pages on memfree In-Reply-To: <20070429090002.GF7869@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 29 Apr 2007 12:00:02 +0300") References: <15ddcffd0702261104x6df977b6g9e4ca0071c8489ad@mail.gmail.com> <15ddcffd0702261105s377ad165h7bfe258f69ede152@mail.gmail.com> <20070429090002.GF7869@mellanox.co.il> Message-ID: > > Is it really returning -ENOMEM? It seems much more likely that you > > are hitting the code > > > > /* For Arbel, all MTTs must fit in the same page. */ > > if (mthca_is_memfree(dev) && > > mr->attr.max_pages * sizeof *mr->mem.arbel.mtts > PAGE_SIZE) > > return -EINVAL; > > > > I guess you could call this limit a driver design issue. > > Actually, I see this in fmr_pool.c: > > fmr = kmalloc(sizeof *fmr + params->max_pages_per_fmr * sizeof (u64), > GFP_KERNEL); > > Therefore, for max_pages_per_fmr = 1K, this attempts to allocate 8K > of physically contigious memory, which could explain the failure. > > One way to fix this would be to use vmalloc to allocate this buffer. > Opinions? I don't think an order 1 allocation will fail under normal conditions. Larger allocations might fail, but I don't think vmalloc is the right solution... maybe just disable the caching in fmr_pool for larger FMRs? Anyway the issue here is defintely that mthca does not handle finding more than one page in the MTT table for memfree HCAs... From rdreier at cisco.com Mon Apr 30 09:30:16 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Apr 2007 09:30:16 -0700 Subject: [ofa-general] Re: What's in infiniband.git for 2.6.22 In-Reply-To: <20070429075112.GA7869@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 29 Apr 2007 10:51:12 +0300") References: <20070427153016.GB1709@mellanox.co.il> <20070429075112.GA7869@mellanox.co.il> Message-ID: > I think you want to queue the following obvios bugix up as well: > http://www.openfabrics.org/git/?p=~vlad/ofed_1_2/.git;a=blob;f=kernel_patches/fixes/ipoib_crash_on_error.patch;hb=HEAD Yes, I'll get that one too. From rdreier at cisco.com Mon Apr 30 09:34:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Apr 2007 09:34:18 -0700 Subject: [ofa-general] Re: [Bug 508] IPoIB CM multicast is hogging interrupts In-Reply-To: <20070429090244.GG7869@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 29 Apr 2007 12:02:44 +0300") References: <20070411062036.E2E8EE6081C@openfabrics.org> <20070429090244.GG7869@mellanox.co.il> Message-ID: > This seems to hurt top bandwidth a bit in my testing, so this needs some more > work. Meanwhile, Scott, could you please check whether the following > patch helps in your test-case? The common CQ helped bandwidth, right? What's the advantage of this approach? > Roland, I think something similiar is a good idea for SRP, too. > What do you think? Yes, it can't hurt really. And after we merge NAPI, then I guess I might as well use the missed event stuff to avoid extra unnecessary interrupts too. From rick.jones2 at hp.com Mon Apr 30 09:49:25 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 30 Apr 2007 09:49:25 -0700 Subject: [ofa-general] OFED 1.2 RC2 <-> WinIB 1.3 In-Reply-To: <4634AD01.5010909@dev.mellanox.co.il> References: <4631D27B.10301@holografika.com> <4634AD01.5010909@dev.mellanox.co.il> Message-ID: <46361E15.1050006@hp.com> Dotan Barak wrote: > Hi Peter. > > Kovacs Peter Tamas wrote: > >> Dear all, >> >> I've tried to do some sped tests between a Linux and a Windows box >> using InfiniBand. >> I've installed OFED 1.2 RC2 to a Fedora Core 6 x64 box, and connected >> it to a Windows XP x64 box with WinIB 1.3, both machines having a >> Mellanox MHES-14XTC. > > > As much as i know the performance tests in windows and in OFED cannot > work together (even if they have the same name). I wonder if the SDP_mumble tests in netperf top of trunk would work? rick jones From xma at us.ibm.com Mon Apr 30 09:49:24 2007 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 30 Apr 2007 09:49:24 -0700 Subject: [ofa-general] Re: [Bug 508] IPoIB CM multicast is hogging interrupts In-Reply-To: Message-ID: > > Roland, I think something similiar is a good idea for SRP, too. > > What do you think? > > Yes, it can't hurt really. > > And after we merge NAPI, then I guess I might as well use the missed > event stuff to avoid extra unnecessary interrupts too. It would be nice if having some common API and documention on using missed event stuff for any UPLs. Thanks Shirley Ma IBM Linux Technology Center -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Mon Apr 30 09:50:07 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Apr 2007 19:50:07 +0300 Subject: [ofa-general] Re: failure to create an FMR mapping 1K pages on memfree In-Reply-To: References: <15ddcffd0702261104x6df977b6g9e4ca0071c8489ad@mail.gmail.com> <15ddcffd0702261105s377ad165h7bfe258f69ede152@mail.gmail.com> <20070429090002.GF7869@mellanox.co.il> Message-ID: <20070430165007.GQ2509@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: failure to create an FMR mapping 1K pages on memfree > > > > Is it really returning -ENOMEM? It seems much more likely that you > > > are hitting the code > > > > > > /* For Arbel, all MTTs must fit in the same page. */ > > > if (mthca_is_memfree(dev) && > > > mr->attr.max_pages * sizeof *mr->mem.arbel.mtts > PAGE_SIZE) > > > return -EINVAL; > > > > > > I guess you could call this limit a driver design issue. > > > > Actually, I see this in fmr_pool.c: > > > > fmr = kmalloc(sizeof *fmr + params->max_pages_per_fmr * sizeof (u64), > > GFP_KERNEL); > > > > Therefore, for max_pages_per_fmr = 1K, this attempts to allocate 8K > > of physically contigious memory, which could explain the failure. > > > > One way to fix this would be to use vmalloc to allocate this buffer. > > Opinions? > > I don't think an order 1 allocation will fail under normal conditions. > Larger allocations might fail, but I don't think vmalloc is the right > solution... maybe just disable the caching in fmr_pool for larger > FMRs? Or just avoid FMR pool altogether. > Anyway the issue here is defintely that mthca does not handle finding > more than one page in the MTT table for memfree HCAs... True. On 64 bit systems, we could relatively easily solve this using vmap. On 32 bit systems, it's a bigger change. -- MST From rdreier at cisco.com Mon Apr 30 09:50:41 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Apr 2007 09:50:41 -0700 Subject: [ofa-general] Re: [ewg] APM Example In-Reply-To: <87aa148d0704280711x9244561pf0abef8100a53887@mail.gmail.com> (Abhinav Vishnu's message of "Sat, 28 Apr 2007 10:11:41 -0400") References: <87aa148d0704261534s67c37c7amdcd01d5f664a768b@mail.gmail.com> <87aa148d0704261549l64d0c9cfy7d29eddcfd89561c@mail.gmail.com> <87aa148d0704280711x9244561pf0abef8100a53887@mail.gmail.com> Message-ID: > > You don't know the time that the transition occurred, except that it > > is between when you called modify QP and when it returned. But an > > asynchronous event doesn't really help, does it? > It does help. APM is not only defined for network fault tolerance, it can > also be used for load-balancing. With this event, one can know when > the path is loaded and it is safe to call modify_qp. I guess I don't really understand how you're using this event. What advantage is there in getting an async event at some unknown time (maybe before the modify QP operation returns, maybe after)? What does it let you do that you can't do with the verbs architecture as defined strictly by the verbs? (I never believe in strictly following specs when they don't make sense, but I also don't believe in adding non-standard features without a compelling reason) - R. From mst at dev.mellanox.co.il Mon Apr 30 10:00:46 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Apr 2007 20:00:46 +0300 Subject: [ofa-general] Re: [Bug 508] IPoIB CM multicast is hogging interrupts In-Reply-To: References: <20070411062036.E2E8EE6081C@openfabrics.org> <20070429090244.GG7869@mellanox.co.il> Message-ID: <20070430170046.GS2509@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [Bug 508] IPoIB CM multicast is hogging interrupts > > > This seems to hurt top bandwidth a bit in my testing, so this needs some more > > work. Meanwhile, Scott, could you please check whether the following > > patch helps in your test-case? > > The common CQ helped bandwidth, right? What's the advantage of this approach? Unfortunately, further testing demonstrated a problem with the common CQ approach: when an error is discovered on some TX QP, I can't just stop using that TX and immediately re-enable the interface as I currently do: I have to wait until I get all completions with error or until the QP is reset to get rid of outstanding completions, and this might (and does) take a long time since this all involves slow path operations. -- MST From rick.jones2 at hp.com Mon Apr 30 10:07:16 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 30 Apr 2007 10:07:16 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <20070428025117.a3b1200a.billfink@mindspring.com> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> <46327A07.1000404@hp.com> <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> <4632894D.40705@hp.com> <20070428025117.a3b1200a.billfink@mindspring.com> Message-ID: <46362244.9030406@hp.com> > What version of the myri10ge driver is this? With the 1.2.0 version > that comes with the 2.6.20.7 kernel, there is no myri10ge_lro module > parameter. > > [root at lang2 ~]# modinfo myri10ge | grep -i lro > [root at lang2 ~]# > > And I've been testing IP forwarding using two Myricom 10-GigE NICs > without setting any special modprobe parameters. Ethtool -i on the interface reports 1.2.0 as the driver version. From rick.jones2 at hp.com Mon Apr 30 10:15:19 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 30 Apr 2007 10:15:19 -0700 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: <20070428175455.GB13106@mellanox.co.il> References: <20070426045855.GH5217@mellanox.co.il> <4630D776.5060704@hp.com> <46325C7F.7040901@hp.com> <20070428175455.GB13106@mellanox.co.il> Message-ID: <46362427.8040905@hp.com> >>I was poking around - it would be nice if they could take AF_INET_SDP - I >>have to wonder if IPPROTO_SDP is actually better, but seeing there has been >>some discussion there (but not having read all of it) I'm just going to go >>with the flow... > > > Basically everyone said "it does not matter". > Do you think IPPROTO_SDP is better? To the extent that I have no idea what is really happening under the covers with SDP I would say yes. My understanding is that the only difference is that "SDP" is used rather than "TCP." That being the case, then I would think it would/should be like using say UDP vs TCP vs SCTP (ignoring the obvoius protocol differences). Each are "INET" sockets using "INET" addressing, the difference is the layer-four (transport) protocol being used, which is selected via IPPROTO_TCP vs IPPROTO_UDP vs IPPROTO_SCTP. And when/if IPv6 is supported, then there shouldn't (?) be any need to have an "extra" AF_INET6_SDP - one would use AF_INET and AF_INET6 with IPPROTO_SDP. Also, an application making use of getaddrinfo() (as all well-written apps are supposed to be these days :) wouldn't have to worry about name to IP resolution in the general case (where a protocol is not provided with the hints) when wanting to use SDP directly - it still calls getaddrinfo() with AF_INET, AF_INET6 or AF_UNSPEC as before, no need to worry that AF_INET_SDP is not groked by getaddrinfo(). rick jones from the SDP/IB peanut gallery From rick.jones2 at hp.com Mon Apr 30 10:16:37 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 30 Apr 2007 10:16:37 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <20070427.165232.74560572.davem@davemloft.net> References: <4632894D.40705@hp.com> <20070427.163934.106263903.davem@davemloft.net> <46328BB0.9030501@hp.com> <20070427.165232.74560572.davem@davemloft.net> Message-ID: <46362475.3080605@hp.com> David Miller wrote: > From: Rick Jones > Date: Fri, 27 Apr 2007 16:48:00 -0700 > > >>No problem - just to play whatif/devil's advocate for a bit >>though... is there any way to tie that in with the setting of >>net.ipv4.ip_forward (and/or its IPv6 counterpart)? > > > Even ignoring that, consider the potential issues this > kind of problem could be causing netfilter. OK, I'll show my ignorance and bite - what sort of issues with netfilter? Is it tied to link-local MTUs? rick jones From mst at dev.mellanox.co.il Mon Apr 30 10:28:46 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Apr 2007 20:28:46 +0300 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: <46362427.8040905@hp.com> References: <20070426045855.GH5217@mellanox.co.il> <4630D776.5060704@hp.com> <46325C7F.7040901@hp.com> <20070428175455.GB13106@mellanox.co.il> <46362427.8040905@hp.com> Message-ID: <20070430172846.GT2509@mellanox.co.il> > Quoting Rick Jones : > Subject: Re: [ofa-general] Re: initial set of "direct" SDP tests in netperf > > >>I was poking around - it would be nice if they could take AF_INET_SDP - I > >>have to wonder if IPPROTO_SDP is actually better, but seeing there has > >>been some discussion there (but not having read all of it) I'm just going > >>to go with the flow... > > > > > >Basically everyone said "it does not matter". > >Do you think IPPROTO_SDP is better? > > To the extent that I have no idea what is really happening under the > covers with SDP I would say yes. > > My understanding is that the only difference is that "SDP" is used > rather than "TCP." That being the case, then I would think it > would/should be like using say UDP vs TCP vs SCTP (ignoring the obvoius > protocol differences). > > Each are "INET" sockets using "INET" addressing, the difference is the > layer-four (transport) protocol being used, which is selected via > IPPROTO_TCP vs IPPROTO_UDP vs IPPROTO_SCTP. > > And when/if IPv6 is supported, then there shouldn't (?) be any need to > have an "extra" AF_INET6_SDP - one would use AF_INET and AF_INET6 with > IPPROTO_SDP. > > Also, an application making use of getaddrinfo() (as all well-written > apps are supposed to be these days :) wouldn't have to worry about name > to IP resolution in the general case (where a protocol is not provided > with the hints) when wanting to use SDP directly - it still calls > getaddrinfo() with AF_INET, AF_INET6 or AF_UNSPEC as before, no need to > worry that AF_INET_SDP is not groked by getaddrinfo(). Surely too late for OFED 1.2, but we can reopen this afterwards. Are there disadvantages to using protocol instead of family? Would this change present a problem for people using e.g. getprotobyname to get the protocol number? -- MST From arlin.r.davis at intel.com Mon Apr 30 10:37:40 2007 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 30 Apr 2007 10:37:40 -0700 Subject: [ofa-general] [PULL] ofed_1_2 branch for dapl Message-ID: <000001c78b4e$40b481f0$ff0da8c0@amr.corp.intel.com> Please pull into OFED 1.2: git://git.openfabrics.org/~ardavis/dapl.git ofed_1_2 1. update for dapltest manpage and README 2. fix for atomic build issue on ia64 RHEL5 Thanks, - arlin From pradeep at us.ibm.com Mon Apr 30 10:48:30 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 30 Apr 2007 10:48:30 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: Message-ID: Pradeep Satyanarayana/Beaverton/IBM wrote on 04/27/2007 05:51:14 PM: > Here is a third version of the IPOIB_CM_NOSRQ patch for review. This > patch will benefit adapters that do not support shared receive queues. > > This patch incorporates the following review comments from v2: > There should be no line wrap issues now > Code restructured to seperate the SRQ/non-SRQ in several places > > This patch has been tested with linux-2.6.21-rc5 and rc7 (derived > from Roland's for 2.6.22 git tree on 04/25/2007) > with Topspin and IBM HCAs on ppc64 machines. I have run netperf > between two IBM HCAs and as well > as between IBM and Topspin HCA. > > Note 1: I have retained the code to avoid IB_WC_RETRY_EXC_ERR while > performing interoperability tests > As discussed in this mailing list that may be a CM bug or have the > various HCA address it. Hence I would > like to seperate out that issue from this patch. At a future point > when the issue gets resolved I can provide > another patch to change the retry_count values back to 0 if need be. > > Note 2: "Modify Port" patch submitted by Joachim Fenkes is needed > for the ehca driver to work on the IBM HCAs. > Have not tested with this patch as yet. > I have now tested with the above mentioned "Modify Port" patch and it works as expected. Pradeep pradeep at us.ibm.com From rick.jones2 at hp.com Mon Apr 30 11:11:46 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 30 Apr 2007 11:11:46 -0700 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: <20070430172846.GT2509@mellanox.co.il> References: <20070426045855.GH5217@mellanox.co.il> <4630D776.5060704@hp.com> <46325C7F.7040901@hp.com> <20070428175455.GB13106@mellanox.co.il> <46362427.8040905@hp.com> <20070430172846.GT2509@mellanox.co.il> Message-ID: <46363162.20000@hp.com> > Surely too late for OFED 1.2, but we can reopen this afterwards. > > Are there disadvantages to using protocol instead of family? > Would this change present a problem for people using e.g. getprotobyname > to get the protocol number? True, at that level it would be trading one problem - AF_INET_SDP not in getaddrinfo()) into a different problem - IPPROTO_SDP not in getprotobyname(). However, I think that is the "better" problem - SDP is a "protocol" not an address family. (Again, based on what little I understand about SDP). Having said that, at this point, is the die not already cast? Probably worth asking for opinions in netdev. I'm sure there would be at least a couple available there. rick jones From jwong at datallegro.com Mon Apr 30 11:17:17 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Mon, 30 Apr 2007 14:17:17 -0400 Subject: [ofa-general] Trouble installing OFED1.2 with kernel 2.6.18-8.el5 Message-ID: Hello, I'm having trouble installing OFED1.2 on my CentOS 2.6.18-8.el5 linux machine. I am running the install.sh script and getting the following errors from the log file. -include include/linux/autoconf.h \ -include /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/include/linux/autoconf.h \ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Os -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -fomit-frame-pointer -fasynchronous-unwind-tables -fno-stack-protector -Wdeclaration-after-statement -Wno-pointer-sign -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/include -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/include -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/debug -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/net/cxgb3 -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/net/rds -DIPATH_IDSTR='"QLogic kernel.org driver"' -DIPATH_KERN_TYPE=0 -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipath_fs)" -D"KBUILD_MODNAME=KBUILD_STR(ib_ipath)" -c -o /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.o /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c: In function ?ipathfs_mknod?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c:66: error: ?struct inode? has no member named ?i_private? /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c: In function ?atomic_counters_read?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c:121: error: ?struct inode? has no member named ?i_private? /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c: In function ?atomic_node_info_read?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c:141: error: ?struct inode? has no member named ?i_private? /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c: In function ?atomic_port_info_read?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c:180: error: ?struct inode? has no member named ?i_private? /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c: In function ?flash_read?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c:327: error: ?struct inode? has no member named ?i_private? /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c: In function ?flash_write?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c:380: error: ?struct inode? has no member named ?i_private? make[4]: *** [/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath _fs.o] Error 1 make[3]: *** [/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath] Error 2 make[2]: *** [/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband] Error 2 make[1]: *** [_module_/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2] Error 2 make[1]: Leaving directory `/usr/src/redhat/SOURCES' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.16809 (%install) Thanks in advanced. Jeff Jeff Wong, Linux Software Engineer (949) 680-3066 - Office (949) 680-3001 - Fax jwong at datallegro.com www.datallegro.com 85Enterprise, 2nd Floor, Aliso Viejo, CA 92656 The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain proprietary, confidential and/or privileged material. If you have received this email in error please contact the sender by replying and delete this email so that it is not recoverable. If you are not the intended recipient(s), any retention, review, disclosure, distribution, copying, printing, dissemination, or other use of, or the taking of any action in reliance upon, this information is strictly prohibited and without liability on our part. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 10836 bytes Desc: image001.jpg URL: From jwong at datallegro.com Mon Apr 30 11:20:41 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Mon, 30 Apr 2007 14:20:41 -0400 Subject: [ofa-general] Trouble installing OFED1.2 with kernel 2.6.18-8.el5 Message-ID: Hello, I'm having trouble installing OFED1.2 on my CentOS 2.6.18-8.el5 linux machine. I am running the install.sh script and getting the following errors from the log file. -include include/linux/autoconf.h \ -include /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/include/linux/autoconf.h \ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Os -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -fomit-frame-pointer -fasynchronous-unwind-tables -fno-stack-protector -Wdeclaration-after-statement -Wno-pointer-sign -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/include -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/include -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/debug -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/net/cxgb3 -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/net/rds -DIPATH_IDSTR='"QLogic kernel.org driver"' -DIPATH_KERN_TYPE=0 -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipath_fs)" -D"KBUILD_MODNAME=KBUILD_STR(ib_ipath)" -c -o /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.o /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c: In function ?ipathfs_mknod?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c:66: error: ?struct inode? has no member named ?i_private? /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c: In function ?atomic_counters_read?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c:121: error: ?struct inode? has no member named ?i_private? /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c: In function ?atomic_node_info_read?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c:141: error: ?struct inode? has no member named ?i_private? /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c: In function ?atomic_port_info_read?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c:180: error: ?struct inode? has no member named ?i_private? /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c: In function ?flash_read?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c:327: error: ?struct inode? has no member named ?i_private? /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c: In function ?flash_write?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath_ fs.c:380: error: ?struct inode? has no member named ?i_private? make[4]: *** [/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath/ipath _fs.o] Error 1 make[3]: *** [/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/ipath] Error 2 make[2]: *** [/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband] Error 2 make[1]: *** [_module_/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2] Error 2 make[1]: Leaving directory `/usr/src/redhat/SOURCES' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.16809 (%install) Thanks in advanced. Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From jwong at datallegro.com Mon Apr 30 11:36:09 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Mon, 30 Apr 2007 14:36:09 -0400 Subject: [ofa-general] RE: error installing ofed_1.2-rc2 on RHEL5 Message-ID: Suri, You can also modify the build_env.sh script. Look where build_32bit is set and at the end of the case statement, add build_32bit=0 That will make it skip over the 32 bit version and only build the 64 bit version. This is what I did to get around this problem. Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Mon Apr 30 11:36:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Apr 2007 21:36:41 +0300 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: <46363162.20000@hp.com> References: <20070426045855.GH5217@mellanox.co.il> <4630D776.5060704@hp.com> <46325C7F.7040901@hp.com> <20070428175455.GB13106@mellanox.co.il> <46362427.8040905@hp.com> <20070430172846.GT2509@mellanox.co.il> <46363162.20000@hp.com> Message-ID: <20070430183641.GB13293@mellanox.co.il> > Quoting Rick Jones : > Subject: Re: [ofa-general] Re: initial set of "direct" SDP tests in netperf > > >Surely too late for OFED 1.2, but we can reopen this afterwards. > > > >Are there disadvantages to using protocol instead of family? > >Would this change present a problem for people using e.g. getprotobyname > >to get the protocol number? > > True, at that level it would be trading one problem - AF_INET_SDP not in > getaddrinfo()) into a different problem - IPPROTO_SDP not in > getprotobyname(). > > However, I think that is the "better" problem - SDP is a "protocol" not > an address family. (Again, based on what little I understand about > SDP). I see why this makes sens for you, but in what sense is it a "better" problem? > Having said that, at this point, is the die not already cast? For OFED 1.2, it is. But hopefully it's not the last one. > Probably worth asking for opinions in netdev. I'm sure there would be > at least a couple available there. As I said, I did that already. -- MST From rick.jones2 at hp.com Mon Apr 30 11:41:26 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 30 Apr 2007 11:41:26 -0700 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: <20070430183641.GB13293@mellanox.co.il> References: <20070426045855.GH5217@mellanox.co.il> <4630D776.5060704@hp.com> <46325C7F.7040901@hp.com> <20070428175455.GB13106@mellanox.co.il> <46362427.8040905@hp.com> <20070430172846.GT2509@mellanox.co.il> <46363162.20000@hp.com> <20070430183641.GB13293@mellanox.co.il> Message-ID: <46363856.7090102@hp.com> >>However, I think that is the "better" problem - SDP is a "protocol" not >>an address family. (Again, based on what little I understand about >>SDP). > > > I see why this makes sens for you, but in what sense is it a "better" problem? Because it isn't trying to describe a change in protocol as a change in addressing. It makes getting to SDP look like getting to any other transport-layer protocol - eg UDP, TCP, or SCTP. My intuitive guessing suggests that fewer folks use getprotobyname() than getaddrinfo(). rick jones From pradeep at us.ibm.com Mon Apr 30 11:45:17 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 30 Apr 2007 11:45:17 -0700 Subject: [ofa-general] [Bug 508] IPoIB CM multicast is hogging interrupts In-Reply-To: <20070429090244.GG7869@mellanox.co.il> Message-ID: > I think this trick I just came up with is a simpe way to prevent > IPoIB TX from hogging interrupts, even without NAPI. And it might be a better > way to solve the problem for IPoIB CM TX than using a common cq > as my previous patch did. > > This seems to hurt top bandwidth a bit in my testing, so this needs some more > work. Meanwhile, Scott, could you please check whether the following > patch helps in your test-case? > > Roland, I think something similiar is a good idea for SRP, too. > What do you think? > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > index 2b242a4..3ed1536 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > @@ -573,14 +573,15 @@ static void ipoib_cm_handle_tx_wc(struct > net_device *dev, struct ipoib_cm_tx *tx > static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) > { > struct ipoib_cm_tx *tx = tx_ptr; > - int n, i; > + int n, i, cnt = 0; > > ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); > do { > n = ib_poll_cq(cq, IPOIB_NUM_WC, tx->ibwc); > + cnt += n; > for (i = 0; i < n; ++i) > ipoib_cm_handle_tx_wc(tx->dev, tx, tx->ibwc + i); > - } while (n == IPOIB_NUM_WC); > + } while (n == IPOIB_NUM_WC && cnt < ipoib_sendq_size); > } This change might exit tx_completion sooner -how does that prevent hogging interrupts (without NAPI)? I am not clear about that. When NAPI is merged, would this be equivalent of running out of budget and treated accordingly? Otherwise, one might continue to be in polling mode for a longer period and maybe starving other interfaces. Pradeep pradeep at us.ibm.com From mst at dev.mellanox.co.il Mon Apr 30 11:47:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Apr 2007 21:47:54 +0300 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel 2.6.18-8.el5 In-Reply-To: References: Message-ID: <20070430184754.GE13293@mellanox.co.il> > Quoting Jeffrey Wong : > Subject: Trouble installing OFED1.2 with kernel 2.6.18-8.el5 > > Hello, > > I’m having trouble installing OFED1.2 on my CentOS 2.6.18-8.el5 linux machine. > > I am running the install.sh script and getting the following errors from the > log file. Try turning off ipath - that's what's failing for you. -- MST From vlad at mellanox.co.il Mon Apr 30 11:49:23 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 30 Apr 2007 21:49:23 +0300 Subject: [ofa-general] RE: [PULL] ofed_1_2 branch for dapl In-Reply-To: <000001c78b4e$40b481f0$ff0da8c0@amr.corp.intel.com> References: <000001c78b4e$40b481f0$ff0da8c0@amr.corp.intel.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C9015FDFF7@mtlexch01.mtl.com> Done, Regards, Vladimir > -----Original Message----- > From: Arlin Davis [mailto:arlin.r.davis at intel.com] > Sent: Monday, April 30, 2007 8:38 PM > To: Vladimir Sokolovsky > Cc: openib-general at lists.openfabrics.org; 'James Lentini' > Subject: [PULL] ofed_1_2 branch for dapl > > > Please pull into OFED 1.2: > > git://git.openfabrics.org/~ardavis/dapl.git ofed_1_2 > > 1. update for dapltest manpage and README 2. fix for atomic > build issue on ia64 RHEL5 > > Thanks, > > - arlin > > > From mst at dev.mellanox.co.il Mon Apr 30 11:51:18 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Apr 2007 21:51:18 +0300 Subject: [ofa-general] [Bug 508] IPoIB CM multicast is hogging interrupts In-Reply-To: References: <20070429090244.GG7869@mellanox.co.il> Message-ID: <20070430185118.GF13293@mellanox.co.il> Quoting Pradeep Satyanarayana : Subject: Re: [ofa-general] [Bug 508] IPoIB CM multicast is hogging interrupts > I think this trick I just came up with is a simpe way to prevent > IPoIB TX from hogging interrupts, even without NAPI. And it might be a better > way to solve the problem for IPoIB CM TX than using a common cq > as my previous patch did. > > static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) > > { > > struct ipoib_cm_tx *tx = tx_ptr; > > - int n, i; > > + int n, i, cnt = 0; > > > > ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); > > do { > > n = ib_poll_cq(cq, IPOIB_NUM_WC, tx->ibwc); > > + cnt += n; > > for (i = 0; i < n; ++i) > > ipoib_cm_handle_tx_wc(tx->dev, tx, tx->ibwc + i); > > - } while (n == IPOIB_NUM_WC); > > + } while (n == IPOIB_NUM_WC && cnt < ipoib_sendq_size); > > } > > This change might exit tx_completion sooner -how does that prevent > hogging interrupts (without NAPI)? I am not clear about that. By returning from interrupt handler after a finite number of completions, rather than polling CQ potentially indefinitely. -- MST From jwong at datallegro.com Mon Apr 30 12:12:15 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Mon, 30 Apr 2007 15:12:15 -0400 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel Message-ID: Don't I need ipath in order to create the ibverbs library? If I don't then how do I turn off ipath? If I do need ipath for the ibverbs library, then how do I get around this problem. When I do the install I am selecting 2 to Install OFED Software and 1 for Basic OFED modules and basic user level libraries. Thanks, Jeff > Quoting Jeffrey Wong : > Subject: Trouble installing OFED1.2 with kernel 2.6.18-8.el5 > > Hello, > > I\u2019m having trouble installing OFED1.2 on my CentOS 2.6.18-8.el5 linux machine. > > I am running the install.sh script and getting the following errors from the > log file. Try turning off ipath - that's what's failing for you. -- MST -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Mon Apr 30 12:16:29 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Apr 2007 22:16:29 +0300 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel In-Reply-To: References: Message-ID: <20070430191629.GH13293@mellanox.co.il> > Quoting Jeffrey Wong : > Subject: Re: Trouble installing OFED1.2 with kernel > > > > Don’t I need ipath in order to create the ibverbs library? > No, this is a kernel driver for ipath hardware, ibverbs is a hardware agnositic userspace library. > > If I don’t then how do I turn off ipath? > > > > If I do need ipath for the ibverbs library, then how do I get around this > problem. > > > > When I do the install I am selecting 2 to Install OFED Software and 1 for Basic > OFED modules and basic user level libraries. Donnu, there should be something in the menu to customise it further. Vlad? -- MST From tziporet at dev.mellanox.co.il Mon Apr 30 12:19:13 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 30 Apr 2007 12:19:13 -0700 Subject: [ofa-general] Re: [ewg] OFED 1.2 April 16 meeting summary In-Reply-To: <462E4EAA.5090606@cse.ohio-state.edu> References: <46231441.6050507@mellanox.co.il> <46238FC0.40906@mellanox.co.il> <6C2C79E72C305246B504CBA17B5500C9A0E2B6@mtlexch01.mtl.com> <462E4EAA.5090606@cse.ohio-state.edu> Message-ID: <46364131.20301@mellanox.co.il> Shaun Rowland wrote: > Hi Tziporet. I am not sure who handles the release notes, but we need to > add release notes for MVAPICH2. I've attached an > mvapich2_release_notes.txt file that is similar to the files in the > docs/ subdirectory of the OFED RC2 release. Can this be added? > I will add them once I am back from Sonoma Tziporet From jwong at datallegro.com Mon Apr 30 12:24:56 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Mon, 30 Apr 2007 15:24:56 -0400 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel Message-ID: So to disable ipath I should go into build_env.sh and comment out the following lines in order to disable ipath correct? #ib_ipath) # case ${ARCH} in # x86_64|ppc64) # case ${K_VER} in # 2.6.5*|2.6.9-22*|2.6.9-34*|2.6.9-42*|2.6.16.*-*-*|2.6.1[7-9]*|2.6.20*) # OFA_KERNEL_PACKAGES=$(echo "$OFA_KERNEL_PACKAGES ib_verbs ib_ipath" | tr -s ' ' '\n' | sort -n | uniq) # OFA_PACKAGES=$(echo "$OFA_PACKAGES kernel-ib" | tr -s ' ' '\n' | sort -n | uniq) # ll_driver=${ll_driver:-"ib_ipath"} # ;; # *) # if [ "$prog" == "build.sh" ]; then # warn_echo IPATH is not supported by this kernel # fi # ;; # esac # ;; # *) # if [ "$prog" == "build.sh" ]; then # warn_echo PathScale InfiniPath host channel adapters low level driver supports only a x86_64 architecture # fi # ;; # esac #;; Thanks, Jeff > Quoting Jeffrey Wong >: > Subject: Re: Trouble installing OFED1.2 with kernel > > > > Don't I need ipath in order to create the ibverbs library? > No, this is a kernel driver for ipath hardware, ibverbs is a hardware agnositic userspace library. > > If I don't then how do I turn off ipath? > > > > If I do need ipath for the ibverbs library, then how do I get around this > problem. > > > > When I do the install I am selecting 2 to Install OFED Software and 1 for Basic > OFED modules and basic user level libraries. Donnu, there should be something in the menu to customise it further. Vlad? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jwong at datallegro.com Mon Apr 30 12:28:10 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Mon, 30 Apr 2007 15:28:10 -0400 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel Message-ID: I guess I can change the /root/OFED-1.2-rc1/ofed.conf file from Ib_ipath=y to ib_path=n Is this correct? Jeff > Quoting Jeffrey Wong >: > Subject: Re: Trouble installing OFED1.2 with kernel > > > > Don't I need ipath in order to create the ibverbs library? > No, this is a kernel driver for ipath hardware, ibverbs is a hardware agnositic userspace library. > > If I don't then how do I turn off ipath? > > > > If I do need ipath for the ibverbs library, then how do I get around this > problem. > > > > When I do the install I am selecting 2 to Install OFED Software and 1 for Basic > OFED modules and basic user level libraries. Donnu, there should be something in the menu to customise it further. Vlad? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Mon Apr 30 12:28:57 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Apr 2007 22:28:57 +0300 Subject: [ofa-general] Re: initial set of "direct" SDP tests in netperf In-Reply-To: <46363856.7090102@hp.com> References: <20070426045855.GH5217@mellanox.co.il> <4630D776.5060704@hp.com> <46325C7F.7040901@hp.com> <20070428175455.GB13106@mellanox.co.il> <46362427.8040905@hp.com> <20070430172846.GT2509@mellanox.co.il> <46363162.20000@hp.com> <20070430183641.GB13293@mellanox.co.il> <46363856.7090102@hp.com> Message-ID: <20070430192857.GI13293@mellanox.co.il> > Quoting Rick Jones : > Subject: Re: [ofa-general] Re: initial set of "direct" SDP tests in netperf > > >>However, I think that is the "better" problem - SDP is a "protocol" not > >>an address family. (Again, based on what little I understand about > >>SDP). True but it's not an IP-based protocol. So IPPROTO_SDP is kind of weird: the comment in netinet/in.h on my system says: /* Standard well-defined IP protocols. */ > >I see why this makes sens for you, but in what sense is it a "better" > >problem? > > Because it isn't trying to describe a change in protocol as a change in > addressing. It makes getting to SDP look like getting to any other > transport-layer protocol - eg UDP, TCP, or SCTP. My intuitive guessing > suggests that fewer folks use getprotobyname() than getaddrinfo(). However, for people that do - protocol numbers are assigned by IANA, while AF numbers are basically free-running numbers. Thus using AF rather than IPPROTO_ could have been a way to bypass the need for standardization. -- MST From sweitzen at cisco.com Mon Apr 30 12:29:28 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 30 Apr 2007 12:29:28 -0700 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel In-Reply-To: References: Message-ID: Yes, this is the easier way to do it. ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jeffrey Wong Sent: Monday, April 30, 2007 12:28 PM To: general at lists.openfabrics.org Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel I guess I can change the /root/OFED-1.2-rc1/ofed.conf file from Ib_ipath=y to ib_path=n Is this correct? Jeff > Quoting Jeffrey Wong >: > Subject: Re: Trouble installing OFED1.2 with kernel > > > > Don't I need ipath in order to create the ibverbs library? > No, this is a kernel driver for ipath hardware, ibverbs is a hardware agnositic userspace library. > > If I don't then how do I turn off ipath? > > > > If I do need ipath for the ibverbs library, then how do I get around this > problem. > > > > When I do the install I am selecting 2 to Install OFED Software and 1 for Basic > OFED modules and basic user level libraries. Donnu, there should be something in the menu to customise it further. Vlad? -------------- next part -------------- An HTML attachment was scrubbed... URL: From boris at mellanox.com Mon Apr 30 12:30:17 2007 From: boris at mellanox.com (Boris Shpolyansky) Date: Mon, 30 Apr 2007 12:30:17 -0700 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel In-Reply-To: Message-ID: <1E3DCD1C63492545881FACB6063A57C1D524A3@mtiexch01.mti.com> Correct. Make sure to use unattended install - to take advantages of the ofed.conf file: install.sh -c ofed.conf Boris ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jeffrey Wong Sent: Monday, April 30, 2007 12:28 PM To: general at lists.openfabrics.org Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel I guess I can change the /root/OFED-1.2-rc1/ofed.conf file from Ib_ipath=y to ib_path=n Is this correct? Jeff > Quoting Jeffrey Wong >: > Subject: Re: Trouble installing OFED1.2 with kernel > > > > Don't I need ipath in order to create the ibverbs library? > No, this is a kernel driver for ipath hardware, ibverbs is a hardware agnositic userspace library. > > If I don't then how do I turn off ipath? > > > > If I do need ipath for the ibverbs library, then how do I get around this > problem. > > > > When I do the install I am selecting 2 to Install OFED Software and 1 for Basic > OFED modules and basic user level libraries. Donnu, there should be something in the menu to customise it further. Vlad? -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradeep at us.ibm.com Mon Apr 30 12:33:42 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 30 Apr 2007 12:33:42 -0700 Subject: [ofa-general] [Bug 508] IPoIB CM multicast is hogging interrupts In-Reply-To: <20070430185118.GF13293@mellanox.co.il> Message-ID: "Michael S. Tsirkin" wrote on 04/30/2007 11:51:18 AM: > Quoting Pradeep Satyanarayana : > Subject: Re: [ofa-general] [Bug 508] IPoIB CM multicast is hogging interrupts > > > I think this trick I just came up with is a simpe way to prevent > > IPoIB TX from hogging interrupts, even without NAPI. And it might > be a better > > way to solve the problem for IPoIB CM TX than using a common cq > > as my previous patch did. > > > static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) > > > { > > > struct ipoib_cm_tx *tx = tx_ptr; > > > - int n, i; > > > + int n, i, cnt = 0; > > > > > > ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); > > > do { > > > n = ib_poll_cq(cq, IPOIB_NUM_WC, tx->ibwc); > > > + cnt += n; > > > for (i = 0; i < n; ++i) > > > ipoib_cm_handle_tx_wc(tx->dev, tx, tx->ibwc + i); > > > - } while (n == IPOIB_NUM_WC); > > > + } while (n == IPOIB_NUM_WC && cnt < ipoib_sendq_size); > > > } > > > > This change might exit tx_completion sooner -how does that prevent > > hogging interrupts (without NAPI)? I am not clear about that. > > By returning from interrupt handler after a finite number of > completions, rather than polling CQ potentially indefinitely. > Ok, this was intended to mean hogging CPU while in interrupt context. This will surely help that case. Infact systems that support round robin distribution of interrupts across CPUs will likely see a good distribution of CPU load. Additionally, this might also solve the soft lockup problem that was discussed here previously (except for the issue with the lo device). Pradeep pradeep at us.ibm.com From mhagen at iol.unh.edu Mon Apr 30 13:07:06 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Mon, 30 Apr 2007 16:07:06 -0400 (EDT) Subject: [ofa-general] [PATCH] infiniband: add support for invalidate stag Message-ID: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu> Patch to add support for the iWARP verbs SEND with INV and SEND with SE and INV. --- linux-2.6.21.1/include/rdma/ib_verbs.h 2007-04-28 15:35:02.677618096 -0400 +++ linux-2.6.21.1/include/rdma/ib_verbs.h 2007-04-28 15:29:16.200290656 -0400 @@ -611,7 +611,8 @@ enum ib_send_flags { IB_SEND_FENCE = 1, IB_SEND_SIGNALED = (1<<1), IB_SEND_SOLICITED = (1<<2), - IB_SEND_INLINE = (1<<3) + IB_SEND_INLINE = (1<<3), + IB_SEND_INVALIDATE = (1<<4) }; struct ib_sge { @@ -646,6 +647,9 @@ struct ib_send_wr { u16 pkey_index; /* valid for GSI only */ u8 port_num; /* valid for DR SMPs on switch only */ } ud; + struct { + u32 rkey; + } invalidate; } wr; }; -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From HNGUYEN at de.ibm.com Mon Apr 30 13:11:03 2007 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Mon, 30 Apr 2007 22:11:03 +0200 Subject: [ofa-general] [PATCH][RFC] IB: Return "maybe missed event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: Hi Roland! As far as this concerns ehca this looks great. Thanks Nam general-bounces at lists.openfabrics.org wrote on 27.04.2007 00:43:19: > > - "IB: Return "maybe missed event" hint from ib_req_notify_cq()" > > This extends the API in a way that lets us implement NAPI, but may > > be useful for other things too. It touches all the drivers, and I > > still need to finish updating cxgb3 to work correctly. I haven't > > heard anything negative about this, so I'll fix it up, post it one > > more time for review, and plan on merging it. > > As promised, here is that patch for review, with a cxgb3 > implementation included. > > --- > > The semantics defined by the InfiniBand specification say that > completion events are only generated when a completions is added to a > completion queue (CQ) after completion notification is requested. In > other words, this means that the following race is possible: > > while (CQ is not empty) > ib_poll_cq(CQ); > // new completion is added after while loop is exited > ib_req_notify_cq(CQ); > // no event is generated for the existing completion > > To close this race, the IB spec recommends doing another poll of the > CQ after requesting notification. > > However, it is not always possible to arrange code this way (for > example, we have found that NAPI for IPoIB cannot poll after > requesting notification). Also, some hardware (eg Mellanox HCAs) > actually will generate an event for completions added before the call > to ib_req_notify_cq() -- which is allowed by the spec, since there's > no way for any upper-layer consumer to know exactly when a completion > was really added -- so the extra poll of the CQ is just a waste. > > Motivated by this, we add a new flag "IB_CQ_REPORT_MISSED_EVENTS" for > ib_req_notify_cq() so that it can return a hint about whether the a > completion may have been added before the request for notification. > The return value of ib_req_notify_cq() is extended so: > > < 0 means an error occurred while requesting notification > == 0 means notification was requested successfully, and if > IB_CQ_REPORT_MISSED_EVENTS was passed in, then no > events were missed and it is safe to wait for another > event. > > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was > passed in. It means that the consumer must poll the > CQ again to make sure it is empty to avoid the race > described above. > > We add a flag to enable this behavior rather than turning it on > unconditionally, because checking for missed events may incur > significant overhead for some low-level drivers, and consumers that > don't care about the results of this test shouldn't be forced to pay > for the test. > > Signed-off-by: Roland Dreier > --- > drivers/infiniband/hw/amso1100/c2.h | 2 +- > drivers/infiniband/hw/amso1100/c2_cq.c | 16 ++++++++--- > drivers/infiniband/hw/cxgb3/cxio_hal.c | 3 ++ > drivers/infiniband/hw/cxgb3/iwch_provider.c | 8 +++-- > drivers/infiniband/hw/ehca/ehca_iverbs.h | 2 +- > drivers/infiniband/hw/ehca/ehca_reqs.c | 14 +++++++-- > drivers/infiniband/hw/ehca/ipz_pt_fn.h | 8 +++++ > drivers/infiniband/hw/ipath/ipath_cq.c | 15 +++++++--- > drivers/infiniband/hw/ipath/ipath_verbs.h | 2 +- > drivers/infiniband/hw/mthca/mthca_cq.c | 12 +++++--- > drivers/infiniband/hw/mthca/mthca_dev.h | 4 +- > include/rdma/ib_verbs.h | 40 > +++++++++++++++++++++------ > 12 files changed, 93 insertions(+), 33 deletions(-) > > diff --git a/drivers/infiniband/hw/amso1100/c2.h > b/drivers/infiniband/hw/amso1100/c2.h > index 04a9db5..fa58200 100644 > --- a/drivers/infiniband/hw/amso1100/c2.h > +++ b/drivers/infiniband/hw/amso1100/c2.h > @@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2dev, > struct c2_cq *cq); > extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); > extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32mq_index); > extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct > ib_wc *entry); > -extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); > +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags); > > /* CM */ > extern int c2_llp_connect(struct iw_cm_id *cm_id, > diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c > b/drivers/infiniband/hw/amso1100/c2_cq.c > index 5175c99..d2b3366 100644 > --- a/drivers/infiniband/hw/amso1100/c2_cq.c > +++ b/drivers/infiniband/hw/amso1100/c2_cq.c > @@ -217,17 +217,19 @@ int c2_poll_cq(struct ib_cq *ibcq, int > num_entries, struct ib_wc *entry) > return npolled; > } > > -int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) > +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags) > { > struct c2_mq_shared __iomem *shared; > struct c2_cq *cq; > + unsigned long flags; > + int ret = 0; > > cq = to_c2cq(ibcq); > shared = cq->mq.peer; > > - if (notify == IB_CQ_NEXT_COMP) > + if ((notify_flags & IB_CQ_SOLICITED_MASK) == IB_CQ_NEXT_COMP) > writeb(C2_CQ_NOTIFICATION_TYPE_NEXT, &shared->notification_type); > - else if (notify == IB_CQ_SOLICITED) > + else if ((notify_flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED) > writeb(C2_CQ_NOTIFICATION_TYPE_NEXT_SE, &shared->notification_type); > else > return -EINVAL; > @@ -241,7 +243,13 @@ int c2_arm_cq(struct ib_cq *ibcq, enum > ib_cq_notify notify) > */ > readb(&shared->armed); > > - return 0; > + if (notify_flags & IB_CQ_REPORT_MISSED_EVENTS) { > + spin_lock_irqsave(&cq->lock, flags); > + ret = !c2_mq_empty(&cq->mq); > + spin_unlock_irqrestore(&cq->lock, flags); > + } > + > + return ret; > } > > static void c2_free_cq_buf(struct c2_dev *c2dev, struct c2_mq *mq) > diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c > b/drivers/infiniband/hw/cxgb3/cxio_hal.c > index f5e9aee..76049af 100644 > --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c > +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c > @@ -114,7 +114,10 @@ int cxio_hal_cq_op(struct cxio_rdev *rdev_p, > struct t3_cq *cq, > return -EIO; > } > } > + > + return 1; > } > + > return 0; > } > > diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c > b/drivers/infiniband/hw/cxgb3/iwch_provider.c > index 24e0df0..e89957f 100644 > --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c > +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c > @@ -292,7 +292,7 @@ static int iwch_resize_cq(struct ib_cq *cq, int > cqe, struct ib_udata *udata) > #endif > } > > -static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) > +static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) > { > struct iwch_dev *rhp; > struct iwch_cq *chp; > @@ -303,7 +303,7 @@ static int iwch_arm_cq(struct ib_cq *ibcq, enum > ib_cq_notify notify) > > chp = to_iwch_cq(ibcq); > rhp = chp->rhp; > - if (notify == IB_CQ_SOLICITED) > + if ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED) > cq_op = CQ_ARM_SE; > else > cq_op = CQ_ARM_AN; > @@ -317,9 +317,11 @@ static int iwch_arm_cq(struct ib_cq *ibcq, enum > ib_cq_notify notify) > PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr); > err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0); > spin_unlock_irqrestore(&chp->lock, flag); > - if (err) > + if (err < 0) > printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err, > chp->cq.cqid); > + if (err > 0 && !(flags & IB_CQ_REPORT_MISSED_EVENTS)) > + err = 0; > return err; > } > > diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h > b/drivers/infiniband/hw/ehca/ehca_iverbs.h > index 95fd59f..9e5460d 100644 > --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h > +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h > @@ -135,7 +135,7 @@ int ehca_poll_cq(struct ib_cq *cq, int > num_entries, struct ib_wc *wc); > > int ehca_peek_cq(struct ib_cq *cq, int wc_cnt); > > -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify); > +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags > notify_flags); > > struct ib_qp *ehca_create_qp(struct ib_pd *pd, > struct ib_qp_init_attr *init_attr, > diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c > b/drivers/infiniband/hw/ehca/ehca_reqs.c > index 08d3f89..caec9de 100644 > --- a/drivers/infiniband/hw/ehca/ehca_reqs.c > +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c > @@ -634,11 +634,13 @@ poll_cq_exit0: > return ret; > } > > -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) > +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags > notify_flags) > { > struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); > + unsigned long spl_flags; > + int ret = 0; > > - switch (cq_notify) { > + switch (notify_flags & IB_CQ_SOLICITED_MASK) { > case IB_CQ_SOLICITED: > hipz_set_cqx_n0(my_cq, 1); > break; > @@ -649,5 +651,11 @@ int ehca_req_notify_cq(struct ib_cq *cq, enum > ib_cq_notify cq_notify) > return -EINVAL; > } > > - return 0; > + if (notify_flags & IB_CQ_REPORT_MISSED_EVENTS) { > + spin_lock_irqsave(&my_cq->spinlock, spl_flags); > + ret = ipz_qeit_is_valid(&my_cq->ipz_queue); > + spin_unlock_irqrestore(&my_cq->spinlock, spl_flags); > + } > + > + return ret; > } > diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h > b/drivers/infiniband/hw/ehca/ipz_pt_fn.h > index 8199c45..57f141a 100644 > --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h > +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h > @@ -140,6 +140,14 @@ static inline void > *ipz_qeit_get_inc_valid(struct ipz_queue *queue) > return cqe; > } > > +static inline int ipz_qeit_is_valid(struct ipz_queue *queue) > +{ > + struct ehca_cqe *cqe = ipz_qeit_get(queue); > + u32 cqe_flags = cqe->cqe_flags; > + > + return cqe_flags >> 7 == (queue->toggle_state & 1); > +} > + > /* > * returns and resets Queue Entry iterator > * returns address (kv) of first Queue Entry > diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c > b/drivers/infiniband/hw/ipath/ipath_cq.c > index 87462e0..9582145 100644 > --- a/drivers/infiniband/hw/ipath/ipath_cq.c > +++ b/drivers/infiniband/hw/ipath/ipath_cq.c > @@ -306,17 +306,18 @@ int ipath_destroy_cq(struct ib_cq *ibcq) > /** > * ipath_req_notify_cq - change the notification type for a completion queue > * @ibcq: the completion queue > - * @notify: the type of notification to request > + * @notify_flags: the type of notification to request > * > * Returns 0 for success. > * > * This may be called from interrupt context. Also called by > * ib_req_notify_cq() in the generic verbs code. > */ > -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) > +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags > notify_flags) > { > struct ipath_cq *cq = to_icq(ibcq); > unsigned long flags; > + int ret = 0; > > spin_lock_irqsave(&cq->lock, flags); > /* > @@ -324,9 +325,15 @@ int ipath_req_notify_cq(struct ib_cq *ibcq, > enum ib_cq_notify notify) > * any other transitions (see C11-31 and C11-32 in ch. 11.4.2.2). > */ > if (cq->notify != IB_CQ_NEXT_COMP) > - cq->notify = notify; > + cq->notify = notify_flags & IB_CQ_SOLICITED_MASK; > + > + if ((notify_flags & IB_CQ_REPORT_MISSED_EVENTS) && > + cq->queue->head != cq->queue->tail) > + ret = 1; > + > spin_unlock_irqrestore(&cq->lock, flags); > - return 0; > + > + return ret; > } > > /** > diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h > b/drivers/infiniband/hw/ipath/ipath_verbs.h > index c0c8d5b..6b3b770 100644 > --- a/drivers/infiniband/hw/ipath/ipath_verbs.h > +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h > @@ -716,7 +716,7 @@ struct ib_cq *ipath_create_cq(struct ib_device > *ibdev, int entries, > > int ipath_destroy_cq(struct ib_cq *ibcq); > > -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); > +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags > notify_flags); > > int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata); > > diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c > b/drivers/infiniband/hw/mthca/mthca_cq.c > index efd79ef..cf0868f 100644 > --- a/drivers/infiniband/hw/mthca/mthca_cq.c > +++ b/drivers/infiniband/hw/mthca/mthca_cq.c > @@ -726,11 +726,12 @@ repoll: > return err == 0 || err == -EAGAIN ? npolled : err; > } > > -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) > +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags) > { > __be32 doorbell[2]; > > - doorbell[0] = cpu_to_be32((notify == IB_CQ_SOLICITED ? > + doorbell[0] = cpu_to_be32(((flags & IB_CQ_SOLICITED_MASK) == > + IB_CQ_SOLICITED ? > MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL : > MTHCA_TAVOR_CQ_DB_REQ_NOT) | > to_mcq(cq)->cqn); > @@ -743,7 +744,7 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, enum > ib_cq_notify notify) > return 0; > } > > -int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) > +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) > { > struct mthca_cq *cq = to_mcq(ibcq); > __be32 doorbell[2]; > @@ -755,7 +756,8 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum > ib_cq_notify notify) > > doorbell[0] = ci; > doorbell[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | > - (notify == IB_CQ_SOLICITED ? 1 : 2)); > + ((flags & IB_CQ_SOLICITED_MASK) == > + IB_CQ_SOLICITED ? 1 : 2)); > > mthca_write_db_rec(doorbell, cq->arm_db); > > @@ -766,7 +768,7 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum > ib_cq_notify notify) > wmb(); > > doorbell[0] = cpu_to_be32((sn << 28) | > - (notify == IB_CQ_SOLICITED ? > + ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ? > MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL : > MTHCA_ARBEL_CQ_DB_REQ_NOT) | > cq->cqn); > diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h > b/drivers/infiniband/hw/mthca/mthca_dev.h > index b7e42ef..9bae3cc 100644 > --- a/drivers/infiniband/hw/mthca/mthca_dev.h > +++ b/drivers/infiniband/hw/mthca/mthca_dev.h > @@ -495,8 +495,8 @@ void mthca_unmap_eq_icm(struct mthca_dev *dev); > > int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, > struct ib_wc *entry); > -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); > -int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); > +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags); > +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags); > int mthca_init_cq(struct mthca_dev *dev, int nent, > struct mthca_ucontext *ctx, u32 pdn, > struct mthca_cq *cq); > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index 765589f..529a69d 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -431,9 +431,11 @@ struct ib_wc { > u8 port_num; /* valid only for DR SMPs on switches */ > }; > > -enum ib_cq_notify { > - IB_CQ_SOLICITED, > - IB_CQ_NEXT_COMP > +enum ib_cq_notify_flags { > + IB_CQ_SOLICITED = 1 << 0, > + IB_CQ_NEXT_COMP = 1 << 1, > + IB_CQ_SOLICITED_MASK = IB_CQ_SOLICITED | IB_CQ_NEXT_COMP, > + IB_CQ_REPORT_MISSED_EVENTS = 1 << 2, > }; > > enum ib_srq_attr_mask { > @@ -987,7 +989,7 @@ struct ib_device { > struct ib_wc *wc); > int (*peek_cq)(struct ib_cq *cq, int wc_cnt); > int (*req_notify_cq)(struct ib_cq *cq, > - enum ib_cq_notify cq_notify); > + enum ib_cq_notify_flags flags); > int (*req_ncomp_notif)(struct ib_cq *cq, > int wc_cnt); > struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, > @@ -1414,14 +1416,34 @@ int ib_peek_cq(struct ib_cq *cq, int wc_cnt); > /** > * ib_req_notify_cq - Request completion notification on a CQ. > * @cq: The CQ to generate an event for. > - * @cq_notify: If set to %IB_CQ_SOLICITED, completion notification will > - * occur on the next solicited event. If set to %IB_CQ_NEXT_COMP, > - * notification will occur on the next completion. > + * @flags: > + * Must contain exactly one of %IB_CQ_SOLICITED or %IB_CQ_NEXT_COMP > + * to request an event on the next solicited event or next work > + * completion at any type, respectively. %IB_CQ_REPORT_MISSED_EVENTS > + * may also be |ed in to request a hint about missed events, as > + * described below. > + * > + * Return Value: > + * < 0 means an error occurred while requesting notification > + * == 0 means notification was requested successfully, and if > + * IB_CQ_REPORT_MISSED_EVENTS was passed in, then no events > + * were missed and it is safe to wait for another event. In > + * this case is it guaranteed that any work completions added > + * to the CQ since the last CQ poll will trigger a completion > + * notification event. > + * > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was passed > + * in. It means that the consumer must poll the CQ again to > + * make sure it is empty to avoid missing an event because of a > + * race between requesting notification and an entry being > + * added to the CQ. This return value means it is possible > + * (but not guaranteed) that a work completion has been added > + * to the CQ since the last poll without triggering a > + * completion notification event. > */ > static inline int ib_req_notify_cq(struct ib_cq *cq, > - enum ib_cq_notify cq_notify) > + enum ib_cq_notify_flags flags) > { > - return cq->device->req_notify_cq(cq, cq_notify); > + return cq->device->req_notify_cq(cq, flags); > } > > /** > -- > 1.5.1.2 > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rick.jones2 at hp.com Mon Apr 30 14:12:52 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 30 Apr 2007 14:12:52 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <4634F49F.9030408@myri.com> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> <46327A07.1000404@hp.com> <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> <4632894D.40705@hp.com> <20070428025117.a3b1200a.billfink@mindspring.com> <4634F49F.9030408@myri.com> Message-ID: <46365BD4.5060607@hp.com> > Only the 1.2.0 version of the external driver makes LRO incompatible > with forwarding. The problem should be fixed in version 1.3.0 released a > few weeks ago (forwarding with myri10ge_lro enabled should then work), > let us know otherwise. > > Anyway, following David Miller remark about netfilter, for the next > version we might ask the user to explicitely enable LRO rather than > making the default. Speaking of defaults, it would seem that the external 1.2.0 driver comes with 9000 bytes as the default MTU? At least I think that is what I am seeing now that I've started looking more closely. rick jones From jwong at datallegro.com Mon Apr 30 14:43:05 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Mon, 30 Apr 2007 17:43:05 -0400 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel Message-ID: I was having the same type of error for ulp/ipoib. What is this module used for? Do I need this for the ibverbs library? After disabling this option in the opfd.conf file. Everything else compiled and installed correctly. Thanks, Jeff -include include/linux/autoconf.h \ -include /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/include/linux/autoconf.h \ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Os -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -fomit-frame-pointer -fasynchronous-unwind-tables -fno-stack-protector -Wdeclaration-after-statement -Wno-pointer-sign -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/include -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/include -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/debug -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/net/cxgb3 -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_fs)" -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib _fs.o /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib _fs.c /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib _fs.c: In function ?ipoib_mcg_open?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib _fs.c:144: error: ?struct inode? has no member named ?i_private? /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib _fs.c: In function ?ipoib_path_open?: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib _fs.c:250: error: ?struct inode? has no member named ?i_private? make[4]: *** [/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoi b_fs.o] Error 1 make[3]: *** [/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib] Error 2 make[2]: *** [/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband] Error 2 make[1]: *** [_module_/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2] Error 2 make[1]: Leaving directory `/usr/src/redhat/SOURCES' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.92132 (%install) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Apr 30 16:59:11 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Apr 2007 16:59:11 -0700 Subject: [ofa-general] Re: [PATCH 2.6.22 5/5] iw_cxgb3: Update required firmware revision to 4.0.0. In-Reply-To: <20070426202126.24234.71523.stgit@dell3.ogc.int> (Steve Wise's message of "Thu, 26 Apr 2007 15:21:26 -0500") References: <20070426202057.24234.56383.stgit@dell3.ogc.int> <20070426202126.24234.71523.stgit@dell3.ogc.int> Message-ID: Thanks, I applied all 5, with the FW version change rolled up into #4/5. From jwong at datallegro.com Mon Apr 30 17:15:43 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Mon, 30 Apr 2007 20:15:43 -0400 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel Message-ID: Is there a workaround for the i_private member of the inode structure either in the kernel or in the OFED 1.2 software? I want to be able to compile the ipoib drivers and I cannot with the error i_private not being a member of inode struct. What does the ulp/ipoib do? I want to be able to test out the ipverbs library and ipoib library to compare performance. Thanks. Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Mon Apr 30 17:51:01 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 1 May 2007 03:51:01 +0300 Subject: [ofa-general] Re: [PATCH] osm: source and destination strings overlap when using sprintf() In-Reply-To: <462E80A3.5060503@dev.mellanox.co.il> References: <462C7C21.7010004@dev.mellanox.co.il> <20070423101738.GG4579@mellanox.co.il> <462E80A3.5060503@dev.mellanox.co.il> Message-ID: <20070501005101.GA26019@sashak.voltaire.com> On 01:11 Wed 25 Apr , Yevgeny Kliteynik wrote: > Michael S. Tsirkin wrote: > >Since you seem to do a strcat which does an anyway, how about, for example: > > > >- sprintf( buf_line1,"%s 0x%01x |", > >- buf_line1, p_vla_tbl->vl_entry[i].vl); > >+ sprintf( buf_line1 + strlen(buf_line1)," 0x%01x |", > >+ p_vla_tbl->vl_entry[i].vl); > > > >and so on in all the other places? > > Agree. > I'll send a new patch later. Or like this: + int n = 0; ... - sprintf( buf_line1,"%s 0x%01x |", - buf_line1, p_vla_tbl->vl_entry[i].vl); + n += sprintf( buf_line1 + n," 0x%01x |", + p_vla_tbl->vl_entry[i].vl); , so strlen() rerunning in loop is not needed anymore. Sasha From rdreier at cisco.com Mon Apr 30 17:41:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Apr 2007 17:41:46 -0700 Subject: [ofa-general] Re: [PATCH] SRP: add orig_dgid to sysfs In-Reply-To: <4607F947.2010205@dev.mellanox.co.il> (ishai@dev.mellanox.co.il's message of "Mon, 26 Mar 2007 18:48:07 +0200") References: <4607F947.2010205@dev.mellanox.co.il> Message-ID: Sorry for the super-slow review. Anyway, how does this updated patch look? I made orig_dgid an array of __be16 to avoid some casting, and I initialize it once when the target is created. With your patch it seems that orig_dgid will not be set unless a redirect happens, and if multiple redirects are used, then orig_dgid will just be the second-to-last dgid, not the real original dgid. diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 5e8ac57..df526cb 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1468,6 +1468,25 @@ static ssize_t show_dgid(struct class_device *cdev, char *buf) be16_to_cpu(((__be16 *) target->path.dgid.raw)[7])); } +static ssize_t show_orig_dgid(struct class_device *cdev, char *buf) +{ + struct srp_target_port *target = host_to_target(class_to_shost(cdev)); + + if (target->state == SRP_TARGET_DEAD || + target->state == SRP_TARGET_REMOVED) + return -ENODEV; + + return sprintf(buf, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + be16_to_cpu(target->orig_dgid[0]), + be16_to_cpu(target->orig_dgid[1]), + be16_to_cpu(target->orig_dgid[2]), + be16_to_cpu(target->orig_dgid[3]), + be16_to_cpu(target->orig_dgid[4]), + be16_to_cpu(target->orig_dgid[5]), + be16_to_cpu(target->orig_dgid[6]), + be16_to_cpu(target->orig_dgid[7])); +} + static ssize_t show_zero_req_lim(struct class_device *cdev, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(cdev)); @@ -1498,6 +1517,7 @@ static CLASS_DEVICE_ATTR(ioc_guid, S_IRUGO, show_ioc_guid, NULL); static CLASS_DEVICE_ATTR(service_id, S_IRUGO, show_service_id, NULL); static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); static CLASS_DEVICE_ATTR(dgid, S_IRUGO, show_dgid, NULL); +static CLASS_DEVICE_ATTR(orig_dgid, S_IRUGO, show_orig_dgid, NULL); static CLASS_DEVICE_ATTR(zero_req_lim, S_IRUGO, show_zero_req_lim, NULL); static CLASS_DEVICE_ATTR(local_ib_port, S_IRUGO, show_local_ib_port, NULL); static CLASS_DEVICE_ATTR(local_ib_device, S_IRUGO, show_local_ib_device, NULL); @@ -1508,6 +1528,7 @@ static struct class_device_attribute *srp_host_attrs[] = { &class_device_attr_service_id, &class_device_attr_pkey, &class_device_attr_dgid, + &class_device_attr_orig_dgid, &class_device_attr_zero_req_lim, &class_device_attr_local_ib_port, &class_device_attr_local_ib_device, @@ -1662,6 +1683,7 @@ static int srp_parse_options(const char *buf, struct srp_target_port *target) target->path.dgid.raw[i] = simple_strtoul(dgid, NULL, 16); } kfree(p); + memcpy(target->orig_dgid, target->path.dgid.raw, 16); break; case SRP_OPT_PKEY: diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index 2f3319c..1d53c7b 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -129,6 +129,7 @@ struct srp_target_port { unsigned int scsi_id; struct ib_sa_path_rec path; + __be16 orig_dgid[8]; struct ib_sa_query *path_query; int path_query_id; From swise at opengridcomputing.com Mon Apr 30 18:50:16 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 30 Apr 2007 20:50:16 -0500 Subject: [ofa-general] [PATCH] infiniband: add support for invalidate stag In-Reply-To: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu> References: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu> Message-ID: <1177984216.4619.4.camel@stevo-laptop> Mike, This is a good start! I suggest: - posting the libibverbs patch needed to support user mode send/inv - posting the ammasso patch as an example of how iwarp providers support this. Thanks for stepping forward and helping out with the iWARP work! We need it. Steve. On Mon, 2007-04-30 at 16:07 -0400, mhagen at iol.unh.edu wrote: > Patch to add support for the iWARP verbs SEND with INV and SEND with SE > and INV. > > --- linux-2.6.21.1/include/rdma/ib_verbs.h 2007-04-28 > 15:35:02.677618096 -0400 > +++ linux-2.6.21.1/include/rdma/ib_verbs.h 2007-04-28 > 15:29:16.200290656 -0400 > @@ -611,7 +611,8 @@ enum ib_send_flags { > IB_SEND_FENCE = 1, > IB_SEND_SIGNALED = (1<<1), > IB_SEND_SOLICITED = (1<<2), > - IB_SEND_INLINE = (1<<3) > + IB_SEND_INLINE = (1<<3), > + IB_SEND_INVALIDATE = (1<<4) > }; > > struct ib_sge { > @@ -646,6 +647,9 @@ struct ib_send_wr { > u16 pkey_index; /* valid for GSI only */ > u8 port_num; /* valid for DR SMPs on switch only */ > } ud; > + struct { > + u32 rkey; > + } invalidate; > } wr; > }; > From mst at dev.mellanox.co.il Mon Apr 30 20:57:08 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 1 May 2007 06:57:08 +0300 Subject: [ofa-general] Re: [PATCH] infiniband: add support for invalidate stag In-Reply-To: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu> References: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu> Message-ID: <20070501035708.GJ13293@mellanox.co.il> > Quoting mhagen at iol.unh.edu : > Subject: [PATCH] infiniband: add support for invalidate stag > > Patch to add support for the iWARP verbs SEND with INV and SEND with SE > and INV. > > --- linux-2.6.21.1/include/rdma/ib_verbs.h 2007-04-28 > 15:35:02.677618096 -0400 > +++ linux-2.6.21.1/include/rdma/ib_verbs.h 2007-04-28 > 15:29:16.200290656 -0400 > @@ -611,7 +611,8 @@ enum ib_send_flags { > IB_SEND_FENCE = 1, > IB_SEND_SIGNALED = (1<<1), > IB_SEND_SOLICITED = (1<<2), > - IB_SEND_INLINE = (1<<3) > + IB_SEND_INLINE = (1<<3), > + IB_SEND_INVALIDATE = (1<<4) > }; > > struct ib_sge { > @@ -646,6 +647,9 @@ struct ib_send_wr { > u16 pkey_index; /* valid for GSI only */ > u8 port_num; /* valid for DR SMPs on switch only */ > } ud; > + struct { > + u32 rkey; > + } invalidate; > } wr; > }; Shouldn't this rather be part of rc wr? -- MST From mst at dev.mellanox.co.il Mon Apr 30 21:03:32 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 1 May 2007 07:03:32 +0300 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel In-Reply-To: References: Message-ID: <20070501040329.GK13293@mellanox.co.il> > Quoting Jeffrey Wong : > Subject: Re: Trouble installing OFED1.2 with kernel > > Is there a workaround for the i_private member of the inode structure either in > the kernel or in the OFED 1.2 software? > > I want to be able to compile the ipoib drivers and I cannot with the error > i_private not being a member of inode struct. > > What does the ulp/ipoib do? > > I want to be able to test out the ipverbs library and ipoib library to compare > performance. > > > > Thanks. > > > > Jeff OFED 1.2 supports the RHEL5 kernel. Shouldn't the Centos kernel be identical? -- MST From sshaw at sgi.com Mon Apr 30 21:32:03 2007 From: sshaw at sgi.com (Scott Shaw) Date: Mon, 30 Apr 2007 21:32:03 -0700 Subject: [ofa-general] Release version of Ofed v1.2 In-Reply-To: <20070501035708.GJ13293@mellanox.co.il> Message-ID: <9BEB932202A05B488722B05D2374A1DA03C7F87A@mtv-amer001e--3.americas.sgi.com> When will the general release of ofed v1.2 be available? Also is the OS requirement going to be SUSE10 SP1? Will ofed v1.2 work with SUSE10 without service packs? Thanks, Scott From billfink at mindspring.com Mon Apr 30 22:57:31 2007 From: billfink at mindspring.com (Bill Fink) Date: Tue, 1 May 2007 01:57:31 -0400 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <46362244.9030406@hp.com> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> <46327A07.1000404@hp.com> <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> <4632894D.40705@hp.com> <20070428025117.a3b1200a.billfink@mindspring.com> <46362244.9030406@hp.com> Message-ID: <20070501015731.3568d28b.billfink@mindspring.com> On Mon, 30 Apr 2007, Rick Jones wrote: > > What version of the myri10ge driver is this? With the 1.2.0 version > > that comes with the 2.6.20.7 kernel, there is no myri10ge_lro module > > parameter. > > > > [root at lang2 ~]# modinfo myri10ge | grep -i lro > > [root at lang2 ~]# > > > > And I've been testing IP forwarding using two Myricom 10-GigE NICs > > without setting any special modprobe parameters. > > > Ethtool -i on the interface reports 1.2.0 as the driver version. Perhaps it would be useful to have different version strings for the in-kernel Linux version and the Myricom externally provided version. Just a thought. -Bill From jwong at datallegro.com Mon Apr 30 23:25:37 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Tue, 1 May 2007 02:25:37 -0400 Subject: [ofa-general] RE: Trouble installing OFED1.2 with kernel References: <20070501040329.GK13293@mellanox.co.il> Message-ID: Well when I try to compile I get an error message saying i_private is not a member of the inode structure when trying to compile the ulp/iboip and the ib_ipath modules. I'm using the 2.6.18-8 kernel src from kernel.org. Any reasons why I would be getting this error message? Thanks, Jeff -----Original Message----- From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] Sent: Tue 5/1/2007 12:03 AM To: Jeffrey Wong Cc: general at lists.openfabrics.org Subject: Re: Trouble installing OFED1.2 with kernel > Quoting Jeffrey Wong : > Subject: Re: Trouble installing OFED1.2 with kernel > > Is there a workaround for the i_private member of the inode structure either in > the kernel or in the OFED 1.2 software? > > I want to be able to compile the ipoib drivers and I cannot with the error > i_private not being a member of inode struct. > > What does the ulp/ipoib do? > > I want to be able to test out the ipverbs library and ipoib library to compare > performance. > > > > Thanks. > > > > Jeff OFED 1.2 supports the RHEL5 kernel. Shouldn't the Centos kernel be identical? -- MST -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Mon Apr 30 23:48:02 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 1 May 2007 09:48:02 +0300 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel In-Reply-To: References: <20070501040329.GK13293@mellanox.co.il> Message-ID: <20070501064802.GM13293@mellanox.co.il> I don't think you are actually using the kernel from kernel.org: we test-build these nightly. Quoting Jeffrey Wong : Subject: RE: Trouble installing OFED1.2 with kernel Well when I try to compile I get an error message saying i_private is not a member of the inode structure when trying to compile the ulp/iboip and the ib_ipath modules. I'm using the 2.6.18-8 kernel src from kernel.org. Any reasons why I would be getting this error message? Thanks, Jeff -----Original Message----- From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] Sent: Tue 5/1/2007 12:03 AM To: Jeffrey Wong Cc: general at lists.openfabrics.org Subject: Re: Trouble installing OFED1.2 with kernel > Quoting Jeffrey Wong : > Subject: Re: Trouble installing OFED1.2 with kernel > > Is there a workaround for the i_private member of the inode structure either in > the kernel or in the OFED 1.2 software? > > I want to be able to compile the ipoib drivers and I cannot with the error > i_private not being a member of inode struct. > > What does the ulp/ipoib do? > > I want to be able to test out the ipverbs library and ipoib library to compare > performance. > > > > Thanks. > > > > Jeff OFED 1.2 supports the RHEL5 kernel. Shouldn't the Centos kernel be identical? -- MST -- MST From moshek at voltaire.com Mon Apr 30 23:51:38 2007 From: moshek at voltaire.com (Moshe Kazir) Date: Tue, 1 May 2007 09:51:38 +0300 Subject: [ofa-general] OFED-1.2 rc2 ibutils 32 bits link error on js21 sles9 sp3 ppc64 In-Reply-To: <462FB486.9090403@sgi.com> Message-ID: <39C75744D164D948A170E9792AF8E7CA0CFEA8@exil.voltaire.com> Did someone faced this 32 bit compile/link error (see attached log) Any hint ? Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of John Partridge Sent: Wednesday, April 25, 2007 11:05 PM To: Hal Rosenstock Cc: general at lists.openfabrics.org Subject: [ofa-general] opensmd init.d script question Hi Hal, I am working on a new SGI product that will have two separate InfiniBand fabrics. Each of these fabrics may have a different topology and could be running one of a number of routing engines (i.e. lash or up/dn) the Subnet Management for both fabrics will run on one host (leader node). Out of the box OFED-1.2 does not have a good way to achieve managing this. Ideally I would like to have the flexibility of chkconfig controlling each fabric (i.e., ib0 ib1), but, I have found that the insserv mechanism has severe limitations. BTW we are running SuSE Sles10. I just wonder if you had come across this kind of config and if you have any ideas about how you see this working. It looks like I need to have more than one opensmd (one for each fabric) but that is looking like it will not work either because of the insserv limitations. Any help or advice you have would be appreciated. Thanks John -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- A non-text attachment was scrubbed... Name: OFED.js21.ppc64.sles9sp3.ibutils.build.error.log Type: application/octet-stream Size: 5384130 bytes Desc: OFED.js21.ppc64.sles9sp3.ibutils.build.error.log URL: From kliteyn at dev.mellanox.co.il Mon Apr 30 23:56:39 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 01 May 2007 09:56:39 +0300 Subject: [ofa-general] Re: [PATCH] osm: source and destination strings overlap when using sprintf() In-Reply-To: <20070501005101.GA26019@sashak.voltaire.com> References: <462C7C21.7010004@dev.mellanox.co.il> <20070423101738.GG4579@mellanox.co.il> <462E80A3.5060503@dev.mellanox.co.il> <20070501005101.GA26019@sashak.voltaire.com> Message-ID: <4636E4A7.7060108@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 01:11 Wed 25 Apr , Yevgeny Kliteynik wrote: >> Michael S. Tsirkin wrote: >>> Since you seem to do a strcat which does an anyway, how about, for example: >>> >>> - sprintf( buf_line1,"%s 0x%01x |", >>> - buf_line1, p_vla_tbl->vl_entry[i].vl); >>> + sprintf( buf_line1 + strlen(buf_line1)," 0x%01x |", >>> + p_vla_tbl->vl_entry[i].vl); >>> >>> and so on in all the other places? >> Agree. >> I'll send a new patch later. > > Or like this: > > + int n = 0; > ... > - sprintf( buf_line1,"%s 0x%01x |", > - buf_line1, p_vla_tbl->vl_entry[i].vl); > + n += sprintf( buf_line1 + n," 0x%01x |", > + p_vla_tbl->vl_entry[i].vl); > > , so strlen() rerunning in loop is not needed anymore. Right, it does look better. -- Yevgeny > Sasha >