From oferg at mellanox.co.il Sun Jan 1 00:14:52 2006 From: oferg at mellanox.co.il (Ofer Gigi) Date: Sun, 1 Jan 2006 10:14:52 +0200 Subject: [openib-general] RE: [PATCH] osm: support for trivial PKey manager Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3FA32B6@mtlexch01.mtl.com> Hi Hal, 1. About the osm_indent - you are correct - it should have been in another patch. 2. Extra spaces - please remove - thanks. 3. > + /* signal = osm_lid_mgr_process_sm( p_mgr->p_lid_mgr ); */ Why add this commented out line ? My mistake, I commented the original code and forgot to remove - please remove it. 4. > # -i3 Substitute indent with 3 spaces > # -npcs No space after procedure calls > # -prs Space after parenthesis > -# -nsai No space after if keyword > -# -nsaw No space after while keyword > +# -nsai No space after if keyword - removed > +# -nsaw No space after while keyword - removed Should these comments just be removed ? No, please leave them, so people will know what they mean. Thanks ! Ofer -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Friday, December 30, 2005 4:52 PM To: Ofer Gigi Cc: OPENIB Subject: Re: [PATCH] osm: support for trivial PKey manager Hi again Ofer, On Thu, 2005-12-29 at 05:20, Ofer Gigi wrote: > Hi Hal, > > My name is Ofer Gigi, and I am a new software engineer in Mellanox > working on OpenSM. > This patch provides a new manager that solves the following problem: > > OpenSM is not currently compliant to the spec statement: > C14.62.1.1 Table 183 p870 l34: > "However, the SM shall ensure that one of the P_KeyTable entries in every > node contains either the value 0xFFFF (the default P_Key, full membership) > or the value 0x7FFF (the default P_Key, partial membership)." > > Luckily, all IB devices comes up from reset with preconfigured 0xffff key. > This was discovered during last plugfest. > > To overcome this limitation I implemented a simple elementary PKey manager > that will enforce the above rule (currently adds 0xffff if missing). > > This additional manager would be used for a full PKey policy manager > in the future. > > We have tested this patch > > Thanks Thanks. Applied. Some mechanical comments below (and also embedded). The general rule is one thought per patch. osm_indent is separate from this. Please try to ensure there is no extra whitespace at the end of the lines. There were several places where it was present. -- Hal > Ofer G. > > Signed-off-by: Ofer Gigi [snip...] > Index: opensm/osm_state_mgr.c > =================================================================== > --- opensm/osm_state_mgr.c (revision 4651) > +++ opensm/osm_state_mgr.c (working copy) > @@ -2216,9 +2219,11 @@ osm_state_mgr_process( > } > } > } > + > /* Need to continue with lid assigning */ > osm_drop_mgr_process( p_mgr->p_drop_mgr ); > - p_mgr->state = OSM_SM_STATE_SET_SM_UCAST_LID; > + > + p_mgr->state = OSM_SM_STATE_SET_PKEY; > > /* > * If we are not MASTER already - this means that we are > @@ -2229,6 +2234,62 @@ osm_state_mgr_process( > osm_sm_state_mgr_process( p_mgr->p_sm_state_mgr, > OSM_SM_SIGNAL_DISCOVERY_COMPLETED ); > > + /* signal = osm_lid_mgr_process_sm( p_mgr->p_lid_mgr ); */ Why add this commented out line ? [I think this was also in one other place as well.] > + /* the returned signal might be DONE or DONE_PENDING */ > + signal = osm_pkey_mgr_process( p_mgr->p_pkey_mgr ); > + break; > + > + default: > + __osm_state_mgr_signal_error( p_mgr, signal ); > + signal = OSM_SIGNAL_NONE; > + break; > + } > + break; > + [snip...] > Index: opensm/osm_indent > =================================================================== > --- opensm/osm_indent (revision 4651) > +++ opensm/osm_indent (working copy) > @@ -63,8 +63,8 @@ > # -i3 Substitute indent with 3 spaces > # -npcs No space after procedure calls > # -prs Space after parenthesis > -# -nsai No space after if keyword > -# -nsaw No space after while keyword > +# -nsai No space after if keyword - removed > +# -nsaw No space after while keyword - removed Should these comments just be removed ? > # -sc Put * at left of comments in a block comment style > # -nsob Don't swallow unnecessary blank lines > # -ts3 Tab size is 3 > @@ -81,7 +81,7 @@ for sourcefile in $*; do > perl -piW -e's/\x0D//' "$sourcefile" > echo Processing $sourcefile > indent -bad -bap -bbb -nbbo -bl -bli0 -bls -cbi0 -ci3 -cli0 -ncs \ > - -hnl -i3 -npcs -prs -nsai -nsaf -nsaw -sc -nsob -ts3 -psl -bfda -nut $sourcefile > + -hnl -i3 -npcs -prs -sc -nsob -ts3 -psl -bfda -nut $sourcefile > > rm ${sourcefile}W > From yipeeyipeeyipeeyipee at yahoo.com Sun Jan 1 01:49:24 2006 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Sun, 1 Jan 2006 09:49:24 +0000 (UTC) Subject: [openib-general] Re: understanding mthca_alloc_db() References: Message-ID: Roland Dreier cisco.com> writes: > [snip] Thanks for the answers Roland. > yipee> The functions defines two types/groups of doorbells. Why > yipee> are these doorbells allocated differently (one group starts > yipee> at the begining of the array and the other at the end)? > > This is the way the hardware works. Where can I read about it? I have the documentation file: InfiniHost_Programmers_Reference_Manual_1_16.pdf but see no mention of this. Can you give any pointer to this info? > yipee> Another thing I noticed is that doorbells are different > yipee> between Tavor and Arbel HCA's (e.g. see > yipee> update_cons_index(). Is it correct that Arbel doorbells are > yipee> only 32 bits wide? > > Sort of. It is definitely true that Tavor-mode doorbells work > differently from Arbel/mem-free-mode doorbells. Again, where can I read about this? I also have another question: We would like to do rdma to memory mapped I/O so a remote node can write, for example, directly to a device. We mapped some virtual address to the physical address of the device. When we tried to register this virtual memory area the ib driver failed (get_user_pages() failed). Can you please advice about how to make this work? Thanks, y From eitan at mellanox.co.il Sun Jan 1 02:12:40 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 1 Jan 2006 12:12:40 +0200 Subject: [openib-general] [ANNOUNCE] Updated OpenIB diagnostics Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B3EF@mtlexch01.mtl.com> Hi Hal, Do you expect the system administrator to manually fill in the discover.map including ALL nodes in the fabric, their guids and "name"? For a large cluster that number is quite large. In the past I was proposing using "system IB connectivity model"* (IBNL) for providing similar but superior capability. Using IBNL describing each system type (should be provided by the system vendor - or extracted once for each system type) the administrator can avoid the need to fill in the data (guid and name) for every node in the cluster. The administrator can select one of two** options: 1. Write a "system-level-topology"*** file to describe the expected topology instantiating systems only (not devices). This topology file is then compared versus the discovered topology and used (the names from the file as well as link width and speed) by all diagnostic tools for reporting errors. 2. Write "annotation" file (ala discover.map syntax) that includes as few as one device per system such that the extracted node level topology could be matched against that spec and mapped dynamically. * IBNL is describes the IB connectivity inside a system in a hierarchical manner. It enables specifying link width and speed inside the box and on the system interface. These properties are automatically propagated to the created topology - and enables their validation on the extracted topology. The topology created hold both the system-to-system connectivity layer as well as the flattened IB node and link layer (the later is similar to the discover.topo). As IBNL is describing the systems a common naming scheme for the devices in each such system is provided by the system vendor and not freely annotated by the system dministrator. Such that any error reported (like bad internal link or device) can be easily understood by the vendor too. Furthermore, when several devices misbehave - the code can correlate them to a specific board in the system and report the problem once for that entire board (this is demonstrated today by code under the ibdm tree - see below). ** Having a "spec topology" has great advantages over extracted one: Several utilities let you: + Analyze your topology even before one cable is laid out for credit loops, num hops, asymmetrical routing patterns, etc + Find routing errors that may very well happen on large cluster due to the human process of connecting thousand of cables. + Find links that did not start up in the right speed or width due to bad cables or their connections. *** By "system-level-topology" I mean a file that is made of the list of systems and not the list of IB nodes (embedded within this system). For large cluster using 288 port switch systems the number of elements in the file is reduced 32 times... The code to allow the option 1 is available under: https://openib.org/svn/gen2/utils/src/linux-user/ibdm To support option 2 this code could be easily enhanced with a new "annotation" algorithm. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Hal Rosenstock > Sent: Saturday, December 31, 2005 8:07 PM > To: openib-general at openib.org > Subject: [openib-general] [ANNOUNCE] Updated OpenIB diagnostics > > Hi, > > The OpenIB diagnostics > (https://openib.org/svn/gen2/trunk/src/userspace/management/diags) have > been updated as follows: > > 1. discover.pl diagnostic tool added > discover.pl uses a topology file create by ibnetdiscover and a discover.map > file which the network administrator creates which indicates the nodes > to be expected and a discover.topo file which is the expected connectivity > and produces a new connectivity file (discover.topo.new) and outputs > the changes to stdout. The network administrator can choose to replace > the "old" topo file with the new one or certain changes in. > > The syntax of the discover.map file is: > |port|"Text for node"| > e.g. > 8f10400410015|8|"ISR 6000"|# SW-6IB4 Voltaire port 0 lid 5 > 8f10403960558|2|"HCA 1"|# MT23108 InfiniHost Mellanox Technologies > > The syntax of the old and new topo files (discover.topo and discover.topo.new) > are: > ||| > e.g. > 10|5442ba00003080|1|8f10400410015 > > These topo files are produced by the discover.pl tool. > > 2. ibportstate diagnostic tool added to query, disable, and enable > switch ports > > 3. Added error only mode to diagnostic scripts so less data to weed > through on a large fabric (also verbose mode to see everything) > > 4. Tree structure collapsed so all tools in same directory as opposed to > individual ones and build simplified > > Let me know about any comments or issues. Thanks. > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Sun Jan 1 02:34:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 1 Jan 2006 12:34:14 +0200 Subject: [openib-general] [PATCH] mthca: max_inline_data tweaks Message-ID: <20060101103414.GU4907@mellanox.co.il> Input parameter max_inline_data checks in qp creation routines arent exactly right. --- Fix a case where copying max_inline_data from a successful create_qp capabilities output to create_qp input could cause EINVAL error: mthca_set_qp_size must check max_inline_data directly against max_desc_sz: checking qp->sq.max_gs is wrong since max_inline_data depends on the qp type and does not involve max_sg. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- latest.orig/drivers/infiniband/hw/mthca/mthca_qp.c +++ latest/drivers/infiniband/hw/mthca/mthca_qp.c @@ -882,18 +882,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, return err; } -static void mthca_adjust_qp_caps(struct mthca_dev *dev, - struct mthca_pd *pd, - struct mthca_qp *qp) +static int mthca_max_data_size(struct mthca_dev *dev, struct mthca_qp *qp, int desc_sz) { - int max_data_size; - /* * Calculate the maximum size of WQE s/g segments, excluding * the next segment and other non-data segments. */ - max_data_size = min(dev->limits.max_desc_sz, 1 << qp->sq.wqe_shift) - - sizeof (struct mthca_next_seg); + int max_data_size = desc_sz - sizeof (struct mthca_next_seg); switch (qp->transport) { case MLX: @@ -912,11 +907,23 @@ static void mthca_adjust_qp_caps(struct break; } + return max_data_size; +} + +static inline int mthca_max_inline_data(struct mthca_pd *pd, int max_data_size) +{ /* We don't support inline data for kernel QPs (yet). */ - if (!pd->ibpd.uobject) - qp->max_inline_data = 0; - else - qp->max_inline_data = max_data_size - MTHCA_INLINE_HEADER_SIZE; + return pd->ibpd.uobject ? max_data_size - MTHCA_INLINE_HEADER_SIZE : 0; +} + +static void mthca_adjust_qp_caps(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int max_data_size = mthca_max_data_size(dev, qp, min(dev->limits.max_desc_sz, + 1 << qp->sq.wqe_shift)); + + qp->max_inline_data = mthca_max_inline_data(pd, max_data_size); qp->sq.max_gs = min_t(int, dev->limits.max_sg, max_data_size / sizeof (struct mthca_data_seg)); @@ -1183,13 +1190,22 @@ static int mthca_alloc_qp_common(struct } static int mthca_set_qp_size(struct mthca_dev *dev, struct ib_qp_cap *cap, - struct mthca_qp *qp) + struct mthca_pd *pd, struct mthca_qp *qp) { /* Sanity check QP size before proceeding */ if (cap->max_send_wr > dev->limits.max_wqes || cap->max_recv_wr > dev->limits.max_wqes || cap->max_send_sge > dev->limits.max_sg || - cap->max_recv_sge > dev->limits.max_sg) + cap->max_recv_sge > dev->limits.max_sg || + cap->max_inline_data > + mthca_max_inline_data(pd, mthca_max_data_size(dev, qp, dev->limits.max_desc_sz))) + return -EINVAL; + + /* + * For MLX transport we need 2 extra S/G entries: + * one for the header and one for the checksum at the end + */ + if (qp->transport == MLX && cap->max_recv_sge + 2 > dev->limits.max_sg) return -EINVAL; if (mthca_is_memfree(dev)) { @@ -1208,14 +1224,6 @@ static int mthca_set_qp_size(struct mthc MTHCA_INLINE_CHUNK_SIZE) / sizeof (struct mthca_data_seg)); - /* - * For MLX transport we need 2 extra S/G entries: - * one for the header and one for the checksum at the end - */ - if ((qp->transport == MLX && qp->sq.max_gs + 2 > dev->limits.max_sg) || - qp->sq.max_gs > dev->limits.max_sg || qp->rq.max_gs > dev->limits.max_sg) - return -EINVAL; - return 0; } @@ -1230,7 +1238,7 @@ int mthca_alloc_qp(struct mthca_dev *dev { int err; - err = mthca_set_qp_size(dev, cap, qp); + err = mthca_set_qp_size(dev, cap, pd, qp); if (err) return err; @@ -1273,7 +1281,7 @@ int mthca_alloc_sqp(struct mthca_dev *de u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; int err; - err = mthca_set_qp_size(dev, cap, &sqp->qp); + err = mthca_set_qp_size(dev, cap, pd, &sqp->qp); if (err) return err; -- MST From jackm at mellanox.co.il Sun Jan 1 02:49:52 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 1 Jan 2006 12:49:52 +0200 Subject: [openib-general] [PATCH] mthca: check port validity in modify_qp Message-ID: <20060101104952.GA3082@mellanox.co.il> Modify_qp should check that the physical port number provided is a legal value. Signed-off-by: Jack Morgenstein Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c (revision 4666) +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -619,6 +619,12 @@ int mthca_modify_qp(struct ib_qp *ibqp, return -EINVAL; } + if ((attr_mask & IB_QP_PORT) && + (attr->port_num == 0 || attr->port_num > dev->limits.num_ports)) { + mthca_dbg(dev, "Port number (%u) is invalid\n", attr->port_num); + return -EINVAL; + } + if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC && attr->max_rd_atomic > dev->limits.max_qp_init_rdma) { mthca_dbg(dev, "Max rdma_atomic as initiator %u too large (max is %d)\n", From rdreier at cisco.com Sun Jan 1 12:02:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 01 Jan 2006 12:02:42 -0800 Subject: [openib-general] Re: Userspace testing results (2.6.15-rc7-git2 with modules) In-Reply-To: <20051230004313.GA8111@us.ibm.com> (Nishanth Aravamudan's message of "Thu, 29 Dec 2005 16:43:13 -0800") References: <20051230004313.GA8111@us.ibm.com> Message-ID: Nish, I haven't had time to look at the issues you saw yet, but let me say right off the bat that you're awesome ;) - R. From bboas at llnl.gov Sun Jan 1 16:15:37 2006 From: bboas at llnl.gov (Bill Boas) Date: Sun, 01 Jan 2006 16:15:37 -0800 Subject: [openib-general] Offer to give NFS/RDMA presentation at Sonoma Workshop Feb 5-8 In-Reply-To: References: Message-ID: <6.2.3.4.2.20060101160534.03453c60@mail-lc.llnl.gov> Yes, James, we would like that. I'm forwarding your offer to the openi generala nd promoters lists so those working on the agenda will see it. Thank you also for registering already for the Workshop - hope Sonoma county has dried out by then, the next storm is just starting! Happy New Year. Bill. At 07:24 PM 12/30/2005, you wrote: >If you would like, I can present a talk on NFS/RDMA at the OpenIB >workshop. > >james > >-- >James Lentini | Network Appliance | 781-768-5359 | jlentini at netapp.com Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From nacc at us.ibm.com Sun Jan 1 18:55:34 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Sun, 1 Jan 2006 18:55:34 -0800 Subject: [openib-general] Userspace testing results (2.6.15-rc7-git2 with modules) In-Reply-To: <20060101024000.GG14100@narn.hozed.org> References: <20051230004313.GA8111@us.ibm.com> <20060101024000.GG14100@narn.hozed.org> Message-ID: <20060102025534.GA22562@us.ibm.com> On 31.12.2005 [20:40:00 -0600], Troy Benjegerdes wrote: > > Currently, I am running netpipe, iperf and netperf (these three tests > > are giving horrible results but we are pretty sure that it is a local > > issue, as both eth1 and ib0 based tests lead to poor performance) and > > also netpipe with a patch from Shirley Ma to run over native IB [1]. > > Additionally, I am running the 4 pingpong tests (rc, srq, uc, ud) and > > the two perftest tests: rdma_lat and rdma_bw. There are some issues with > > some size combinations; or, at least, that is how it seems to me. > > > > I assume this is the same patch I have here.. > http://source.scl.ameslab.gov/hg/netpipe3-dev Yup, it looks effectively identical Thanks, Nish From nacc at us.ibm.com Sun Jan 1 19:57:49 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Sun, 1 Jan 2006 19:57:49 -0800 Subject: [openib-general] Re: Userspace testing results (2.6.15-rc7-git2 with modules) In-Reply-To: References: <20051230004313.GA8111@us.ibm.com> Message-ID: <20060102035749.GB22562@us.ibm.com> On 01.01.2006 [12:02:42 -0800], Roland Dreier wrote: > Nish, I haven't had time to look at the issues you saw yet, but let me > say right off the bat that you're awesome ;) Heh, I'm glad I was actually able to get some runs in before New Year's. I actually have results for svn rev 4662 and 4663 (4670 is pending right now) but some of them didn't complete correctly (machine problems, I think, not OpenIB issues). So I would rather wait to see what happens with 4670 and post a complete set of numbers if I can. Thanks, Nish From jackm at mellanox.co.il Mon Jan 2 01:42:03 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 2 Jan 2006 11:42:03 +0200 Subject: [openib-general] [PATCH] mthca: fix for SQEr-to-RTS transition in modify-qp Message-ID: <20060102094203.GA5607@mellanox.co.il> Fixes to SQEr->RTS transition in modify_qp: 1. The flag IB_QP_ACCESS_FLAGS is optional for UC qps 2. The SQEr state is not supported for RC qps Signed-off-by: Jack Morgenstein Index: latest/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- latest.orig/drivers/infiniband/hw/mthca/mthca_qp.c +++ latest/drivers/infiniband/hw/mthca/mthca_qp.c @@ -474,9 +474,8 @@ static const struct { .opt_param = { [UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [UC] = IB_QP_CUR_STATE, - [RC] = (IB_QP_CUR_STATE | - IB_QP_MIN_RNR_TIMER), + [UC] = IB_QP_CUR_STATE | + IB_QP_ACCESS_FLAGS, [MLX] = (IB_QP_CUR_STATE | IB_QP_QKEY), } From jackm at mellanox.co.il Mon Jan 2 01:43:40 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 2 Jan 2006 11:43:40 +0200 Subject: [openib-general] [PATCH] mthca: fix for RTR-to-RTS transition in modify-qp Message-ID: <20060102094340.GB5607@mellanox.co.il> PKEY_INDEX is not a legal parameter in the RTR->RTS transition Signed-off-by: Jack Morgenstein Index: latest/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- latest.orig/drivers/infiniband/hw/mthca/mthca_qp.c +++ latest/drivers/infiniband/hw/mthca/mthca_qp.c @@ -381,12 +381,10 @@ static const struct { [UC] = (IB_QP_CUR_STATE | IB_QP_ALT_PATH | IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX | IB_QP_PATH_MIG_STATE), [RC] = (IB_QP_CUR_STATE | IB_QP_ALT_PATH | IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX | IB_QP_MIN_RNR_TIMER | IB_QP_PATH_MIG_STATE), [MLX] = (IB_QP_CUR_STATE | From yipeeyipeeyipeeyipee at yahoo.com Mon Jan 2 04:28:20 2006 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Mon, 2 Jan 2006 12:28:20 +0000 (UTC) Subject: [openib-general] Re: understanding mthca_alloc_db() References: Message-ID: yipee yahoo.com> writes: > > Roland Dreier cisco.com> writes: > > > > [snip] Ok I've found the correct PRM (InfiniHost_III_Programmers_Reference_Manual_0_86. pdf). Any comment on this next question? > We would like to do rdma to memory mapped I/O so a remote node can write, > for example, directly to a device. We mapped some virtual address to > the physical address of the device. When we tried to register this > virtual memory area the ib driver failed (get_user_pages() failed). > Can you please advice about how to make this work? y From halr at voltaire.com Mon Jan 2 06:40:39 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Jan 2006 09:40:39 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION Message-ID: <1136212605.4331.33056.camel@hal.voltaire.com> OpenSM: Separate out OSM_VERSION so when changing only needed files are recompiled rather than everything Signed-off-by: Hal Rosenstock Index: osm/include/opensm/osm_version.h =================================================================== --- osm/include/opensm/osm_version.h (revision 0) +++ osm/include/opensm/osm_version.h (revision 0) @@ -0,0 +1,65 @@ +/* + * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + + +#ifndef _OSM_VERSION_H_ +#define _OSM_VERSION_H_ + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +/****s* OpenSM: Base/OSM_VERSION +* NAME +* OSM_VERSION +* +* DESCRIPTION +* The version string for OpenSM +* +* SYNOPSIS +*/ +#define OSM_VERSION "OpenSM Rev:openib-1.1.0" +/********/ + +END_C_DECLS + +#endif /* _OSM_VERSION_H_ */ Property changes on: osm/include/opensm/osm_version.h ___________________________________________________________________ Name: svn:keywords + Id Index: osm/include/opensm/osm_base.h =================================================================== --- osm/include/opensm/osm_base.h (revision 4686) +++ osm/include/opensm/osm_base.h (working copy) @@ -89,18 +89,6 @@ BEGIN_C_DECLS * Steve King, Intel * *********/ -/****s* OpenSM: Base/OSM_VERSION -* NAME -* OSM_VERSION -* -* DESCRIPTION -* The version string for OpenSM -* -* SYNOPSIS -*/ -#define OSM_VERSION "OpenSM Rev:openib-1.1.0" -/********/ - /****s* OpenSM: Base/OSM_DEFAULT_M_KEY * NAME * OSM_DEFAULT_M_KEY Index: osm/opensm/osm_opensm.c =================================================================== --- osm/opensm/osm_opensm.c (revision 4686) +++ osm/opensm/osm_opensm.c (working copy) @@ -58,6 +58,7 @@ #include #include #include +#include #include #include #include Index: osm/opensm/main.c =================================================================== --- osm/opensm/main.c (revision 4686) +++ osm/opensm/main.c (working copy) @@ -56,6 +56,7 @@ #include #include #include +#include #include #include #include From vonbrand at inf.utfsm.cl Mon Jan 2 08:05:43 2006 From: vonbrand at inf.utfsm.cl (Horst von Brand) Date: Mon, 02 Jan 2006 13:05:43 -0300 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: Message from Lee Revell of "Thu, 29 Dec 2005 14:26:24 CDT." <1135884385.6804.0.camel@mindpipe> Message-ID: <200601021605.k02G5iN9010252@laptop11.inf.utfsm.cl> Lee Revell wrote: > On Thu, 2005-12-29 at 16:01 -0300, Horst von Brand wrote: > > > - Someone asked for the kernel's i2c infrastructure to be used,but > > > our i2c usage is very specialised, and it would be more of a mess > > > to use the kernel's > > Problem with that is that if everybody and Aunt Tillie does the same, > > the kernel as a whole gets to be a mess. > ALSA does the exact same thing for the exact same reason. Maybe an > indication that the kernel's i2c layer is too heavy? That would mean that the respective teams should put their heads together and (re)design it to their needs... -- Dr. Horst H. von Brand User #22616 counter.li.org Departamento de Informatica Fono: +56 32 654431 Universidad Tecnica Federico Santa Maria +56 32 654239 Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513 From hch at infradead.org Mon Jan 2 08:22:29 2006 From: hch at infradead.org (Christoph Hellwig) Date: Mon, 2 Jan 2006 16:22:29 +0000 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: <200601021605.k02G5iN9010252@laptop11.inf.utfsm.cl> References: <1135884385.6804.0.camel@mindpipe> <200601021605.k02G5iN9010252@laptop11.inf.utfsm.cl> Message-ID: <20060102162229.GB13904@infradead.org> On Mon, Jan 02, 2006 at 01:05:43PM -0300, Horst von Brand wrote: > > > Problem with that is that if everybody and Aunt Tillie does the same, > > > the kernel as a whole gets to be a mess. > > > ALSA does the exact same thing for the exact same reason. Maybe an > > indication that the kernel's i2c layer is too heavy? > > That would mean that the respective teams should put their heads together > and (re)design it to their needs... Exactly. We got quite a few developers to help adjusting the i2c stack for their needs and improve it. The i2c stack started out beeing used only for hardware monitoring chips and then later multimedia devices. Help to make it more useful for other users is always appreciated. From ebiederm at xmission.com Mon Jan 2 12:35:07 2006 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 02 Jan 2006 13:35:07 -0700 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: <1135993250.13318.94.camel@serpentine.pathscale.com> (Bryan O'Sullivan's message of "Fri, 30 Dec 2005 17:40:50 -0800") References: <20051230080002.GA7438@kroah.com> <1135984304.13318.50.camel@serpentine.pathscale.com> <20051231001051.GB20314@kroah.com> <1135993250.13318.94.camel@serpentine.pathscale.com> Message-ID: "Bryan O'Sullivan" writes: > On Fri, 2005-12-30 at 16:10 -0800, Greg KH wrote: > >> But we (the kernel community), don't really accept that as a valid >> reason to accept this kind of code, sorry. > > Fair enough. I'd like some guidance in that case. Some of our ioctls > access the hardware more or less directly, while others do things like > read or reset counters. As a general rule a driver should push as much functionality to libraries and the infrastructure code as possible. > Which of these kinds of operations are appropriate to retain as ioctls, > in your eyes, and which are best converted to sysfs or configfs > alternatives? > > As an example, take a look at ipath_sma_ioctl. It seems to me that > receiving or sending subnet management packets ought to remain as > ioctls, while getting port or node data could be turned into sysfs > attributes. Lane identification could live in configfs. If you think > otherwise, please let me know what's more appropriate. I haven't looked closely enough at the state of the openib tree but you should not need an additional interface to send/receive standard IB subnet management packets. That is something that should be provided the same way by all infiniband drivers. The only case I can think of where this might not already exist is the code that responds to the subnet manager. If the current interfaces are not sufficient then the infiniband layer needs more work. > The less blind I am in doing these conversions, the fewer rounds we'll > have to go in reviewing humongous driver submission patches :-) Given Linus's comments and looking at where you are getting stuck I would recommend you split out support for the nonstandard ipath protocol from the rest of the driver. If the standard infiniband interfaces for kernel bypass are not sufficient for flinging packets then we need to re-examine them. Eric From bos at pathscale.com Mon Jan 2 14:22:38 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 02 Jan 2006 14:22:38 -0800 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: References: <20051230080002.GA7438@kroah.com> <1135984304.13318.50.camel@serpentine.pathscale.com> <20051231001051.GB20314@kroah.com> <1135993250.13318.94.camel@serpentine.pathscale.com> Message-ID: <1136240558.20330.57.camel@serpentine.pathscale.com> On Mon, 2006-01-02 at 13:35 -0700, Eric W. Biederman wrote: > I haven't looked closely enough at the state of the openib tree but > you should not need an additional interface to send/receive standard > IB subnet management packets. That is something that should be provided > the same way by all infiniband drivers. We provide the standard OpenIB mechanisms for doing that, of course. However, our driver is layered. The OpenIB layer uses facilities provided by the main driver (via ipath_layer.c). The main driver can stand alone, without the OpenIB code compiled into the kernel or available as a module at all. In that case, a userland subnet management agent must still be able to send and receive management packets. > Given Linus's comments and looking at where you are getting stuck I > would recommend you split out support for the nonstandard ipath > protocol from the rest of the driver. While we can split the main driver source file up along those lines, we are not planning to make the ipath protocol optional. We are planning to submit another non-OpenIB network driver that depends on the ipath protocol support. Message-ID: > Where can I read about it? I have the documentation file: > InfiniHost_Programmers_Reference_Manual_1_16.pdf but see no > mention of this. Can you give any pointer to this info? You need the InfiniHost III PRM -- the manual you have now is for non-mem-free HCAs. > We would like to do rdma to memory mapped I/O so a remote node > can write, for example, directly to a device. We mapped some > virtual address to the physical address of the device. When we > tried to register this virtual memory area the ib driver failed > (get_user_pages() failed). Can you please advice about how to > make this work? It will be a little tricky to do this. You need to register a memory region that is mapped to the bus addresses of your MMIO region. You could create a new kernel interface to do this (hacky but not too hard) or enhance get_user_pages() to handle MMIO regions (cleaner but more difficult). - R. From rdreier at cisco.com Tue Jan 3 00:02:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 03 Jan 2006 00:02:15 -0800 Subject: [openib-general] Kickstart over OpenIB? References: <1135955154.5243.34.camel@brilong-lnx> Message-ID: Brian> I would like to know how difficult it would be to modify Brian> kickstart such that it would work over Infiniband. I've Brian> asked the Red Hat anaconda developers about this and, as Brian> you can see in the attached email, they believe IB only Brian> accepts netlink and no ioctls for network setup. Is this Brian> true? No, I think RH support is a little confused. A few ioctls having to do with hardware addresses (eg SIOCGIFHWADDR) don't work for IB, because IPoIB hardware addresses are 20 bytes while the ioctl ABI can only handle 14 bytes. The practical impact is that ifconfig will not show the correct HW address for an IPoIB interface. Other than that, any strictly IP-related ioctls will work fine -- for example, something like "ifconfig ib0 192.168.1.2" works. With that said I don't have much experience with kickstart, so I have no idea if there are any other glitches that you might run into. - R. From ogerlitz at voltaire.com Tue Jan 3 03:43:04 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 3 Jan 2006 13:43:04 +0200 (IST) Subject: [openib-general] [PATCH] iser: simplify handling of iscsi unsolicited data Message-ID: The patch below eliminates the special handling of memory to be used for iscsi unsolicited data write, instead all the command data is registered for rdma. The descriptors holding the rdma registration info were much simplified, the fields rdma_read/write_dto and send/recv_buff_list were removed from struct iscsi_iser_cmd_task and are now replaced with rdma_regd. Signed-off-by: Alex Nezhinsky Signed-off-by: Or Gerlitz Index: ulp/iser/iser_memory.h =================================================================== --- ulp/iser/iser_memory.h (revision 4622) +++ ulp/iser/iser_memory.h (working copy) @@ -54,13 +54,6 @@ void iser_reg_single(struct iser_adaptor struct iser_regd_buf *p_regd_buf, enum dma_data_direction direction); -void iser_reg_single_task(struct iser_adaptor *p_iser_adaptor, - struct iser_regd_buf *p_regd_buf, - void *virt_addr, - dma_addr_t dma_addr, - unsigned long data_size, - enum dma_data_direction direction); - /* scatterlist */ int iser_sg_size(struct iser_data_buf *p_mem); @@ -70,11 +63,6 @@ void iser_start_rdma_unaligned_sg(struct void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *p_iser_task); /* iser_data_buf */ -unsigned int iser_data_buf_contig_len(struct iser_data_buf *p_data, - int skip, - dma_addr_t *chunk_dma_addr, - int *chink_sz); - unsigned int iser_data_buf_aligned_len(struct iser_data_buf *p_data, int skip); Index: ulp/iser/iscsi_iser.h =================================================================== --- ulp/iser/iscsi_iser.h (revision 4622) +++ ulp/iser/iscsi_iser.h (working copy) @@ -277,16 +277,8 @@ struct iscsi_iser_cmd_task { unsigned int post_send_count; /* posted send buffers pending completion */ - /* buffers, to release when the task is complete */ - struct list_head send_buff_list; - struct list_head rcv_buff_list; - struct iser_dto rdma_read_dto; - struct iser_dto rdma_write_dto; - - struct list_head conn_list; /* Tasks list of the conn */ - struct list_head hash_list; /* Hash table bucket entry */ - int dir[ISER_DIRS_NUM]; /* set if direction used */ + struct iser_regd_buf *rdma_regd[ISER_DIRS_NUM]; /* regd rdma buffer */ unsigned long data_len[ISER_DIRS_NUM]; /* total data length */ struct iser_data_buf data[ISER_DIRS_NUM]; /* orig. data descriptor */ struct iser_data_buf data_copy[ISER_DIRS_NUM]; /* contig. copy */ Index: ulp/iser/iser.h =================================================================== --- ulp/iser/iser.h (revision 4622) +++ ulp/iser/iser.h (working copy) @@ -63,16 +63,9 @@ #define ISER_TOTAL_HEADERS_LEN \ (ISER_HDR_LEN + ISER_PDU_BHS_LENGTH) -/* Hash tables */ -#define HASH_TABLE_SIZE 256 - /* Various size limits */ #define ISER_LOGIN_PHASE_PDU_DATA_LEN (8*1024) /* 8K */ -struct hash_table { - struct list_head bucket_head[HASH_TABLE_SIZE]; - spinlock_t lock; -}; struct iser_page_vec { u64 *pages; @@ -99,9 +92,6 @@ struct iser_regd_buf { enum dma_data_direction direction; /* direction for dma_unmap */ unsigned int data_size; - - /* To be chained here, if freeing upon completion is signaled */ - struct list_head free_upon_comp_list; /* Reference count, memory freed when decremented to 0 */ atomic_t ref_count; }; @@ -149,8 +139,6 @@ struct iser_global { kmem_cache_t *login_cache; kmem_cache_t *header_cache; - - struct hash_table task_hash; /* hash table for tasks */ }; /* iser_global */ extern struct iser_global ig; Index: ulp/iser/iser_dto.c =================================================================== --- ulp/iser/iser_dto.c (revision 4622) +++ ulp/iser/iser_dto.c (working copy) @@ -79,152 +79,6 @@ int iser_dto_add_regd_buff(struct iser_d } /** - * iser_dto_clone_regd_buffs - creates a dto (dst) which refers to a subrange - * of the memory referenced by another dto (src). - */ -void iser_dto_clone_regd_buffs(struct iser_dto *p_dst, - struct iser_dto *p_src, - unsigned long offset, - unsigned long size) -{ - unsigned long remaining_offset = offset; - unsigned long remaining_size = size; - unsigned long regd_buf_size; - unsigned long used_size; - int i; - - for (i = 0; i < p_src->regd_vector_len; i++) { - regd_buf_size = p_src->used_sz[i] > 0 ? - p_src->used_sz[i] : - p_src->regd[i]->reg.len; - - if (remaining_offset < regd_buf_size) { - used_size = min(remaining_size, - regd_buf_size - remaining_offset); - iser_dto_add_regd_buff(p_dst, - p_src->regd[i], - USE_OFFSET(p_src-> - offset[i] + - remaining_offset), - USE_SIZE(used_size)); - remaining_size -= used_size; - if (remaining_size == 0) - break; - else - remaining_offset = 0; - } else - remaining_offset -= regd_buf_size; - } - if (remaining_size > 0) - iser_bug("size to clone:%ld exceeds by %ld the total size of " - "src DTO:0x%p; dst DTO:0x%p, task:0x%p\n", - size, remaining_size, p_src, p_dst, p_dst->p_task); -} - -/** - * iser_dto_add_local_single - - */ -void iser_dto_add_local_single(struct iser_adaptor *p_iser_adaptor, - struct iser_dto *p_dto, - void *virt_addr, - dma_addr_t dma_addr, - unsigned long data_size, - enum dma_data_direction direction) -{ - struct iser_regd_buf *p_regd_buf; - - p_regd_buf = iser_regd_buf_alloc(p_iser_adaptor); - - iser_reg_single_task(p_iser_adaptor, p_regd_buf, - virt_addr, dma_addr, data_size, direction); - - iser_dto_add_regd_buff(p_dto, p_regd_buf, - USE_NO_OFFSET, USE_ENTIRE_SIZE); -} - -/** - * iser_dto_add_local_sg - adds a scatterlist to a dto intended for local - * operations only; tries to use registration keys from all-memory - * registration whenever possible. - */ -int iser_dto_add_local_sg(struct iser_dto *p_dto, - struct iser_data_buf *p_mem, - enum dma_data_direction direction) -{ - struct iser_adaptor *p_iser_adaptor = p_dto->p_conn->ib_conn->p_adaptor; - struct iser_regd_buf *p_regd_buf; - int cur_buf = 0; - int err = 0; - int num_sg; - - do { - p_regd_buf = iser_regd_buf_alloc(p_iser_adaptor); - if (p_regd_buf == NULL) { - iser_err("Failed to alloc regd_buf\n"); - err = -ENOMEM; - goto dto_add_local_sg_exit; - } - /* if enough place in IOV for all sg entries, use all-memory - * registration, otherwise register memory */ - /* DMA_MAP: by now the sg must have been mapped, get the dma addr properly & pass it */ - if (p_mem->dma_nents - cur_buf < - MAX_REGD_BUF_VECTOR_LEN - p_dto->regd_vector_len) { - dma_addr_t chunk_dma_addr; - int chunk_sz; - void *chunk_vaddr; - num_sg = iser_data_buf_contig_len(p_mem, - cur_buf, /* skip */ - &chunk_dma_addr, - &chunk_sz); - /* DMA_MAP: vaddr not needed for this regd_buf */ - chunk_vaddr = 0; - iser_reg_single_task(p_iser_adaptor, p_regd_buf, - chunk_vaddr, chunk_dma_addr, - chunk_sz, direction); - } else { - struct iser_page_vec *page_vec; - num_sg = iser_data_buf_aligned_len(p_mem,cur_buf); - page_vec = iser_page_vec_alloc(p_mem,cur_buf,num_sg); - if (page_vec == NULL) { - iser_err("Failed to alloc page_vec\n"); - iser_regd_buff_release(p_regd_buf); - err = -ENOMEM; - goto dto_add_local_sg_exit; - } - iser_page_vec_build(p_mem,page_vec,cur_buf,num_sg); - - err = iser_reg_phys_mem(p_iser_adaptor, - page_vec, - IB_ACCESS_LOCAL_WRITE | - IB_ACCESS_REMOTE_WRITE | - IB_ACCESS_REMOTE_READ , - &p_regd_buf->reg); - iser_page_vec_free(page_vec); - if (err) { - iser_err("Failed to register %d sg entries " - "starting from %d\n",num_sg,cur_buf); - iser_regd_buff_release(p_regd_buf); - goto dto_add_local_sg_exit; - } - - iser_dto_add_regd_buff(p_dto, - p_regd_buf, - USE_NO_OFFSET, - USE_ENTIRE_SIZE); - } - iser_dto_add_regd_buff(p_dto, p_regd_buf, - USE_NO_OFFSET, USE_ENTIRE_SIZE); - iser_dbg("Added regd.buf:0x%p to DTO:0x%p now %d regd.bufs\n", - p_regd_buf, p_dto, p_dto->regd_vector_len); - - cur_buf += num_sg; - } while (cur_buf < p_mem->size); - - dto_add_local_sg_exit: - return err; -} - -/** * iser_dto_buffs_release - free all registered buffers */ void iser_dto_buffs_release(struct iser_dto *p_dto) Index: ulp/iser/iser_dto.h =================================================================== --- ulp/iser/iser_dto.h (revision 4622) +++ ulp/iser/iser_dto.h (working copy) @@ -47,28 +47,11 @@ int iser_dto_add_regd_buff(struct iser_d struct iser_regd_buf *p_regd_buf, unsigned long use_offset, unsigned long use_size); -void -iser_dto_clone_regd_buffs(struct iser_dto *p_dst_dto, - struct iser_dto *p_src_dto, - unsigned long offset, - unsigned long size); -void iser_dto_buffs_release(struct iser_dto *p_dto); void iser_dto_free(struct iser_dto *p_dto); int iser_dto_completion_error(struct iser_dto *p_dto); -void iser_dto_add_local_single(struct iser_adaptor *p_iser_adaptor, - struct iser_dto *p_dto, - void *virt_addr, - dma_addr_t dma_addr, - unsigned long data_size, - enum dma_data_direction direction); - -int iser_dto_add_local_sg(struct iser_dto *p_dto, - struct iser_data_buf *p_mem, - enum dma_data_direction direction); - void iser_dto_get_rx_pdu_data(struct iser_dto *p_dto, unsigned long dto_xfer_len, struct iscsi_hdr **p_rx_hdr, Index: ulp/iser/iser_initiator.c =================================================================== --- ulp/iser/iser_initiator.c (revision 4622) +++ ulp/iser/iser_initiator.c (working copy) @@ -46,11 +46,6 @@ #include "iser_verbs.h" #include "iser_memory.h" -#define ISCSI_AHSL_MASK 0xFF000000 -#define ISCSI_DSL_MASK 0x00FFFFFF -#define ISCSI_INVALID_ITT 0xFFFFFFFF - - static void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *p_iser_task); /** @@ -60,43 +55,27 @@ static void iser_dma_unmap_task_data(str * returns 0 on success, -1 on failure */ static int iser_reg_rdma_mem(struct iscsi_iser_cmd_task *p_iser_task, - enum iser_data_dir cmd_dir, - struct iser_data_buf *p_mem, - struct iser_regd_buf **regd_buf) + enum iser_data_dir cmd_dir) { struct iser_adaptor *p_iser_adaptor = p_iser_task->conn->ib_conn->p_adaptor; - struct list_head *p_task_buff_list = NULL; struct iser_page_vec *page_vec = NULL; struct iser_regd_buf *p_regd_buf = NULL; - struct iser_dto *p_dto = NULL; - enum ib_access_flags priv_flags = 0; + enum ib_access_flags priv_flags = IB_ACCESS_LOCAL_WRITE; + struct iser_data_buf *p_mem = &p_iser_task->data[cmd_dir]; unsigned int page_vec_len = 0; - struct iser_data_buf *mem_to_reg; - int cnt_to_reg; + int cnt_to_reg = 0; int err = 0; - if (cmd_dir == ISER_DIR_IN) { - iser_dbg("cmd_dir == ISER_DIR_IN\n"); - p_dto = &p_iser_task->rdma_write_dto; - p_task_buff_list = &p_iser_task->rcv_buff_list; - priv_flags = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; - } else if (cmd_dir == ISER_DIR_OUT) { - iser_dbg("cmd_dir == ISER_DIR_OUT\n"); - p_dto = &p_iser_task->rdma_read_dto; - p_task_buff_list = &p_iser_task->send_buff_list; - priv_flags = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ; - } else - iser_bug("Unexpected cmd dir:%d, task:0x%p\n", - cmd_dir, p_iser_task); - *regd_buf = NULL; + if (cmd_dir == ISER_DIR_IN) + priv_flags |= IB_ACCESS_REMOTE_WRITE; + else + priv_flags |= IB_ACCESS_REMOTE_READ; + p_iser_task->rdma_regd[cmd_dir] = NULL; p_regd_buf = iser_regd_buf_alloc(p_iser_adaptor); if (p_regd_buf == NULL) return -ENOMEM; - cnt_to_reg = 0; - mem_to_reg = p_mem; - iser_dbg("p_mem %p p_mem->type %d\n", p_mem,p_mem->type); if (p_mem->type != ISER_BUF_TYPE_SINGLE) { @@ -114,19 +93,19 @@ static int iser_reg_rdma_mem(struct iscs /* unaligned scatterlist, anyway dma map the copy */ iser_start_rdma_unaligned_sg(p_iser_task, cmd_dir); p_regd_buf->virt_addr = p_iser_task->data_copy[cmd_dir].p_buf; - mem_to_reg = &p_iser_task->data_copy[cmd_dir]; + p_mem = &p_iser_task->data_copy[cmd_dir]; } } else { iser_dbg("converting single to page_vec\n"); p_regd_buf->virt_addr = p_mem->p_buf; } - page_vec = iser_page_vec_alloc(mem_to_reg,0,cnt_to_reg); + page_vec = iser_page_vec_alloc(p_mem,0,cnt_to_reg); if (page_vec == NULL) { iser_regd_buff_release(p_regd_buf); return -ENOMEM; } - page_vec_len = iser_page_vec_build(mem_to_reg,page_vec, 0, cnt_to_reg); + page_vec_len = iser_page_vec_build(p_mem, page_vec, 0, cnt_to_reg); err = iser_reg_phys_mem(p_iser_adaptor, page_vec, priv_flags, &p_regd_buf->reg); iser_page_vec_free(page_vec); @@ -135,57 +114,10 @@ static int iser_reg_rdma_mem(struct iscs iser_regd_buff_release(p_regd_buf); return -EINVAL; } - *regd_buf = p_regd_buf; - - spin_lock_bh(&p_iser_task->task_lock); - -/*FIXME p_dto->p_task = p_iser_task; */ -/*FIXME p_dto->p_conn = p_iser_task->p_conn; */ - p_dto->regd_vector_len = 0; - iser_dto_add_regd_buff(p_dto, p_regd_buf, - USE_NO_OFFSET, USE_ENTIRE_SIZE); - /* to be released when the task completes */ - list_add(&p_regd_buf->free_upon_comp_list, p_task_buff_list); - - spin_unlock_bh(&p_iser_task->task_lock); - return 0; -} - -/** - * Registers memory - * intended for sending as unsolicited data - * - * returns 0 on success, -1 on failure - */ -static int iser_reg_unsol(struct iscsi_iser_cmd_task *p_iser_task) -{ - struct iser_adaptor *p_iser_adaptor = p_iser_task->conn->ib_conn->p_adaptor; - struct iser_dto *p_dto = &p_iser_task->rdma_read_dto; - struct iser_data_buf *p_mem = &p_iser_task->data[ISER_DIR_OUT]; - int err = 0; - int i; - - if (p_mem->type == ISER_BUF_TYPE_SINGLE) { - /* DMA_MAP: should pass the task? single address has been mapped already!!! */ - iser_dto_add_local_single(p_iser_adaptor, p_dto, - p_mem->p_buf, - p_mem->dma_addr, p_mem->size, - DMA_TO_DEVICE); - } - else { - /* DMA_MAP: should pass copied and mapped sg instead? */ - err = iser_dto_add_local_sg(p_dto, p_mem, DMA_TO_DEVICE); - if (err) { - iser_err("iser_dto_add_local_sg failed\n"); - iser_dto_buffs_release(p_dto); - return err; - } - } - - /* all registered buffers have been referenced, - but this dto is not used in any IO */ - for (i = 0; i < p_dto->regd_vector_len; i++) - iser_regd_buff_deref(p_dto->regd[i]); + /* take a reference on this regd buf such that it will not be released * + * (eg in send dto completion) before we get the scsi response */ + iser_regd_buff_ref(p_regd_buf); + p_iser_task->rdma_regd[cmd_dir] = p_regd_buf; return 0; } @@ -239,12 +171,12 @@ static int iser_prepare_read_cmd(struct memcpy(&p_iser_task->data[ISER_DIR_IN], buf_in, sizeof(struct iser_data_buf)); - err = iser_reg_rdma_mem(p_iser_task,ISER_DIR_IN, - &p_iser_task->data[ISER_DIR_IN],&p_regd_buf); + err = iser_reg_rdma_mem(p_iser_task,ISER_DIR_IN); if (err) { iser_err("Failed to set up Data-IN RDMA\n"); return err; } + p_regd_buf = p_iser_task->rdma_regd[ISER_DIR_IN]; ISER_HDR_SET_BITS(p_iser_header, RSV, 1); ISER_HDR_R_VADDR(p_iser_header) = cpu_to_be64(p_regd_buf->reg.va); ISER_HDR_R_RKEY(p_iser_header) = htonl(p_regd_buf->reg.rkey); @@ -311,17 +243,18 @@ iser_prepare_write_cmd(struct iscsi_iser memcpy(&p_iser_task->data[ISER_DIR_OUT], buf_out, sizeof(struct iser_data_buf)); - if (unsol_sz < edtl) { - err = iser_reg_rdma_mem(p_iser_task,ISER_DIR_OUT, - &p_iser_task->data[ISER_DIR_OUT], - &p_regd_buf); - if (err != 0) { - iser_err("Failed to register write cmd RDMA mem\n"); - return err; - } + err = iser_reg_rdma_mem(p_iser_task,ISER_DIR_OUT); + if (err != 0) { + iser_err("Failed to register write cmd RDMA mem\n"); + return err; + } + + p_regd_buf = p_iser_task->rdma_regd[ISER_DIR_OUT]; + + if(unsol_sz < edtl) { ISER_HDR_SET_BITS(p_iser_header, WSV, 1); ISER_HDR_W_VADDR(p_iser_header) = cpu_to_be64( - p_regd_buf->reg.va + unsol_sz); + p_regd_buf->reg.va + unsol_sz); ISER_HDR_W_RKEY(p_iser_header) = htonl(p_regd_buf->reg.rkey); iser_dbg("Cmd itt:%d, WRITE tags, RKEY:0x%08X " @@ -329,24 +262,17 @@ iser_prepare_write_cmd(struct iscsi_iser p_iser_task->itt, p_regd_buf->reg.rkey, (unsigned long)p_regd_buf->reg.va, unsol_sz); - } else { - err = iser_reg_unsol(p_iser_task); /* DMA_MAP: buf_out is already in task->data[DIR_OUT] */ - if (err != 0){ - iser_err("Failed to register write cmd RDMA mem\n"); - return err; - } } - /* If there is immediate data, add its register - buffer reference to the send dto descriptor */ if (imm_sz > 0) { iser_dbg("Cmd itt:%d, WRITE, adding imm.data sz: %d\n", p_iser_task->itt, imm_sz); - - iser_dto_clone_regd_buffs(p_send_dto, /* dst */ - &p_iser_task->rdma_read_dto, - 0, imm_sz); + iser_dto_add_regd_buff(p_send_dto, + p_regd_buf, + USE_NO_OFFSET, + USE_SIZE(imm_sz)); } + return 0; } @@ -469,7 +395,8 @@ int iser_send_data_out(struct iscsi_iser data_seg_len = ntoh24(hdr->dlength); buf_offset = ntohl(hdr->offset); - iser_dbg("%s itt %d dseg_len %d offset %d\n",__func__,(int)itt,(int)data_seg_len,(int)buf_offset); + iser_dbg("%s itt %d dseg_len %d offset %d\n", + __func__,(int)itt,(int)data_seg_len,(int)buf_offset); /* Allocate send DTO descriptor, headers buf and add it to the DTO */ p_send_dto = iser_dto_send_create(p_iser_conn, @@ -486,10 +413,11 @@ int iser_send_data_out(struct iscsi_iser p_send_dto->p_task = p_ctask; - /* Set-up the registered buffer entries for the data segment */ - iser_dto_clone_regd_buffs(p_send_dto, /* dst */ - &p_ctask->rdma_read_dto, - buf_offset, data_seg_len); + /* all data was registered for RDMA, we can use the lkey */ + iser_dto_add_regd_buff(p_send_dto, + p_ctask->rdma_regd[ISER_DIR_OUT], + USE_OFFSET(buf_offset), + USE_SIZE(data_seg_len)); if (buf_offset + data_seg_len > p_ctask->data_len[ISER_DIR_OUT]) { iser_err("Offset:%ld & DSL:%ld in Data-Out " Index: ulp/iser/iser_task.c =================================================================== --- ulp/iser/iser_task.c (revision 4622) +++ ulp/iser/iser_task.c (working copy) @@ -46,90 +46,16 @@ void iser_task_init_lowpart(struct iscsi { spin_lock_init(&p_iser_task->task_lock); p_iser_task->status = ISER_TASK_STATUS_INIT; - - INIT_LIST_HEAD(&p_iser_task->send_buff_list); - INIT_LIST_HEAD(&p_iser_task->rcv_buff_list); - p_iser_task->post_send_count = 0; - + p_iser_task->dir[ISER_DIR_IN] = 0; p_iser_task->dir[ISER_DIR_OUT] = 0; - + p_iser_task->data_len[ISER_DIR_IN] = 0; p_iser_task->data_len[ISER_DIR_OUT] = 0; - - iser_dto_init(&p_iser_task->rdma_read_dto); - p_iser_task->rdma_read_dto.p_conn = p_iser_task->conn; - p_iser_task->rdma_read_dto.p_task = p_iser_task; - - iser_dto_init(&p_iser_task->rdma_write_dto); - p_iser_task->rdma_write_dto.p_conn = p_iser_task->conn; - p_iser_task->rdma_write_dto.p_task = p_iser_task; -} - -/** - * iser_task_release_send_buffers - Frees all sent buffers of a - * task (upon completion) - */ -void iser_task_release_send_buffers(struct iscsi_iser_cmd_task *p_iser_task) -{ - struct iser_regd_buf *p_regd_buf; - int tries = 0; - - iser_dbg( "Releasing send buffs for iSER task: 0x%p\n", - p_iser_task); - - /* Free all sent buffers from the list */ - spin_lock_bh(&p_iser_task->task_lock); - while (!list_empty(&p_iser_task->send_buff_list)) { - /* Get the next send buffer & remove it from the list */ - p_regd_buf = - list_entry(p_iser_task->send_buff_list.next, - struct iser_regd_buf, free_upon_comp_list); - list_del(&p_regd_buf->free_upon_comp_list); - spin_unlock_bh(&p_iser_task->task_lock); - - if (iser_regd_buff_release(p_regd_buf) != 0) { - iser_err("Failed to release send buffer after " - "task complete, task: 0x%p, itt: %d -" - " references remain\n", - p_iser_task, p_iser_task->itt); - - tries++; /* FIXME: calling schedule */ - schedule(); - } - - spin_lock_bh(&p_iser_task->task_lock); - } - spin_unlock_bh(&p_iser_task->task_lock); - if (tries) - iser_err("Released send buff after %d tries\n", tries); -} - -/** - * iser_task_release_recv_buffers - Frees all receive buffers of - * a task (upon completion) - */ -void iser_task_release_recv_buffers(struct iscsi_iser_cmd_task *p_iser_task) -{ - struct iser_regd_buf *p_regd_buf; - - spin_lock_bh(&p_iser_task->task_lock); - while (!list_empty(&p_iser_task->rcv_buff_list)) { - p_regd_buf = list_entry(p_iser_task->rcv_buff_list.next, - struct iser_regd_buf, - free_upon_comp_list); - list_del(&p_regd_buf->free_upon_comp_list); - spin_unlock_bh(&p_iser_task->task_lock); - - if (iser_regd_buff_release(p_regd_buf) != 0) - iser_bug("task:0x%p complete, failed to release " - "recv buf:0x%p, itt:%d - refs remain\n", - p_iser_task, p_regd_buf, p_iser_task->itt); - - spin_lock_bh(&p_iser_task->task_lock); - } - spin_unlock_bh(&p_iser_task->task_lock); + + p_iser_task->rdma_regd[ISER_DIR_IN] = NULL; + p_iser_task->rdma_regd[ISER_DIR_OUT] = NULL; } /** @@ -184,9 +110,22 @@ iser_task_set_status(struct iscsi_iser_c */ void iser_task_finalize_lowpart(struct iscsi_iser_cmd_task *p_iser_task) { + int deferred; + if (p_iser_task == NULL) iser_bug("NULL task descriptor\n"); - iser_task_release_send_buffers(p_iser_task); - iser_task_release_recv_buffers(p_iser_task); + spin_lock_bh(&p_iser_task->task_lock); + if (p_iser_task->dir[ISER_DIR_IN]) { + deferred = iser_regd_buff_release(p_iser_task->rdma_regd[ISER_DIR_IN]); + if (deferred) + iser_bug("References remain for BUF-IN rdma reg\n"); + } + if (p_iser_task->dir[ISER_DIR_OUT] && + p_iser_task->rdma_regd[ISER_DIR_OUT] != NULL) { + deferred = iser_regd_buff_release(p_iser_task->rdma_regd[ISER_DIR_OUT]); + if (deferred) + iser_bug("References remain for BUF-OUT rdma reg\n"); + } + spin_unlock_bh(&p_iser_task->task_lock); } Index: ulp/iser/iser_conn.h =================================================================== --- ulp/iser/iser_conn.h (revision 4622) +++ ulp/iser/iser_conn.h (working copy) @@ -40,9 +40,6 @@ /* adaptor-related */ int iser_adaptor_init(struct iser_adaptor *p_iser_adaptor); int iser_adaptor_release(struct iser_adaptor *p_iser_adaptor); -struct iser_conn *iser_adaptor_find_conn( - struct iser_adaptor *p_iser_adaptor, void *ep_handle); - /* internal connection handling */ void iser_conn_init(struct iser_conn *p_iser_conn); Index: ulp/iser/iser_task.h =================================================================== --- ulp/iser/iser_task.h (revision 4622) +++ ulp/iser/iser_task.h (working copy) @@ -37,13 +37,12 @@ #include "iser.h" -void iser_task_hash_init(struct hash_table *hash_table); -struct iscsi_iser_cmd_task *iser_task_find(struct iscsi_iser_conn *p_iser_conn, u32 itt); void iser_task_init_lowpart(struct iscsi_iser_cmd_task *p_iser_task); +void iser_task_finalize_lowpart(struct iscsi_iser_cmd_task *iser_task); + void iser_task_post_send_count_inc(struct iscsi_iser_cmd_task *p_iser_task); int iser_task_post_send_count_dec_and_test(struct iscsi_iser_cmd_task *p_iser_task); void iser_task_set_status(struct iscsi_iser_cmd_task *p_iser_task, enum iser_task_status status); -void iser_task_finalize_lowpart(struct iscsi_iser_cmd_task *iser_task); #endif /* __ISER_TASK_H__ */ Index: ulp/iser/iser_memory.c =================================================================== --- ulp/iser/iser_memory.c (revision 4622) +++ ulp/iser/iser_memory.c (working copy) @@ -206,24 +206,6 @@ void iser_reg_single(struct iser_adaptor p_regd_buf->direction = direction; } -void iser_reg_single_task(struct iser_adaptor *p_iser_adaptor, - struct iser_regd_buf *p_regd_buf, - void *virt_addr, - dma_addr_t dma_addr, - unsigned long data_size, - enum dma_data_direction direction) -{ - p_regd_buf->reg.lkey = p_iser_adaptor->mr->lkey; - p_regd_buf->reg.rkey = 0; /* indicate there's no need to unreg */ - p_regd_buf->reg.len = data_size; - p_regd_buf->reg.va = dma_addr; - - p_regd_buf->dma_addr = 0; - p_regd_buf->virt_addr = virt_addr; - p_regd_buf->data_size = data_size; - p_regd_buf->direction = direction; -} - /** * iser_sg_size - returns the total data length in sg list */ @@ -523,42 +505,6 @@ unsigned int iser_data_buf_aligned_len(s return ret_len; } -/* - * determine the maximal contiguous sub-list of a scatter-gather list - */ -unsigned int iser_data_buf_contig_len(struct iser_data_buf *p_data, int skip, - dma_addr_t *chunk_dma_addr, int *chunk_size) -{ - unsigned int ret_len = 0; - - if (p_data->type == ISER_BUF_TYPE_SINGLE) - iser_bug("p_data must be sg\n"); - else { - struct scatterlist *p_sg = p_data->p_buf; - int cnt, i; - - *chunk_dma_addr = sg_dma_address(&p_sg[skip]); - *chunk_size = 0; - - for (cnt = 0, i = skip; i < p_data->dma_nents; i++, cnt++){ - if ((cnt > 0) && sg_dma_address(&p_sg[i]) != - (sg_dma_address(&p_sg[i-1]) + sg_dma_len(&p_sg[i-1]))) { - ret_len = cnt; - break; - } - *chunk_size += sg_dma_len(&p_sg[i]); - } - if (i == p_data->dma_nents) - ret_len = cnt; - - iser_dbg("Found %d contiguous entries out of %d in sg:0x%p, " - "start dma addr:%ld size:%d\n", - ret_len, p_data->dma_nents-skip, p_data, - (long)*chunk_dma_addr, *chunk_size); - } - return ret_len; -} - /** * iser_data_buf_memcpy - Copies arbitrary data buffer to a * contiguous memory region From eitan at mellanox.co.il Tue Jan 3 03:55:37 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 3 Jan 2006 13:55:37 +0200 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B413@mtlexch01.mtl.com> Hi Hal, This patch is fine with me and Yael, and Ofer. We will use some scripts to automatically update the version info with the "build name" and SVN version. We plan to do that for the osm_version.h as well as the configure.in files. This will happen when building a distribution of OpenSM code as part of the OpenIB distribution or standalone OpenSM drop. EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, January 02, 2006 4:41 PM > To: Yael Kalka; Eitan Zahavi > Cc: openib-general at openib.org > Subject: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > OpenSM: Separate out OSM_VERSION so when changing only needed files are > recompiled rather than everything > > Signed-off-by: Hal Rosenstock > > Index: osm/include/opensm/osm_version.h > =================================================================== > --- osm/include/opensm/osm_version.h (revision 0) > +++ osm/include/opensm/osm_version.h (revision 0) > @@ -0,0 +1,65 @@ > +/* > + * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY > KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE > WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT > HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER > IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR > IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN > THE > + * SOFTWARE. > + * > + * $Id$ > + */ > + > + > +#ifndef _OSM_VERSION_H_ > +#define _OSM_VERSION_H_ > + > +#ifdef __cplusplus > +# define BEGIN_C_DECLS extern "C" { > +# define END_C_DECLS } > +#else /* !__cplusplus */ > +# define BEGIN_C_DECLS > +# define END_C_DECLS > +#endif /* __cplusplus */ > + > +BEGIN_C_DECLS > + > +/****s* OpenSM: Base/OSM_VERSION > +* NAME > +* OSM_VERSION > +* > +* DESCRIPTION > +* The version string for OpenSM > +* > +* SYNOPSIS > +*/ > +#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > +/********/ > + > +END_C_DECLS > + > +#endif /* _OSM_VERSION_H_ */ > > Property changes on: osm/include/opensm/osm_version.h > ___________________________________________________________________ > Name: svn:keywords > + Id > > Index: osm/include/opensm/osm_base.h > =================================================================== > --- osm/include/opensm/osm_base.h (revision 4686) > +++ osm/include/opensm/osm_base.h (working copy) > @@ -89,18 +89,6 @@ BEGIN_C_DECLS > * Steve King, Intel > * > *********/ > -/****s* OpenSM: Base/OSM_VERSION > -* NAME > -* OSM_VERSION > -* > -* DESCRIPTION > -* The version string for OpenSM > -* > -* SYNOPSIS > -*/ > -#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > -/********/ > - > /****s* OpenSM: Base/OSM_DEFAULT_M_KEY > * NAME > * OSM_DEFAULT_M_KEY > Index: osm/opensm/osm_opensm.c > =================================================================== > --- osm/opensm/osm_opensm.c (revision 4686) > +++ osm/opensm/osm_opensm.c (working copy) > @@ -58,6 +58,7 @@ > #include > #include > #include > +#include > #include > #include > #include > Index: osm/opensm/main.c > =================================================================== > --- osm/opensm/main.c (revision 4686) > +++ osm/opensm/main.c (working copy) > @@ -56,6 +56,7 @@ > #include > #include > #include > +#include > #include > #include > #include > > From halr at voltaire.com Tue Jan 3 04:24:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Jan 2006 07:24:57 -0500 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B413@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B413@mtlexch01.mtl.com> Message-ID: <1136291096.4331.44229.camel@hal.voltaire.com> Hi Eitan, On Tue, 2006-01-03 at 06:55, Eitan Zahavi wrote: > Hi Hal, > > This patch is fine with me and Yael, and Ofer. Thanks. > We will use some scripts to automatically update the version info with > the "build name" and SVN version. We plan to do that for the > osm_version.h as well as the configure.in files. This will happen when > building a distribution of OpenSM code as part of the OpenIB > distribution or standalone OpenSM drop. I will shortly have a patch along these lines which I will send to the list. It creates a separate osm_svn_revision.h if userspace/management/osm/.svn/entries is present. -- Hal > EZ > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Monday, January 02, 2006 4:41 PM > > To: Yael Kalka; Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > OpenSM: Separate out OSM_VERSION so when changing only needed files > are > > recompiled rather than everything > > > > Signed-off-by: Hal Rosenstock > > > > Index: osm/include/opensm/osm_version.h > > =================================================================== > > --- osm/include/opensm/osm_version.h (revision 0) > > +++ osm/include/opensm/osm_version.h (revision 0) > > @@ -0,0 +1,65 @@ > > +/* > > + * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. > > + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights > reserved. > > + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the > GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the > following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY > > KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE > > WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT > > HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER > > IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR > > IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN > > THE > > + * SOFTWARE. > > + * > > + * $Id$ > > + */ > > + > > + > > +#ifndef _OSM_VERSION_H_ > > +#define _OSM_VERSION_H_ > > + > > +#ifdef __cplusplus > > +# define BEGIN_C_DECLS extern "C" { > > +# define END_C_DECLS } > > +#else /* !__cplusplus */ > > +# define BEGIN_C_DECLS > > +# define END_C_DECLS > > +#endif /* __cplusplus */ > > + > > +BEGIN_C_DECLS > > + > > +/****s* OpenSM: Base/OSM_VERSION > > +* NAME > > +* OSM_VERSION > > +* > > +* DESCRIPTION > > +* The version string for OpenSM > > +* > > +* SYNOPSIS > > +*/ > > +#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > +/********/ > > + > > +END_C_DECLS > > + > > +#endif /* _OSM_VERSION_H_ */ > > > > Property changes on: osm/include/opensm/osm_version.h > > ___________________________________________________________________ > > Name: svn:keywords > > + Id > > > > Index: osm/include/opensm/osm_base.h > > =================================================================== > > --- osm/include/opensm/osm_base.h (revision 4686) > > +++ osm/include/opensm/osm_base.h (working copy) > > @@ -89,18 +89,6 @@ BEGIN_C_DECLS > > * Steve King, Intel > > * > > *********/ > > -/****s* OpenSM: Base/OSM_VERSION > > -* NAME > > -* OSM_VERSION > > -* > > -* DESCRIPTION > > -* The version string for OpenSM > > -* > > -* SYNOPSIS > > -*/ > > -#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > -/********/ > > - > > /****s* OpenSM: Base/OSM_DEFAULT_M_KEY > > * NAME > > * OSM_DEFAULT_M_KEY > > Index: osm/opensm/osm_opensm.c > > =================================================================== > > --- osm/opensm/osm_opensm.c (revision 4686) > > +++ osm/opensm/osm_opensm.c (working copy) > > @@ -58,6 +58,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > Index: osm/opensm/main.c > > =================================================================== > > --- osm/opensm/main.c (revision 4686) > > +++ osm/opensm/main.c (working copy) > > @@ -56,6 +56,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > > > > From info at pgox.com Tue Jan 3 04:09:44 2006 From: info at pgox.com (info at pgox.com) Date: 3 Jan 2006 21:09:44 +0900 Subject: [openib-general] $B$b$7$+$7$F!D(B Message-ID: <20060103120944.566.qmail@mail.pgox.com> $B!X$b$7$+$7$F2?EY$b$*4j$$$7$A$c$C$F$b$&7y$o$l$A$c$C$?$+$J!D(B $B%@%a$J$i%@%a$G0l8@$$$C$FM_$7$$$J!#(B $BBN$@$1$N4X78$,%@%a$J$iIaDL$KM7$SAj!G$C$F$F!D(B $BJV;v$O at dBP$KJV$9$+$i$;$a$F2?$+O"MmM_$7$$$G$9!#!Y(B http://www.megabazooka.com/?num=6565 $B$"$J$?$O!Z(B332743$B!!(B $B?pJf![MM$+$i;XL>$5$l$?0Y!"(B $B%K%C%/%M!<%`$N:G8e$K!V(B102$B!W$rIU$1$FD:$1$l$P!"(B $BEPO?$NMxMQL5NA$HD>$,$G$-$^$9!#(B $B5qH](B From eitan at mellanox.co.il Tue Jan 3 06:42:25 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 3 Jan 2006 16:42:25 +0200 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B417@mtlexch01.mtl.com> Thanks. Can you elaborate for how that file " osm_svn_revision.h" will be updated? Is it going to be updated by the "autogen.sh" ? or by a checkin trigger? > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, January 03, 2006 2:25 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > Hi Eitan, > > On Tue, 2006-01-03 at 06:55, Eitan Zahavi wrote: > > Hi Hal, > > > > This patch is fine with me and Yael, and Ofer. > > Thanks. > > > We will use some scripts to automatically update the version info with > > the "build name" and SVN version. We plan to do that for the > > osm_version.h as well as the configure.in files. This will happen when > > building a distribution of OpenSM code as part of the OpenIB > > distribution or standalone OpenSM drop. > > I will shortly have a patch along these lines which I will send to the > list. It creates a separate osm_svn_revision.h if > userspace/management/osm/.svn/entries is present. > > -- Hal > > > EZ > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Monday, January 02, 2006 4:41 PM > > > To: Yael Kalka; Eitan Zahavi > > > Cc: openib-general at openib.org > > > Subject: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > > > OpenSM: Separate out OSM_VERSION so when changing only needed files > > are > > > recompiled rather than everything > > > > > > Signed-off-by: Hal Rosenstock > > > > > > Index: osm/include/opensm/osm_version.h > > > =================================================================== > > > --- osm/include/opensm/osm_version.h (revision 0) > > > +++ osm/include/opensm/osm_version.h (revision 0) > > > @@ -0,0 +1,65 @@ > > > +/* > > > + * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. > > > + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights > > reserved. > > > + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > > + * > > > + * This software is available to you under a choice of one of two > > > + * licenses. You may choose to be licensed under the terms of the > > GNU > > > + * General Public License (GPL) Version 2, available from the file > > > + * COPYING in the main directory of this source tree, or the > > > + * OpenIB.org BSD license below: > > > + * > > > + * Redistribution and use in source and binary forms, with or > > > + * without modification, are permitted provided that the > > following > > > + * conditions are met: > > > + * > > > + * - Redistributions of source code must retain the above > > > + * copyright notice, this list of conditions and the following > > > + * disclaimer. > > > + * > > > + * - Redistributions in binary form must reproduce the above > > > + * copyright notice, this list of conditions and the following > > > + * disclaimer in the documentation and/or other materials > > > + * provided with the distribution. > > > + * > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY > > > KIND, > > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE > > > WARRANTIES OF > > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR > COPYRIGHT > > > HOLDERS > > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, > WHETHER > > > IN AN > > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF > OR > > > IN > > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS > IN > > > THE > > > + * SOFTWARE. > > > + * > > > + * $Id$ > > > + */ > > > + > > > + > > > +#ifndef _OSM_VERSION_H_ > > > +#define _OSM_VERSION_H_ > > > + > > > +#ifdef __cplusplus > > > +# define BEGIN_C_DECLS extern "C" { > > > +# define END_C_DECLS } > > > +#else /* !__cplusplus */ > > > +# define BEGIN_C_DECLS > > > +# define END_C_DECLS > > > +#endif /* __cplusplus */ > > > + > > > +BEGIN_C_DECLS > > > + > > > +/****s* OpenSM: Base/OSM_VERSION > > > +* NAME > > > +* OSM_VERSION > > > +* > > > +* DESCRIPTION > > > +* The version string for OpenSM > > > +* > > > +* SYNOPSIS > > > +*/ > > > +#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > > +/********/ > > > + > > > +END_C_DECLS > > > + > > > +#endif /* _OSM_VERSION_H_ */ > > > > > > Property changes on: osm/include/opensm/osm_version.h > > > ___________________________________________________________________ > > > Name: svn:keywords > > > + Id > > > > > > Index: osm/include/opensm/osm_base.h > > > =================================================================== > > > --- osm/include/opensm/osm_base.h (revision 4686) > > > +++ osm/include/opensm/osm_base.h (working copy) > > > @@ -89,18 +89,6 @@ BEGIN_C_DECLS > > > * Steve King, Intel > > > * > > > *********/ > > > -/****s* OpenSM: Base/OSM_VERSION > > > -* NAME > > > -* OSM_VERSION > > > -* > > > -* DESCRIPTION > > > -* The version string for OpenSM > > > -* > > > -* SYNOPSIS > > > -*/ > > > -#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > > -/********/ > > > - > > > /****s* OpenSM: Base/OSM_DEFAULT_M_KEY > > > * NAME > > > * OSM_DEFAULT_M_KEY > > > Index: osm/opensm/osm_opensm.c > > > =================================================================== > > > --- osm/opensm/osm_opensm.c (revision 4686) > > > +++ osm/opensm/osm_opensm.c (working copy) > > > @@ -58,6 +58,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include > > > Index: osm/opensm/main.c > > > =================================================================== > > > --- osm/opensm/main.c (revision 4686) > > > +++ osm/opensm/main.c (working copy) > > > @@ -56,6 +56,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include > > > > > > > > From openib-general at openib.org Tue Jan 3 07:04:11 2006 From: openib-general at openib.org (openib-general at openib.org) Date: Tue, 3 Jan 2006 07:04:11 -0800 (PST) Subject: [openib-general] openib-general@openib.org Message-ID: <20060103150411.381182283D6@openib.ca.sandia.gov> ------------------------------------------------------------------------- ADULT MEDIA Video Clips .: Slide Shows .: Screen Shots ADULTS ONLY ------------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: Download-and-Buy.zip Type: application/x-zip-compressed Size: 3759 bytes Desc: Download-and-Buy.zip URL: From Trevor at Mellanox.com Tue Jan 3 06:52:27 2006 From: Trevor at Mellanox.com (Trevor Caulder) Date: Tue, 3 Jan 2006 06:52:27 -0800 Subject: Out of Office AutoReply: [openib-general] openib-general@openib.o rg Message-ID: <25AE7F432672D511B8DC00B0D0DF11DA036D4D81@MTIEX01> I will be out of the office until Tuesday 1/3. If you have an urgent issues, please contact Todd Wilde (todd at mellanox.com) or Matt Finlay (matt at mellanox.com) Thanks, Trevor -------------- next part -------------- An HTML attachment was scrubbed... URL: From Wayne at Mellanox.com Tue Jan 3 06:52:27 2006 From: Wayne at Mellanox.com (Wayne Augsburger) Date: Tue, 3 Jan 2006 06:52:27 -0800 Subject: Out of Office AutoReply: [openib-general] openib-general@openib.o rg Message-ID: <25AE7F432672D511B8DC00B0D0DF11DA02F4DA45@MTIEX01> I will be out of the office for the Christmas/new years holidays; returning on January 3. During this time I will have limited access to email. If you need immediate assistance, please contact Todd at mellanox.com. Thanks, Wayne -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Jan 3 07:16:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Jan 2006 10:16:55 -0500 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B417@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B417@mtlexch01.mtl.com> Message-ID: <1136301413.4331.45769.camel@hal.voltaire.com> On Tue, 2006-01-03 at 09:42, Eitan Zahavi wrote: > Thanks. Can you elaborate for how that file " osm_svn_revision.h" will > be updated? > Is it going to be updated by the "autogen.sh" ? or by a checkin trigger? Neither; I'm planning to have it updated by the make when needed. -- Hal > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Tuesday, January 03, 2006 2:25 PM > > To: Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > Hi Eitan, > > > > On Tue, 2006-01-03 at 06:55, Eitan Zahavi wrote: > > > Hi Hal, > > > > > > This patch is fine with me and Yael, and Ofer. > > > > Thanks. > > > > > We will use some scripts to automatically update the version info > with > > > the "build name" and SVN version. We plan to do that for the > > > osm_version.h as well as the configure.in files. This will happen > when > > > building a distribution of OpenSM code as part of the OpenIB > > > distribution or standalone OpenSM drop. > > > > I will shortly have a patch along these lines which I will send to the > > list. It creates a separate osm_svn_revision.h if > > userspace/management/osm/.svn/entries is present. > > > > -- Hal > > > > > EZ > > > Eitan Zahavi > > > Design Technology Director > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > -----Original Message----- > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Sent: Monday, January 02, 2006 4:41 PM > > > > To: Yael Kalka; Eitan Zahavi > > > > Cc: openib-general at openib.org > > > > Subject: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > > > > > OpenSM: Separate out OSM_VERSION so when changing only needed > files > > > are > > > > recompiled rather than everything > > > > > > > > Signed-off-by: Hal Rosenstock > > > > > > > > Index: osm/include/opensm/osm_version.h > > > > > =================================================================== > > > > --- osm/include/opensm/osm_version.h (revision 0) > > > > +++ osm/include/opensm/osm_version.h (revision 0) > > > > @@ -0,0 +1,65 @@ > > > > +/* > > > > + * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. > > > > + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights > > > reserved. > > > > + * Copyright (c) 1996-2003 Intel Corporation. All rights > reserved. > > > > + * > > > > + * This software is available to you under a choice of one of two > > > > + * licenses. You may choose to be licensed under the terms of > the > > > GNU > > > > + * General Public License (GPL) Version 2, available from the > file > > > > + * COPYING in the main directory of this source tree, or the > > > > + * OpenIB.org BSD license below: > > > > + * > > > > + * Redistribution and use in source and binary forms, with or > > > > + * without modification, are permitted provided that the > > > following > > > > + * conditions are met: > > > > + * > > > > + * - Redistributions of source code must retain the above > > > > + * copyright notice, this list of conditions and the > following > > > > + * disclaimer. > > > > + * > > > > + * - Redistributions in binary form must reproduce the above > > > > + * copyright notice, this list of conditions and the > following > > > > + * disclaimer in the documentation and/or other materials > > > > + * provided with the distribution. > > > > + * > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY > > > > KIND, > > > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE > > > > WARRANTIES OF > > > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR > > COPYRIGHT > > > > HOLDERS > > > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, > > WHETHER > > > > IN AN > > > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF > > OR > > > > IN > > > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS > > IN > > > > THE > > > > + * SOFTWARE. > > > > + * > > > > + * $Id$ > > > > + */ > > > > + > > > > + > > > > +#ifndef _OSM_VERSION_H_ > > > > +#define _OSM_VERSION_H_ > > > > + > > > > +#ifdef __cplusplus > > > > +# define BEGIN_C_DECLS extern "C" { > > > > +# define END_C_DECLS } > > > > +#else /* !__cplusplus */ > > > > +# define BEGIN_C_DECLS > > > > +# define END_C_DECLS > > > > +#endif /* __cplusplus */ > > > > + > > > > +BEGIN_C_DECLS > > > > + > > > > +/****s* OpenSM: Base/OSM_VERSION > > > > +* NAME > > > > +* OSM_VERSION > > > > +* > > > > +* DESCRIPTION > > > > +* The version string for OpenSM > > > > +* > > > > +* SYNOPSIS > > > > +*/ > > > > +#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > > > +/********/ > > > > + > > > > +END_C_DECLS > > > > + > > > > +#endif /* _OSM_VERSION_H_ */ > > > > > > > > Property changes on: osm/include/opensm/osm_version.h > > > > > ___________________________________________________________________ > > > > Name: svn:keywords > > > > + Id > > > > > > > > Index: osm/include/opensm/osm_base.h > > > > > =================================================================== > > > > --- osm/include/opensm/osm_base.h (revision 4686) > > > > +++ osm/include/opensm/osm_base.h (working copy) > > > > @@ -89,18 +89,6 @@ BEGIN_C_DECLS > > > > * Steve King, Intel > > > > * > > > > *********/ > > > > -/****s* OpenSM: Base/OSM_VERSION > > > > -* NAME > > > > -* OSM_VERSION > > > > -* > > > > -* DESCRIPTION > > > > -* The version string for OpenSM > > > > -* > > > > -* SYNOPSIS > > > > -*/ > > > > -#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > > > -/********/ > > > > - > > > > /****s* OpenSM: Base/OSM_DEFAULT_M_KEY > > > > * NAME > > > > * OSM_DEFAULT_M_KEY > > > > Index: osm/opensm/osm_opensm.c > > > > > =================================================================== > > > > --- osm/opensm/osm_opensm.c (revision 4686) > > > > +++ osm/opensm/osm_opensm.c (working copy) > > > > @@ -58,6 +58,7 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > #include > > > > #include > > > > #include > > > > Index: osm/opensm/main.c > > > > > =================================================================== > > > > --- osm/opensm/main.c (revision 4686) > > > > +++ osm/opensm/main.c (working copy) > > > > @@ -56,6 +56,7 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > #include > > > > #include > > > > #include > > > > > > > > > > > From mst at mellanox.co.il Tue Jan 3 07:29:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jan 2006 17:29:42 +0200 Subject: [openib-general] [PATCH] fix race in mad.c Message-ID: <20060103152942.GB2790@mellanox.co.il> After removing the port from port_list, ib_mad_port_close flushes port_priv->wq before destroying the special QPs. This means that a completion event could arrive, and queue a new work in this work queue after flush. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/core/mad.c =================================================================== --- latest.orig/drivers/infiniband/core/mad.c +++ latest/drivers/infiniband/core/mad.c @@ -2285,8 +2285,17 @@ static void timeout_sends(void *data) static void ib_mad_thread_completion_handler(struct ib_cq *cq, void *arg) { struct ib_mad_port_private *port_priv = cq->cq_context; + struct ib_mad_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + list_for_each_entry(entry, &ib_mad_port_list, port_list) + if (entry == port_priv) { + queue_work(port_priv->wq, &port_priv->work); + break; + } - queue_work(port_priv->wq, &port_priv->work); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); } /* -- MST From eitan at mellanox.co.il Tue Jan 3 07:43:27 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 3 Jan 2006 17:43:27 +0200 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B41B@mtlexch01.mtl.com> Hi Hal, Sounds good. I think you should be able to use the .svn/entries to get the last update revision and then use svn diff (or diff) to see if local mods are done on top of it... So we do not get caught by surprise when something broke due to un-committed mod in the local directory Thanks Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, January 03, 2006 5:17 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > On Tue, 2006-01-03 at 09:42, Eitan Zahavi wrote: > > Thanks. Can you elaborate for how that file " osm_svn_revision.h" will > > be updated? > > Is it going to be updated by the "autogen.sh" ? or by a checkin trigger? > > Neither; I'm planning to have it updated by the make when needed. > > -- Hal > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Tuesday, January 03, 2006 2:25 PM > > > To: Eitan Zahavi > > > Cc: openib-general at openib.org > > > Subject: RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > > > Hi Eitan, > > > > > > On Tue, 2006-01-03 at 06:55, Eitan Zahavi wrote: > > > > Hi Hal, > > > > > > > > This patch is fine with me and Yael, and Ofer. > > > > > > Thanks. > > > > > > > We will use some scripts to automatically update the version info > > with > > > > the "build name" and SVN version. We plan to do that for the > > > > osm_version.h as well as the configure.in files. This will happen > > when > > > > building a distribution of OpenSM code as part of the OpenIB > > > > distribution or standalone OpenSM drop. > > > > > > I will shortly have a patch along these lines which I will send to the > > > list. It creates a separate osm_svn_revision.h if > > > userspace/management/osm/.svn/entries is present. > > > > > > -- Hal > > > > > > > EZ > > > > Eitan Zahavi > > > > Design Technology Director > > > > Mellanox Technologies LTD > > > > Tel:+972-4-9097208 > > > > Fax:+972-4-9593245 > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > -----Original Message----- > > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > Sent: Monday, January 02, 2006 4:41 PM > > > > > To: Yael Kalka; Eitan Zahavi > > > > > Cc: openib-general at openib.org > > > > > Subject: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > > > > > > > OpenSM: Separate out OSM_VERSION so when changing only needed > > files > > > > are > > > > > recompiled rather than everything > > > > > > > > > > Signed-off-by: Hal Rosenstock > > > > > > > > > > Index: osm/include/opensm/osm_version.h > > > > > > > =================================================================== > > > > > --- osm/include/opensm/osm_version.h (revision 0) > > > > > +++ osm/include/opensm/osm_version.h (revision 0) > > > > > @@ -0,0 +1,65 @@ > > > > > +/* > > > > > + * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. > > > > > + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights > > > > reserved. > > > > > + * Copyright (c) 1996-2003 Intel Corporation. All rights > > reserved. > > > > > + * > > > > > + * This software is available to you under a choice of one of two > > > > > + * licenses. You may choose to be licensed under the terms of > > the > > > > GNU > > > > > + * General Public License (GPL) Version 2, available from the > > file > > > > > + * COPYING in the main directory of this source tree, or the > > > > > + * OpenIB.org BSD license below: > > > > > + * > > > > > + * Redistribution and use in source and binary forms, with or > > > > > + * without modification, are permitted provided that the > > > > following > > > > > + * conditions are met: > > > > > + * > > > > > + * - Redistributions of source code must retain the above > > > > > + * copyright notice, this list of conditions and the > > following > > > > > + * disclaimer. > > > > > + * > > > > > + * - Redistributions in binary form must reproduce the above > > > > > + * copyright notice, this list of conditions and the > > following > > > > > + * disclaimer in the documentation and/or other materials > > > > > + * provided with the distribution. > > > > > + * > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY > > > > > KIND, > > > > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE > > > > > WARRANTIES OF > > > > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > > > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR > > > COPYRIGHT > > > > > HOLDERS > > > > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, > > > WHETHER > > > > > IN AN > > > > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT > OF > > > OR > > > > > IN > > > > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER > DEALINGS > > > IN > > > > > THE > > > > > + * SOFTWARE. > > > > > + * > > > > > + * $Id$ > > > > > + */ > > > > > + > > > > > + > > > > > +#ifndef _OSM_VERSION_H_ > > > > > +#define _OSM_VERSION_H_ > > > > > + > > > > > +#ifdef __cplusplus > > > > > +# define BEGIN_C_DECLS extern "C" { > > > > > +# define END_C_DECLS } > > > > > +#else /* !__cplusplus */ > > > > > +# define BEGIN_C_DECLS > > > > > +# define END_C_DECLS > > > > > +#endif /* __cplusplus */ > > > > > + > > > > > +BEGIN_C_DECLS > > > > > + > > > > > +/****s* OpenSM: Base/OSM_VERSION > > > > > +* NAME > > > > > +* OSM_VERSION > > > > > +* > > > > > +* DESCRIPTION > > > > > +* The version string for OpenSM > > > > > +* > > > > > +* SYNOPSIS > > > > > +*/ > > > > > +#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > > > > +/********/ > > > > > + > > > > > +END_C_DECLS > > > > > + > > > > > +#endif /* _OSM_VERSION_H_ */ > > > > > > > > > > Property changes on: osm/include/opensm/osm_version.h > > > > > > > ___________________________________________________________________ > > > > > Name: svn:keywords > > > > > + Id > > > > > > > > > > Index: osm/include/opensm/osm_base.h > > > > > > > =================================================================== > > > > > --- osm/include/opensm/osm_base.h (revision 4686) > > > > > +++ osm/include/opensm/osm_base.h (working copy) > > > > > @@ -89,18 +89,6 @@ BEGIN_C_DECLS > > > > > * Steve King, Intel > > > > > * > > > > > *********/ > > > > > -/****s* OpenSM: Base/OSM_VERSION > > > > > -* NAME > > > > > -* OSM_VERSION > > > > > -* > > > > > -* DESCRIPTION > > > > > -* The version string for OpenSM > > > > > -* > > > > > -* SYNOPSIS > > > > > -*/ > > > > > -#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > > > > -/********/ > > > > > - > > > > > /****s* OpenSM: Base/OSM_DEFAULT_M_KEY > > > > > * NAME > > > > > * OSM_DEFAULT_M_KEY > > > > > Index: osm/opensm/osm_opensm.c > > > > > > > =================================================================== > > > > > --- osm/opensm/osm_opensm.c (revision 4686) > > > > > +++ osm/opensm/osm_opensm.c (working copy) > > > > > @@ -58,6 +58,7 @@ > > > > > #include > > > > > #include > > > > > #include > > > > > +#include > > > > > #include > > > > > #include > > > > > #include > > > > > Index: osm/opensm/main.c > > > > > > > =================================================================== > > > > > --- osm/opensm/main.c (revision 4686) > > > > > +++ osm/opensm/main.c (working copy) > > > > > @@ -56,6 +56,7 @@ > > > > > #include > > > > > #include > > > > > #include > > > > > +#include > > > > > #include > > > > > #include > > > > > #include > > > > > > > > > > > > > > > From halr at voltaire.com Tue Jan 3 08:01:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Jan 2006 11:01:43 -0500 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B41B@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B41B@mtlexch01.mtl.com> Message-ID: <1136304101.4331.46313.camel@hal.voltaire.com> On Tue, 2006-01-03 at 10:43, Eitan Zahavi wrote: > Hi Hal, > > Sounds good. > I think you should be able to use the .svn/entries to get the last > update revision and then use svn diff (or diff) to see if local mods are > done on top of it... I'm using .svn/entries at the osm level. > So we do not get caught by surprise when something broke due to > un-committed mod in the local directory > Thanks > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Tuesday, January 03, 2006 5:17 PM > > To: Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > On Tue, 2006-01-03 at 09:42, Eitan Zahavi wrote: > > > Thanks. Can you elaborate for how that file " osm_svn_revision.h" > will > > > be updated? > > > Is it going to be updated by the "autogen.sh" ? or by a checkin > trigger? > > > > Neither; I'm planning to have it updated by the make when needed. > > > > -- Hal > > > > > > -----Original Message----- > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Sent: Tuesday, January 03, 2006 2:25 PM > > > > To: Eitan Zahavi > > > > Cc: openib-general at openib.org > > > > Subject: RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > > > > > Hi Eitan, > > > > > > > > On Tue, 2006-01-03 at 06:55, Eitan Zahavi wrote: > > > > > Hi Hal, > > > > > > > > > > This patch is fine with me and Yael, and Ofer. > > > > > > > > Thanks. > > > > > > > > > We will use some scripts to automatically update the version > info > > > with > > > > > the "build name" and SVN version. We plan to do that for the > > > > > osm_version.h as well as the configure.in files. This will > happen > > > when > > > > > building a distribution of OpenSM code as part of the OpenIB > > > > > distribution or standalone OpenSM drop. > > > > > > > > I will shortly have a patch along these lines which I will send to > the > > > > list. It creates a separate osm_svn_revision.h if > > > > userspace/management/osm/.svn/entries is present. > > > > > > > > -- Hal > > > > > > > > > EZ > > > > > Eitan Zahavi > > > > > Design Technology Director > > > > > Mellanox Technologies LTD > > > > > Tel:+972-4-9097208 > > > > > Fax:+972-4-9593245 > > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > > Sent: Monday, January 02, 2006 4:41 PM > > > > > > To: Yael Kalka; Eitan Zahavi > > > > > > Cc: openib-general at openib.org > > > > > > Subject: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > > > > > > > > > OpenSM: Separate out OSM_VERSION so when changing only needed > > > files > > > > > are > > > > > > recompiled rather than everything > > > > > > > > > > > > Signed-off-by: Hal Rosenstock > > > > > > > > > > > > Index: osm/include/opensm/osm_version.h > > > > > > > > > =================================================================== > > > > > > --- osm/include/opensm/osm_version.h (revision 0) > > > > > > +++ osm/include/opensm/osm_version.h (revision 0) > > > > > > @@ -0,0 +1,65 @@ > > > > > > +/* > > > > > > + * Copyright (c) 2004, 2005 Voltaire, Inc. All rights > reserved. > > > > > > + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All > rights > > > > > reserved. > > > > > > + * Copyright (c) 1996-2003 Intel Corporation. All rights > > > reserved. > > > > > > + * > > > > > > + * This software is available to you under a choice of one of > two > > > > > > + * licenses. You may choose to be licensed under the terms > of > > > the > > > > > GNU > > > > > > + * General Public License (GPL) Version 2, available from the > > > file > > > > > > + * COPYING in the main directory of this source tree, or the > > > > > > + * OpenIB.org BSD license below: > > > > > > + * > > > > > > + * Redistribution and use in source and binary forms, > with or > > > > > > + * without modification, are permitted provided that the > > > > > following > > > > > > + * conditions are met: > > > > > > + * > > > > > > + * - Redistributions of source code must retain the > above > > > > > > + * copyright notice, this list of conditions and the > > > following > > > > > > + * disclaimer. > > > > > > + * > > > > > > + * - Redistributions in binary form must reproduce the > above > > > > > > + * copyright notice, this list of conditions and the > > > following > > > > > > + * disclaimer in the documentation and/or other > materials > > > > > > + * provided with the distribution. > > > > > > + * > > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY > > > > > > KIND, > > > > > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE > > > > > > WARRANTIES OF > > > > > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > > > > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR > > > > COPYRIGHT > > > > > > HOLDERS > > > > > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, > > > > WHETHER > > > > > > IN AN > > > > > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT > > OF > > > > OR > > > > > > IN > > > > > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER > > DEALINGS > > > > IN > > > > > > THE > > > > > > + * SOFTWARE. > > > > > > + * > > > > > > + * $Id$ > > > > > > + */ > > > > > > + > > > > > > + > > > > > > +#ifndef _OSM_VERSION_H_ > > > > > > +#define _OSM_VERSION_H_ > > > > > > + > > > > > > +#ifdef __cplusplus > > > > > > +# define BEGIN_C_DECLS extern "C" { > > > > > > +# define END_C_DECLS } > > > > > > +#else /* !__cplusplus */ > > > > > > +# define BEGIN_C_DECLS > > > > > > +# define END_C_DECLS > > > > > > +#endif /* __cplusplus */ > > > > > > + > > > > > > +BEGIN_C_DECLS > > > > > > + > > > > > > +/****s* OpenSM: Base/OSM_VERSION > > > > > > +* NAME > > > > > > +* OSM_VERSION > > > > > > +* > > > > > > +* DESCRIPTION > > > > > > +* The version string for OpenSM > > > > > > +* > > > > > > +* SYNOPSIS > > > > > > +*/ > > > > > > +#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > > > > > +/********/ > > > > > > + > > > > > > +END_C_DECLS > > > > > > + > > > > > > +#endif /* _OSM_VERSION_H_ */ > > > > > > > > > > > > Property changes on: osm/include/opensm/osm_version.h > > > > > > > > > ___________________________________________________________________ > > > > > > Name: svn:keywords > > > > > > + Id > > > > > > > > > > > > Index: osm/include/opensm/osm_base.h > > > > > > > > > =================================================================== > > > > > > --- osm/include/opensm/osm_base.h (revision 4686) > > > > > > +++ osm/include/opensm/osm_base.h (working copy) > > > > > > @@ -89,18 +89,6 @@ BEGIN_C_DECLS > > > > > > * Steve King, Intel > > > > > > * > > > > > > *********/ > > > > > > -/****s* OpenSM: Base/OSM_VERSION > > > > > > -* NAME > > > > > > -* OSM_VERSION > > > > > > -* > > > > > > -* DESCRIPTION > > > > > > -* The version string for OpenSM > > > > > > -* > > > > > > -* SYNOPSIS > > > > > > -*/ > > > > > > -#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > > > > > -/********/ > > > > > > - > > > > > > /****s* OpenSM: Base/OSM_DEFAULT_M_KEY > > > > > > * NAME > > > > > > * OSM_DEFAULT_M_KEY > > > > > > Index: osm/opensm/osm_opensm.c > > > > > > > > > =================================================================== > > > > > > --- osm/opensm/osm_opensm.c (revision 4686) > > > > > > +++ osm/opensm/osm_opensm.c (working copy) > > > > > > @@ -58,6 +58,7 @@ > > > > > > #include > > > > > > #include > > > > > > #include > > > > > > +#include > > > > > > #include > > > > > > #include > > > > > > #include > > > > > > Index: osm/opensm/main.c > > > > > > > > > =================================================================== > > > > > > --- osm/opensm/main.c (revision 4686) > > > > > > +++ osm/opensm/main.c (working copy) > > > > > > @@ -56,6 +56,7 @@ > > > > > > #include > > > > > > #include > > > > > > #include > > > > > > +#include > > > > > > #include > > > > > > #include > > > > > > #include > > > > > > > > > > > > > > > > > > > > From halr at voltaire.com Tue Jan 3 08:15:06 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Jan 2006 11:15:06 -0500 Subject: [openib-general] RE: [PATCH] osm: support for trivial PKey manager In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3FA32B6@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FA32B6@mtlexch01.mtl.com> Message-ID: <1136304905.4331.46475.camel@hal.voltaire.com> On Sun, 2006-01-01 at 03:14, Ofer Gigi wrote: > Hi Hal, > > 1. About the osm_indent - you are correct - it should have been in > another patch. Thanks. Applied. > 2. Extra spaces - please remove - thanks. Done. > 3. > + /* signal = osm_lid_mgr_process_sm( p_mgr->p_lid_mgr ); */ > > Why add this commented out line ? > > My mistake, I commented the original code and forgot to remove - please > remove it. > 4. > # -i3 Substitute indent with 3 spaces > > # -npcs No space after procedure calls > > # -prs Space after parenthesis > > -# -nsai No space after if keyword > > -# -nsaw No space after while keyword > > +# -nsai No space after if keyword - removed > > +# -nsaw No space after while keyword - removed > > Should these comments just be removed ? > No, please leave them, so people will know what they mean. Thanks. Applied. -- Hal From eitan at mellanox.co.il Tue Jan 3 09:00:49 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 3 Jan 2006 19:00:49 +0200 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B41C@mtlexch01.mtl.com> > > On Tue, 2006-01-03 at 10:43, Eitan Zahavi wrote: > > Hi Hal, > > > > Sounds good. > > I think you should be able to use the .svn/entries to get the last > > update revision and then use svn diff (or diff) to see if local mods are > > done on top of it... > > I'm using .svn/entries at the osm level. > [EZ] Do you agree flagging local modifications (that happened after the svn up) is important? > > So we do not get caught by surprise when something broke due to > > un-committed mod in the local directory > > Thanks > > > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Tuesday, January 03, 2006 5:17 PM > > > To: Eitan Zahavi > > > Cc: openib-general at openib.org > > > Subject: RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > > > On Tue, 2006-01-03 at 09:42, Eitan Zahavi wrote: > > > > Thanks. Can you elaborate for how that file " osm_svn_revision.h" > > will > > > > be updated? > > > > Is it going to be updated by the "autogen.sh" ? or by a checkin > > trigger? > > > > > > Neither; I'm planning to have it updated by the make when needed. > > > > > > -- Hal > > > > > > > > -----Original Message----- > > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > Sent: Tuesday, January 03, 2006 2:25 PM > > > > > To: Eitan Zahavi > > > > > Cc: openib-general at openib.org > > > > > Subject: RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > > > > > > > Hi Eitan, > > > > > > > > > > On Tue, 2006-01-03 at 06:55, Eitan Zahavi wrote: > > > > > > Hi Hal, > > > > > > > > > > > > This patch is fine with me and Yael, and Ofer. > > > > > > > > > > Thanks. > > > > > > > > > > > We will use some scripts to automatically update the version > > info > > > > with > > > > > > the "build name" and SVN version. We plan to do that for the > > > > > > osm_version.h as well as the configure.in files. This will > > happen > > > > when > > > > > > building a distribution of OpenSM code as part of the OpenIB > > > > > > distribution or standalone OpenSM drop. > > > > > > > > > > I will shortly have a patch along these lines which I will send to > > the > > > > > list. It creates a separate osm_svn_revision.h if > > > > > userspace/management/osm/.svn/entries is present. > > > > > > > > > > -- Hal > > > > > > > > > > > EZ > > > > > > Eitan Zahavi > > > > > > Design Technology Director > > > > > > Mellanox Technologies LTD > > > > > > Tel:+972-4-9097208 > > > > > > Fax:+972-4-9593245 > > > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > > > Sent: Monday, January 02, 2006 4:41 PM > > > > > > > To: Yael Kalka; Eitan Zahavi > > > > > > > Cc: openib-general at openib.org > > > > > > > Subject: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > > > > > > > > > > > OpenSM: Separate out OSM_VERSION so when changing only needed > > > > files > > > > > > are > > > > > > > recompiled rather than everything > > > > > > > > > > > > > > Signed-off-by: Hal Rosenstock > > > > > > > > > > > > > > Index: osm/include/opensm/osm_version.h > > > > > > > > > > > =================================================================== > > > > > > > --- osm/include/opensm/osm_version.h (revision 0) > > > > > > > +++ osm/include/opensm/osm_version.h (revision 0) > > > > > > > @@ -0,0 +1,65 @@ > > > > > > > +/* > > > > > > > + * Copyright (c) 2004, 2005 Voltaire, Inc. All rights > > reserved. > > > > > > > + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All > > rights > > > > > > reserved. > > > > > > > + * Copyright (c) 1996-2003 Intel Corporation. All rights > > > > reserved. > > > > > > > + * > > > > > > > + * This software is available to you under a choice of one of > > two > > > > > > > + * licenses. You may choose to be licensed under the terms > > of > > > > the > > > > > > GNU > > > > > > > + * General Public License (GPL) Version 2, available from the > > > > file > > > > > > > + * COPYING in the main directory of this source tree, or the > > > > > > > + * OpenIB.org BSD license below: > > > > > > > + * > > > > > > > + * Redistribution and use in source and binary forms, > > with or > > > > > > > + * without modification, are permitted provided that the > > > > > > following > > > > > > > + * conditions are met: > > > > > > > + * > > > > > > > + * - Redistributions of source code must retain the > > above > > > > > > > + * copyright notice, this list of conditions and the > > > > following > > > > > > > + * disclaimer. > > > > > > > + * > > > > > > > + * - Redistributions in binary form must reproduce the > > above > > > > > > > + * copyright notice, this list of conditions and the > > > > following > > > > > > > + * disclaimer in the documentation and/or other > > materials > > > > > > > + * provided with the distribution. > > > > > > > + * > > > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF > ANY > > > > > > > KIND, > > > > > > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE > > > > > > > WARRANTIES OF > > > > > > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > > > > > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR > > > > > COPYRIGHT > > > > > > > HOLDERS > > > > > > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, > > > > > WHETHER > > > > > > > IN AN > > > > > > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, > OUT > > > OF > > > > > OR > > > > > > > IN > > > > > > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER > > > DEALINGS > > > > > IN > > > > > > > THE > > > > > > > + * SOFTWARE. > > > > > > > + * > > > > > > > + * $Id$ > > > > > > > + */ > > > > > > > + > > > > > > > + > > > > > > > +#ifndef _OSM_VERSION_H_ > > > > > > > +#define _OSM_VERSION_H_ > > > > > > > + > > > > > > > +#ifdef __cplusplus > > > > > > > +# define BEGIN_C_DECLS extern "C" { > > > > > > > +# define END_C_DECLS } > > > > > > > +#else /* !__cplusplus */ > > > > > > > +# define BEGIN_C_DECLS > > > > > > > +# define END_C_DECLS > > > > > > > +#endif /* __cplusplus */ > > > > > > > + > > > > > > > +BEGIN_C_DECLS > > > > > > > + > > > > > > > +/****s* OpenSM: Base/OSM_VERSION > > > > > > > +* NAME > > > > > > > +* OSM_VERSION > > > > > > > +* > > > > > > > +* DESCRIPTION > > > > > > > +* The version string for OpenSM > > > > > > > +* > > > > > > > +* SYNOPSIS > > > > > > > +*/ > > > > > > > +#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > > > > > > +/********/ > > > > > > > + > > > > > > > +END_C_DECLS > > > > > > > + > > > > > > > +#endif /* _OSM_VERSION_H_ */ > > > > > > > > > > > > > > Property changes on: osm/include/opensm/osm_version.h > > > > > > > > > > > ___________________________________________________________________ > > > > > > > Name: svn:keywords > > > > > > > + Id > > > > > > > > > > > > > > Index: osm/include/opensm/osm_base.h > > > > > > > > > > > =================================================================== > > > > > > > --- osm/include/opensm/osm_base.h (revision 4686) > > > > > > > +++ osm/include/opensm/osm_base.h (working copy) > > > > > > > @@ -89,18 +89,6 @@ BEGIN_C_DECLS > > > > > > > * Steve King, Intel > > > > > > > * > > > > > > > *********/ > > > > > > > -/****s* OpenSM: Base/OSM_VERSION > > > > > > > -* NAME > > > > > > > -* OSM_VERSION > > > > > > > -* > > > > > > > -* DESCRIPTION > > > > > > > -* The version string for OpenSM > > > > > > > -* > > > > > > > -* SYNOPSIS > > > > > > > -*/ > > > > > > > -#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > > > > > > -/********/ > > > > > > > - > > > > > > > /****s* OpenSM: Base/OSM_DEFAULT_M_KEY > > > > > > > * NAME > > > > > > > * OSM_DEFAULT_M_KEY > > > > > > > Index: osm/opensm/osm_opensm.c > > > > > > > > > > > =================================================================== > > > > > > > --- osm/opensm/osm_opensm.c (revision 4686) > > > > > > > +++ osm/opensm/osm_opensm.c (working copy) > > > > > > > @@ -58,6 +58,7 @@ > > > > > > > #include > > > > > > > #include > > > > > > > #include > > > > > > > +#include > > > > > > > #include > > > > > > > #include > > > > > > > #include > > > > > > > Index: osm/opensm/main.c > > > > > > > > > > > =================================================================== > > > > > > > --- osm/opensm/main.c (revision 4686) > > > > > > > +++ osm/opensm/main.c (working copy) > > > > > > > @@ -56,6 +56,7 @@ > > > > > > > #include > > > > > > > #include > > > > > > > #include > > > > > > > +#include > > > > > > > #include > > > > > > > #include > > > > > > > #include > > > > > > > > > > > > > > > > > > > > > > > > > From RAISCH at de.ibm.com Tue Jan 3 08:55:43 2006 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Tue, 3 Jan 2006 17:55:43 +0100 Subject: [openib-general] strange behaviour in svn 4706 libibverbs/init.c with two adapters Message-ID: Hi Roland, could it be that libibverbs/src/init.c:271 *list[num_devices++] actually should be a (*list)[num_devices++] for num_devices>1 ? I personally can't follow anymore when I see a ***list, but my system here thinks that &(*list[num_devices++]) for num_devices=0 is 0x10017c10 &(*list[num_devices++]) for num_devices=1 is (nil) &((*list)[num_devices++]) for num_devices=0 is 0x10017c10 &((*list)[num_devices++]) for num_devices=1 is 0x10017c18 To me the second one looks better... If you just have one IB adapter you won't see a difference... Gruss / Regards . . . Christoph R. From mst at mellanox.co.il Tue Jan 3 09:11:58 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jan 2006 19:11:58 +0200 Subject: [openib-general] Re: strange behaviour in svn 4706 libibverbs/init.cwith two adapters In-Reply-To: References: Message-ID: <20060103171158.GE2790@mellanox.co.il> Quoting r. Christoph Raisch : > libibverbs/src/init.c:271 *list[num_devices++] wc -l libibverbs/src/init.c 260 libibverbs/src/init.c Hmm? -- MST From halr at voltaire.com Tue Jan 3 09:00:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Jan 2006 12:00:34 -0500 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B41C@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B41C@mtlexch01.mtl.com> Message-ID: <1136307632.4331.46990.camel@hal.voltaire.com> On Tue, 2006-01-03 at 12:00, Eitan Zahavi wrote: > > > > On Tue, 2006-01-03 at 10:43, Eitan Zahavi wrote: > > > Hi Hal, > > > > > > Sounds good. > > > I think you should be able to use the .svn/entries to get the last > > > update revision and then use svn diff (or diff) to see if local mods > are > > > done on top of it... > > > > I'm using .svn/entries at the osm level. > > > [EZ] Do you agree flagging local modifications (that happened after the > svn up) is important? Yes, but I'm not sure what you are proposing here. > > > So we do not get caught by surprise when something broke due to > > > un-committed mod in the local directory > > > Thanks > > > > > > Eitan Zahavi > > > Design Technology Director > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > -----Original Message----- > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Sent: Tuesday, January 03, 2006 5:17 PM > > > > To: Eitan Zahavi > > > > Cc: openib-general at openib.org > > > > Subject: RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION > > > > > > > > On Tue, 2006-01-03 at 09:42, Eitan Zahavi wrote: > > > > > Thanks. Can you elaborate for how that file " > osm_svn_revision.h" > > > will > > > > > be updated? > > > > > Is it going to be updated by the "autogen.sh" ? or by a checkin > > > trigger? > > > > > > > > Neither; I'm planning to have it updated by the make when needed. > > > > > > > > -- Hal > > > > > > > > > > -----Original Message----- > > > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > > Sent: Tuesday, January 03, 2006 2:25 PM > > > > > > To: Eitan Zahavi > > > > > > Cc: openib-general at openib.org > > > > > > Subject: RE: [PATCH] [TRIVIAL] OpenSM: Separate out > OSM_VERSION > > > > > > > > > > > > Hi Eitan, > > > > > > > > > > > > On Tue, 2006-01-03 at 06:55, Eitan Zahavi wrote: > > > > > > > Hi Hal, > > > > > > > > > > > > > > This patch is fine with me and Yael, and Ofer. > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > We will use some scripts to automatically update the version > > > info > > > > > with > > > > > > > the "build name" and SVN version. We plan to do that for the > > > > > > > osm_version.h as well as the configure.in files. This will > > > happen > > > > > when > > > > > > > building a distribution of OpenSM code as part of the OpenIB > > > > > > > distribution or standalone OpenSM drop. > > > > > > > > > > > > I will shortly have a patch along these lines which I will > send to > > > the > > > > > > list. It creates a separate osm_svn_revision.h if > > > > > > userspace/management/osm/.svn/entries is present. > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > EZ > > > > > > > Eitan Zahavi > > > > > > > Design Technology Director > > > > > > > Mellanox Technologies LTD > > > > > > > Tel:+972-4-9097208 > > > > > > > Fax:+972-4-9593245 > > > > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > > > > Sent: Monday, January 02, 2006 4:41 PM > > > > > > > > To: Yael Kalka; Eitan Zahavi > > > > > > > > Cc: openib-general at openib.org > > > > > > > > Subject: [PATCH] [TRIVIAL] OpenSM: Separate out > OSM_VERSION > > > > > > > > > > > > > > > > OpenSM: Separate out OSM_VERSION so when changing only > needed > > > > > files > > > > > > > are > > > > > > > > recompiled rather than everything > > > > > > > > > > > > > > > > Signed-off-by: Hal Rosenstock > > > > > > > > > > > > > > > > Index: osm/include/opensm/osm_version.h > > > > > > > > > > > > > > =================================================================== > > > > > > > > --- osm/include/opensm/osm_version.h (revision 0) > > > > > > > > +++ osm/include/opensm/osm_version.h (revision 0) > > > > > > > > @@ -0,0 +1,65 @@ > > > > > > > > +/* > > > > > > > > + * Copyright (c) 2004, 2005 Voltaire, Inc. All rights > > > reserved. > > > > > > > > + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All > > > rights > > > > > > > reserved. > > > > > > > > + * Copyright (c) 1996-2003 Intel Corporation. All rights > > > > > reserved. > > > > > > > > + * > > > > > > > > + * This software is available to you under a choice of > one of > > > two > > > > > > > > + * licenses. You may choose to be licensed under the > terms > > > of > > > > > the > > > > > > > GNU > > > > > > > > + * General Public License (GPL) Version 2, available from > the > > > > > file > > > > > > > > + * COPYING in the main directory of this source tree, or > the > > > > > > > > + * OpenIB.org BSD license below: > > > > > > > > + * > > > > > > > > + * Redistribution and use in source and binary forms, > > > with or > > > > > > > > + * without modification, are permitted provided that > the > > > > > > > following > > > > > > > > + * conditions are met: > > > > > > > > + * > > > > > > > > + * - Redistributions of source code must retain the > > > above > > > > > > > > + * copyright notice, this list of conditions and > the > > > > > following > > > > > > > > + * disclaimer. > > > > > > > > + * > > > > > > > > + * - Redistributions in binary form must reproduce > the > > > above > > > > > > > > + * copyright notice, this list of conditions and > the > > > > > following > > > > > > > > + * disclaimer in the documentation and/or other > > > materials > > > > > > > > + * provided with the distribution. > > > > > > > > + * > > > > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF > > ANY > > > > > > > > KIND, > > > > > > > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE > > > > > > > > WARRANTIES OF > > > > > > > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > > > > > > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR > > > > > > COPYRIGHT > > > > > > > > HOLDERS > > > > > > > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, > > > > > > WHETHER > > > > > > > > IN AN > > > > > > > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, > > OUT > > > > OF > > > > > > OR > > > > > > > > IN > > > > > > > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER > > > > DEALINGS > > > > > > IN > > > > > > > > THE > > > > > > > > + * SOFTWARE. > > > > > > > > + * > > > > > > > > + * $Id$ > > > > > > > > + */ > > > > > > > > + > > > > > > > > + > > > > > > > > +#ifndef _OSM_VERSION_H_ > > > > > > > > +#define _OSM_VERSION_H_ > > > > > > > > + > > > > > > > > +#ifdef __cplusplus > > > > > > > > +# define BEGIN_C_DECLS extern "C" { > > > > > > > > +# define END_C_DECLS } > > > > > > > > +#else /* !__cplusplus */ > > > > > > > > +# define BEGIN_C_DECLS > > > > > > > > +# define END_C_DECLS > > > > > > > > +#endif /* __cplusplus */ > > > > > > > > + > > > > > > > > +BEGIN_C_DECLS > > > > > > > > + > > > > > > > > +/****s* OpenSM: Base/OSM_VERSION > > > > > > > > +* NAME > > > > > > > > +* OSM_VERSION > > > > > > > > +* > > > > > > > > +* DESCRIPTION > > > > > > > > +* The version string for OpenSM > > > > > > > > +* > > > > > > > > +* SYNOPSIS > > > > > > > > +*/ > > > > > > > > +#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > > > > > > > +/********/ > > > > > > > > + > > > > > > > > +END_C_DECLS > > > > > > > > + > > > > > > > > +#endif /* _OSM_VERSION_H_ */ > > > > > > > > > > > > > > > > Property changes on: osm/include/opensm/osm_version.h > > > > > > > > > > > > > > ___________________________________________________________________ > > > > > > > > Name: svn:keywords > > > > > > > > + Id > > > > > > > > > > > > > > > > Index: osm/include/opensm/osm_base.h > > > > > > > > > > > > > > =================================================================== > > > > > > > > --- osm/include/opensm/osm_base.h (revision 4686) > > > > > > > > +++ osm/include/opensm/osm_base.h (working copy) > > > > > > > > @@ -89,18 +89,6 @@ BEGIN_C_DECLS > > > > > > > > * Steve King, Intel > > > > > > > > * > > > > > > > > *********/ > > > > > > > > -/****s* OpenSM: Base/OSM_VERSION > > > > > > > > -* NAME > > > > > > > > -* OSM_VERSION > > > > > > > > -* > > > > > > > > -* DESCRIPTION > > > > > > > > -* The version string for OpenSM > > > > > > > > -* > > > > > > > > -* SYNOPSIS > > > > > > > > -*/ > > > > > > > > -#define OSM_VERSION "OpenSM Rev:openib-1.1.0" > > > > > > > > -/********/ > > > > > > > > - > > > > > > > > /****s* OpenSM: Base/OSM_DEFAULT_M_KEY > > > > > > > > * NAME > > > > > > > > * OSM_DEFAULT_M_KEY > > > > > > > > Index: osm/opensm/osm_opensm.c > > > > > > > > > > > > > > =================================================================== > > > > > > > > --- osm/opensm/osm_opensm.c (revision 4686) > > > > > > > > +++ osm/opensm/osm_opensm.c (working copy) > > > > > > > > @@ -58,6 +58,7 @@ > > > > > > > > #include > > > > > > > > #include > > > > > > > > #include > > > > > > > > +#include > > > > > > > > #include > > > > > > > > #include > > > > > > > > #include > > > > > > > > Index: osm/opensm/main.c > > > > > > > > > > > > > > =================================================================== > > > > > > > > --- osm/opensm/main.c (revision 4686) > > > > > > > > +++ osm/opensm/main.c (working copy) > > > > > > > > @@ -56,6 +56,7 @@ > > > > > > > > #include > > > > > > > > #include > > > > > > > > #include > > > > > > > > +#include > > > > > > > > #include > > > > > > > > #include > > > > > > > > #include > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From sean.hefty at intel.com Tue Jan 3 09:47:28 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 3 Jan 2006 09:47:28 -0800 Subject: [openib-general] RE: [PATCH] fix race in mad.c In-Reply-To: <20060103152942.GB2790@mellanox.co.il> Message-ID: >After removing the port from port_list, ib_mad_port_close flushes port_priv->wq >before destroying the special QPs. This means that a completion event could >arrive, and queue a new work in this work queue after flush. > >Signed-off-by: Eli Cohen >Signed-off-by: Michael S. Tsirkin > >Index: latest/drivers/infiniband/core/mad.c >=================================================================== >--- latest.orig/drivers/infiniband/core/mad.c >+++ latest/drivers/infiniband/core/mad.c >@@ -2285,8 +2285,17 @@ static void timeout_sends(void *data) > static void ib_mad_thread_completion_handler(struct ib_cq *cq, void *arg) > { > struct ib_mad_port_private *port_priv = cq->cq_context; >+ struct ib_mad_port_private *entry; >+ unsigned long flags; >+ >+ spin_lock_irqsave(&ib_mad_port_list_lock, flags); >+ list_for_each_entry(entry, &ib_mad_port_list, port_list) >+ if (entry == port_priv) { >+ queue_work(port_priv->wq, &port_priv->work); >+ break; >+ } > >- queue_work(port_priv->wq, &port_priv->work); >+ spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); > } There should be some way to fix this that doesn't involve walking a list on every completion. Can't the cleanup be changed? Either move destroying the QP after the workqueue flush or transition it to the error state before flushing. - Sean From jlentini at netapp.com Tue Jan 3 09:51:06 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 3 Jan 2006 12:51:06 -0500 (EST) Subject: [openib-general] [kernel verbs] u64 vs dma_addr_t In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1142245@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1142245@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: On Fri, 30 Dec 2005, Caitlin Bestler wrote: > openib-general-bounces at openib.org wrote: > > sean> James Lentini wrote: > > sean> > Why is the ib_sge's addr a u64 and not a dma_addr_t? > > sean> > > sean> It's the same address that the user can transfer to the remote > > sean> side. > > > > It can be the same address, but does it have to be? > > > > A user can directly map local addresses to InfiniBand I/O > > virtual addresses, but I don't think it is a requirement. In > > other words, I thought that user could register address x and > > request an InfiniBand I/O virtual address of y, x != y, for > > the mapping. > > > > I understand why the ib_send_wr's rdma.remote_addr needs to > > be a u64, since it ultimately winds up on the wire. > > > > In the case of the ib_sge's addr, I didn't think these values > > left the local node. My assumption (based on looking at the > > mthca driver) is that they are supposed to contain "local" > > I/O addresses (bus addresses). Therefore, my confusion over > > why dma_addr_t wasn't used. > > A privileged user, such as an NFS Daemon or iSER iSCSI Target, can > and will create Memory Regions that are not part of its own address > space out of page buffers. Even running on a 32-bit kernel it might > create a memory region larger than 2**32. > > Admittedly, that isn't very likely unless it is the *only* daemon > running on the machine. But it is legal. How does the size of a region relate to my original question on the type of an ib_sge.addr? From halr at voltaire.com Tue Jan 3 09:43:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Jan 2006 12:43:33 -0500 Subject: [openib-general] [PATCH] OpenSM: include OpenIB svn version when OpenIB build Message-ID: <1136310211.4331.47477.camel@hal.voltaire.com> OpenSM: include OpenIB svn version when OpenIB build Signed-off-by: Hal Rosenstock Index: osm_opensm.c =================================================================== --- osm_opensm.c (revision 4716) +++ osm_opensm.c (working copy) @@ -59,6 +59,9 @@ #include #include #include +#ifdef OSM_VENDOR_INTF_OPENIB +#include +#endif #include #include #include @@ -206,12 +209,33 @@ osm_opensm_init( if( status != IB_SUCCESS ) return ( status ); +#ifndef OSM_VENDOR_INTF_OPENIB /* If there is a log level defined - add the OSM_VERSION to it. */ osm_log( &p_osm->log, osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", OSM_VERSION ); /* Write the OSM_VERSION to the SYS_LOG */ osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format Waived */ +#else + if (strlen(OSM_SVN_REVISION)) + { + /* If there is a log level defined - add OSM_VERSION and OSM_SVN_REVISION to it. */ + osm_log( &p_osm->log, + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s OpenIB svn %s\n", + OSM_VERSION, OSM_SVN_REVISION ); + /* Write the OSM_VERSION and OSM_SVN_REVISION to the SYS_LOG */ + osm_log( &p_osm->log, OSM_LOG_SYS, "%s OpenIB svn %s\n", OSM_VERSION, OSM_SVN_REVISION ); /* Format Waived */ + } + else + { + /* If there is a log level defined - add the OSM_VERSION to it. */ + osm_log( &p_osm->log, + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", + OSM_VERSION ); + /* Write the OSM_VERSION to the SYS_LOG */ + osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format Waived */ + } +#endif osm_log( &p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: [\n" ); /* Format Waived */ Index: main.c =================================================================== --- main.c (revision 4716) +++ main.c (working copy) @@ -57,6 +57,9 @@ #include #include #include +#ifdef OSM_VENDOR_INTF_OPENIB +#include +#endif #include #include #include @@ -522,6 +525,10 @@ main( printf("-------------------------------------------------\n"); printf("%s\n", OSM_VERSION); +#if defined ( OSM_VENDOR_INTF_OPENIB ) + if (strlen(OSM_SVN_REVISION)) + printf("Based on OpenIB svn %s\n", OSM_SVN_REVISION); +#endif osm_subn_set_default_opt(&opt); osm_subn_parse_conf_file(&opt); Index: Makefile.am =================================================================== --- Makefile.am (revision 4716) +++ Makefile.am (working copy) @@ -9,6 +9,22 @@ else DBGFLAGS = -g -O2 endif +if OSMV_OPENIB +$(srcdir)/../include/opensm/osm_svn_revision.h: + if test -f $(srcdir)/../.svn/entries; then \ + grep revision $(srcdir)/../.svn/entries | sed 's/revision=/#define OSM_SVN_REVISION /' | sed 's/\/>//' >$(srcdir)/../include/opensm/osm_svn_revision.h; \ + else \ + echo "#define OSM_SVN_REVISION \"\"" >$(srcdir)/../include/opensm/osm_svn_revision.h; \ + fi + +main.c: $(srcdir)/../include/opensm/osm_svn_revision.h + if test -f $(srcdir)/../include/opensm/osm_svn_revision.h; then \ + if test -f $(srcdir)/../.svn/entries; then \ + grep revision $(srcdir)/../.svn/entries | sed 's/revision=/#define OSM_SVN_REVISION /' | sed 's/\/>//' >$(srcdir)/../include/opensm/osm_svn_revision.h; \ + fi \ + fi +endif + libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 if HAVE_LD_VERSION_SCRIPT From greg at kroah.com Tue Jan 3 09:27:32 2006 From: greg at kroah.com (Greg KH) Date: Tue, 3 Jan 2006 09:27:32 -0800 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: <1135993250.13318.94.camel@serpentine.pathscale.com> References: <20051230080002.GA7438@kroah.com> <1135984304.13318.50.camel@serpentine.pathscale.com> <20051231001051.GB20314@kroah.com> <1135993250.13318.94.camel@serpentine.pathscale.com> Message-ID: <20060103172732.GA9170@kroah.com> On Fri, Dec 30, 2005 at 05:40:50PM -0800, Bryan O'Sullivan wrote: > On Fri, 2005-12-30 at 16:10 -0800, Greg KH wrote: > > > But we (the kernel community), don't really accept that as a valid > > reason to accept this kind of code, sorry. > > Fair enough. I'd like some guidance in that case. Some of our ioctls > access the hardware more or less directly, while others do things like > read or reset counters. > > Which of these kinds of operations are appropriate to retain as ioctls, > in your eyes, and which are best converted to sysfs or configfs > alternatives? Idealy, nothing should be new ioctls. But in the end, it all depends on exactly what you are trying to do with each different one. > As an example, take a look at ipath_sma_ioctl. It seems to me that > receiving or sending subnet management packets ought to remain as > ioctls, while getting port or node data could be turned into sysfs > attributes. Lane identification could live in configfs. If you think > otherwise, please let me know what's more appropriate. I really don't know what the subnet management stuff involves, sorry. But doesn't the open-ib layer handle that all for you already? thanks, greg k-h From caitlinb at broadcom.com Tue Jan 3 09:54:53 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 3 Jan 2006 09:54:53 -0800 Subject: [openib-general] [kernel verbs] u64 vs dma_addr_t Message-ID: <54AD0F12E08D1541B826BE97C98F99F114231B@NT-SJCA-0751.brcm.ad.broadcom.com> James Lentini wrote: > On Fri, 30 Dec 2005, Caitlin Bestler wrote: > >> openib-general-bounces at openib.org wrote: >>> sean> James Lentini wrote: >>> sean> > Why is the ib_sge's addr a u64 and not a dma_addr_t? sean> >>> sean> It's the same address that the user can transfer to the >>> remote sean> side. >>> >>> It can be the same address, but does it have to be? >>> >>> A user can directly map local addresses to InfiniBand I/O virtual >>> addresses, but I don't think it is a requirement. In other words, I >>> thought that user could register address x and request an InfiniBand >>> I/O virtual address of y, x != y, for the mapping. >>> >>> I understand why the ib_send_wr's rdma.remote_addr needs to be a >>> u64, since it ultimately winds up on the wire. >>> >>> In the case of the ib_sge's addr, I didn't think these values left >>> the local node. My assumption (based on looking at the mthca driver) >>> is that they are supposed to contain "local" >>> I/O addresses (bus addresses). Therefore, my confusion over why >>> dma_addr_t wasn't used. >> >> A privileged user, such as an NFS Daemon or iSER iSCSI Target, can >> and will create Memory Regions that are not part of its own address >> space out of page buffers. Even running on a 32-bit kernel it might >> create a memory region larger than 2**32. >> >> Admittedly, that isn't very likely unless it is the *only* daemon >> running on the machine. But it is legal. > > How does the size of a region relate to my original question > on the type of an ib_sge.addr? The address has to be big enough to enumerate all bytes in the region. If a region is not constrained to fit within an existing virtual memory map then it is not constrained by the size of a virtual address, only by the size of MRs and of physical addresses. From mst at mellanox.co.il Tue Jan 3 10:05:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jan 2006 20:05:46 +0200 Subject: [openib-general] Re: [PATCH] fix race in mad.c In-Reply-To: References: Message-ID: <20060103180546.GF2790@mellanox.co.il> Quoting Sean Hefty : > There should be some way to fix this that doesn't involve walking a list on > every completion. Can't the cleanup be changed? I guess we could set some kind of flag. Is this better? And we still have to take a spinlock across the entire operation. > Either move destroying the QP after the workqueue flush or transition it to > the error state before flushing. How would this help? As far as I know neither will flush completion events. You'd have to destroy the CQ for this. -- MST From jlentini at netapp.com Tue Jan 3 10:10:53 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 3 Jan 2006 13:10:53 -0500 (EST) Subject: [openib-general] [kernel verbs] u64 vs dma_addr_t In-Reply-To: <469958e00512301942h29490c39xe82bbe5118371732@mail.gmail.com> References: <54AD0F12E08D1541B826BE97C98F99F1142243@NT-SJCA-0751.brcm.ad.broadcom.com> <469958e00512301942h29490c39xe82bbe5118371732@mail.gmail.com> Message-ID: On Fri, 30 Dec 2005, Caitlin Bestler wrote: caitlin> On 12/30/05, James Lentini wrote: caitlin> > caitlin> > caitlin> > caitlin> > One more question on this topic. caitlin> > caitlin> > caitlin> > caitlin> > Why is the ib_sge's addr a u64 and not a dma_addr_t? caitlin> > caitlin> caitlin> > caitlin> Because the hardware may need for it to be a 64 bit caitlin> > caitlin> IO Address accessible on the system bus. That applies caitlin> > caitlin> to the whole system bus, no matter how many PCI roots caitlin> > caitlin> or virtual OSs there are. caitlin> > caitlin> caitlin> > caitlin> In particular there could be a guest OS that was caitlin> > caitlin> running in 32-bit mode, and the RDMA hardware receiving caitlin> > caitlin> fast path requests will not support different caitlin> > caitlin> work request formats for each guest OS. caitlin> > caitlin> > Let me back up a step and explain the context for this question. caitlin> > caitlin> > As you know, our goal is to use the Linux IB verbs as a caitlin> > hardware/protocol independent RDMA API. I'm reviewing my use of the caitlin> > API to make sure that it does not make any particular assumptions. caitlin> > caitlin> > Roland stated that a scatter/gather list's address value should be a caitlin> > bus address: caitlin> > caitlin> > http://openib.org/pipermail/openib-general/2005-August/009748.html caitlin> > caitlin> caitlin> That depends on whether it is part of a registered memory space, caitlin> or being used to specify a new registered memory space (i.e. it is caitlin> for a memory register operation). That is not an issue. The memory registration functions (ib_reg_phys_mr(), etc.) describe memory regions using a ib_phys_buf structure. caitlin> When *using* an already established memory region, the address caitlin> is interpreted in the context of that memory region. The size of caitlin> address within an RDMA managed memory regions is always caitlin> 64 bits. No matter which transport or what processor. That is caitlin> extremely unlikely to change (in fact I think the R-Key/L-Key/ caitlin> STag size would increase to 64-bits before the address size caitlin> itself changed. But I'm expecting that a 96-bit logical address caitlin> space should be adequate for quite some time). caitlin> caitlin> When creating a memory region the "physical address" is caitlin> really a bus address, which on a strictly local basis could caitlin> be 32 or 64 bits. If you were trying to generalize that, the caitlin> "physical address" is a "RDMA Device accessible address", caitlin> which on anything even vaguely PCI-ish is a bus address. caitlin> caitlin> But just as the distinction between "physical address" and caitlin> "bus address" would not have been anticipated in the past, caitlin> there may be some other distinction that we are not caitlin> anticipating yet. So, in that context, the Memory Region caitlin> defines the translation from logical addresses with the caitlin> context of a Memory Region (most typically a subset of an caitlin> existing virtual memory map) to addresses that the RDMA caitlin> device can use to access the same memory. Whatever that caitlin> distinction is, I'm sure it will be relevant before another caitlin> decade goes by. Are you arguing that an ib_sge addr should be a u64 or a dma_addr_t? From sean.hefty at intel.com Tue Jan 3 10:13:37 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 3 Jan 2006 10:13:37 -0800 Subject: [openib-general] [PATCH] [ib_addr] generalize address to RDMA device translation Message-ID: The following patch changes the ib_addr interface to make it more generic. The interface now translates IP addresses directly into source and destination device and broadcast addresses. The CMA is then responsible for interpreting these addresses as GIDs/PKey or MAC addresses. The intent is that this will simplify integrating support for other RDMA devices in the CMA. I'd like to get some feedback from the iWarp community on whether this approach works for them, or if different/additional changes are needed. Signed-off-by: Sean Hefty Index: include/rdma/rdma_cm.h =================================================================== --- include/rdma/rdma_cm.h (revision 4651) +++ include/rdma/rdma_cm.h (working copy) @@ -68,9 +68,7 @@ struct rdma_addr { struct sockaddr dst_addr; u8 dst_pad[sizeof(struct sockaddr_in6) - sizeof(struct sockaddr)]; - union { - struct ib_addr ibaddr; - } addr; + struct rdma_dev_addr dev_addr; }; struct rdma_route { Index: include/rdma/ib_addr.h =================================================================== --- include/rdma/ib_addr.h (revision 4654) +++ include/rdma/ib_addr.h (working copy) @@ -32,26 +32,28 @@ #include #include +#include #include #include extern struct workqueue_struct *rdma_wq; -struct ib_addr { - union ib_gid sgid; - union ib_gid dgid; - u16 pkey; +struct rdma_dev_addr { + unsigned char src_dev_addr[MAX_ADDR_LEN]; + unsigned char dst_dev_addr[MAX_ADDR_LEN]; + unsigned char broadcast[MAX_ADDR_LEN]; + enum ib_node_type dev_type; }; /** - * ib_translate_addr - Translate a local IP address to an Infiniband GID and - * PKey. + * rdma_translate_ip - Translate a local IP address to an RDMA hardware + * address. */ -int ib_translate_addr(struct sockaddr *addr, union ib_gid *gid, u16 *pkey); +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr); /** - * ib_resolve_addr - Resolve source and destination IP addresses to - * Infiniband network addresses. + * rdma_resolve_ip - Resolve source and destination IP addresses to + * RDMA hardware addresses. * @src_addr: An optional source address to use in the resolution. If a * source address is not provided, a usable address will be returned via * the callback. @@ -64,13 +66,13 @@ int ib_translate_addr(struct sockaddr *a * or been canceled. A status of 0 indicates success. * @context: User-specified context associated with the call. */ -int ib_resolve_addr(struct sockaddr *src_addr, struct sockaddr *dst_addr, - struct ib_addr *addr, int timeout_ms, +int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct rdma_dev_addr *addr, int timeout_ms, void (*callback)(int status, struct sockaddr *src_addr, - struct ib_addr *addr, void *context), + struct rdma_dev_addr *addr, void *context), void *context); -void ib_addr_cancel(struct ib_addr *addr); +void rdma_addr_cancel(struct rdma_dev_addr *addr); static inline int ip_addr_size(struct sockaddr *addr) { @@ -78,5 +80,38 @@ static inline int ip_addr_size(struct so sizeof(struct sockaddr_in6) : sizeof(struct sockaddr_in); } +static inline u16 ib_addr_get_pkey(struct rdma_dev_addr *dev_addr) +{ + return ((u16)dev_addr->broadcast[8] << 8) | (u16)dev_addr->broadcast[9]; +} + +static inline void ib_addr_set_pkey(struct rdma_dev_addr *dev_addr, u16 pkey) +{ + dev_addr->broadcast[8] = pkey >> 8; + dev_addr->broadcast[9] = (unsigned char) pkey; +} + +static inline union ib_gid* ib_addr_get_sgid(struct rdma_dev_addr *dev_addr) +{ + return (union ib_gid *) (dev_addr->src_dev_addr + 4); +} + +static inline void ib_addr_set_sgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) +{ + memcpy(dev_addr->src_dev_addr + 4, gid, sizeof *gid); +} + +static inline union ib_gid* ib_addr_get_dgid(struct rdma_dev_addr *dev_addr) +{ + return (union ib_gid *) (dev_addr->dst_dev_addr + 4); +} + +static inline void ib_addr_set_dgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) +{ + memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid); +} + #endif /* IB_ADDR_H */ Index: core/addr.c =================================================================== --- core/addr.c (revision 4654) +++ core/addr.c (working copy) @@ -42,10 +42,10 @@ struct addr_req { struct list_head list; struct sockaddr src_addr; struct sockaddr dst_addr; - struct ib_addr *addr; + struct rdma_dev_addr *addr; void *context; void (*callback)(int status, struct sockaddr *src_addr, - struct ib_addr *addr, void *context); + struct rdma_dev_addr *addr, void *context); unsigned long timeout; int status; }; @@ -58,26 +58,39 @@ static DECLARE_WORK(work, process_req, N struct workqueue_struct *rdma_wq; EXPORT_SYMBOL(rdma_wq); -static u16 addr_get_pkey(struct net_device *dev) +static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, + unsigned char *dst_dev_addr) { - return ((u16)dev->broadcast[8] << 8) | (u16)dev->broadcast[9]; + switch (dev->type) { + case ARPHRD_INFINIBAND: + dev_addr->dev_type = IB_NODE_CA; + break; + default: + return -EADDRNOTAVAIL; + } + + memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN); + memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN); + if (dst_dev_addr) + memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); + return 0; } -int ib_translate_addr(struct sockaddr *addr, union ib_gid *gid, u16 *pkey) +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) { struct net_device *dev; u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr; + int ret; dev = ip_dev_find(ip); if (!dev) return -EADDRNOTAVAIL; - *gid = *(union ib_gid *) (dev->dev_addr + 4); - *pkey = addr_get_pkey(dev); + ret = copy_addr(dev_addr, dev, NULL); dev_put(dev); - return 0; + return ret; } -EXPORT_SYMBOL(ib_translate_addr); +EXPORT_SYMBOL(rdma_translate_ip); static void set_timeout(unsigned long time) { @@ -127,7 +140,7 @@ static void addr_send_arp(struct sockadd static int addr_resolve_remote(struct sockaddr_in *src_in, struct sockaddr_in *dst_in, - struct ib_addr *addr) + struct rdma_dev_addr *addr) { u32 src_ip = src_in->sin_addr.s_addr; u32 dst_ip = dst_in->sin_addr.s_addr; @@ -159,10 +172,7 @@ static int addr_resolve_remote(struct so src_in->sin_addr.s_addr = rt->rt_src; } - addr->sgid = *(union ib_gid *) (neigh->dev->dev_addr + 4); - addr->dgid = *(union ib_gid *) (neigh->ha + 4); - addr->pkey = addr_get_pkey(neigh->dev); - + ret = copy_addr(addr, neigh->dev, neigh->ha); err2: neigh_release(neigh); err1: @@ -212,12 +222,12 @@ static void process_req(void *data) static int addr_resolve_local(struct sockaddr_in *src_in, struct sockaddr_in *dst_in, - struct ib_addr *addr) + struct rdma_dev_addr *addr) { struct net_device *dev; u32 src_ip = src_in->sin_addr.s_addr; u32 dst_ip = dst_in->sin_addr.s_addr; - int ret = 0; + int ret; dev = ip_dev_find(dst_ip); if (!dev) @@ -226,25 +236,21 @@ static int addr_resolve_local(struct soc if (!src_ip) { src_in->sin_family = dst_in->sin_family; src_in->sin_addr.s_addr = dst_ip; - addr->sgid = *(union ib_gid *) (dev->dev_addr + 4); - addr->pkey = addr_get_pkey(dev); + ret = copy_addr(addr, dev, dev->dev_addr); } else { - ret = ib_translate_addr((struct sockaddr *)src_in, - &addr->sgid, &addr->pkey); - if (ret) - goto out; + ret = rdma_translate_ip((struct sockaddr *)src_in, addr); + if (!ret) + memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN); } - addr->dgid = *(union ib_gid *) (dev->dev_addr + 4); -out: dev_put(dev); return ret; } -int ib_resolve_addr(struct sockaddr *src_addr, struct sockaddr *dst_addr, - struct ib_addr *addr, int timeout_ms, +int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct rdma_dev_addr *addr, int timeout_ms, void (*callback)(int status, struct sockaddr *src_addr, - struct ib_addr *addr, void *context), + struct rdma_dev_addr *addr, void *context), void *context) { struct sockaddr_in *src_in, *dst_in; @@ -287,9 +293,9 @@ int ib_resolve_addr(struct sockaddr *src } return ret; } -EXPORT_SYMBOL(ib_resolve_addr); +EXPORT_SYMBOL(rdma_resolve_ip); -void ib_addr_cancel(struct ib_addr *addr) +void rdma_addr_cancel(struct rdma_dev_addr *addr) { struct addr_req *req, *temp_req; @@ -306,7 +312,7 @@ void ib_addr_cancel(struct ib_addr *addr } up(&mutex); } -EXPORT_SYMBOL(ib_addr_cancel); +EXPORT_SYMBOL(rdma_addr_cancel); static int addr_arp_recv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pkt, struct net_device *orig_dev) Index: core/cma.c =================================================================== --- core/cma.c (revision 4655) +++ core/cma.c (working copy) @@ -213,12 +213,14 @@ static void cma_detach_from_dev(struct r id_priv->cma_dev = NULL; } -static int cma_acquire_ib_dev(struct rdma_id_private *id_priv, - union ib_gid *gid) +static int cma_acquire_ib_dev(struct rdma_id_private *id_priv) { struct cma_device *cma_dev; + union ib_gid *gid; int ret = -ENODEV; + gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + down(&mutex); list_for_each_entry(cma_dev, &dev_list, list) { ret = ib_find_cached_gid(cma_dev->device, gid, @@ -232,6 +234,16 @@ static int cma_acquire_ib_dev(struct rdm return ret; } +static int cma_acquire_dev(struct rdma_id_private *id_priv) +{ + switch (id_priv->id.route.addr.dev_addr.dev_type) { + case IB_NODE_CA: + return cma_acquire_ib_dev(id_priv); + default: + return -ENODEV; + } +} + static void cma_deref_id(struct rdma_id_private *id_priv) { if (atomic_dec_and_test(&id_priv->refcount)) @@ -272,11 +284,13 @@ EXPORT_SYMBOL(rdma_create_id); static int cma_init_ib_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) { struct ib_qp_attr qp_attr; - struct ib_addr *ibaddr = &id_priv->id.route.addr.addr.ibaddr; + struct rdma_dev_addr *dev_addr; int ret; + dev_addr = &id_priv->id.route.addr.dev_addr; ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num, - ibaddr->pkey, &qp_attr.pkey_index); + ib_addr_get_pkey(dev_addr), + &qp_attr.pkey_index); if (ret) return ret; @@ -520,7 +534,7 @@ static void cma_cancel_addr(struct rdma_ { switch (id_priv->id.device->node_type) { case IB_NODE_CA: - ib_addr_cancel(&id_priv->id.route.addr.addr.ibaddr); + rdma_addr_cancel(&id_priv->id.route.addr.dev_addr); break; default: break; @@ -760,9 +774,10 @@ static struct rdma_id_private* cma_new_i if (rt->num_paths == 2) rt->path_rec[1] = *ib_event->param.req_rcvd.alternate_path; - rt->addr.addr.ibaddr.sgid = rt->path_rec[0].sgid; - rt->addr.addr.ibaddr.dgid = rt->path_rec[0].dgid; - rt->addr.addr.ibaddr.pkey = be16_to_cpu(rt->path_rec[0].pkey); + ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); + ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); + ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); + rt->addr.dev_addr.dev_type = IB_NODE_CA; id_priv = container_of(id, struct rdma_id_private, id); id_priv->state = CMA_CONNECT; @@ -791,7 +806,7 @@ static int cma_req_handler(struct ib_cm_ } atomic_inc(&conn_id->dev_remove); - ret = cma_acquire_ib_dev(conn_id, &conn_id->id.route.path_rec[0].sgid); + ret = cma_acquire_ib_dev(conn_id); if (ret) { ret = -ENODEV; cma_release_remove(conn_id); @@ -1028,13 +1043,13 @@ out: static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) { - struct ib_addr *addr = &id_priv->id.route.addr.addr.ibaddr; + struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; struct ib_sa_path_rec path_rec; memset(&path_rec, 0, sizeof path_rec); - path_rec.sgid = addr->sgid; - path_rec.dgid = addr->dgid; - path_rec.pkey = cpu_to_be16(addr->pkey); + path_rec.sgid = *ib_addr_get_sgid(addr); + path_rec.dgid = *ib_addr_get_dgid(addr); + path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); path_rec.numb_path = 1; id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, @@ -1079,6 +1094,8 @@ EXPORT_SYMBOL(rdma_resolve_route); static int cma_bind_loopback(struct rdma_id_private *id_priv) { struct cma_device *cma_dev; + union ib_gid *gid; + u16 pkey; int ret; down(&mutex); @@ -1088,16 +1105,16 @@ static int cma_bind_loopback(struct rdma } cma_dev = list_entry(dev_list.next, struct cma_device, list); - ret = ib_get_cached_gid(cma_dev->device, 1, 0, - &id_priv->id.route.addr.addr.ibaddr.sgid); + gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + ret = ib_get_cached_gid(cma_dev->device, 1, 0, gid); if (ret) goto out; - ret = ib_get_cached_pkey(cma_dev->device, 1, 0, - &id_priv->id.route.addr.addr.ibaddr.pkey); + ret = ib_get_cached_pkey(cma_dev->device, 1, 0, &pkey); if (ret) goto out; + ib_addr_set_pkey(&id_priv->id.route.addr.dev_addr, pkey); id_priv->id.port_num = 1; cma_attach_to_dev(id_priv, cma_dev); out: @@ -1106,7 +1123,7 @@ out: } static void addr_handler(int status, struct sockaddr *src_addr, - struct ib_addr *ibaddr, void *context) + struct rdma_dev_addr *dev_addr, void *context) { struct rdma_id_private *id_priv = context; enum rdma_cm_event_type event; @@ -1116,7 +1133,7 @@ static void addr_handler(int status, str if (!id_priv->cma_dev) { old_state = CMA_IDLE; if (!status) - status = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); + status = cma_acquire_dev(id_priv); } else old_state = CMA_ADDR_BOUND; @@ -1169,6 +1186,7 @@ static int cma_resolve_loopback(struct r struct sockaddr *src_addr, enum cma_state state) { struct work_struct *work; + struct rdma_dev_addr *dev_addr; int ret; work = kmalloc(sizeof *work, GFP_KERNEL); @@ -1179,8 +1197,8 @@ static int cma_resolve_loopback(struct r ret = cma_bind_loopback(id_priv); if (ret) goto err; - id_priv->id.route.addr.addr.ibaddr.dgid = - id_priv->id.route.addr.addr.ibaddr.sgid; + dev_addr = &id_priv->id.route.addr.dev_addr; + ib_addr_set_dgid(dev_addr, ib_addr_get_sgid(dev_addr)); if (!src_addr || cma_any_addr(src_addr)) src_addr = &id_priv->id.route.addr.dst_addr; memcpy(&id_priv->id.route.addr.src_addr, src_addr, @@ -1217,8 +1235,8 @@ int rdma_resolve_addr(struct rdma_cm_id if (cma_loopback_addr(dst_addr)) ret = cma_resolve_loopback(id_priv, src_addr, expected_state); else - ret = ib_resolve_addr(src_addr, dst_addr, - &id->route.addr.addr.ibaddr, + ret = rdma_resolve_ip(src_addr, dst_addr, + &id->route.addr.dev_addr, timeout_ms, addr_handler, id_priv); if (ret) goto err; @@ -1234,7 +1252,7 @@ EXPORT_SYMBOL(rdma_resolve_addr); int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) { struct rdma_id_private *id_priv; - struct ib_addr *ibaddr; + struct rdma_dev_addr *dev_addr; int ret; if (addr->sa_family != AF_INET) @@ -1249,10 +1267,10 @@ int rdma_bind_addr(struct rdma_cm_id *id } else if (cma_loopback_addr(addr)) { ret = cma_bind_loopback(id_priv); } else { - ibaddr = &id->route.addr.addr.ibaddr; - ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); + dev_addr = &id->route.addr.dev_addr; + ret = rdma_translate_ip(addr, dev_addr); if (!ret) - ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); + ret = cma_acquire_dev(id_priv); } if (ret) Index: core/ucma.c =================================================================== --- core/ucma.c (revision 4651) +++ core/ucma.c (working copy) @@ -419,17 +419,17 @@ static ssize_t ucma_resolve_route(struct static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp, struct rdma_route *route) { - struct ib_addr *ibaddr; + struct rdma_dev_addr *dev_addr; resp->num_paths = route->num_paths; switch (route->num_paths) { case 0: - ibaddr = &route->addr.addr.ibaddr; - memcpy(&resp->ib_route[0].dgid, ibaddr->dgid.raw, - sizeof ibaddr->dgid); - memcpy(&resp->ib_route[0].sgid, ibaddr->sgid.raw, - sizeof ibaddr->sgid); - resp->ib_route[0].pkey = cpu_to_be16(ibaddr->pkey); + dev_addr = &route->addr.dev_addr; + memcpy(&resp->ib_route[0].dgid, ib_addr_get_dgid(dev_addr), + sizeof(union ib_gid)); + memcpy(&resp->ib_route[0].sgid, ib_addr_get_sgid(dev_addr), + sizeof(union ib_gid)); + resp->ib_route[0].pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); break; case 2: ib_copy_path_rec_to_user(&resp->ib_route[1], From caitlinb at broadcom.com Tue Jan 3 10:46:40 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 3 Jan 2006 10:46:40 -0800 Subject: [openib-general] [PATCH] [ib_addr] generalize address to RDMA device translation Message-ID: <54AD0F12E08D1541B826BE97C98F99F114232D@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > The following patch changes the ib_addr interface to make it > more generic. > The interface now translates IP addresses directly into > source and destination device and broadcast addresses. The > CMA is then responsible for interpreting these addresses as > GIDs/PKey or MAC addresses. The intent is that this will > simplify integrating support for other RDMA devices in the CMA. > My understanding of these is that they should all translate pretty much to nops for IP networks. But I'd like to review that, and have defintions that make that clear. There are times where fetching lower, say to fetch an Ethernet MAC address of a VLAN ID, could violate layering in an iWARP implmenetaiton and might have actual performance hits to fetch data that was not naturally available to the RDMA layer. More in-line. > I'd like to get some feedback from the iWarp community on > whether this approach works for them, or if > different/additional changes are needed. > > Signed-off-by: Sean Hefty > > > Index: include/rdma/rdma_cm.h > =================================================================== > --- include/rdma/rdma_cm.h (revision 4651) > +++ include/rdma/rdma_cm.h (working copy) > @@ -68,9 +68,7 @@ struct rdma_addr { > struct sockaddr dst_addr; > u8 dst_pad[sizeof(struct sockaddr_in6) - > sizeof(struct sockaddr)]; > - union { > - struct ib_addr ibaddr; > - } addr; > + struct rdma_dev_addr dev_addr; > }; > "dev_addr" is any clarifying/lower layer address that matches the sockaddr *and is required by the RDMA layer*. If the rdma layer does not require this data the contents may be logically void. This interface MUST NOT be used to query for a deterministic lower layer address. Rationale: if the RDMA layer is cleanly layered over IP then the MAC layer address is not accessible to it without doing additional queries. It is not a natural by-product and should not be automatically fetched, at extra cost, when most applications do not care. > struct rdma_route { > Index: include/rdma/ib_addr.h > =================================================================== > --- include/rdma/ib_addr.h (revision 4654) > +++ include/rdma/ib_addr.h (working copy) > @@ -32,26 +32,28 @@ > > #include > #include > +#include > #include > #include > > extern struct workqueue_struct *rdma_wq; > > -struct ib_addr { > - union ib_gid sgid; > - union ib_gid dgid; > - u16 pkey; > +struct rdma_dev_addr { > + unsigned char src_dev_addr[MAX_ADDR_LEN]; > + unsigned char dst_dev_addr[MAX_ADDR_LEN]; > + unsigned char broadcast[MAX_ADDR_LEN]; > + enum ib_node_type dev_type; > }; > > /** > - * ib_translate_addr - Translate a local IP address to an > Infiniband GID and > - * PKey. > + * rdma_translate_ip - Translate a local IP address to an > RDMA hardware > + * address. > */ > -int ib_translate_addr(struct sockaddr *addr, union ib_gid *gid, u16 > *pkey); +int rdma_translate_ip(struct sockaddr *addr, struct > rdma_dev_addr +*dev_addr); > > /** > - * ib_resolve_addr - Resolve source and destination IP addresses to > - * Infiniband network addresses. > + * rdma_resolve_ip - Resolve source and destination IP addresses to > + * RDMA hardware addresses. > * @src_addr: An optional source address to use in the > resolution. If a > * source address is not provided, a usable address will > be returned via > * the callback. > @@ -64,13 +66,13 @@ int ib_translate_addr(struct sockaddr *a > * or been canceled. A status of 0 indicates success. > * @context: User-specified context associated with the call. */ > -int ib_resolve_addr(struct sockaddr *src_addr, struct > sockaddr *dst_addr, > - struct ib_addr *addr, int timeout_ms, > +int rdma_resolve_ip(struct sockaddr *src_addr, struct > sockaddr *dst_addr, > + struct rdma_dev_addr *addr, int timeout_ms, > void (*callback)(int status, struct > sockaddr *src_addr, > - struct ib_addr *addr, void > *context), > + struct rdma_dev_addr > *addr, void *context), > void *context); > > -void ib_addr_cancel(struct ib_addr *addr); > +void rdma_addr_cancel(struct rdma_dev_addr *addr); > > static inline int ip_addr_size(struct sockaddr *addr) { @@ > -78,5 +80,38 @@ static inline int ip_addr_size(struct so > sizeof(struct sockaddr_in6) : sizeof(struct > sockaddr_in); } > > +static inline u16 ib_addr_get_pkey(struct rdma_dev_addr *dev_addr) { > + return ((u16)dev_addr->broadcast[8] << 8) | > +(u16)dev_addr->broadcast[9]; } > + > +static inline void ib_addr_set_pkey(struct rdma_dev_addr *dev_addr, > u16 +pkey) { > + dev_addr->broadcast[8] = pkey >> 8; > + dev_addr->broadcast[9] = (unsigned char) pkey; } > + The closest IP equivalent of a pkey is the VLAN. It is dealt with well below the transport layer and is not visible to the RDMA layer at all. These routines should not attempt to define some sort of transport neutral PKEY/VLAN ID concept. That is fabric discovery/configuration, not RDMA networking. Transport neutral discovery/configuration is not something we want to attempt. From tom at opengridcomputing.com Tue Jan 3 11:35:22 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 03 Jan 2006 13:35:22 -0600 Subject: [openib-general] Re: [PATCH] [ib_addr] generalize address to RDMA device translation In-Reply-To: References: Message-ID: <1136316922.6173.65.camel@trinity.austin.ammasso.com> Sean: This is a good start towards generalizing the IP address translation service. I think there is quite a bit of "hair" in this area that is "Part 2" of the iWARP integration -- after the integration of the current stuff into the trunk. I was going to introduce some of this "hair" at the face to face in Sonoma, but we should probably get it going now. There are three areas that need attention: - ARP resolution - ROUTE changes - Path MTU changes ARP Resolve The iWARP side needs to be able to resolve an IP address to an Ethernet address. Today this is not done for iWARP and it works because the AMSO1100 does this itself in the hardware. Other iWARP devices probably don't. This means that the logic in ib_at needs to be extended on the iWARP side to call neigh_event_send (instead of arp_send) to resolve an IP to an Ethernet address. The current method of calling arp_send directly and "sniffing" for arp replies is probably not the best way to go long term. It would be better to register for neighbor update events (new mechanism) and be notified when the neighbor entry gets resolved. This is better for two reasons: 1) it doesn't duplicate code already in Linux, and 2) unlike IB, Ethernet MAC addresses may change for the next hop while the connection is still active. The provider needs to know this so it's hardware ARP tables can be updated. ROUTE Changes Two obvious cases, 1) the next hop changes due to normal network least- cost routing, and 2) the user changes a route manually. Both events would require the iWARP provider to be notified (via an event again) and update its hardware PathMTU The new route to the remote peer has a hop with a smaller MTU than we're currently using. Ouch! All my packets are going to be dropped until I reduce my path MTU. The provider can't know unless he is either filtering all ICMP traffic himself ("evil") or is notified via an event ("nice"). So all this said, my little brain had imagined this logic going in and around the ib_at module in a wonderfully crafted bit of algorithmic art -- once I figured out how to do it all ;-) It sounds like you're beating the same bushes. How would you like to proceed? Tom On Tue, 2006-01-03 at 10:13 -0800, Sean Hefty wrote: > The following patch changes the ib_addr interface to make it more generic. > The interface now translates IP addresses directly into source and destination > device and broadcast addresses. The CMA is then responsible for interpreting > these addresses as GIDs/PKey or MAC addresses. The intent is that this will > simplify integrating support for other RDMA devices in the CMA. > > I'd like to get some feedback from the iWarp community on whether this approach > works for them, or if different/additional changes are needed. > > Signed-off-by: Sean Hefty > > > Index: include/rdma/rdma_cm.h > =================================================================== > --- include/rdma/rdma_cm.h (revision 4651) > +++ include/rdma/rdma_cm.h (working copy) > @@ -68,9 +68,7 @@ struct rdma_addr { > struct sockaddr dst_addr; > u8 dst_pad[sizeof(struct sockaddr_in6) - > sizeof(struct sockaddr)]; > - union { > - struct ib_addr ibaddr; > - } addr; > + struct rdma_dev_addr dev_addr; > }; > > struct rdma_route { > Index: include/rdma/ib_addr.h > =================================================================== > --- include/rdma/ib_addr.h (revision 4654) > +++ include/rdma/ib_addr.h (working copy) > @@ -32,26 +32,28 @@ > > #include > #include > +#include > #include > #include > > extern struct workqueue_struct *rdma_wq; > > -struct ib_addr { > - union ib_gid sgid; > - union ib_gid dgid; > - u16 pkey; > +struct rdma_dev_addr { > + unsigned char src_dev_addr[MAX_ADDR_LEN]; > + unsigned char dst_dev_addr[MAX_ADDR_LEN]; > + unsigned char broadcast[MAX_ADDR_LEN]; > + enum ib_node_type dev_type; > }; > > /** > - * ib_translate_addr - Translate a local IP address to an Infiniband GID and > - * PKey. > + * rdma_translate_ip - Translate a local IP address to an RDMA hardware > + * address. > */ > -int ib_translate_addr(struct sockaddr *addr, union ib_gid *gid, u16 *pkey); > +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr); > > /** > - * ib_resolve_addr - Resolve source and destination IP addresses to > - * Infiniband network addresses. > + * rdma_resolve_ip - Resolve source and destination IP addresses to > + * RDMA hardware addresses. > * @src_addr: An optional source address to use in the resolution. If a > * source address is not provided, a usable address will be returned via > * the callback. > @@ -64,13 +66,13 @@ int ib_translate_addr(struct sockaddr *a > * or been canceled. A status of 0 indicates success. > * @context: User-specified context associated with the call. > */ > -int ib_resolve_addr(struct sockaddr *src_addr, struct sockaddr *dst_addr, > - struct ib_addr *addr, int timeout_ms, > +int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, > + struct rdma_dev_addr *addr, int timeout_ms, > void (*callback)(int status, struct sockaddr *src_addr, > - struct ib_addr *addr, void *context), > + struct rdma_dev_addr *addr, void *context), > void *context); > > -void ib_addr_cancel(struct ib_addr *addr); > +void rdma_addr_cancel(struct rdma_dev_addr *addr); > > static inline int ip_addr_size(struct sockaddr *addr) > { > @@ -78,5 +80,38 @@ static inline int ip_addr_size(struct so > sizeof(struct sockaddr_in6) : sizeof(struct sockaddr_in); > } > > +static inline u16 ib_addr_get_pkey(struct rdma_dev_addr *dev_addr) > +{ > + return ((u16)dev_addr->broadcast[8] << 8) | (u16)dev_addr->broadcast[9]; > +} > + > +static inline void ib_addr_set_pkey(struct rdma_dev_addr *dev_addr, u16 pkey) > +{ > + dev_addr->broadcast[8] = pkey >> 8; > + dev_addr->broadcast[9] = (unsigned char) pkey; > +} > + > +static inline union ib_gid* ib_addr_get_sgid(struct rdma_dev_addr *dev_addr) > +{ > + return (union ib_gid *) (dev_addr->src_dev_addr + 4); > +} > + > +static inline void ib_addr_set_sgid(struct rdma_dev_addr *dev_addr, > + union ib_gid *gid) > +{ > + memcpy(dev_addr->src_dev_addr + 4, gid, sizeof *gid); > +} > + > +static inline union ib_gid* ib_addr_get_dgid(struct rdma_dev_addr *dev_addr) > +{ > + return (union ib_gid *) (dev_addr->dst_dev_addr + 4); > +} > + > +static inline void ib_addr_set_dgid(struct rdma_dev_addr *dev_addr, > + union ib_gid *gid) > +{ > + memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid); > +} > + > #endif /* IB_ADDR_H */ > > Index: core/addr.c > =================================================================== > --- core/addr.c (revision 4654) > +++ core/addr.c (working copy) > @@ -42,10 +42,10 @@ struct addr_req { > struct list_head list; > struct sockaddr src_addr; > struct sockaddr dst_addr; > - struct ib_addr *addr; > + struct rdma_dev_addr *addr; > void *context; > void (*callback)(int status, struct sockaddr *src_addr, > - struct ib_addr *addr, void *context); > + struct rdma_dev_addr *addr, void *context); > unsigned long timeout; > int status; > }; > @@ -58,26 +58,39 @@ static DECLARE_WORK(work, process_req, N > struct workqueue_struct *rdma_wq; > EXPORT_SYMBOL(rdma_wq); > > -static u16 addr_get_pkey(struct net_device *dev) > +static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, > + unsigned char *dst_dev_addr) > { > - return ((u16)dev->broadcast[8] << 8) | (u16)dev->broadcast[9]; > + switch (dev->type) { > + case ARPHRD_INFINIBAND: > + dev_addr->dev_type = IB_NODE_CA; > + break; > + default: > + return -EADDRNOTAVAIL; > + } > + > + memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN); > + memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN); > + if (dst_dev_addr) > + memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); > + return 0; > } > > -int ib_translate_addr(struct sockaddr *addr, union ib_gid *gid, u16 *pkey) > +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) > { > struct net_device *dev; > u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr; > + int ret; > > dev = ip_dev_find(ip); > if (!dev) > return -EADDRNOTAVAIL; > > - *gid = *(union ib_gid *) (dev->dev_addr + 4); > - *pkey = addr_get_pkey(dev); > + ret = copy_addr(dev_addr, dev, NULL); > dev_put(dev); > - return 0; > + return ret; > } > -EXPORT_SYMBOL(ib_translate_addr); > +EXPORT_SYMBOL(rdma_translate_ip); > > static void set_timeout(unsigned long time) > { > @@ -127,7 +140,7 @@ static void addr_send_arp(struct sockadd > > static int addr_resolve_remote(struct sockaddr_in *src_in, > struct sockaddr_in *dst_in, > - struct ib_addr *addr) > + struct rdma_dev_addr *addr) > { > u32 src_ip = src_in->sin_addr.s_addr; > u32 dst_ip = dst_in->sin_addr.s_addr; > @@ -159,10 +172,7 @@ static int addr_resolve_remote(struct so > src_in->sin_addr.s_addr = rt->rt_src; > } > > - addr->sgid = *(union ib_gid *) (neigh->dev->dev_addr + 4); > - addr->dgid = *(union ib_gid *) (neigh->ha + 4); > - addr->pkey = addr_get_pkey(neigh->dev); > - > + ret = copy_addr(addr, neigh->dev, neigh->ha); > err2: > neigh_release(neigh); > err1: > @@ -212,12 +222,12 @@ static void process_req(void *data) > > static int addr_resolve_local(struct sockaddr_in *src_in, > struct sockaddr_in *dst_in, > - struct ib_addr *addr) > + struct rdma_dev_addr *addr) > { > struct net_device *dev; > u32 src_ip = src_in->sin_addr.s_addr; > u32 dst_ip = dst_in->sin_addr.s_addr; > - int ret = 0; > + int ret; > > dev = ip_dev_find(dst_ip); > if (!dev) > @@ -226,25 +236,21 @@ static int addr_resolve_local(struct soc > if (!src_ip) { > src_in->sin_family = dst_in->sin_family; > src_in->sin_addr.s_addr = dst_ip; > - addr->sgid = *(union ib_gid *) (dev->dev_addr + 4); > - addr->pkey = addr_get_pkey(dev); > + ret = copy_addr(addr, dev, dev->dev_addr); > } else { > - ret = ib_translate_addr((struct sockaddr *)src_in, > - &addr->sgid, &addr->pkey); > - if (ret) > - goto out; > + ret = rdma_translate_ip((struct sockaddr *)src_in, addr); > + if (!ret) > + memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN); > } > > - addr->dgid = *(union ib_gid *) (dev->dev_addr + 4); > -out: > dev_put(dev); > return ret; > } > > -int ib_resolve_addr(struct sockaddr *src_addr, struct sockaddr *dst_addr, > - struct ib_addr *addr, int timeout_ms, > +int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, > + struct rdma_dev_addr *addr, int timeout_ms, > void (*callback)(int status, struct sockaddr *src_addr, > - struct ib_addr *addr, void *context), > + struct rdma_dev_addr *addr, void *context), > void *context) > { > struct sockaddr_in *src_in, *dst_in; > @@ -287,9 +293,9 @@ int ib_resolve_addr(struct sockaddr *src > } > return ret; > } > -EXPORT_SYMBOL(ib_resolve_addr); > +EXPORT_SYMBOL(rdma_resolve_ip); > > -void ib_addr_cancel(struct ib_addr *addr) > +void rdma_addr_cancel(struct rdma_dev_addr *addr) > { > struct addr_req *req, *temp_req; > > @@ -306,7 +312,7 @@ void ib_addr_cancel(struct ib_addr *addr > } > up(&mutex); > } > -EXPORT_SYMBOL(ib_addr_cancel); > +EXPORT_SYMBOL(rdma_addr_cancel); > > static int addr_arp_recv(struct sk_buff *skb, struct net_device *dev, > struct packet_type *pkt, struct net_device *orig_dev) > Index: core/cma.c > =================================================================== > --- core/cma.c (revision 4655) > +++ core/cma.c (working copy) > @@ -213,12 +213,14 @@ static void cma_detach_from_dev(struct r > id_priv->cma_dev = NULL; > } > > -static int cma_acquire_ib_dev(struct rdma_id_private *id_priv, > - union ib_gid *gid) > +static int cma_acquire_ib_dev(struct rdma_id_private *id_priv) > { > struct cma_device *cma_dev; > + union ib_gid *gid; > int ret = -ENODEV; > > + gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); > + > down(&mutex); > list_for_each_entry(cma_dev, &dev_list, list) { > ret = ib_find_cached_gid(cma_dev->device, gid, > @@ -232,6 +234,16 @@ static int cma_acquire_ib_dev(struct rdm > return ret; > } > > +static int cma_acquire_dev(struct rdma_id_private *id_priv) > +{ > + switch (id_priv->id.route.addr.dev_addr.dev_type) { > + case IB_NODE_CA: > + return cma_acquire_ib_dev(id_priv); > + default: > + return -ENODEV; > + } > +} > + > static void cma_deref_id(struct rdma_id_private *id_priv) > { > if (atomic_dec_and_test(&id_priv->refcount)) > @@ -272,11 +284,13 @@ EXPORT_SYMBOL(rdma_create_id); > static int cma_init_ib_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) > { > struct ib_qp_attr qp_attr; > - struct ib_addr *ibaddr = &id_priv->id.route.addr.addr.ibaddr; > + struct rdma_dev_addr *dev_addr; > int ret; > > + dev_addr = &id_priv->id.route.addr.dev_addr; > ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num, > - ibaddr->pkey, &qp_attr.pkey_index); > + ib_addr_get_pkey(dev_addr), > + &qp_attr.pkey_index); > if (ret) > return ret; > > @@ -520,7 +534,7 @@ static void cma_cancel_addr(struct rdma_ > { > switch (id_priv->id.device->node_type) { > case IB_NODE_CA: > - ib_addr_cancel(&id_priv->id.route.addr.addr.ibaddr); > + rdma_addr_cancel(&id_priv->id.route.addr.dev_addr); > break; > default: > break; > @@ -760,9 +774,10 @@ static struct rdma_id_private* cma_new_i > if (rt->num_paths == 2) > rt->path_rec[1] = *ib_event->param.req_rcvd.alternate_path; > > - rt->addr.addr.ibaddr.sgid = rt->path_rec[0].sgid; > - rt->addr.addr.ibaddr.dgid = rt->path_rec[0].dgid; > - rt->addr.addr.ibaddr.pkey = be16_to_cpu(rt->path_rec[0].pkey); > + ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); > + ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); > + ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); > + rt->addr.dev_addr.dev_type = IB_NODE_CA; > > id_priv = container_of(id, struct rdma_id_private, id); > id_priv->state = CMA_CONNECT; > @@ -791,7 +806,7 @@ static int cma_req_handler(struct ib_cm_ > } > > atomic_inc(&conn_id->dev_remove); > - ret = cma_acquire_ib_dev(conn_id, &conn_id->id.route.path_rec[0].sgid); > + ret = cma_acquire_ib_dev(conn_id); > if (ret) { > ret = -ENODEV; > cma_release_remove(conn_id); > @@ -1028,13 +1043,13 @@ out: > > static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) > { > - struct ib_addr *addr = &id_priv->id.route.addr.addr.ibaddr; > + struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; > struct ib_sa_path_rec path_rec; > > memset(&path_rec, 0, sizeof path_rec); > - path_rec.sgid = addr->sgid; > - path_rec.dgid = addr->dgid; > - path_rec.pkey = cpu_to_be16(addr->pkey); > + path_rec.sgid = *ib_addr_get_sgid(addr); > + path_rec.dgid = *ib_addr_get_dgid(addr); > + path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); > path_rec.numb_path = 1; > > id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, > @@ -1079,6 +1094,8 @@ EXPORT_SYMBOL(rdma_resolve_route); > static int cma_bind_loopback(struct rdma_id_private *id_priv) > { > struct cma_device *cma_dev; > + union ib_gid *gid; > + u16 pkey; > int ret; > > down(&mutex); > @@ -1088,16 +1105,16 @@ static int cma_bind_loopback(struct rdma > } > > cma_dev = list_entry(dev_list.next, struct cma_device, list); > - ret = ib_get_cached_gid(cma_dev->device, 1, 0, > - &id_priv->id.route.addr.addr.ibaddr.sgid); > + gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); > + ret = ib_get_cached_gid(cma_dev->device, 1, 0, gid); > if (ret) > goto out; > > - ret = ib_get_cached_pkey(cma_dev->device, 1, 0, > - &id_priv->id.route.addr.addr.ibaddr.pkey); > + ret = ib_get_cached_pkey(cma_dev->device, 1, 0, &pkey); > if (ret) > goto out; > > + ib_addr_set_pkey(&id_priv->id.route.addr.dev_addr, pkey); > id_priv->id.port_num = 1; > cma_attach_to_dev(id_priv, cma_dev); > out: > @@ -1106,7 +1123,7 @@ out: > } > > static void addr_handler(int status, struct sockaddr *src_addr, > - struct ib_addr *ibaddr, void *context) > + struct rdma_dev_addr *dev_addr, void *context) > { > struct rdma_id_private *id_priv = context; > enum rdma_cm_event_type event; > @@ -1116,7 +1133,7 @@ static void addr_handler(int status, str > if (!id_priv->cma_dev) { > old_state = CMA_IDLE; > if (!status) > - status = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); > + status = cma_acquire_dev(id_priv); > } else > old_state = CMA_ADDR_BOUND; > > @@ -1169,6 +1186,7 @@ static int cma_resolve_loopback(struct r > struct sockaddr *src_addr, enum cma_state state) > { > struct work_struct *work; > + struct rdma_dev_addr *dev_addr; > int ret; > > work = kmalloc(sizeof *work, GFP_KERNEL); > @@ -1179,8 +1197,8 @@ static int cma_resolve_loopback(struct r > ret = cma_bind_loopback(id_priv); > if (ret) > goto err; > - id_priv->id.route.addr.addr.ibaddr.dgid = > - id_priv->id.route.addr.addr.ibaddr.sgid; > + dev_addr = &id_priv->id.route.addr.dev_addr; > + ib_addr_set_dgid(dev_addr, ib_addr_get_sgid(dev_addr)); > if (!src_addr || cma_any_addr(src_addr)) > src_addr = &id_priv->id.route.addr.dst_addr; > memcpy(&id_priv->id.route.addr.src_addr, src_addr, > @@ -1217,8 +1235,8 @@ int rdma_resolve_addr(struct rdma_cm_id > if (cma_loopback_addr(dst_addr)) > ret = cma_resolve_loopback(id_priv, src_addr, expected_state); > else > - ret = ib_resolve_addr(src_addr, dst_addr, > - &id->route.addr.addr.ibaddr, > + ret = rdma_resolve_ip(src_addr, dst_addr, > + &id->route.addr.dev_addr, > timeout_ms, addr_handler, id_priv); > if (ret) > goto err; > @@ -1234,7 +1252,7 @@ EXPORT_SYMBOL(rdma_resolve_addr); > int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) > { > struct rdma_id_private *id_priv; > - struct ib_addr *ibaddr; > + struct rdma_dev_addr *dev_addr; > int ret; > > if (addr->sa_family != AF_INET) > @@ -1249,10 +1267,10 @@ int rdma_bind_addr(struct rdma_cm_id *id > } else if (cma_loopback_addr(addr)) { > ret = cma_bind_loopback(id_priv); > } else { > - ibaddr = &id->route.addr.addr.ibaddr; > - ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); > + dev_addr = &id->route.addr.dev_addr; > + ret = rdma_translate_ip(addr, dev_addr); > if (!ret) > - ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); > + ret = cma_acquire_dev(id_priv); > } > > if (ret) > Index: core/ucma.c > =================================================================== > --- core/ucma.c (revision 4651) > +++ core/ucma.c (working copy) > @@ -419,17 +419,17 @@ static ssize_t ucma_resolve_route(struct > static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp, > struct rdma_route *route) > { > - struct ib_addr *ibaddr; > + struct rdma_dev_addr *dev_addr; > > resp->num_paths = route->num_paths; > switch (route->num_paths) { > case 0: > - ibaddr = &route->addr.addr.ibaddr; > - memcpy(&resp->ib_route[0].dgid, ibaddr->dgid.raw, > - sizeof ibaddr->dgid); > - memcpy(&resp->ib_route[0].sgid, ibaddr->sgid.raw, > - sizeof ibaddr->sgid); > - resp->ib_route[0].pkey = cpu_to_be16(ibaddr->pkey); > + dev_addr = &route->addr.dev_addr; > + memcpy(&resp->ib_route[0].dgid, ib_addr_get_dgid(dev_addr), > + sizeof(union ib_gid)); > + memcpy(&resp->ib_route[0].sgid, ib_addr_get_sgid(dev_addr), > + sizeof(union ib_gid)); > + resp->ib_route[0].pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); > break; > case 2: > ib_copy_path_rec_to_user(&resp->ib_route[1], > > From caitlinb at broadcom.com Tue Jan 3 11:39:33 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 3 Jan 2006 11:39:33 -0800 Subject: [openib-general] RE: [PATCH] [ib_addr] generalize address to RDMA device translation Message-ID: <54AD0F12E08D1541B826BE97C98F99F114234E@NT-SJCA-0751.brcm.ad.broadcom.com> Tom Tucker wrote: > Sean: > > This is a good start towards generalizing the IP address translation > service. > > I think there is quite a bit of "hair" in this area that is > "Part 2" of the iWARP integration -- after the integration of > the current stuff into the trunk. I was going to introduce > some of this "hair" at the face to face in Sonoma, but we > should probably get it going now. > > There are three areas that need attention: > - ARP resolution > - ROUTE changes > - Path MTU changes > > ARP Resolve > > The iWARP side needs to be able to resolve an IP address to > an Ethernet address. Today this is not done for iWARP and it > works because the AMSO1100 does this itself in the hardware. > Other iWARP devices probably don't. This means that the logic > in ib_at needs to be extended on the iWARP side to call > neigh_event_send (instead of arp_send) to resolve an IP to an > Ethernet address. The current method of calling arp_send > directly and "sniffing" for arp replies is probably not the > best way to go long term. It would be better to register for > neighbor update events (new mechanism) and be notified when > the neighbor entry gets resolved. > This is better for two reasons: 1) it doesn't duplicate code > already in Linux, and 2) unlike IB, Ethernet MAC addresses > may change for the next hop while the connection is still > active. The provider needs to know this so it's hardware ARP tables > can be updated. > Agreed, an iWARP NIC *might* need to receive these notices, so subscribing is an excellent approach. Typically the iWARP RNIC is also a plain Ethernet NIC, where ARP,etc. are being handled by the network stack. It is a major problem if the RNIC and NIC are out of sync on an L3 issue (like what the MAC address of a local peer is, or what the next hop to reach a given IP address is). Sniffing is a workaround. Subscription is the correct solution here for long term stability. Subscription is also the only solution that will correctly respond when the system administrator has manually loaded an ARP entry. This also applies to Routing and ICMP. From hozer at hozed.org Tue Jan 3 11:44:42 2006 From: hozer at hozed.org (Troy Benjegerdes) Date: Tue, 3 Jan 2006 13:44:42 -0600 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION In-Reply-To: <1136304101.4331.46313.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B41B@mtlexch01.mtl.com> <1136304101.4331.46313.camel@hal.voltaire.com> Message-ID: <20060103194442.GO14100@narn.hozed.org> On Tue, Jan 03, 2006 at 11:01:43AM -0500, Hal Rosenstock wrote: > On Tue, 2006-01-03 at 10:43, Eitan Zahavi wrote: > > Hi Hal, > > > > Sounds good. > > I think you should be able to use the .svn/entries to get the last > > update revision and then use svn diff (or diff) to see if local mods are > > done on top of it... > > I'm using .svn/entries at the osm level. > > > So we do not get caught by surprise when something broke due to > > un-committed mod in the local directory > > Thanks The 'svnversion' command gives you the version, and then checks for local mods. No need to reinvent how to do that. From mshefty at ichips.intel.com Tue Jan 3 12:05:28 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 03 Jan 2006 12:05:28 -0800 Subject: [openib-general] Re: [PATCH] [ib_addr] generalize address to RDMA device translation In-Reply-To: <1136316922.6173.65.camel@trinity.austin.ammasso.com> References: <1136316922.6173.65.camel@trinity.austin.ammasso.com> Message-ID: <43BAD908.5030101@ichips.intel.com> Tom Tucker wrote: > ARP Resolve > > The iWARP side needs to be able to resolve an IP address to an Ethernet > address. Today this is not done for iWARP and it works because the > AMSO1100 does this itself in the hardware. Other iWARP devices probably > don't. This means that the logic in ib_at needs to be extended on the > iWARP side to call neigh_event_send (instead of arp_send) to resolve an > IP to an Ethernet address. The current method of calling arp_send > directly and "sniffing" for arp replies is probably not the best way to > go long term. It would be better to register for neighbor update events > (new mechanism) and be notified when the neighbor entry gets resolved. > This is better for two reasons: 1) it doesn't duplicate code already in > Linux, and 2) unlike IB, Ethernet MAC addresses may change for the next > hop while the connection is still active. The provider needs to know > this so it's hardware ARP tables can be updated. To be clear, the CMA uses ib_addr, and not ib_at, which is a different module. I'm not sure I understand what's wrong with sniffing arp replies. There's very little code (about a dozen lines) in ib_addr to handle arps. It also seems that it's just as unlikely that the mapping from an IP address to a hardware address will change for Ethernet as it does for IB. Are you trying to deal with a destination IP address of a connection that is not on the local subnet? If this is the case, then this seems like a separate issue than address resolution. > ROUTE Changes > > Two obvious cases, 1) the next hop changes due to normal network least- > cost routing, and 2) the user changes a route manually. Both events > would require the iWARP provider to be notified (via an event again) and > update its hardware Maybe this can be included as part of some sort of automatic "failover"? Otherwise, I'm not sure how this functionality maps to IB. It's not a big deal if it doesn't, but it'd be nice to keep similarities where possible. > PathMTU > > The new route to the remote peer has a hop with a smaller MTU than we're > currently using. Ouch! All my packets are going to be dropped until I > reduce my path MTU. The provider can't know unless he is either > filtering all ICMP traffic himself ("evil") or is notified via an event > ("nice"). > > So all this said, my little brain had imagined this logic going in and > around the ib_at module in a wonderfully crafted bit of algorithmic art > -- once I figured out how to do it all ;-) > > It sounds like you're beating the same bushes. How would you like to > proceed? I'd like to define a set of changes to ib_addr and the rdma_cm that makes it easier to support multiple RDMA devices, then evolve the codebase from there. My hope is to keep the network addressing ugliness in ib_addr. The changes to the ib_addr interface is based on trying to determine what might help support iWarp after looking at your patch. If the changes appear to be a step in the right direction, then I will commit them. The essence of the change is that ib_addr leaves the interpretation of the addresses up to the caller, which may still be a good thing even if it doesn't directly make supporting iWarp any easier. - Sean From tom at opengridcomputing.com Tue Jan 3 12:24:57 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 03 Jan 2006 14:24:57 -0600 Subject: [openib-general] Re: [PATCH] [ib_addr] generalize address to RDMA device translation In-Reply-To: <43BAD908.5030101@ichips.intel.com> References: <1136316922.6173.65.camel@trinity.austin.ammasso.com> <43BAD908.5030101@ichips.intel.com> Message-ID: <1136319897.6173.89.camel@trinity.austin.ammasso.com> On Tue, 2006-01-03 at 12:05 -0800, Sean Hefty wrote: > Tom Tucker wrote: > > ARP Resolve > > > > The iWARP side needs to be able to resolve an IP address to an Ethernet > > address. Today this is not done for iWARP and it works because the > > AMSO1100 does this itself in the hardware. Other iWARP devices probably > > don't. This means that the logic in ib_at needs to be extended on the > > iWARP side to call neigh_event_send (instead of arp_send) to resolve an > > IP to an Ethernet address. The current method of calling arp_send > > directly and "sniffing" for arp replies is probably not the best way to > > go long term. It would be better to register for neighbor update events > > (new mechanism) and be notified when the neighbor entry gets resolved. > > This is better for two reasons: 1) it doesn't duplicate code already in > > Linux, and 2) unlike IB, Ethernet MAC addresses may change for the next > > hop while the connection is still active. The provider needs to know > > this so it's hardware ARP tables can be updated. > > To be clear, the CMA uses ib_addr, and not ib_at, which is a different module. Absolutely. I was dumping a bunch of loosely related concerns... > > I'm not sure I understand what's wrong with sniffing arp replies. There's very > little code (about a dozen lines) in ib_addr to handle arps. It also seems that > it's just as unlikely that the mapping from an IP address to a hardware address > will change for Ethernet as it does for IB. Agreed -- It is unlikely. The more common case is a re-arp when the arp entry times out (typically 15 minutes). > Are you trying to deal with a destination IP address of a connection that is not > on the local subnet? If this is the case, then this seems like a separate issue > than address resolution. Yes, and no. The IP address being resolved is the peer if it is on the same subnet. If it is not, then the IP address being resolved is for the next hop. > > > ROUTE Changes > > > > Two obvious cases, 1) the next hop changes due to normal network least- > > cost routing, and 2) the user changes a route manually. Both events > > would require the iWARP provider to be notified (via an event again) and > > update its hardware > > Maybe this can be included as part of some sort of automatic "failover"? > Otherwise, I'm not sure how this functionality maps to IB. It's not a big deal > if it doesn't, but it'd be nice to keep similarities where possible. > > PathMTU > > > > The new route to the remote peer has a hop with a smaller MTU than we're > > currently using. Ouch! All my packets are going to be dropped until I > > reduce my path MTU. The provider can't know unless he is either > > filtering all ICMP traffic himself ("evil") or is notified via an event > > ("nice"). > > > > So all this said, my little brain had imagined this logic going in and > > around the ib_at module in a wonderfully crafted bit of algorithmic art > > -- once I figured out how to do it all ;-) > > > > It sounds like you're beating the same bushes. How would you like to > > proceed? > > I'd like to define a set of changes to ib_addr and the rdma_cm that makes it > easier to support multiple RDMA devices, then evolve the codebase from there. > My hope is to keep the network addressing ugliness in ib_addr. > > The changes to the ib_addr interface is based on trying to determine what might > help support iWarp after looking at your patch. If the changes appear to be a > step in the right direction, then I will commit them. The essence of the change > is that ib_addr leaves the interpretation of the addresses up to the caller, > which may still be a good thing even if it doesn't directly make supporting > iWarp any easier. My 2 cents is that it's a good thing. Sorry to throw 10 lbs of @#^$ in with this bag... I was core dumping. > > - Sean From mshefty at ichips.intel.com Tue Jan 3 12:25:38 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 03 Jan 2006 12:25:38 -0800 Subject: [openib-general] RE: [PATCH] [ib_addr] generalize address to RDMA device translation In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F114234E@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F114234E@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <43BADDC2.90109@ichips.intel.com> Caitlin Bestler wrote: > Sniffing is a workaround. Subscription is the correct > solution here for long term stability. > > Subscription is also the only solution that will correctly > respond when the system administrator has manually loaded > an ARP entry. > > This also applies to Routing and ICMP. To clarify, ib_addr does not actually "sniff" ARPs. It simply uses them to schedule its workqueue thread to check for updates. Manually loaded ARP entries would the same as any previously loaded entries. If there is a subscription based mechanism that can be used as well or instead, it should be straightforward to update ib_addr. I am just not familiar enough with the Linux network stack to know if such a mechanism exists. - Sean From mst at mellanox.co.il Tue Jan 3 12:32:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jan 2006 22:32:05 +0200 Subject: [openib-general] Re: [PATCH] fix race in mad.c In-Reply-To: References: Message-ID: <20060103203205.GA15104@mellanox.co.il> Quoting r. Sean Hefty : > There should be some way to fix this that doesn't involve walking a list on > every completion. Can't the cleanup be changed? Either move destroying the QP > after the workqueue flush or transition it to the error state before flushing. > > - Sean > What about resurrecting my idea to have ib_cq_sync()? Then we could just set a flag to suppress queueing more work. -- MST From swise at opengridcomputing.com Tue Jan 3 12:31:36 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 03 Jan 2006 14:31:36 -0600 Subject: [openib-general] RE: [PATCH] [ib_addr] generalize address to RDMA device translation In-Reply-To: <43BADDC2.90109@ichips.intel.com> References: <54AD0F12E08D1541B826BE97C98F99F114234E@NT-SJCA-0751.brcm.ad.broadcom.com> <43BADDC2.90109@ichips.intel.com> Message-ID: <1136320296.10697.26.camel@stevo-desktop> On Tue, 2006-01-03 at 12:25 -0800, Sean Hefty wrote: > Caitlin Bestler wrote: > > Sniffing is a workaround. Subscription is the correct > > solution here for long term stability. > > > > Subscription is also the only solution that will correctly > > respond when the system administrator has manually loaded > > an ARP entry. > > > > This also applies to Routing and ICMP. > > To clarify, ib_addr does not actually "sniff" ARPs. It simply uses them to > schedule its workqueue thread to check for updates. Manually loaded ARP entries > would the same as any previously loaded entries. If there is a subscription > based mechanism that can be used as well or instead, it should be > straightforward to update ib_addr. I am just not familiar enough with the Linux > network stack to know if such a mechanism exists. There is a subscription service that is used for notification of netdev events. It uses the base notifier_block services (see include/linux/notifier.h and net/core/dev.c). Our thoughts were to enhance this to allow notifications of neighbour, route, and pmtu events as well. Steve. From mshefty at ichips.intel.com Tue Jan 3 12:47:02 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 03 Jan 2006 12:47:02 -0800 Subject: [openib-general] Re: [PATCH] fix race in mad.c In-Reply-To: <20060103203205.GA15104@mellanox.co.il> References: <20060103203205.GA15104@mellanox.co.il> Message-ID: <43BAE2C6.2050807@ichips.intel.com> Michael S. Tsirkin wrote: >>There should be some way to fix this that doesn't involve walking a list on >>every completion. Can't the cleanup be changed? Either move destroying the QP >>after the workqueue flush or transition it to the error state before flushing. > > What about resurrecting my idea to have ib_cq_sync()? > Then we could just set a flag to suppress queueing more work. I don't remember the details of ib_cq_sync() off the top of my head. I think that we need to add a state to struct ib_mad_port_private that can be checked in ib_mad_thread_completion_handler(). I don't think that a new call is needed though. - Sean From eitan at mellanox.co.il Tue Jan 3 12:57:36 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 3 Jan 2006 22:57:36 +0200 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B41E@mtlexch01.mtl.com> Hal > Yes, but I'm not sure what you are proposing here. [EZ] Troy gave us the answer. What I now propose is to use: svnversion . (at the osm dir) From bos at pathscale.com Tue Jan 3 12:54:50 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 03 Jan 2006 12:54:50 -0800 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: <20060103172732.GA9170@kroah.com> References: <20051230080002.GA7438@kroah.com> <1135984304.13318.50.camel@serpentine.pathscale.com> <20051231001051.GB20314@kroah.com> <1135993250.13318.94.camel@serpentine.pathscale.com> <20060103172732.GA9170@kroah.com> Message-ID: <1136321691.10862.61.camel@localhost.localdomain> On Tue, 2006-01-03 at 09:27 -0800, Greg KH wrote: > Idealy, nothing should be new ioctls. But in the end, it all depends on > exactly what you are trying to do with each different one. Fair enough. > I really don't know what the subnet management stuff involves, sorry. > But doesn't the open-ib layer handle that all for you already? It does when our OpenIB driver is being used. But our lower level driver is independent of OpenIB (and is often used without the infiniband stuff even configured into the kernel), and needs to provide some way for a userspace subnet management agent to send and receive packets. References: <20051230080002.GA7438@kroah.com> <1135984304.13318.50.camel@serpentine.pathscale.com> <20051231001051.GB20314@kroah.com> <1135993250.13318.94.camel@serpentine.pathscale.com> <20060103172732.GA9170@kroah.com> <1136321691.10862.61.camel@localhost.localdomain> Message-ID: <1136321851.2869.18.camel@laptopd505.fenrus.org> On Tue, 2006-01-03 at 12:54 -0800, Bryan O'Sullivan wrote: > On Tue, 2006-01-03 at 09:27 -0800, Greg KH wrote: > > > Idealy, nothing should be new ioctls. But in the end, it all depends on > > exactly what you are trying to do with each different one. > > Fair enough. > > > I really don't know what the subnet management stuff involves, sorry. > > But doesn't the open-ib layer handle that all for you already? > > It does when our OpenIB driver is being used. But our lower level > driver is independent of OpenIB (and is often used without the > infiniband stuff even configured into the kernel), and needs to provide > some way for a userspace subnet management agent to send and receive > packets. that sounds like your driver should mimic the openIB userspace ABI for this *exactly* so that you can use the same management tools for either scenario... From mst at mellanox.co.il Tue Jan 3 13:09:39 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jan 2006 23:09:39 +0200 Subject: [openib-general] Re: [PATCH] fix race in mad.c In-Reply-To: <43BAE2C6.2050807@ichips.intel.com> References: <43BAE2C6.2050807@ichips.intel.com> Message-ID: <20060103210939.GA15430@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] Re: [PATCH] fix race in mad.c > > Michael S. Tsirkin wrote: > >>There should be some way to fix this that doesn't involve walking a list on > >>every completion. Can't the cleanup be changed? Either move destroying the QP > >>after the workqueue flush or transition it to the error state before flushing. > > > > What about resurrecting my idea to have ib_cq_sync()? > > Then we could just set a flag to suppress queueing more work. > > I don't remember the details of ib_cq_sync() off the top of my head. > > I think that we need to add a state to struct ib_mad_port_private that can be > checked in ib_mad_thread_completion_handler(). I don't think that a new call is > needed though. > > - Sean > We also need a way to make sure that ib_mad_thread_completion_handler is not currently running. Thats what the proposed ib_cq_sync could do. -- MST From mshefty at ichips.intel.com Tue Jan 3 13:15:11 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 03 Jan 2006 13:15:11 -0800 Subject: [openib-general] Re: [PATCH] fix race in mad.c In-Reply-To: <20060103210939.GA15430@mellanox.co.il> References: <43BAE2C6.2050807@ichips.intel.com> <20060103210939.GA15430@mellanox.co.il> Message-ID: <43BAE95F.9060905@ichips.intel.com> Michael S. Tsirkin wrote: > We also need a way to make sure that ib_mad_thread_completion_handler is > not currently running. Why wouldn't this work? lock - check state - queue work - unlock It shouldn't matter if ib_mad_thread_completion_handler() is running or not. - Sean From sobebike at gmail.com Tue Jan 3 13:30:45 2006 From: sobebike at gmail.com (Jimmy Hill) Date: Tue, 3 Jan 2006 15:30:45 -0600 Subject: [openib-general] uDAPL disconnect events Message-ID: I'm running with the latest OpenIB Gen2 uDAPL code (CMA version) and have encountered a problem with disconnect events. The basic problem is that both sides have to call dat_ep_disconnect in order to break down a connection cleanly. It should be possible for just one side (i.e., client) to call disconnect and the other side wait for and see the disconnect event. That does not appear to be working. It does however work that way in the old reference implementation (as it should). I have code that depends on that functionality and as a result, can not move it to OpenIB Gen2 uDAPL yet. Is this a known problem? Changing the flag to DAT_CLOSE_GRACEFUL_FLAG does not change the behavior. The attached copy of dtest.c is still using the default flag value. I have attached a modified version of the "dtest" test program which demonstrates the problem. The "client" will disconnect and exit cleanly. The "server" will hang waiting for the disconnect event. thanks, jimmy Jimmy Hill jimmy.hill at us.ibm.com sobebike at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: dtest.c Type: application/octet-stream Size: 63439 bytes Desc: not available URL: From caitlinb at broadcom.com Tue Jan 3 14:01:30 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 3 Jan 2006 14:01:30 -0800 Subject: [openib-general] Re: [PATCH] [ib_addr] generalize address to RDMA device translation Message-ID: <54AD0F12E08D1541B826BE97C98F99F114237B@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On Tue, 2006-01-03 at 12:05 -0800, Sean Hefty wrote: >> Tom Tucker wrote: >>> ARP Resolve >>> >>> The iWARP side needs to be able to resolve an IP address to an >>> Ethernet address. Today this is not done for iWARP and it works >>> because the AMSO1100 does this itself in the hardware. Other iWARP >>> devices probably don't. This means that the logic in ib_at needs to >>> be extended on the iWARP side to call neigh_event_send (instead of >>> arp_send) to resolve an IP to an Ethernet address. The current >>> method of calling arp_send directly and "sniffing" for arp replies >>> is probably not the best way to go long term. It would be better to >>> register for neighbor update events (new mechanism) and > be notified when the neighbor entry gets resolved. >>> This is better for two reasons: 1) it doesn't duplicate code already >>> in Linux, and 2) unlike IB, Ethernet MAC addresses may change for >>> the next hop while the connection is still active. The provider >>> needs to know this so it's hardware ARP tables can be updated. >> >> To be clear, the CMA uses ib_addr, and not ib_at, which is a >> different module. > > Absolutely. I was dumping a bunch of loosely related concerns... > >> >> I'm not sure I understand what's wrong with sniffing arp replies. >> There's very little code (about a dozen lines) in ib_addr to handle >> arps. It also seems that it's just as unlikely that the mapping from >> an IP address to a hardware address will change for Ethernet as it >> does for IB. > > Agreed -- It is unlikely. The more common case is a re-arp > when the arp entry times out (typically 15 minutes). > > It is unlikely, but it is also crucial. IP failover within a subnet is dependent on arp updates. Manual entering new ARP translations is even rarer (I think I've done it about six times in nearly 20 years of working with IP networks), but it is legal. And it is something that IP network administrators can do now, and they do not expect RDMA to break it. >> Are you trying to deal with a destination IP address of a connection >> that is not on the local subnet? If this is the case, then this >> seems like a separate issue than address resolution. > > Yes, and no. The IP address being resolved is the peer if it > is on the same subnet. If it is not, then the IP address > being resolved is for the next hop. > >> >>> ROUTE Changes >>> >>> Two obvious cases, 1) the next hop changes due to normal network >>> least- cost routing, and 2) the user changes a route manually. Both >>> events would require the iWARP provider to be notified (via an event >>> again) and update its hardware >> >> Maybe this can be included as part of some sort of automatic >> "failover"? Otherwise, I'm not sure how this functionality maps to >> IB. It's not a big deal if it doesn't, but it'd be nice to keep >> similarities where possible. The key point is that the IP layer implemented by the RNIC has to be working from the same data as the IP layer implemented by Linux. Since Linux does not implement the IB transport layer the same issue is not likely to come up. In an IP network, changing routes is supposed to be transparent to established connections, especially if there is no PMTU decrease. So trying to map it to IB APM won't be a fit. > >>> PathMTU >>> >>> The new route to the remote peer has a hop with a smaller MTU than >>> we're currently using. Ouch! All my packets are going to be dropped >>> until I reduce my path MTU. The provider can't know unless he is >>> either filtering all ICMP traffic himself ("evil") or is notified >>> via an event ("nice"). >>> >>> So all this said, my little brain had imagined this logic going in >>> and around the ib_at module in a wonderfully crafted bit of >>> algorithmic art -- once I figured out how to do it all ;-) >>> >>> It sounds like you're beating the same bushes. How would you like to >>> proceed? >> >> I'd like to define a set of changes to ib_addr and the rdma_cm that >> makes it easier to support multiple RDMA devices, then > evolve the codebase from there. >> My hope is to keep the network addressing ugliness in ib_addr. >> >> The changes to the ib_addr interface is based on trying to determine >> what might help support iWarp after looking at your patch. If the >> changes appear to be a step in the right direction, then I will >> commit them. The essence of the change is that ib_addr leaves the >> interpretation of the addresses up to the caller, which may still be >> a good thing even if it doesn't directly make supporting iWarp any >> easier. > > My 2 cents is that it's a good thing. Sorry to throw 10 lbs > of @#^$ in with this bag... I was core dumping. > I agree with Tom's assessment here. Leaving the interpretation of the rdma_addr up to the rdma transport device is a necessary and solid first step, but it should be understood that there are some related issues that will also have to be addressed. A major part of the semantics of an IP address will have to include that it has consistent meaning whether you are working through the RDMA interface or the L2/network device interface. I understand that it is a bit trickier with IPoIB, but the principle still stands that there should be some correlation between how packets are handled for an RDMA and SOCK_STREAM connection that both are established to the same remote IP Address. >From the iWARP side we can discuss what integrations we need to properly integrate iWARP with the L2 device to avoid these problems. As these proposals are brought up, we should have some review to double-check that we have properly minimized the impact on netdev and so that the hooks can be defined in a way that will allow native IB, SDP/IB and IPoIB to have similar consistency guarantees. From ardavis at ichips.intel.com Tue Jan 3 14:18:18 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 03 Jan 2006 14:18:18 -0800 Subject: [openib-general] Re: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's In-Reply-To: References: <43AC57C2.8050509@ichips.intel.com> Message-ID: <43BAF82A.9040603@ichips.intel.com> James Lentini wrote: >On Fri, 23 Dec 2005, Arlin Davis wrote: > >arlin> >arlin> A single entry point is still there with this patch, I just >arlin> defined it a little different with a function definition for >arlin> better DAT API mappings. The idea was to replace the existing >arlin> pvoid extension definition with this new one. Can you give me >arlin> an idea of how you would map these extended DAT calls to this >arlin> pvoid function definition? > >For uDAPL, the DAT_PROVIDER structure is defined as follows: > >struct dat_provider >{ > const char * device_name; > DAT_PVOID extension; > ... > >You could create a well known extensions API by defining a structure >with several function pointers > >struct dat_atomic_extensions >{ > DAT_RETURN (*cmp_and_swap_func)(IN DAT_EP_HANDLE ep_handle, > IN DAT_UINT64 cmp_value, > IN DAT_UINT64 swap_value, > IN DAT_LMR_TRIPLE *local_iov, > IN DAT_DTO_COOKIE user_cookie, > IN DAT_RMR_TRIPLE *remote_iov, > IN DAT_COMPLETION_FLAGS completion_flags); > ... >} > >and require the dat_provider's extensions member to point to your new >extension struct. > >To make the API easier to use, you could also create macros, similar >to the standard DAT macros, to reach inside an objects provider >structure and call the correct extension function. > >#define dat_ep_post_cmp_and_swap(ep, cmp, swap, local_iov, cookie, >remote_iov, flags) \ > (*DAT_HANDLE_TO_EXTENSION (ep)->cmp_and_swap_func) (\ > (ep), \ > (cmp), \ > (swap), \ > (local_iov), \ > (cookie), \ > (remote_iov), \ > (flags)) > >A drawback to this approach is that adding new extensions requires >synchronizing with the original extension specification document. To >eliminate that issue, you could require that the dat_provider's >extension member point to a typed list of these sorts of extension >structures. > > The other drawback is that the consumer calls directly into a table with no validation of provider extensions nor handles. The method I am proposing uses the existing dat_api layer for handle validation, a provider extension validation during the open, and provider extension operation validation with the extension operation parameter in the new DAT_EXTENSION_FUNC typedef. From ardavis at ichips.intel.com Tue Jan 3 14:54:55 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 03 Jan 2006 14:54:55 -0800 Subject: [openib-general] uDAPL disconnect events In-Reply-To: References: Message-ID: <43BB00BF.5050702@ichips.intel.com> Jimmy Hill wrote: > I'm running with the latest OpenIB Gen2 uDAPL code (CMA version) and > have encountered a problem with disconnect events. The basic problem > is that both sides have to call dat_ep_disconnect in order to break > down a connection cleanly. It should be possible for just one side > (i.e., client) to call disconnect and the other side wait for and see > the disconnect event. That does not appear to be working. It does > however work that way in the old reference implementation (as it > should). I have code that depends on that functionality and as a > result, can not move it to OpenIB Gen2 uDAPL yet. Is this a known problem? Yes, this is a problem. The uDAPL event should be processed from the uCMA event callback. I will work on a fix. > > Changing the flag to DAT_CLOSE_GRACEFUL_FLAG does not change the > behavior. The attached copy of dtest.c is still using the default > flag value. > > I have attached a modified version of the "dtest" test program which > demonstrates the problem. The "client" will disconnect and exit > cleanly. The "server" will hang waiting for the disconnect event. > > thanks, > jimmy > > Jimmy Hill > jimmy.hill at us.ibm.com > sobebike at gmail.com > >------------------------------------------------------------------------ > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From robert.j.woodruff at intel.com Tue Jan 3 16:13:29 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 3 Jan 2006 16:13:29 -0800 Subject: [openib-general] iSer SVN 4714 does not compile on 2.6.15 Message-ID: <1AC79F16F5C5284499BB9591B33D6F000680BB56@orsmsx408> I tried to compile iSer SVN 4714 against the released 2.6.15 kernel and get the following compile errors. CC [M] drivers/infiniband/ulp/iser/iscsi_iser.o drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_set_param': drivers/infiniband/ulp/iser/iscsi_iser.c:1437: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) drivers/infiniband/ulp/iser/iscsi_iser.c:1437: error: (Each undeclared identifier is reported only once drivers/infiniband/ulp/iser/iscsi_iser.c:1437: error: for each function it appears in.) drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_get_param': drivers/infiniband/ulp/iser/iscsi_iser.c:1497: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) drivers/infiniband/ulp/iser/iscsi_iser.c: At top level: drivers/infiniband/ulp/iser/iscsi_iser.c:1635: error: unknown field `af' specified in initializer drivers/infiniband/ulp/iser/iscsi_iser.c:1635: warning: initialization makes pointer from integer without a cast drivers/infiniband/ulp/iser/iscsi_iser.c:1636: error: unknown field `rdma' specified in initializer make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 From halr at voltaire.com Tue Jan 3 16:19:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Jan 2006 19:19:47 -0500 Subject: [openib-general] Re: iSer SVN 4714 does not compile on 2.6.15 In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000680BB56@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F000680BB56@orsmsx408> Message-ID: <1136333986.4331.52129.camel@hal.voltaire.com> Hi Woody, On Tue, 2006-01-03 at 19:13, Woodruff, Robert J wrote: > I tried to compile iSer SVN 4714 against the released 2.6.15 kernel > and get the following compile errors. > > CC [M] drivers/infiniband/ulp/iser/iscsi_iser.o > drivers/infiniband/ulp/iser/iscsi_iser.c: In function > `iscsi_iser_conn_set_param': > drivers/infiniband/ulp/iser/iscsi_iser.c:1437: error: > `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > drivers/infiniband/ulp/iser/iscsi_iser.c:1437: error: (Each undeclared > identifier is reported only once > drivers/infiniband/ulp/iser/iscsi_iser.c:1437: error: for each function > it appears in.) > drivers/infiniband/ulp/iser/iscsi_iser.c: In function > `iscsi_iser_conn_get_param': > drivers/infiniband/ulp/iser/iscsi_iser.c:1497: error: > `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > drivers/infiniband/ulp/iser/iscsi_iser.c: At top level: > drivers/infiniband/ulp/iser/iscsi_iser.c:1635: error: unknown field `af' > specified in initializer > drivers/infiniband/ulp/iser/iscsi_iser.c:1635: warning: initialization > makes pointer from integer without a cast > drivers/infiniband/ulp/iser/iscsi_iser.c:1636: error: unknown field > `rdma' specified in initializer > make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 Did you apply linux-2.6.15-rc7-iscsi_iser.diff ? Please read the iser wiki page. -- Hal From halr at voltaire.com Tue Jan 3 16:27:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Jan 2006 19:27:33 -0500 Subject: [openib-general] [ANNOUNCE] Updated OpenIB diagnostics In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B3EF@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B3EF@mtlexch01.mtl.com> Message-ID: <1136334076.4331.52142.camel@hal.voltaire.com> Hi Eitan, On Sun, 2006-01-01 at 05:12, Eitan Zahavi wrote: > Hi Hal, > > Do you expect the system administrator to manually fill in the > discover.map including ALL nodes in the fabric, their guids and "name"? Yes, although tools can assist with extracting the GUIDs, the names need annotation. > For a large cluster that number is quite large. > > In the past I was proposing using "system IB connectivity model"* (IBNL) > for providing similar but superior capability. How are the names configured using this approach ? > Using IBNL describing > each system type (should be provided by the system vendor - or extracted > once for each system type) the administrator can avoid the need to fill > in the data (guid and name) for every node in the cluster. The names in the discover tool are not system type. They are more like system location (descriptive name) although there is no requirement. > The administrator can select one of two** options: > 1. Write a "system-level-topology"*** file to describe the expected > topology instantiating systems only (not devices). This topology file is > then compared versus the discovered topology and used (the names from > the file as well as link width and speed) by all diagnostic tools for > reporting errors. How are board swaps within a system reported ? > 2. Write "annotation" file (ala discover.map syntax) that includes as > few as one device per system such that the extracted node level topology > could be matched against that spec and mapped dynamically. This seems to me like the analog. The difference is that one device gets a name. That will occur within OpenIB once the system boundary work is implemented (logical to physical mapping). > * IBNL is describes the IB connectivity inside a system in a > hierarchical manner. Yes, it needs to be hierarchial. > It enables specifying link width and speed inside the box and on the > system interface. > These properties are automatically propagated to the created topology > - and enables > their validation on the extracted topology. > The topology created hold both the system-to-system connectivity layer > as well as the > flattened IB node and link layer (the later is similar to the > discover.topo). As IBNL is > describing the systems a common naming scheme for the devices in each > such system > is provided by the system vendor and not freely annotated by the > system dministrator. Why shouldn't the system admin be allowed to use a name friendly to them ? It should point to a device type which is supplied by the vendor. -- Hal > Such that any error reported (like bad internal link or device) can be > easily understood > by the vendor too. Furthermore, when several devices misbehave - the > code can > correlate them to a specific board in the system and report the > problem once for that > entire board (this is demonstrated today by code under the ibdm tree - > see below). > > ** Having a "spec topology" has great advantages over extracted one: > Several utilities let you: > + Analyze your topology even before one cable is laid out for credit > loops, num hops, > asymmetrical routing patterns, etc > + Find routing errors that may very well happen on large cluster due > to the > human process of connecting thousand of cables. > + Find links that did not start up in the right speed or width due to > bad cables or their > connections. > > *** By "system-level-topology" I mean a file that is made of the list of > systems and not the list of IB nodes (embedded within this system). For > large cluster using 288 port switch systems the number of elements in > the file is reduced 32 times... > > The code to allow the option 1 is available under: > https://openib.org/svn/gen2/utils/src/linux-user/ibdm > > To support option 2 this code could be easily enhanced with a new > "annotation" algorithm. > > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of Hal Rosenstock > > Sent: Saturday, December 31, 2005 8:07 PM > > To: openib-general at openib.org > > Subject: [openib-general] [ANNOUNCE] Updated OpenIB diagnostics > > > > Hi, > > > > The OpenIB diagnostics > > (https://openib.org/svn/gen2/trunk/src/userspace/management/diags) > have > > been updated as follows: > > > > 1. discover.pl diagnostic tool added > > discover.pl uses a topology file create by ibnetdiscover and a > discover.map > > file which the network administrator creates which indicates the nodes > > to be expected and a discover.topo file which is the expected > connectivity > > and produces a new connectivity file (discover.topo.new) and outputs > > the changes to stdout. The network administrator can choose to replace > > the "old" topo file with the new one or certain changes in. > > > > The syntax of the discover.map file is: > > |port|"Text for node"| format> > > e.g. > > 8f10400410015|8|"ISR 6000"|# SW-6IB4 Voltaire port 0 lid 5 > > 8f10403960558|2|"HCA 1"|# MT23108 InfiniHost Mellanox Technologies > > > > The syntax of the old and new topo files (discover.topo and > discover.topo.new) > > are: > > ||| > > e.g. > > 10|5442ba00003080|1|8f10400410015 > > > > These topo files are produced by the discover.pl tool. > > > > 2. ibportstate diagnostic tool added to query, disable, and enable > > switch ports > > > > 3. Added error only mode to diagnostic scripts so less data to weed > > through on a large fabric (also verbose mode to see everything) > > > > 4. Tree structure collapsed so all tools in same directory as opposed > to > > individual ones and build simplified > > > > Let me know about any comments or issues. Thanks. > > > > -- Hal > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Tue Jan 3 16:42:55 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 03 Jan 2006 16:42:55 -0800 Subject: [openib-general] SA cache design Message-ID: <43BB1A0F.2080305@ichips.intel.com> I've been given the task of trying to come up with an implementation for an SA cache. The intent is to increase the scalability and performance of the openib stack. My current thoughts on the implementation are below. Any feedback is welcome. To keep the design as flexible as possible, my plan is to implement the cache in userspace. The interface to the cache would be via MADs. Clients would send their queries to the sa_cache instead of the SA itself. The format of the MADs would be essentially identical to those used to query the SA itself. Response MADs would contain any requested information. If the cache could not satisfy a request, the sa_cache would query the SA, update its cache, then return a reply. The benefits that I see with this approach are: + Clients would only need to send requests to the sa_cache. + The sa_cache can be implemented in stages. Requests that it cannot handle would just be forwarded to the SA. + The sa_cache could be implemented on each host, or a select number of hosts. + The interface to the sa_cache is similar to that used by the SA. + The cache would use virtual memory and could be saved to disk. Some drawbacks specific to this method are: - The MAD interface will result in additional data copies and userspace to kernel transitions for clients residing on the local system. - Clients require a mechanism to locate the sa_cache, or need to make assumptions about its location. I'm sure that there are other benefits and drawbacks that I'm missing. - Sean From halr at voltaire.com Tue Jan 3 16:49:22 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Jan 2006 19:49:22 -0500 Subject: [openib-general] SA cache design In-Reply-To: <43BB1A0F.2080305@ichips.intel.com> References: <43BB1A0F.2080305@ichips.intel.com> Message-ID: <1136335761.4331.52415.camel@hal.voltaire.com> Hi Sean, On Tue, 2006-01-03 at 19:42, Sean Hefty wrote: > I've been given the task of trying to come up with an implementation for an SA > cache. The intent is to increase the scalability and performance of the openib > stack. My current thoughts on the implementation are below. Any feedback is > welcome. > > To keep the design as flexible as possible, my plan is to implement the cache in > userspace. The interface to the cache would be via MADs. Would this be another MAD class which mimics the SA class ? > Clients would send > their queries to the sa_cache instead of the SA itself. The format of the MADs > would be essentially identical to those used to query the SA itself. Response > MADs would contain any requested information. If the cache could not satisfy a > request, the sa_cache would query the SA, update its cache, then return a reply. > > The benefits that I see with this approach are: > > + Clients would only need to send requests to the sa_cache. > + The sa_cache can be implemented in stages. Requests that it cannot handle > would just be forwarded to the SA. Another option would be for the SA cache to indicate what requests its handles (some MADs for this) and have the clients only go to the cache for those queries (and direct to the SA for the others). > + The sa_cache could be implemented on each host, or a select number of hosts. > + The interface to the sa_cache is similar to that used by the SA. > + The cache would use virtual memory and could be saved to disk. > > Some drawbacks specific to this method are: > > - The MAD interface will result in additional data copies and userspace to > kernel transitions for clients residing on the local system. > - Clients require a mechanism to locate the sa_cache, or need to make > assumptions about its location. Would SA caching be a service ID or set of IDs ? Are there also issues around cache invalidation ? -- Hal From halr at voltaire.com Tue Jan 3 16:53:53 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Jan 2006 19:53:53 -0500 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM: Separate out OSM_VERSION In-Reply-To: <20060103194442.GO14100@narn.hozed.org> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B41B@mtlexch01.mtl.com> <1136304101.4331.46313.camel@hal.voltaire.com> <20060103194442.GO14100@narn.hozed.org> Message-ID: <1136336032.4331.52467.camel@hal.voltaire.com> Hi Troy, On Tue, 2006-01-03 at 14:44, Troy Benjegerdes wrote: > On Tue, Jan 03, 2006 at 11:01:43AM -0500, Hal Rosenstock wrote: > > On Tue, 2006-01-03 at 10:43, Eitan Zahavi wrote: > > > Hi Hal, > > > > > > Sounds good. > > > I think you should be able to use the .svn/entries to get the last > > > update revision and then use svn diff (or diff) to see if local mods are > > > done on top of it... > > > > I'm using .svn/entries at the osm level. > > > > > So we do not get caught by surprise when something broke due to > > > un-committed mod in the local directory > > > Thanks > > The 'svnversion' command gives you the version, and then checks for > local mods. No need to reinvent how to do that. I think svnversion works off the ,svn/entries file anyhow so this is not reinventing the how. It is slightly more flexible to rely on the actual file than the command but I'm not adverse to using the command if there is consensus from others on this. -- Hal From mshefty at ichips.intel.com Tue Jan 3 17:15:50 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 03 Jan 2006 17:15:50 -0800 Subject: [openib-general] SA cache design In-Reply-To: <1136335761.4331.52415.camel@hal.voltaire.com> References: <43BB1A0F.2080305@ichips.intel.com> <1136335761.4331.52415.camel@hal.voltaire.com> Message-ID: <43BB21C6.4050800@ichips.intel.com> Hal Rosenstock wrote: >>I've been given the task of trying to come up with an implementation for an SA >>cache. The intent is to increase the scalability and performance of the openib >>stack. My current thoughts on the implementation are below. Any feedback is >>welcome. >> >>To keep the design as flexible as possible, my plan is to implement the cache in >>userspace. The interface to the cache would be via MADs. > > Would this be another MAD class which mimics the SA class ? I hadn't fully figured this out yet. I'm not sure if another MAD class is needed or not. My goal is to implement this as transparent to the application as possible without violating the spec, perhaps appearing as an SA on a different LID. >> Clients would send >>their queries to the sa_cache instead of the SA itself. The format of the MADs >>would be essentially identical to those used to query the SA itself. Response >>MADs would contain any requested information. If the cache could not satisfy a >>request, the sa_cache would query the SA, update its cache, then return a reply. >> >>The benefits that I see with this approach are: >> >>+ Clients would only need to send requests to the sa_cache. >>+ The sa_cache can be implemented in stages. Requests that it cannot handle >>would just be forwarded to the SA. > > Another option would be for the SA cache to indicate what requests its > handles (some MADs for this) and have the clients only go to the cache > for those queries (and direct to the SA for the others). I thought about this, but this puts an additional burden on the clients. Letting the sa_cache forward the request allows it to send the requests to another sa_cache, rather than directly to the SA. There's some additional flexibility that we gain in the long term design by forwarding requests. (I'm thinking of the possibility of having an sa_cache hierarchy.) >>+ The sa_cache could be implemented on each host, or a select number of hosts. >>+ The interface to the sa_cache is similar to that used by the SA. >>+ The cache would use virtual memory and could be saved to disk. >> >>Some drawbacks specific to this method are: >> >>- The MAD interface will result in additional data copies and userspace to >>kernel transitions for clients residing on the local system. >>- Clients require a mechanism to locate the sa_cache, or need to make >>assumptions about its location. > > Would SA caching be a service ID or set of IDs ? I'd like the sa_cache to give the appearance of being a standard SA as much as possible. One effect is that an sa_cache may not be able to run on the same node as the actual SA, but that restriction seems desirable to me. > Are there also issues around cache invalidation ? I didn't list cache synchronization as an issue because I couldn't think of any problems that were specific to this design, versus being a general issue. - Sean From ralphc at pathscale.com Tue Jan 3 18:06:05 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 03 Jan 2006 18:06:05 -0800 Subject: [openib-general] Patch for possible bug in ib_create_ah_from_wc() Message-ID: <1136340365.5081.125.camel@brick.internal.keyresearch.com> It looks like ib_create_ah_from_wc() doesn't create the correct return address (AH) when there is a GRH present (source & dest GIDs need to be swapped). I think the following patch will fix the problem but I haven't been able to test it yet. Index: gen2/trunk/src/linux-kernel/infiniband/core/verbs.c =================================================================== --- verbs.c (revision 4718) +++ verbs.c (working copy) @@ -106,9 +106,9 @@ if (wc->wc_flags & IB_WC_GRH) { ah_attr.ah_flags = IB_AH_GRH; - ah_attr.grh.dgid = grh->dgid; + ah_attr.grh.dgid = grh->sgid; - ret = ib_find_cached_gid(pd->device, &grh->sgid, &port_num, + ret = ib_find_cached_gid(pd->device, &grh->dgid, &port_num, &gid_index); if (ret) return ERR_PTR(ret); -- Ralph Campbell From dotanb at mellanox.co.il Tue Jan 3 22:38:07 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 4 Jan 2006 08:38:07 +0200 Subject: [openib-general] typo fix in the description of ibv_modify_srq Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3010C7038@mtlexch01.mtl.com> Mask names were fixed. Signed-off-by: Dotan Barak Index: last_stable/src/userspace/libibverbs/include/infiniband/verbs.h =================================================================== --- last_stable.orig/src/userspace/libibverbs/include/infiniband/verbs.h +++ last_stable/src/userspace/libibverbs/include/infiniband/verbs.h @@ -796,8 +796,8 @@ struct ibv_srq *ibv_create_srq(struct ib * @srq_attr_mask: A bit-mask used to specify which attributes of the SRQ * are being modified. * - * The mask may contain IB_SRQ_MAX_WR to resize the SRQ and/or - * IB_SRQ_LIMIT to set the SRQ's limit and request notification when + * The mask may contain IBV_SRQ_MAX_WR to resize the SRQ and/or + * IBV_SRQ_LIMIT to set the SRQ's limit and request notification when * the number of receives queued drops below the limit. */ int ibv_modify_srq(struct ibv_srq *srq, Dotan Barak Software Verification Engineer Mellanox Technologies LTD Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Jan 3 23:01:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 4 Jan 2006 09:01:41 +0200 Subject: [openib-general] Re: [PATCH] fix race in mad.c In-Reply-To: <43BAE95F.9060905@ichips.intel.com> References: <43BAE95F.9060905@ichips.intel.com> Message-ID: <20060104070141.GA23294@mellanox.co.il> Quoting r. Sean Hefty : > Why wouldn't this work? > > lock - check state - queue work - unlock > > It shouldn't matter if ib_mad_thread_completion_handler() is running or not. Right, I just tried to say you need to lock. -- MST From mst at mellanox.co.il Wed Jan 4 02:12:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 4 Jan 2006 12:12:31 +0200 Subject: [openib-general] Re: Re: strange behaviour in svn 4706 libibverbs/init.cwith two adapters In-Reply-To: References: Message-ID: <20060104101231.GO2790@mellanox.co.il> You are right. Roland, I think you need something like this: Fix ibverbs_init for multiple adapters. Noted by Christoph Raisch. Index: openib/src/userspace/libibverbs/src/init.c =================================================================== --- openib/src/userspace/libibverbs/src/init.c (revision 4716) +++ openib/src/userspace/libibverbs/src/init.c (working copy) @@ -251,7 +251,7 @@ HIDDEN int ibverbs_init(struct ibv_devic goto out; *list = new_list; } - *list[num_devices++] = device; + (*list)[num_devices++] = device; } } -- MST From RAISCH at de.ibm.com Wed Jan 4 01:48:57 2006 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Wed, 4 Jan 2006 10:48:57 +0100 Subject: [openib-general] Re: strange behaviour in svn 4706 libibverbs/init.cwith two adapters In-Reply-To: <20060103171158.GE2790@mellanox.co.il> Message-ID: sorry, did still have my additional debug print statements in there and counted these as well... never post sth in the evening... libibverbs/src/init.c:254 should be the right linenumber *list[num_devices++] = device; } } out: return num_devices; } ----end of file------ Gruss / Regards . . . Christoph R openib-general-bounces at openib.org wrote on 03.01.2006 18:11:58: > Quoting r. Christoph Raisch : > > libibverbs/src/init.c:271 *list[num_devices++] > > wc -l libibverbs/src/init.c > 260 libibverbs/src/init.c > > Hmm? > > -- > MST > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tencent_office at 56.com Wed Jan 4 03:17:34 2006 From: tencent_office at 56.com (=?shift-jis?B?ZXZlbG1rNw==?=) Date: Wed, 04 Jan 2006 19:17:34 +0800 Subject: [openib-general] =?shift-jis?b?jaGUToLMjYeMvpd0gXmDiYN1g32DYoNg?= =?shift-jis?q?=81z?= Message-ID: <2006010430014.17483@dragon08.vicp.net> さて、皆さん「あげまんあげちん」さんと会えましたか? まだの人は↓ http://54633.com/1292/ 問) xiaomenkaixingfulai at Yaho0.com.cn 19:17:34 From mst at mellanox.co.il Wed Jan 4 03:29:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 4 Jan 2006 13:29:37 +0200 Subject: [openib-general] [PATCH] ipoib: send only multicast objects memory leak Message-ID: <20060104112937.GP2790@mellanox.co.il> Send only multicast groups currently stay on multicast_list forever. Clean them at ipoib_mcast_dev_down. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ latest/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -808,18 +808,29 @@ void ipoib_mcast_dev_down(struct net_dev { struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned long flags; + struct ipoib_mcast *mcast, *tmcast; + LIST_HEAD(remove_list); + + spin_lock_irqsave(&priv->lock, flags); + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + list_del_init(&mcast->list); + rb_erase(&mcast->rb_node, &priv->multicast_tree); + list_add(&mcast->list, &remove_list); + } /* Delete broadcast since it will be recreated */ if (priv->broadcast) { ipoib_dbg_mcast(priv, "deleting broadcast group\n"); - - spin_lock_irqsave(&priv->lock, flags); rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree); - spin_unlock_irqrestore(&priv->lock, flags); - ipoib_mcast_leave(dev, priv->broadcast); - ipoib_mcast_free(priv->broadcast); + list_add_tail(&priv->broadcast->list, &remove_list); priv->broadcast = NULL; } + spin_unlock_irqrestore(&priv->lock, flags); + + list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { + ipoib_mcast_leave(dev, mcast); + ipoib_mcast_free(mcast); + } } void ipoib_mcast_restart_task(void *dev_ptr) -- MST From mst at mellanox.co.il Wed Jan 4 03:37:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 4 Jan 2006 13:37:29 +0200 Subject: [openib-general] [PATCH 1 of 2] repost: ipoib neigh issues Message-ID: <20060104113729.GQ2790@mellanox.co.il> Hi! I'm reposting the series since a bug in the second patch has been fixed. Patch 1 of 2 --- Multiple ipoib_neigh structures on mcast->neigh_list may point to the same ah. Handle this in ipoib_multicast.c, in the same way as it is handled in ipoib_main.c for struct ipoib_path. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-12-28 07:55:03.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-04 13:27:59.000000000 +0200 @@ -95,8 +95,6 @@ static void ipoib_mcast_free(struct ipoi struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tmp; unsigned long flags; - LIST_HEAD(ah_list); - struct ipoib_ah *ah, *tah; ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group " IPOIB_GID_FMT "\n", @@ -105,8 +103,14 @@ static void ipoib_mcast_free(struct ipoi spin_lock_irqsave(&priv->lock, flags); list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { + /* + * It's safe to call ipoib_put_ah() inside priv->lock + * here, because we know that mcast->ah will always + * hold one more reference, so ipoib_put_ah() will + * never do more than decrement the ref count. + */ if (neigh->ah) - list_add_tail(&neigh->ah->list, &ah_list); + ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; neigh->neighbour->ops->destructor = NULL; kfree(neigh); @@ -114,9 +118,6 @@ static void ipoib_mcast_free(struct ipoi spin_unlock_irqrestore(&priv->lock, flags); - list_for_each_entry_safe(ah, tah, &ah_list, list) - ipoib_put_ah(ah); - if (mcast->ah) ipoib_put_ah(mcast->ah); -- MST From mst at mellanox.co.il Wed Jan 4 03:38:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 4 Jan 2006 13:38:17 +0200 Subject: [openib-general] [PATCH 2 of 2] repost: ipoib neigh issues In-Reply-To: <20060104113729.GQ2790@mellanox.co.il> References: <20060104113729.GQ2790@mellanox.co.il> Message-ID: <20060104113817.GR2790@mellanox.co.il> Hi! I'm reposting the series since a bug in the second patch has been fixed. Patch 2 of 2. Applied on top of ipoib_all_neigh_issues_1.patch --- IPoIB uses neighbour ops->destructor to clean up struct ipoib_neigh, but ignores the fact that multiple neighbour objects can share the same ops structure, so setting it to NULL affects multiple neighbours. Fix this, by tracking all ipoib_neigh objects, and only cleaning destructor after no neighbour is going to use it. Note that ops structure isnt per device, so we track them in a global list. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-12-28 07:55:03.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-04 13:27:52.000000000 +0200 @@ -71,6 +71,9 @@ static const u8 ipv4_bcast_addr[] = { struct workqueue_struct *ipoib_workqueue; +static spinlock_t ipoib_all_neigh_list_lock; +static LIST_HEAD(ipoib_all_neigh_list); + static void ipoib_add_one(struct ib_device *device); static void ipoib_remove_one(struct ib_device *device); @@ -244,9 +247,8 @@ static void path_free(struct net_device */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; - kfree(neigh); + + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -474,7 +476,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -482,8 +484,6 @@ static void neigh_add_path(struct sk_buf } skb_queue_head_init(&neigh->queue); - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; /* * We can only be called from ipoib_start_xmit, so we're @@ -526,11 +526,8 @@ static void neigh_add_path(struct sk_buf return; err: - *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; - kfree(neigh); - + ipoib_neigh_free(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -757,8 +754,7 @@ static void ipoib_neigh_destructor(struc if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); - *to_ipoib_neigh(n) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -767,23 +763,45 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } -static int ipoib_neigh_setup(struct neighbour *neigh) +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) { + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) + return NULL; + + neigh->neighbour = neighbour; + *to_ipoib_neigh(neighbour) = neigh; + /* * Is this kosher? I can't find anybody in the kernel that * sets neigh->destructor, so we should be able to set it here * without trouble. */ - neigh->ops->destructor = ipoib_neigh_destructor; - - return 0; + spin_lock(&ipoib_all_neigh_list_lock); + list_add_tail(&neigh->all_neigh_list, &ipoib_all_neigh_list); + neigh->neighbour->ops->destructor = ipoib_neigh_destructor; + spin_unlock(&ipoib_all_neigh_list_lock); + return neigh; } -static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) +void ipoib_neigh_free(struct ipoib_neigh *neigh) { - parms->neigh_setup = ipoib_neigh_setup; + struct ipoib_neigh *n; - return 0; + spin_lock(&ipoib_all_neigh_list_lock); + list_del(&neigh->all_neigh_list); + + list_for_each_entry(n, &ipoib_all_neigh_list, all_neigh_list) + if (n->neighbour->ops == neigh->neighbour->ops) + goto found; + + neigh->neighbour->ops->destructor = NULL; +found: + spin_unlock(&ipoib_all_neigh_list_lock); + *to_ipoib_neigh(neigh->neighbour) = NULL; + kfree(neigh); } int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) @@ -859,7 +877,6 @@ static void ipoib_setup(struct net_devic dev->tx_timeout = ipoib_timeout; dev->hard_header = ipoib_hard_header; dev->set_multicast_list = ipoib_set_mcast_list; - dev->neigh_setup = ipoib_neigh_setup_dev; dev->watchdog_timeo = HZ; @@ -1142,6 +1159,8 @@ static int __init ipoib_init_module(void goto err_fs; } + spin_lock_init(&ipoib_all_neigh_list_lock); + ret = ib_register_client(&ipoib_client); if (ret) goto err_wq; Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-04 10:39:29.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-04 10:39:30.000000000 +0200 @@ -111,9 +111,7 @@ static void ipoib_mcast_free(struct ipoi */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -719,13 +717,11 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (neigh) { kref_get(&mcast->ah->ref); neigh->ah = mcast->ah; - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; list_add_tail(&neigh->list, &mcast->neigh_list); } } Index: openib/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-12-28 07:55:03.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-04 10:39:30.000000000 +0200 @@ -214,6 +214,7 @@ struct ipoib_neigh { struct neighbour *neighbour; struct list_head list; + struct list_head all_neigh_list; }; static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) @@ -222,6 +223,9 @@ static inline struct ipoib_neigh **to_ip (offsetof(struct neighbour, ha) & 4)); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +void ipoib_neigh_free(struct ipoib_neigh *neigh); + extern struct workqueue_struct *ipoib_workqueue; /* functions */ -- MST From mst at mellanox.co.il Wed Jan 4 04:09:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 4 Jan 2006 14:09:32 +0200 Subject: [openib-general] [PATCH] ipoib_init_qp error handling Message-ID: <20060104120932.GU2790@mellanox.co.il> If qp modify to init failed, hardware is likely wedged, so its safer not to try to modify it to reset. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ latest/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -122,7 +122,7 @@ int ipoib_init_qp(struct net_device *dev ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask); if (ret) { ipoib_warn(priv, "failed to modify QP to init, ret = %d\n", ret); - goto out_fail; + return ret; } qp_attr.qp_state = IB_QPS_RTR; -- MST From mst at mellanox.co.il Wed Jan 4 04:13:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 4 Jan 2006 14:13:03 +0200 Subject: [openib-general] [PATCH] ipoib_ib_post_receives error handling Message-ID: <20060104121303.GV2790@mellanox.co.il> Clean up if ipoib_ib_post_receives returns an error code. Rename ipoib_ib_dev_stop to ipoib_reset_qp and call that if posting receive work requests fails. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ latest/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -391,83 +391,6 @@ static void __ipoib_reap_ah(struct net_d } } -void ipoib_reap_ah(void *dev_ptr) -{ - struct net_device *dev = dev_ptr; - struct ipoib_dev_priv *priv = netdev_priv(dev); - - __ipoib_reap_ah(dev); - - if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) - queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); -} - -int ipoib_ib_dev_open(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - int ret; - - ret = ipoib_init_qp(dev); - if (ret) { - ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret); - return -1; - } - - ret = ipoib_ib_post_receives(dev); - if (ret) { - ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - return -1; - } - - clear_bit(IPOIB_STOP_REAPER, &priv->flags); - queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); - - return 0; -} - -int ipoib_ib_dev_up(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - - set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); - - return ipoib_mcast_start_thread(dev); -} - -int ipoib_ib_dev_down(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - - ipoib_dbg(priv, "downing ib_dev\n"); - - clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); - netif_carrier_off(dev); - - /* Shutdown the P_Key thread if still active */ - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { - down(&pkey_sem); - set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); - up(&pkey_sem); - flush_workqueue(ipoib_workqueue); - } - - ipoib_mcast_stop_thread(dev, 1); - - /* - * Flush the multicast groups first so we stop any multicast joins. The - * completion thread may have already died and we may deadlock waiting - * for the completion thread to finish some multicast joins. - */ - ipoib_mcast_dev_flush(dev); - - /* Delete broadcast and local addresses since they will be recreated */ - ipoib_mcast_dev_down(dev); - - ipoib_flush_paths(dev); - - return 0; -} static int recvs_pending(struct net_device *dev) { @@ -482,7 +405,8 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) + +static int ipoib_reset_qp(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -564,6 +488,94 @@ timeout: } return 0; + +} + + +void ipoib_reap_ah(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + __ipoib_reap_ah(dev); + + if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); +} + +int ipoib_ib_dev_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + ret = ipoib_init_qp(dev); + if (ret) { + ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret); + return -1; + } + + ret = ipoib_ib_post_receives(dev); + if (ret) { + ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); + goto error; + } + + clear_bit(IPOIB_STOP_REAPER, &priv->flags); + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); + + return 0; +error: + ipoib_reset_qp(dev); + return ret; +} + +int ipoib_ib_dev_up(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + + return ipoib_mcast_start_thread(dev); +} + +int ipoib_ib_dev_down(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "downing ib_dev\n"); + + clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + netif_carrier_off(dev); + + /* Shutdown the P_Key thread if still active */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + set_bit(IPOIB_PKEY_STOP, &priv->flags); + cancel_delayed_work(&priv->pkey_task); + up(&pkey_sem); + flush_workqueue(ipoib_workqueue); + } + + ipoib_mcast_stop_thread(dev, 1); + + /* + * Flush the multicast groups first so we stop any multicast joins. The + * completion thread may have already died and we may deadlock waiting + * for the completion thread to finish some multicast joins. + */ + ipoib_mcast_dev_flush(dev); + + /* Delete broadcast and local addresses since they will be recreated */ + ipoib_mcast_dev_down(dev); + + ipoib_flush_paths(dev); + + return 0; +} + +int ipoib_ib_dev_stop(struct net_device *dev) +{ + return ipoib_reset_qp(dev); } int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) -- MST From mst at mellanox.co.il Wed Jan 4 04:31:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 4 Jan 2006 14:31:34 +0200 Subject: [openib-general] [PATCH] mthca: fix page shift calculation Message-ID: <20060104123134.GW2790@mellanox.co.il> Fix page shift calculation: for all pages except possibly the last one, the byte beyond the buffer end must be page aligned. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- openib/drivers/infiniband/hw/mthca/mthca_provider.c (revision 4722) +++ openib/drivers/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -787,6 +787,8 @@ static struct ib_mr *mthca_reg_phys_mr(s total_size += buffer_list[i].size; if (i > 0) mask |= buffer_list[i].addr; + if(i < (num_phys_buf - 1)) + mask |= buffer_list[i].addr + buffer_list[i].size; } /* Find largest page shift we can use to cover buffers */ -- MST From mst at mellanox.co.il Wed Jan 4 04:48:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 4 Jan 2006 14:48:29 +0200 Subject: [openib-general] [PATCH updated] mthca: fix page shift calculation Message-ID: <20060104124829.GY2790@mellanox.co.il> Sorry, the version I posted previously has broken whitespace. --- Fix page shift calculation: for all pages except possibly the last one, the byte beyond the buffer end must be page aligned. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- openib/drivers/infiniband/hw/mthca/mthca_provider.c (revision 4722) +++ openib/drivers/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -787,6 +787,8 @@ static struct ib_mr *mthca_reg_phys_mr(s total_size += buffer_list[i].size; if (i > 0) mask |= buffer_list[i].addr; + if (i < num_phys_buf - 1) + mask |= buffer_list[i].addr + buffer_list[i].size; } /* Find largest page shift we can use to cover buffers */ -- MST From halr at voltaire.com Wed Jan 4 05:25:07 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Jan 2006 08:25:07 -0500 Subject: [openib-general] RE: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to 1 second In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B7D@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B7D@mtlexch01.mtl.com> Message-ID: <1136381106.4339.152.camel@hal.voltaire.com> Hi Eitan, On Tue, 2005-12-20 at 16:27, Eitan Zahavi wrote: > Hi Hal, > > The effect is basically a slowdown in case of non responding or lost > packets. > With 1sec timeout - up to 4sec per lost transaction are added to the SM > bringup time. > > In many clusters I have seen a 100msec was enough - but I guess you have > actually have seen such failures. I see that the timeout is set to 200 msec (and maxsmps 0) in the Mellanox OpenSM configuration. Do you have a problem with increasing the default from 100 to 200 msec (and also changing default maxsmps to 0) ? -- Hal > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Tuesday, December 20, 2005 3:38 PM > > To: Yael Kalka; Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: [PATCH] OpenSM: Extend default transaction timeout from 100 > msec to 1 > > second > > > > OpenSM: Extend default transaction timeout from 100 msec to 1 second. > > > > With the advent of long distance IB and software SMAs, 100 msec is no > > longer adaquete as a default transaction timeout. Increase this to 1 > > second which so that the default is sufficient in most common cases. > > > > Signed-off-by: Hal Rosenstock > > > > Index: include/opensm/osm_base.h > > =================================================================== > > --- include/opensm/osm_base.h (revision 4549) > > +++ include/opensm/osm_base.h (working copy) > > @@ -246,7 +246,7 @@ BEGIN_C_DECLS > > * > > * SYNOPSIS > > */ > > -#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 100 > > +#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 1000 > > /***********/ > > > > /****d* OpenSM: Base/OSM_DEFAULT_SUBNET_TIMEOUT > > Index: opensm/main.c > > =================================================================== > > --- opensm/main.c (revision 4549) > > +++ opensm/main.c (working copy) > > @@ -153,7 +153,7 @@ show_usage(void) > > " used for transaction timeouts.\n" > > " Specifying -t 0 disables timeouts.\n" > > " Without -t, OpenSM defaults to a timeout value > of\n" > > - " 100 milliseconds.\n\n" ); > > + " 1 second (1000 milliseconds).\n\n" ); > > printf( "-maxsmps \n" > > " This option specifies the number of VL15 SMP > MADs\n" > > " allowed on the wire at any one time.\n" > From eitan at mellanox.co.il Wed Jan 4 06:04:31 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 4 Jan 2006 16:04:31 +0200 Subject: [openib-general] RE: [PATCH] OpenSM: Extend default transaction timeout from 100msec to 1 second Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B42E@mtlexch01.mtl.com> Hi Hal Regarding timeout 200msec is fine with me. Regarding maxsmps - I think it is better to have 1 SMP on the wire. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, January 04, 2006 3:25 PM > To: Eitan Zahavi > Cc: Yael Kalka; openib-general at openib.org > Subject: RE: [PATCH] OpenSM: Extend default transaction timeout from 100msec to 1 > second > > Hi Eitan, > > On Tue, 2005-12-20 at 16:27, Eitan Zahavi wrote: > > Hi Hal, > > > > The effect is basically a slowdown in case of non responding or lost > > packets. > > With 1sec timeout - up to 4sec per lost transaction are added to the SM > > bringup time. > > > > In many clusters I have seen a 100msec was enough - but I guess you have > > actually have seen such failures. > > I see that the timeout is set to 200 msec (and maxsmps 0) in the > Mellanox OpenSM configuration. Do you have a problem with increasing the > default from 100 to 200 msec (and also changing default maxsmps to 0) ? > > -- Hal > > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Tuesday, December 20, 2005 3:38 PM > > > To: Yael Kalka; Eitan Zahavi > > > Cc: openib-general at openib.org > > > Subject: [PATCH] OpenSM: Extend default transaction timeout from 100 > > msec to 1 > > > second > > > > > > OpenSM: Extend default transaction timeout from 100 msec to 1 second. > > > > > > With the advent of long distance IB and software SMAs, 100 msec is no > > > longer adaquete as a default transaction timeout. Increase this to 1 > > > second which so that the default is sufficient in most common cases. > > > > > > Signed-off-by: Hal Rosenstock > > > > > > Index: include/opensm/osm_base.h > > > =================================================================== > > > --- include/opensm/osm_base.h (revision 4549) > > > +++ include/opensm/osm_base.h (working copy) > > > @@ -246,7 +246,7 @@ BEGIN_C_DECLS > > > * > > > * SYNOPSIS > > > */ > > > -#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 100 > > > +#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 1000 > > > /***********/ > > > > > > /****d* OpenSM: Base/OSM_DEFAULT_SUBNET_TIMEOUT > > > Index: opensm/main.c > > > =================================================================== > > > --- opensm/main.c (revision 4549) > > > +++ opensm/main.c (working copy) > > > @@ -153,7 +153,7 @@ show_usage(void) > > > " used for transaction timeouts.\n" > > > " Specifying -t 0 disables timeouts.\n" > > > " Without -t, OpenSM defaults to a timeout value > > of\n" > > > - " 100 milliseconds.\n\n" ); > > > + " 1 second (1000 milliseconds).\n\n" ); > > > printf( "-maxsmps \n" > > > " This option specifies the number of VL15 SMP > > MADs\n" > > > " allowed on the wire at any one time.\n" > > From halr at voltaire.com Wed Jan 4 05:56:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Jan 2006 08:56:00 -0500 Subject: [openib-general] RE: [PATCH] OpenSM: Extend default transaction timeout from 100msec to 1 second In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B42E@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B42E@mtlexch01.mtl.com> Message-ID: <1136382959.4339.404.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-01-04 at 09:04, Eitan Zahavi wrote: > Hi Hal > > Regarding timeout 200msec is fine with me. OK. We'll start there. > Regarding maxsmps - I think it is better to have 1 SMP on the wire. Can you explain the inconsistency of this with the Mellanox default of 0 (infinite) ? Why is 1 outstanding SMP better ? -- Hal > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Wednesday, January 04, 2006 3:25 PM > > To: Eitan Zahavi > > Cc: Yael Kalka; openib-general at openib.org > > Subject: RE: [PATCH] OpenSM: Extend default transaction timeout from > 100msec to 1 > > second > > > > Hi Eitan, > > > > On Tue, 2005-12-20 at 16:27, Eitan Zahavi wrote: > > > Hi Hal, > > > > > > The effect is basically a slowdown in case of non responding or lost > > > packets. > > > With 1sec timeout - up to 4sec per lost transaction are added to the > SM > > > bringup time. > > > > > > In many clusters I have seen a 100msec was enough - but I guess you > have > > > actually have seen such failures. > > > > I see that the timeout is set to 200 msec (and maxsmps 0) in the > > Mellanox OpenSM configuration. Do you have a problem with increasing > the > > default from 100 to 200 msec (and also changing default maxsmps to 0) > ? > > > > -- Hal > > > > > Eitan Zahavi > > > Design Technology Director > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > -----Original Message----- > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Sent: Tuesday, December 20, 2005 3:38 PM > > > > To: Yael Kalka; Eitan Zahavi > > > > Cc: openib-general at openib.org > > > > Subject: [PATCH] OpenSM: Extend default transaction timeout from > 100 > > > msec to 1 > > > > second > > > > > > > > OpenSM: Extend default transaction timeout from 100 msec to 1 > second. > > > > > > > > With the advent of long distance IB and software SMAs, 100 msec is > no > > > > longer adaquete as a default transaction timeout. Increase this to > 1 > > > > second which so that the default is sufficient in most common > cases. > > > > > > > > Signed-off-by: Hal Rosenstock > > > > > > > > Index: include/opensm/osm_base.h > > > > > =================================================================== > > > > --- include/opensm/osm_base.h (revision 4549) > > > > +++ include/opensm/osm_base.h (working copy) > > > > @@ -246,7 +246,7 @@ BEGIN_C_DECLS > > > > * > > > > * SYNOPSIS > > > > */ > > > > -#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 100 > > > > +#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 1000 > > > > /***********/ > > > > > > > > /****d* OpenSM: Base/OSM_DEFAULT_SUBNET_TIMEOUT > > > > Index: opensm/main.c > > > > > =================================================================== > > > > --- opensm/main.c (revision 4549) > > > > +++ opensm/main.c (working copy) > > > > @@ -153,7 +153,7 @@ show_usage(void) > > > > " used for transaction timeouts.\n" > > > > " Specifying -t 0 disables timeouts.\n" > > > > " Without -t, OpenSM defaults to a timeout > value > > > of\n" > > > > - " 100 milliseconds.\n\n" ); > > > > + " 1 second (1000 milliseconds).\n\n" ); > > > > printf( "-maxsmps \n" > > > > " This option specifies the number of VL15 SMP > > > MADs\n" > > > > " allowed on the wire at any one time.\n" > > > From web1 at sbc-global.net Wed Jan 4 05:02:39 2006 From: web1 at sbc-global.net (web1 at sbc-global.net) Date: Wed, 4 Jan 2006 07:02:39 -0600 Subject: [openib-general] Mexico Partnership Message-ID: <20060104143241.C78992283D4@openib.ca.sandia.gov> Hello, I was browsing through the results from a search engine when I came across your site. It has a lot of good, relevant information and I was hoping that we could exchange links between our sites. Doing so would not only help our regular visitors find other useful pages, but it would also make our sites rank higher in search engines like Google, AOL, MSN, and Yahoo. Exchanging links would be free and beneficial for both of our sites. My web page is over 4 years old and receives more than 2,000 visitors a day. I try to offer the best information available on any topic I think my visitors would appreciate. If you'd like to exchange links for our mutual benefit, I'd be happy to discuss it with you over email. Sincerely, Jeremy AAAAC.com 3825 West Lindsey Suite 302 Norman, Ok 73071 4053212922 http://www.aaaac.com/ We are sorry if you received this email in error, Click this link http://24.248.35.4/remove/?e=openib-general at openib.org for immediate removal. From halr at voltaire.com Wed Jan 4 08:00:20 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Jan 2006 11:00:20 -0500 Subject: [openib-general] [PATCH] OpenSM: Extend default transaction timeout from 100 to 200 msec Message-ID: <1136390419.4339.1383.camel@hal.voltaire.com> OpenSM: Extend default transaction timeout from 100 to 200 msec Signed-off-by: Hal Rosenstock Index: include/opensm/osm_base.h =================================================================== --- include/opensm/osm_base.h (revision 4753) +++ include/opensm/osm_base.h (working copy) @@ -234,7 +234,7 @@ BEGIN_C_DECLS * * SYNOPSIS */ -#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 100 +#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 200 /***********/ /****d* OpenSM: Base/OSM_DEFAULT_SUBNET_TIMEOUT Index: opensm/main.c =================================================================== --- opensm/main.c (revision 4753) +++ opensm/main.c (working copy) @@ -154,7 +154,7 @@ show_usage(void) " used for transaction timeouts.\n" " Specifying -t 0 disables timeouts.\n" " Without -t, OpenSM defaults to a timeout value of\n" - " 100 milliseconds.\n\n" ); + " 200 milliseconds.\n\n" ); printf( "-maxsmps \n" " This option specifies the number of VL15 SMP MADs\n" " allowed on the wire at any one time.\n" From bardov at gmail.com Wed Jan 4 08:22:57 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Wed, 4 Jan 2006 18:22:57 +0200 Subject: [openib-general] [PATCH] iser: simplify handling of iscsi unsolicited data In-Reply-To: References: Message-ID: Applied, thanks. Dan On 1/3/06, Or Gerlitz wrote: > The patch below eliminates the special handling of memory to be used for > iscsi unsolicited data write, instead all the command data is registered > for rdma. The descriptors holding the rdma registration info were much > simplified, the fields rdma_read/write_dto and send/recv_buff_list were > removed from struct iscsi_iser_cmd_task and are now replaced with rdma_regd. > > Signed-off-by: Alex Nezhinsky > Signed-off-by: Or Gerlitz > > > Index: ulp/iser/iser_memory.h > =================================================================== > --- ulp/iser/iser_memory.h (revision 4622) > +++ ulp/iser/iser_memory.h (working copy) > @@ -54,13 +54,6 @@ void iser_reg_single(struct iser_adaptor > struct iser_regd_buf *p_regd_buf, > enum dma_data_direction direction); > > -void iser_reg_single_task(struct iser_adaptor *p_iser_adaptor, > - struct iser_regd_buf *p_regd_buf, > - void *virt_addr, > - dma_addr_t dma_addr, > - unsigned long data_size, > - enum dma_data_direction direction); > - > /* scatterlist */ > int iser_sg_size(struct iser_data_buf *p_mem); > > @@ -70,11 +63,6 @@ void iser_start_rdma_unaligned_sg(struct > void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *p_iser_task); > > /* iser_data_buf */ > -unsigned int iser_data_buf_contig_len(struct iser_data_buf *p_data, > - int skip, > - dma_addr_t *chunk_dma_addr, > - int *chink_sz); > - > unsigned int iser_data_buf_aligned_len(struct iser_data_buf *p_data, > int skip); > > Index: ulp/iser/iscsi_iser.h > =================================================================== > --- ulp/iser/iscsi_iser.h (revision 4622) > +++ ulp/iser/iscsi_iser.h (working copy) > @@ -277,16 +277,8 @@ struct iscsi_iser_cmd_task { > > unsigned int post_send_count; /* posted send buffers pending completion */ > > - /* buffers, to release when the task is complete */ > - struct list_head send_buff_list; > - struct list_head rcv_buff_list; > - struct iser_dto rdma_read_dto; > - struct iser_dto rdma_write_dto; > - > - struct list_head conn_list; /* Tasks list of the conn */ > - struct list_head hash_list; /* Hash table bucket entry */ > - > int dir[ISER_DIRS_NUM]; /* set if direction used */ > + struct iser_regd_buf *rdma_regd[ISER_DIRS_NUM]; /* regd rdma buffer */ > unsigned long data_len[ISER_DIRS_NUM]; /* total data length */ > struct iser_data_buf data[ISER_DIRS_NUM]; /* orig. data descriptor */ > struct iser_data_buf data_copy[ISER_DIRS_NUM]; /* contig. copy */ > Index: ulp/iser/iser.h > =================================================================== > --- ulp/iser/iser.h (revision 4622) > +++ ulp/iser/iser.h (working copy) > @@ -63,16 +63,9 @@ > #define ISER_TOTAL_HEADERS_LEN \ > (ISER_HDR_LEN + ISER_PDU_BHS_LENGTH) > > -/* Hash tables */ > -#define HASH_TABLE_SIZE 256 > - > /* Various size limits */ > #define ISER_LOGIN_PHASE_PDU_DATA_LEN (8*1024) /* 8K */ > > -struct hash_table { > - struct list_head bucket_head[HASH_TABLE_SIZE]; > - spinlock_t lock; > -}; > > struct iser_page_vec { > u64 *pages; > @@ -99,9 +92,6 @@ struct iser_regd_buf { > enum dma_data_direction direction; /* direction for dma_unmap */ > unsigned int data_size; > > - > - /* To be chained here, if freeing upon completion is signaled */ > - struct list_head free_upon_comp_list; > /* Reference count, memory freed when decremented to 0 */ > atomic_t ref_count; > }; > @@ -149,8 +139,6 @@ struct iser_global { > > kmem_cache_t *login_cache; > kmem_cache_t *header_cache; > - > - struct hash_table task_hash; /* hash table for tasks */ > }; /* iser_global */ > > extern struct iser_global ig; > Index: ulp/iser/iser_dto.c > =================================================================== > --- ulp/iser/iser_dto.c (revision 4622) > +++ ulp/iser/iser_dto.c (working copy) > @@ -79,152 +79,6 @@ int iser_dto_add_regd_buff(struct iser_d > } > > /** > - * iser_dto_clone_regd_buffs - creates a dto (dst) which refers to a subrange > - * of the memory referenced by another dto (src). > - */ > -void iser_dto_clone_regd_buffs(struct iser_dto *p_dst, > - struct iser_dto *p_src, > - unsigned long offset, > - unsigned long size) > -{ > - unsigned long remaining_offset = offset; > - unsigned long remaining_size = size; > - unsigned long regd_buf_size; > - unsigned long used_size; > - int i; > - > - for (i = 0; i < p_src->regd_vector_len; i++) { > - regd_buf_size = p_src->used_sz[i] > 0 ? > - p_src->used_sz[i] : > - p_src->regd[i]->reg.len; > - > - if (remaining_offset < regd_buf_size) { > - used_size = min(remaining_size, > - regd_buf_size - remaining_offset); > - iser_dto_add_regd_buff(p_dst, > - p_src->regd[i], > - USE_OFFSET(p_src-> > - offset[i] + > - remaining_offset), > - USE_SIZE(used_size)); > - remaining_size -= used_size; > - if (remaining_size == 0) > - break; > - else > - remaining_offset = 0; > - } else > - remaining_offset -= regd_buf_size; > - } > - if (remaining_size > 0) > - iser_bug("size to clone:%ld exceeds by %ld the total size of " > - "src DTO:0x%p; dst DTO:0x%p, task:0x%p\n", > - size, remaining_size, p_src, p_dst, p_dst->p_task); > -} > - > -/** > - * iser_dto_add_local_single - > - */ > -void iser_dto_add_local_single(struct iser_adaptor *p_iser_adaptor, > - struct iser_dto *p_dto, > - void *virt_addr, > - dma_addr_t dma_addr, > - unsigned long data_size, > - enum dma_data_direction direction) > -{ > - struct iser_regd_buf *p_regd_buf; > - > - p_regd_buf = iser_regd_buf_alloc(p_iser_adaptor); > - > - iser_reg_single_task(p_iser_adaptor, p_regd_buf, > - virt_addr, dma_addr, data_size, direction); > - > - iser_dto_add_regd_buff(p_dto, p_regd_buf, > - USE_NO_OFFSET, USE_ENTIRE_SIZE); > -} > - > -/** > - * iser_dto_add_local_sg - adds a scatterlist to a dto intended for local > - * operations only; tries to use registration keys from all-memory > - * registration whenever possible. > - */ > -int iser_dto_add_local_sg(struct iser_dto *p_dto, > - struct iser_data_buf *p_mem, > - enum dma_data_direction direction) > -{ > - struct iser_adaptor *p_iser_adaptor = p_dto->p_conn->ib_conn->p_adaptor; > - struct iser_regd_buf *p_regd_buf; > - int cur_buf = 0; > - int err = 0; > - int num_sg; > - > - do { > - p_regd_buf = iser_regd_buf_alloc(p_iser_adaptor); > - if (p_regd_buf == NULL) { > - iser_err("Failed to alloc regd_buf\n"); > - err = -ENOMEM; > - goto dto_add_local_sg_exit; > - } > - /* if enough place in IOV for all sg entries, use all-memory > - * registration, otherwise register memory */ > - /* DMA_MAP: by now the sg must have been mapped, get the dma addr properly & pass it */ > - if (p_mem->dma_nents - cur_buf < > - MAX_REGD_BUF_VECTOR_LEN - p_dto->regd_vector_len) { > - dma_addr_t chunk_dma_addr; > - int chunk_sz; > - void *chunk_vaddr; > - num_sg = iser_data_buf_contig_len(p_mem, > - cur_buf, /* skip */ > - &chunk_dma_addr, > - &chunk_sz); > - /* DMA_MAP: vaddr not needed for this regd_buf */ > - chunk_vaddr = 0; > - iser_reg_single_task(p_iser_adaptor, p_regd_buf, > - chunk_vaddr, chunk_dma_addr, > - chunk_sz, direction); > - } else { > - struct iser_page_vec *page_vec; > - num_sg = iser_data_buf_aligned_len(p_mem,cur_buf); > - page_vec = iser_page_vec_alloc(p_mem,cur_buf,num_sg); > - if (page_vec == NULL) { > - iser_err("Failed to alloc page_vec\n"); > - iser_regd_buff_release(p_regd_buf); > - err = -ENOMEM; > - goto dto_add_local_sg_exit; > - } > - iser_page_vec_build(p_mem,page_vec,cur_buf,num_sg); > - > - err = iser_reg_phys_mem(p_iser_adaptor, > - page_vec, > - IB_ACCESS_LOCAL_WRITE | > - IB_ACCESS_REMOTE_WRITE | > - IB_ACCESS_REMOTE_READ , > - &p_regd_buf->reg); > - iser_page_vec_free(page_vec); > - if (err) { > - iser_err("Failed to register %d sg entries " > - "starting from %d\n",num_sg,cur_buf); > - iser_regd_buff_release(p_regd_buf); > - goto dto_add_local_sg_exit; > - } > - > - iser_dto_add_regd_buff(p_dto, > - p_regd_buf, > - USE_NO_OFFSET, > - USE_ENTIRE_SIZE); > - } > - iser_dto_add_regd_buff(p_dto, p_regd_buf, > - USE_NO_OFFSET, USE_ENTIRE_SIZE); > - iser_dbg("Added regd.buf:0x%p to DTO:0x%p now %d regd.bufs\n", > - p_regd_buf, p_dto, p_dto->regd_vector_len); > - > - cur_buf += num_sg; > - } while (cur_buf < p_mem->size); > - > - dto_add_local_sg_exit: > - return err; > -} > - > -/** > * iser_dto_buffs_release - free all registered buffers > */ > void iser_dto_buffs_release(struct iser_dto *p_dto) > Index: ulp/iser/iser_dto.h > =================================================================== > --- ulp/iser/iser_dto.h (revision 4622) > +++ ulp/iser/iser_dto.h (working copy) > @@ -47,28 +47,11 @@ int iser_dto_add_regd_buff(struct iser_d > struct iser_regd_buf *p_regd_buf, > unsigned long use_offset, > unsigned long use_size); > -void > -iser_dto_clone_regd_buffs(struct iser_dto *p_dst_dto, > - struct iser_dto *p_src_dto, > - unsigned long offset, > - unsigned long size); > > -void iser_dto_buffs_release(struct iser_dto *p_dto); > void iser_dto_free(struct iser_dto *p_dto); > > int iser_dto_completion_error(struct iser_dto *p_dto); > > -void iser_dto_add_local_single(struct iser_adaptor *p_iser_adaptor, > - struct iser_dto *p_dto, > - void *virt_addr, > - dma_addr_t dma_addr, > - unsigned long data_size, > - enum dma_data_direction direction); > - > -int iser_dto_add_local_sg(struct iser_dto *p_dto, > - struct iser_data_buf *p_mem, > - enum dma_data_direction direction); > - > void iser_dto_get_rx_pdu_data(struct iser_dto *p_dto, > unsigned long dto_xfer_len, > struct iscsi_hdr **p_rx_hdr, > Index: ulp/iser/iser_initiator.c > =================================================================== > --- ulp/iser/iser_initiator.c (revision 4622) > +++ ulp/iser/iser_initiator.c (working copy) > @@ -46,11 +46,6 @@ > #include "iser_verbs.h" > #include "iser_memory.h" > > -#define ISCSI_AHSL_MASK 0xFF000000 > -#define ISCSI_DSL_MASK 0x00FFFFFF > -#define ISCSI_INVALID_ITT 0xFFFFFFFF > - > - > static void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *p_iser_task); > > /** > @@ -60,43 +55,27 @@ static void iser_dma_unmap_task_data(str > * returns 0 on success, -1 on failure > */ > static int iser_reg_rdma_mem(struct iscsi_iser_cmd_task *p_iser_task, > - enum iser_data_dir cmd_dir, > - struct iser_data_buf *p_mem, > - struct iser_regd_buf **regd_buf) > + enum iser_data_dir cmd_dir) > { > struct iser_adaptor *p_iser_adaptor = p_iser_task->conn->ib_conn->p_adaptor; > - struct list_head *p_task_buff_list = NULL; > struct iser_page_vec *page_vec = NULL; > struct iser_regd_buf *p_regd_buf = NULL; > - struct iser_dto *p_dto = NULL; > - enum ib_access_flags priv_flags = 0; > + enum ib_access_flags priv_flags = IB_ACCESS_LOCAL_WRITE; > + struct iser_data_buf *p_mem = &p_iser_task->data[cmd_dir]; > unsigned int page_vec_len = 0; > - struct iser_data_buf *mem_to_reg; > - int cnt_to_reg; > + int cnt_to_reg = 0; > int err = 0; > > - if (cmd_dir == ISER_DIR_IN) { > - iser_dbg("cmd_dir == ISER_DIR_IN\n"); > - p_dto = &p_iser_task->rdma_write_dto; > - p_task_buff_list = &p_iser_task->rcv_buff_list; > - priv_flags = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; > - } else if (cmd_dir == ISER_DIR_OUT) { > - iser_dbg("cmd_dir == ISER_DIR_OUT\n"); > - p_dto = &p_iser_task->rdma_read_dto; > - p_task_buff_list = &p_iser_task->send_buff_list; > - priv_flags = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ; > - } else > - iser_bug("Unexpected cmd dir:%d, task:0x%p\n", > - cmd_dir, p_iser_task); > - *regd_buf = NULL; > + if (cmd_dir == ISER_DIR_IN) > + priv_flags |= IB_ACCESS_REMOTE_WRITE; > + else > + priv_flags |= IB_ACCESS_REMOTE_READ; > > + p_iser_task->rdma_regd[cmd_dir] = NULL; > p_regd_buf = iser_regd_buf_alloc(p_iser_adaptor); > if (p_regd_buf == NULL) > return -ENOMEM; > > - cnt_to_reg = 0; > - mem_to_reg = p_mem; > - > iser_dbg("p_mem %p p_mem->type %d\n", p_mem,p_mem->type); > > if (p_mem->type != ISER_BUF_TYPE_SINGLE) { > @@ -114,19 +93,19 @@ static int iser_reg_rdma_mem(struct iscs > /* unaligned scatterlist, anyway dma map the copy */ > iser_start_rdma_unaligned_sg(p_iser_task, cmd_dir); > p_regd_buf->virt_addr = p_iser_task->data_copy[cmd_dir].p_buf; > - mem_to_reg = &p_iser_task->data_copy[cmd_dir]; > + p_mem = &p_iser_task->data_copy[cmd_dir]; > } > } else { > iser_dbg("converting single to page_vec\n"); > p_regd_buf->virt_addr = p_mem->p_buf; > } > > - page_vec = iser_page_vec_alloc(mem_to_reg,0,cnt_to_reg); > + page_vec = iser_page_vec_alloc(p_mem,0,cnt_to_reg); > if (page_vec == NULL) { > iser_regd_buff_release(p_regd_buf); > return -ENOMEM; > } > - page_vec_len = iser_page_vec_build(mem_to_reg,page_vec, 0, cnt_to_reg); > + page_vec_len = iser_page_vec_build(p_mem, page_vec, 0, cnt_to_reg); > err = iser_reg_phys_mem(p_iser_adaptor, page_vec, priv_flags, > &p_regd_buf->reg); > iser_page_vec_free(page_vec); > @@ -135,57 +114,10 @@ static int iser_reg_rdma_mem(struct iscs > iser_regd_buff_release(p_regd_buf); > return -EINVAL; > } > - *regd_buf = p_regd_buf; > - > - spin_lock_bh(&p_iser_task->task_lock); > - > -/*FIXME p_dto->p_task = p_iser_task; */ > -/*FIXME p_dto->p_conn = p_iser_task->p_conn; */ > - p_dto->regd_vector_len = 0; > - iser_dto_add_regd_buff(p_dto, p_regd_buf, > - USE_NO_OFFSET, USE_ENTIRE_SIZE); > - /* to be released when the task completes */ > - list_add(&p_regd_buf->free_upon_comp_list, p_task_buff_list); > - > - spin_unlock_bh(&p_iser_task->task_lock); > - return 0; > -} > - > -/** > - * Registers memory > - * intended for sending as unsolicited data > - * > - * returns 0 on success, -1 on failure > - */ > -static int iser_reg_unsol(struct iscsi_iser_cmd_task *p_iser_task) > -{ > - struct iser_adaptor *p_iser_adaptor = p_iser_task->conn->ib_conn->p_adaptor; > - struct iser_dto *p_dto = &p_iser_task->rdma_read_dto; > - struct iser_data_buf *p_mem = &p_iser_task->data[ISER_DIR_OUT]; > - int err = 0; > - int i; > - > - if (p_mem->type == ISER_BUF_TYPE_SINGLE) { > - /* DMA_MAP: should pass the task? single address has been mapped already!!! */ > - iser_dto_add_local_single(p_iser_adaptor, p_dto, > - p_mem->p_buf, > - p_mem->dma_addr, p_mem->size, > - DMA_TO_DEVICE); > - } > - else { > - /* DMA_MAP: should pass copied and mapped sg instead? */ > - err = iser_dto_add_local_sg(p_dto, p_mem, DMA_TO_DEVICE); > - if (err) { > - iser_err("iser_dto_add_local_sg failed\n"); > - iser_dto_buffs_release(p_dto); > - return err; > - } > - } > - > - /* all registered buffers have been referenced, > - but this dto is not used in any IO */ > - for (i = 0; i < p_dto->regd_vector_len; i++) > - iser_regd_buff_deref(p_dto->regd[i]); > + /* take a reference on this regd buf such that it will not be released * > + * (eg in send dto completion) before we get the scsi response */ > + iser_regd_buff_ref(p_regd_buf); > + p_iser_task->rdma_regd[cmd_dir] = p_regd_buf; > return 0; > } > > @@ -239,12 +171,12 @@ static int iser_prepare_read_cmd(struct > memcpy(&p_iser_task->data[ISER_DIR_IN], buf_in, > sizeof(struct iser_data_buf)); > > - err = iser_reg_rdma_mem(p_iser_task,ISER_DIR_IN, > - &p_iser_task->data[ISER_DIR_IN],&p_regd_buf); > + err = iser_reg_rdma_mem(p_iser_task,ISER_DIR_IN); > if (err) { > iser_err("Failed to set up Data-IN RDMA\n"); > return err; > } > + p_regd_buf = p_iser_task->rdma_regd[ISER_DIR_IN]; > ISER_HDR_SET_BITS(p_iser_header, RSV, 1); > ISER_HDR_R_VADDR(p_iser_header) = cpu_to_be64(p_regd_buf->reg.va); > ISER_HDR_R_RKEY(p_iser_header) = htonl(p_regd_buf->reg.rkey); > @@ -311,17 +243,18 @@ iser_prepare_write_cmd(struct iscsi_iser > memcpy(&p_iser_task->data[ISER_DIR_OUT], buf_out, > sizeof(struct iser_data_buf)); > > - if (unsol_sz < edtl) { > - err = iser_reg_rdma_mem(p_iser_task,ISER_DIR_OUT, > - &p_iser_task->data[ISER_DIR_OUT], > - &p_regd_buf); > - if (err != 0) { > - iser_err("Failed to register write cmd RDMA mem\n"); > - return err; > - } > + err = iser_reg_rdma_mem(p_iser_task,ISER_DIR_OUT); > + if (err != 0) { > + iser_err("Failed to register write cmd RDMA mem\n"); > + return err; > + } > + > + p_regd_buf = p_iser_task->rdma_regd[ISER_DIR_OUT]; > + > + if(unsol_sz < edtl) { > ISER_HDR_SET_BITS(p_iser_header, WSV, 1); > ISER_HDR_W_VADDR(p_iser_header) = cpu_to_be64( > - p_regd_buf->reg.va + unsol_sz); > + p_regd_buf->reg.va + unsol_sz); > ISER_HDR_W_RKEY(p_iser_header) = htonl(p_regd_buf->reg.rkey); > > iser_dbg("Cmd itt:%d, WRITE tags, RKEY:0x%08X " > @@ -329,24 +262,17 @@ iser_prepare_write_cmd(struct iscsi_iser > p_iser_task->itt, p_regd_buf->reg.rkey, > (unsigned long)p_regd_buf->reg.va, > unsol_sz); > - } else { > - err = iser_reg_unsol(p_iser_task); /* DMA_MAP: buf_out is already in task->data[DIR_OUT] */ > - if (err != 0){ > - iser_err("Failed to register write cmd RDMA mem\n"); > - return err; > - } > } > > - /* If there is immediate data, add its register > - buffer reference to the send dto descriptor */ > if (imm_sz > 0) { > iser_dbg("Cmd itt:%d, WRITE, adding imm.data sz: %d\n", > p_iser_task->itt, imm_sz); > - > - iser_dto_clone_regd_buffs(p_send_dto, /* dst */ > - &p_iser_task->rdma_read_dto, > - 0, imm_sz); > + iser_dto_add_regd_buff(p_send_dto, > + p_regd_buf, > + USE_NO_OFFSET, > + USE_SIZE(imm_sz)); > } > + > return 0; > } > > @@ -469,7 +395,8 @@ int iser_send_data_out(struct iscsi_iser > data_seg_len = ntoh24(hdr->dlength); > buf_offset = ntohl(hdr->offset); > > - iser_dbg("%s itt %d dseg_len %d offset %d\n",__func__,(int)itt,(int)data_seg_len,(int)buf_offset); > + iser_dbg("%s itt %d dseg_len %d offset %d\n", > + __func__,(int)itt,(int)data_seg_len,(int)buf_offset); > > /* Allocate send DTO descriptor, headers buf and add it to the DTO */ > p_send_dto = iser_dto_send_create(p_iser_conn, > @@ -486,10 +413,11 @@ int iser_send_data_out(struct iscsi_iser > > p_send_dto->p_task = p_ctask; > > - /* Set-up the registered buffer entries for the data segment */ > - iser_dto_clone_regd_buffs(p_send_dto, /* dst */ > - &p_ctask->rdma_read_dto, > - buf_offset, data_seg_len); > + /* all data was registered for RDMA, we can use the lkey */ > + iser_dto_add_regd_buff(p_send_dto, > + p_ctask->rdma_regd[ISER_DIR_OUT], > + USE_OFFSET(buf_offset), > + USE_SIZE(data_seg_len)); > > if (buf_offset + data_seg_len > p_ctask->data_len[ISER_DIR_OUT]) { > iser_err("Offset:%ld & DSL:%ld in Data-Out " > Index: ulp/iser/iser_task.c > =================================================================== > --- ulp/iser/iser_task.c (revision 4622) > +++ ulp/iser/iser_task.c (working copy) > @@ -46,90 +46,16 @@ void iser_task_init_lowpart(struct iscsi > { > spin_lock_init(&p_iser_task->task_lock); > p_iser_task->status = ISER_TASK_STATUS_INIT; > - > - INIT_LIST_HEAD(&p_iser_task->send_buff_list); > - INIT_LIST_HEAD(&p_iser_task->rcv_buff_list); > - > p_iser_task->post_send_count = 0; > - > + > p_iser_task->dir[ISER_DIR_IN] = 0; > p_iser_task->dir[ISER_DIR_OUT] = 0; > - > + > p_iser_task->data_len[ISER_DIR_IN] = 0; > p_iser_task->data_len[ISER_DIR_OUT] = 0; > - > - iser_dto_init(&p_iser_task->rdma_read_dto); > - p_iser_task->rdma_read_dto.p_conn = p_iser_task->conn; > - p_iser_task->rdma_read_dto.p_task = p_iser_task; > - > - iser_dto_init(&p_iser_task->rdma_write_dto); > - p_iser_task->rdma_write_dto.p_conn = p_iser_task->conn; > - p_iser_task->rdma_write_dto.p_task = p_iser_task; > -} > - > -/** > - * iser_task_release_send_buffers - Frees all sent buffers of a > - * task (upon completion) > - */ > -void iser_task_release_send_buffers(struct iscsi_iser_cmd_task *p_iser_task) > -{ > - struct iser_regd_buf *p_regd_buf; > - int tries = 0; > - > - iser_dbg( "Releasing send buffs for iSER task: 0x%p\n", > - p_iser_task); > - > - /* Free all sent buffers from the list */ > - spin_lock_bh(&p_iser_task->task_lock); > - while (!list_empty(&p_iser_task->send_buff_list)) { > - /* Get the next send buffer & remove it from the list */ > - p_regd_buf = > - list_entry(p_iser_task->send_buff_list.next, > - struct iser_regd_buf, free_upon_comp_list); > - list_del(&p_regd_buf->free_upon_comp_list); > - spin_unlock_bh(&p_iser_task->task_lock); > - > - if (iser_regd_buff_release(p_regd_buf) != 0) { > - iser_err("Failed to release send buffer after " > - "task complete, task: 0x%p, itt: %d -" > - " references remain\n", > - p_iser_task, p_iser_task->itt); > - > - tries++; /* FIXME: calling schedule */ > - schedule(); > - } > - > - spin_lock_bh(&p_iser_task->task_lock); > - } > - spin_unlock_bh(&p_iser_task->task_lock); > - if (tries) > - iser_err("Released send buff after %d tries\n", tries); > -} > - > -/** > - * iser_task_release_recv_buffers - Frees all receive buffers of > - * a task (upon completion) > - */ > -void iser_task_release_recv_buffers(struct iscsi_iser_cmd_task *p_iser_task) > -{ > - struct iser_regd_buf *p_regd_buf; > - > - spin_lock_bh(&p_iser_task->task_lock); > - while (!list_empty(&p_iser_task->rcv_buff_list)) { > - p_regd_buf = list_entry(p_iser_task->rcv_buff_list.next, > - struct iser_regd_buf, > - free_upon_comp_list); > - list_del(&p_regd_buf->free_upon_comp_list); > - spin_unlock_bh(&p_iser_task->task_lock); > - > - if (iser_regd_buff_release(p_regd_buf) != 0) > - iser_bug("task:0x%p complete, failed to release " > - "recv buf:0x%p, itt:%d - refs remain\n", > - p_iser_task, p_regd_buf, p_iser_task->itt); > - > - spin_lock_bh(&p_iser_task->task_lock); > - } > - spin_unlock_bh(&p_iser_task->task_lock); > + > + p_iser_task->rdma_regd[ISER_DIR_IN] = NULL; > + p_iser_task->rdma_regd[ISER_DIR_OUT] = NULL; > } > > /** > @@ -184,9 +110,22 @@ iser_task_set_status(struct iscsi_iser_c > */ > void iser_task_finalize_lowpart(struct iscsi_iser_cmd_task *p_iser_task) > { > + int deferred; > + > if (p_iser_task == NULL) > iser_bug("NULL task descriptor\n"); > > - iser_task_release_send_buffers(p_iser_task); > - iser_task_release_recv_buffers(p_iser_task); > + spin_lock_bh(&p_iser_task->task_lock); > + if (p_iser_task->dir[ISER_DIR_IN]) { > + deferred = iser_regd_buff_release(p_iser_task->rdma_regd[ISER_DIR_IN]); > + if (deferred) > + iser_bug("References remain for BUF-IN rdma reg\n"); > + } > + if (p_iser_task->dir[ISER_DIR_OUT] && > + p_iser_task->rdma_regd[ISER_DIR_OUT] != NULL) { > + deferred = iser_regd_buff_release(p_iser_task->rdma_regd[ISER_DIR_OUT]); > + if (deferred) > + iser_bug("References remain for BUF-OUT rdma reg\n"); > + } > + spin_unlock_bh(&p_iser_task->task_lock); > } > Index: ulp/iser/iser_conn.h > =================================================================== > --- ulp/iser/iser_conn.h (revision 4622) > +++ ulp/iser/iser_conn.h (working copy) > @@ -40,9 +40,6 @@ > /* adaptor-related */ > int iser_adaptor_init(struct iser_adaptor *p_iser_adaptor); > int iser_adaptor_release(struct iser_adaptor *p_iser_adaptor); > -struct iser_conn *iser_adaptor_find_conn( > - struct iser_adaptor *p_iser_adaptor, void *ep_handle); > - > > /* internal connection handling */ > void iser_conn_init(struct iser_conn *p_iser_conn); > Index: ulp/iser/iser_task.h > =================================================================== > --- ulp/iser/iser_task.h (revision 4622) > +++ ulp/iser/iser_task.h (working copy) > @@ -37,13 +37,12 @@ > > #include "iser.h" > > -void iser_task_hash_init(struct hash_table *hash_table); > -struct iscsi_iser_cmd_task *iser_task_find(struct iscsi_iser_conn *p_iser_conn, u32 itt); > void iser_task_init_lowpart(struct iscsi_iser_cmd_task *p_iser_task); > +void iser_task_finalize_lowpart(struct iscsi_iser_cmd_task *iser_task); > + > void iser_task_post_send_count_inc(struct iscsi_iser_cmd_task *p_iser_task); > int iser_task_post_send_count_dec_and_test(struct iscsi_iser_cmd_task *p_iser_task); > void iser_task_set_status(struct iscsi_iser_cmd_task *p_iser_task, > enum iser_task_status status); > -void iser_task_finalize_lowpart(struct iscsi_iser_cmd_task *iser_task); > > #endif /* __ISER_TASK_H__ */ > Index: ulp/iser/iser_memory.c > =================================================================== > --- ulp/iser/iser_memory.c (revision 4622) > +++ ulp/iser/iser_memory.c (working copy) > @@ -206,24 +206,6 @@ void iser_reg_single(struct iser_adaptor > p_regd_buf->direction = direction; > } > > -void iser_reg_single_task(struct iser_adaptor *p_iser_adaptor, > - struct iser_regd_buf *p_regd_buf, > - void *virt_addr, > - dma_addr_t dma_addr, > - unsigned long data_size, > - enum dma_data_direction direction) > -{ > - p_regd_buf->reg.lkey = p_iser_adaptor->mr->lkey; > - p_regd_buf->reg.rkey = 0; /* indicate there's no need to unreg */ > - p_regd_buf->reg.len = data_size; > - p_regd_buf->reg.va = dma_addr; > - > - p_regd_buf->dma_addr = 0; > - p_regd_buf->virt_addr = virt_addr; > - p_regd_buf->data_size = data_size; > - p_regd_buf->direction = direction; > -} > - > /** > * iser_sg_size - returns the total data length in sg list > */ > @@ -523,42 +505,6 @@ unsigned int iser_data_buf_aligned_len(s > return ret_len; > } > > -/* > - * determine the maximal contiguous sub-list of a scatter-gather list > - */ > -unsigned int iser_data_buf_contig_len(struct iser_data_buf *p_data, int skip, > - dma_addr_t *chunk_dma_addr, int *chunk_size) > -{ > - unsigned int ret_len = 0; > - > - if (p_data->type == ISER_BUF_TYPE_SINGLE) > - iser_bug("p_data must be sg\n"); > - else { > - struct scatterlist *p_sg = p_data->p_buf; > - int cnt, i; > - > - *chunk_dma_addr = sg_dma_address(&p_sg[skip]); > - *chunk_size = 0; > - > - for (cnt = 0, i = skip; i < p_data->dma_nents; i++, cnt++){ > - if ((cnt > 0) && sg_dma_address(&p_sg[i]) != > - (sg_dma_address(&p_sg[i-1]) + sg_dma_len(&p_sg[i-1]))) { > - ret_len = cnt; > - break; > - } > - *chunk_size += sg_dma_len(&p_sg[i]); > - } > - if (i == p_data->dma_nents) > - ret_len = cnt; > - > - iser_dbg("Found %d contiguous entries out of %d in sg:0x%p, " > - "start dma addr:%ld size:%d\n", > - ret_len, p_data->dma_nents-skip, p_data, > - (long)*chunk_dma_addr, *chunk_size); > - } > - return ret_len; > -} > - > /** > * iser_data_buf_memcpy - Copies arbitrary data buffer to a > * contiguous memory region > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Wed Jan 4 13:15:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 04 Jan 2006 13:15:45 -0800 Subject: [openib-general] Re: Re: strange behaviour in svn 4706 libibverbs/init.cwith two adapters In-Reply-To: <20060104101231.GO2790@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 4 Jan 2006 12:12:31 +0200") References: <20060104101231.GO2790@mellanox.co.il> Message-ID: Michael> You are right. Roland, I think you need something like this Yes, thanks ... applied. - R. From rdreier at cisco.com Wed Jan 4 13:17:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 04 Jan 2006 13:17:19 -0800 Subject: [openib-general] Re: typo fix in the description of ibv_modify_srq In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3010C7038@mtlexch01.mtl.com> (Dotan Barak's message of "Wed, 4 Jan 2006 08:38:07 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3010C7038@mtlexch01.mtl.com> Message-ID: Thanks, applied. From rdreier at cisco.com Wed Jan 4 13:26:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 04 Jan 2006 13:26:55 -0800 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: (Eric W. Biederman's message of "Mon, 02 Jan 2006 13:35:07 -0700") References: <20051230080002.GA7438@kroah.com> <1135984304.13318.50.camel@serpentine.pathscale.com> <20051231001051.GB20314@kroah.com> <1135993250.13318.94.camel@serpentine.pathscale.com> Message-ID: Eric> Given Linus's comments and looking at where you are getting Eric> stuck I would recommend you split out support for the Eric> nonstandard ipath protocol from the rest of the driver. If Eric> the standard infiniband interfaces for kernel bypass are not Eric> sufficient for flinging packets then we need to re-examine Eric> them. Yes, this might be a good idea. The "core" driver looks like it is suffering from really being several things stuck together. It would probably make things a lot cleaner and easier to maintain if the core driver just handled synchronizing access to the low-level hardware, with other stuff split into its own driver. It seems there might even be enough stuff to split "core" into three drivers: the real core, the ultra-high-performance MPI transport, and the management/diagnostitcs stuff. Also, there are APIs in the "core" driver that are only exported for a single user outside the driver -- it would probably make sense to move that logic directly to where it's used. I'm thinking of things like ipath_verbs_send() and the whole ipath_copy.c file. - R. From rdreier at cisco.com Wed Jan 4 13:28:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 04 Jan 2006 13:28:29 -0800 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: <1136321691.10862.61.camel@localhost.localdomain> (Bryan O'Sullivan's message of "Tue, 03 Jan 2006 12:54:50 -0800") References: <20051230080002.GA7438@kroah.com> <1135984304.13318.50.camel@serpentine.pathscale.com> <20051231001051.GB20314@kroah.com> <1135993250.13318.94.camel@serpentine.pathscale.com> <20060103172732.GA9170@kroah.com> <1136321691.10862.61.camel@localhost.localdomain> Message-ID: Bryan> It does when our OpenIB driver is being used. But our Bryan> lower level driver is independent of OpenIB (and is often Bryan> used without the infiniband stuff even configured into the Bryan> kernel), and needs to provide some way for a userspace Bryan> subnet management agent to send and receive packets. Isn't there some way you can use the same SMA (subnet management agent) interface in all the cases? Can ipath_mad.c just go away in favor of your userspace SMA? - R. From rdreier at cisco.com Wed Jan 4 13:41:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 04 Jan 2006 13:41:53 -0800 Subject: [openib-general] Re: [PATCH] fix umad object lifetime stuff In-Reply-To: <43B4614C.7060509@ichips.intel.com> (Sean Hefty's message of "Thu, 29 Dec 2005 14:21:00 -0800") References: <528xwdqn4x.fsf@cisco.com> <43B4614C.7060509@ichips.intel.com> Message-ID: Sean> I'm just now getting back to looking at this issue. If I Sean> understand the problem in the ucm correctly, struct cdev is Sean> freed as part of struct ib_ucm_device after cdev_del() Sean> returns; however, a user could still have a reference on the Sean> cdev. Also, the user could still make calls into the driver. Sean> Is this correct? Sean> If this is the case, isn't more protection needed that Sean> simply preventing access to cdev? I.e. what prevents the Sean> user from invoking a call that tries to access the Sean> underlying ib_device? Does every file operation need Sean> synchronization with device removal to ensure that the Sean> underlying hardware is still there? (This appears to be Sean> what user_mad now does.) That all sounds right, although to be honest I'd have to take more time to recreate my old reasoning. The basic idea is that we have to keep every object around as long as there is a way to reach it. If you look at the comment in user_mad.c that starts * Our lifetime rules for these structs... you might find some clues about my reasoing... Sean> Assuming that my understanding is correct (which is a Sean> stretch), it seems that there has to be a better way to Sean> handle this that is or can be integrated with the kernel, Sean> rather than adding complex reference counting, Sean> synchronization, and clean-up code to every driver that Sean> wants to handle device removal... There probably is but I wasn't smart enough to see it. On the other hand keeping one reference to an object for each way to reach it and not freeing the object until the last reference is gone seems fairly natural to me. - R. From rminnich at lanl.gov Wed Jan 4 13:42:53 2006 From: rminnich at lanl.gov (Ronald G Minnich) Date: Wed, 04 Jan 2006 14:42:53 -0700 Subject: [openib-general] simple rarp code for gen2 Message-ID: <43BC415D.8000704@lanl.gov> I have some rarp code for gen2. It does not work. Does anyone have a VERY simple example for RARP over ib? I have some code that is supposed to work; it appears not to work. thanks ron From rdreier at cisco.com Wed Jan 4 14:37:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 04 Jan 2006 14:37:36 -0800 Subject: [openib-general] Re: [PATCH updated] mthca: fix page shift calculation In-Reply-To: <20060104124829.GY2790@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 4 Jan 2006 14:48:29 +0200") References: <20060104124829.GY2790@mellanox.co.il> Message-ID: Michael> Fix page shift calculation: for all pages except possibly Michael> the last one, the byte beyond the buffer end must be page Michael> aligned. Good catch. But it seems to me that we don't have to worry about the first page either, because of the trick buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1); buffer_list[0].addr &= ~0ull << shift; later on. Does the patch below make sense? - R. --- infiniband/hw/mthca/mthca_provider.c (revision 4754) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -778,17 +778,17 @@ static struct ib_mr *mthca_reg_phys_mr(s mask = 0; total_size = 0; for (i = 0; i < num_phys_buf; ++i) { - if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) - return ERR_PTR(-EINVAL); - if (i != 0 && i != num_phys_buf - 1 && - (buffer_list[i].size & ~PAGE_MASK)) - return ERR_PTR(-EINVAL); + if (i != 0) + mask |= buffer_list[i].addr; + if (i != 0 && i != num_phys_buf - 1) + mask |= buffer_list[i].size; total_size += buffer_list[i].size; - if (i > 0) - mask |= buffer_list[i].addr; } + if (mask & ~PAGE_MASK) + return ERR_PTR(-EINVAL); + /* Find largest page shift we can use to cover buffers */ for (shift = PAGE_SHIFT; shift < 31; ++shift) if (num_phys_buf > 1) { From mst at mellanox.co.il Wed Jan 4 14:48:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 5 Jan 2006 00:48:41 +0200 Subject: [openib-general] Re: [PATCH updated] mthca: fix page shift calculation In-Reply-To: References: Message-ID: <20060104224841.GA9839@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH updated] mthca: fix page shift calculation > > Michael> Fix page shift calculation: for all pages except possibly > Michael> the last one, the byte beyond the buffer end must be page > Michael> aligned. > > Good catch. But it seems to me that we don't have to worry about the > first page either, because of the trick > > buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1); > buffer_list[0].addr &= ~0ull << shift; > > later on. Hmm. Lets suppose I have a first chunk in bytes 1 to 2095, and then another chunk in bytes 0x100000 to 0x1ffffff - should not we limit the page size to 4K? Does your proposed change do this? > Does the patch below make sense? > > - R. > > --- infiniband/hw/mthca/mthca_provider.c (revision 4754) > +++ infiniband/hw/mthca/mthca_provider.c (working copy) > @@ -778,17 +778,17 @@ static struct ib_mr *mthca_reg_phys_mr(s > mask = 0; > total_size = 0; > for (i = 0; i < num_phys_buf; ++i) { > - if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) > - return ERR_PTR(-EINVAL); > - if (i != 0 && i != num_phys_buf - 1 && > - (buffer_list[i].size & ~PAGE_MASK)) > - return ERR_PTR(-EINVAL); > + if (i != 0) > + mask |= buffer_list[i].addr; > + if (i != 0 && i != num_phys_buf - 1) > + mask |= buffer_list[i].size; > > total_size += buffer_list[i].size; > - if (i > 0) > - mask |= buffer_list[i].addr; > } > > + if (mask & ~PAGE_MASK) > + return ERR_PTR(-EINVAL); > + > /* Find largest page shift we can use to cover buffers */ > for (shift = PAGE_SHIFT; shift < 31; ++shift) > if (num_phys_buf > 1) { > -- MST From rdreier at cisco.com Wed Jan 4 14:39:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 04 Jan 2006 14:39:27 -0800 Subject: [openib-general] *** glibc detected *** corrupted double-linked list error In-Reply-To: (wei huang's message of "Wed, 14 Dec 2005 16:36:38 -0500 (EST)") References: Message-ID: wei> Hi, We encountered the following error when we call wei> ibv_close_device: *** glibc detected *** corrupted wei> double-linked list: 0x0000000000a54e10 *** Any further information on this? Are you still seeing the problem? - R. From rdreier at cisco.com Wed Jan 4 14:42:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 04 Jan 2006 14:42:54 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix WQE size calculation in create-srq In-Reply-To: <20051218075224.GA1741@mellanox.co.il> (Jack Morgenstein's message of "Sun, 18 Dec 2005 09:52:24 +0200") References: <20051218075224.GA1741@mellanox.co.il> Message-ID: Thanks, applied From arlin.r.davis at intel.com Wed Jan 4 16:46:21 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 4 Jan 2006 16:46:21 -0800 Subject: [openib-general] [PATCH] uDAPL openib_cma disconnect processing fix Message-ID: James, Here is a patch to fix up the disconnect event processing and a change to dtest to validate. Tested with dtest and dapltest. -arlin Signed-off-by: Arlin Davis ardavis at ichips.intel.com Index: test/dtest/dtest.c =================================================================== --- test/dtest/dtest.c (revision 4759) +++ test/dtest/dtest.c (working copy) @@ -862,15 +862,31 @@ disconnect_ep() if (connected) { - LOGPRINTF("%d dat_ep_disconnect\n", getpid()); - ret = dat_ep_disconnect( h_ep, DAT_CLOSE_DEFAULT ); - if(ret != DAT_SUCCESS) { - fprintf(stderr, "%d Error dat_ep_disconnect: %s\n", - getpid(),DT_RetToString(ret)); - } - else { + /* + * Only the client needs to call disconnect. The server _should_ be able to + * just wait on the EVD associated with connection events for a disconnect + * request and exit then. + */ + if ( !server ) { + LOGPRINTF("%d dat_ep_disconnect\n", getpid()); + ret = dat_ep_disconnect( h_ep, DAT_CLOSE_DEFAULT ); + if(ret != DAT_SUCCESS) { + fprintf(stderr, "%d Error dat_ep_disconnect: %s\n", + getpid(),DT_RetToString(ret)); + } + else { LOGPRINTF("%d dat_ep_disconnect completed\n", getpid()); + } } + + ret = dat_evd_wait( h_conn_evd, DAT_TIMEOUT_INFINITE, 1, &event, &nmore ); + if(ret != DAT_SUCCESS) { + fprintf(stderr, "%d Error dat_evd_wait: %s\n", + getpid(),DT_RetToString(ret)); + } + else { + LOGPRINTF("%d dat_evd_wait for h_conn_evd completed\n", getpid()); + } } /* destroy service point */ Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 4759) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - connection management + * The OpenIB uCMA provider - uCMA connection management * **************************************************************************** * Source Control System Information @@ -287,6 +287,12 @@ static int dapli_cm_active_cb(struct dap break; case RDMA_CM_EVENT_DISCONNECTED: + /* validate EP handle */ + if (!DAPL_BAD_HANDLE(conn->ep, DAPL_MAGIC_EP)) + dapl_evd_connection_callback(conn, + IB_CME_DISCONNECTED, + NULL, + conn->ep); break; default: dapl_dbg_log( @@ -364,6 +370,13 @@ static int dapli_cm_passive_cb(struct da break; case RDMA_CM_EVENT_DISCONNECTED: + /* validate SP handle context */ + if (!DAPL_BAD_HANDLE(conn->sp, DAPL_MAGIC_PSP) || + !DAPL_BAD_HANDLE(conn->sp, DAPL_MAGIC_RSP)) + dapls_cr_callback(conn, + IB_CME_DISCONNECTED, + NULL, + conn->sp); break; default: dapl_dbg_log(DAPL_DBG_TYPE_ERR, " passive_cb: " @@ -496,21 +509,10 @@ dapls_ib_disconnect(IN DAPL_EP *ep_ptr, " disconnect: ID %p ret %d\n", ep_ptr->cm_handle, ret); - /* - * uDAPL does NOT expect disconnect callback from provider - * with abrupt close. uDAPL will callback with DISC event when - * from provider returns. So, if callback is expected from - * rdma_cma then block and don't post the event during callback. + /* + * DAT event notification occurs from the callback + * Note: will fire even if DREQ goes unanswered on timeout */ - if (close_flags != DAT_CLOSE_ABRUPT_FLAG) - { - if (ep_ptr->cr_ptr) - dapls_cr_callback(conn, IB_CME_DISCONNECTED, NULL, - ((DAPL_CR *)ep_ptr->cr_ptr)->sp_ptr); - else - dapl_evd_connection_callback(conn, IB_CME_DISCONNECTED, - NULL, ep_ptr); - } return DAT_SUCCESS; } @@ -537,11 +539,8 @@ dapls_ib_disconnect_clean(IN DAPL_EP *ep IN DAT_BOOLEAN active, IN const ib_cm_events_t ib_cm_event) { - /* - * Clean up outstanding connection state - */ - dapls_ib_disconnect(ep_ptr, DAT_CLOSE_ABRUPT_FLAG); - + /* nothing to do */ + return; } /* @@ -592,7 +591,11 @@ dapls_ib_setup_conn_listener(IN DAPL_IA if (rdma_bind_addr(conn->cm_id, (struct sockaddr *)&ia_ptr->hca_ptr->hca_address)) { - dat_status = dapl_convert_errno(errno,"setup_listener"); + if (errno == -EBUSY) + dat_status = DAT_CONN_QUAL_IN_USE; + else + dat_status = + dapl_convert_errno(errno,"setup_listener"); goto bail; } From ogerlitz at voltaire.com Wed Jan 4 23:34:25 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 5 Jan 2006 09:34:25 +0200 (IST) Subject: [openib-general] [PATCH] iser: leftovers of the mem registration for unsolicited data change Message-ID: some leftovers of the mem registration for unsolicited data change. post_lock field of struct iscsi_iser_conn removed as the xmitsema field serializes connection xmits, removed the unused conn_lock. Signed-off-by: Or Gerlitz Index: ulp/iser/iscsi_iser.h =================================================================== --- ulp/iser/iscsi_iser.h (revision 4759) +++ ulp/iser/iscsi_iser.h (working copy) @@ -113,7 +113,7 @@ struct rdma_cm_id; struct ib_qp; struct iscsi_iser_cmd_task; -#define MAX_REGD_BUF_VECTOR_LEN 28 +#define MAX_REGD_BUF_VECTOR_LEN 2 enum iser_dto_type { ISER_DTO_RCV = 0, /* Receive buffer */ @@ -170,9 +170,6 @@ struct iscsi_iser_conn atomic_t state; /* iSCSI connection state */ int ff_mode_enabled; /* To be removed ??? */ - spinlock_t conn_lock; /* guards the conn and related structures */ - spinlock_t post_lock; /* serializes posting WR to the QP */ - struct list_head adaptor_list; /* entry in the adaptor's conn list */ kmem_cache_t *postrecv_cache; Index: ulp/iser/iser_verbs.c =================================================================== --- ulp/iser/iser_verbs.c (revision 4759) +++ ulp/iser/iser_verbs.c (working copy) @@ -596,9 +596,6 @@ int iser_post_recv(struct iser_dto *p_re if (p_iser_conn == NULL) iser_bug("NULL p_conn in dto: 0x%p\n", p_recv_dto); - if (p_recv_dto->regd_vector_len > MAX_REGD_BUF_VECTOR_LEN) - iser_bug("DTO regd_vector_len exceeds maximal IOV len\n"); - iser_dto_to_iov(p_recv_dto, iov, 2); recv_wr.next = NULL; @@ -606,9 +603,7 @@ int iser_post_recv(struct iser_dto *p_re recv_wr.num_sge = p_recv_dto->regd_vector_len; recv_wr.wr_id = (unsigned long)p_recv_dto; - spin_lock(&p_iser_conn->post_lock); ib_ret = ib_post_recv (p_iser_conn->ib_conn->qp, &recv_wr, &recv_wr_failed); - spin_unlock(&p_iser_conn->post_lock); if (ib_ret) { iser_err("ib_post_recv failed ret=%d\n", ib_ret); @@ -625,8 +620,6 @@ int iser_post_recv(struct iser_dto *p_re */ int iser_start_send(struct iser_dto *p_dto) { - /* #warning we consume way too much stack here, * - * sizeof ib_send_wr=72 + sizeof ib_sge=16 * 28 */ int ib_ret, ret_val = 0; struct ib_send_wr send_wr, *send_wr_failed; struct ib_sge iov[MAX_REGD_BUF_VECTOR_LEN]; @@ -639,10 +632,6 @@ int iser_start_send(struct iser_dto *p_d if (p_iser_conn == NULL) iser_bug("NULL p_conn in dto: 0x%p\n", p_dto); - if (p_dto->regd_vector_len > MAX_REGD_BUF_VECTOR_LEN) - iser_bug("DTO regd_vector_len %d exceeds maximal IOV len\n", - p_dto->regd_vector_len); - iser_dto_to_iov(p_dto, iov, MAX_REGD_BUF_VECTOR_LEN); send_wr.next = NULL; @@ -652,14 +641,9 @@ int iser_start_send(struct iser_dto *p_d send_wr.opcode = IB_WR_SEND; send_wr.send_flags = p_dto->notify_enable ? IB_SEND_SIGNALED : 0; - if(p_dto->type != ISER_DTO_SEND) - iser_bug("Illegal DTO type for iser_start_dto\n"); - atomic_inc(&p_iser_conn->post_send_buf_count); - spin_lock(&p_iser_conn->post_lock); ib_ret = ib_post_send(p_iser_conn->ib_conn->qp, &send_wr, &send_wr_failed); - spin_unlock(&p_iser_conn->post_lock); if (ib_ret) { iser_err("Failed to start SEND DTO, p_dto: 0x%p, IOV len: %d\n", Index: ulp/iser/iser_conn.c =================================================================== --- ulp/iser/iser_conn.c (revision 4759) +++ ulp/iser/iser_conn.c (working copy) @@ -88,9 +88,6 @@ void iser_conn_init(struct iser_conn *p_ void iser_conn_init_lowpart(struct iscsi_iser_conn *p_iser_conn) { /* moved from iser_conn_init */ - spin_lock_init(&p_iser_conn->conn_lock); - spin_lock_init(&p_iser_conn->post_lock); - atomic_set(&p_iser_conn->post_recv_buf_count, 0); atomic_set(&p_iser_conn->post_send_buf_count, 0); From yael at mellanox.co.il Thu Jan 5 00:24:31 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 05 Jan 2006 10:24:31 +0200 Subject: [openib-general] Re[PATCH] Opensm - running on system with 2 hcas Message-ID: <5z1wzngk00.fsf@mtl066.yok.mtl.com> Hi Hal, When trying to run OpenSM on a system with 2 hca cards, we noticed that there is a problem with the osm_vendor_get_all_port_attr. What happens is that we are saving the port 0 for each hca, though this data is relevant for the default port only once. The result is that if running with -g 0, we get 5 ports instead of 4, and the third port (which was the data copied as the default port for the second hca) is not valid. The following patch fixes this. Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 4760) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -637,18 +637,24 @@ osm_vendor_get_all_port_attr( umad_release_port(&def_port); } + j = 0; if (p_attr_array) { /* set the port guid, lid, and sm lid in the port attr struct */ for (i = 0; i < *p_num_ports; i++) { - p_attr_array[i].port_guid = portguids[i]; - p_attr_array[i].lid = lids[i]; - if (i == 0) - p_attr_array[i].sm_lid = sm_lid; + if (i > 0 && portguids[i] == 0) { + continue; + } + p_attr_array[j].port_guid = portguids[i]; + p_attr_array[j].lid = lids[i]; + if (j == 0) + p_attr_array[j].sm_lid = sm_lid; else - p_attr_array[i].sm_lid = p_vend->umad_port.sm_lid; - p_attr_array[i].link_state = linkstates[i]; + p_attr_array[j].sm_lid = p_vend->umad_port.sm_lid; + p_attr_array[j].link_state = linkstates[i]; + j++; } r = 0; + *p_num_ports = j; } else r = IB_INSUFFICIENT_MEMORY; From bardov at gmail.com Thu Jan 5 00:47:49 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Thu, 5 Jan 2006 10:47:49 +0200 Subject: [openib-general] [PATCH] iser: leftovers of the mem registration for unsolicited data change In-Reply-To: References: Message-ID: Applied r4761. Thanks, Dan On 1/5/06, Or Gerlitz wrote: > some leftovers of the mem registration for unsolicited data change. > post_lock field of struct iscsi_iser_conn removed as the xmitsema > field serializes connection xmits, removed the unused conn_lock. > > Signed-off-by: Or Gerlitz > > Index: ulp/iser/iscsi_iser.h > =================================================================== > --- ulp/iser/iscsi_iser.h (revision 4759) > +++ ulp/iser/iscsi_iser.h (working copy) > @@ -113,7 +113,7 @@ struct rdma_cm_id; > struct ib_qp; > struct iscsi_iser_cmd_task; > > -#define MAX_REGD_BUF_VECTOR_LEN 28 > +#define MAX_REGD_BUF_VECTOR_LEN 2 > > enum iser_dto_type { > ISER_DTO_RCV = 0, /* Receive buffer */ > @@ -170,9 +170,6 @@ struct iscsi_iser_conn > atomic_t state; /* iSCSI connection state */ > int ff_mode_enabled; /* To be removed ??? */ > > - spinlock_t conn_lock; /* guards the conn and related structures */ > - spinlock_t post_lock; /* serializes posting WR to the QP */ > - > struct list_head adaptor_list; /* entry in the adaptor's conn list */ > > kmem_cache_t *postrecv_cache; > Index: ulp/iser/iser_verbs.c > =================================================================== > --- ulp/iser/iser_verbs.c (revision 4759) > +++ ulp/iser/iser_verbs.c (working copy) > @@ -596,9 +596,6 @@ int iser_post_recv(struct iser_dto *p_re > if (p_iser_conn == NULL) > iser_bug("NULL p_conn in dto: 0x%p\n", p_recv_dto); > > - if (p_recv_dto->regd_vector_len > MAX_REGD_BUF_VECTOR_LEN) > - iser_bug("DTO regd_vector_len exceeds maximal IOV len\n"); > - > iser_dto_to_iov(p_recv_dto, iov, 2); > > recv_wr.next = NULL; > @@ -606,9 +603,7 @@ int iser_post_recv(struct iser_dto *p_re > recv_wr.num_sge = p_recv_dto->regd_vector_len; > recv_wr.wr_id = (unsigned long)p_recv_dto; > > - spin_lock(&p_iser_conn->post_lock); > ib_ret = ib_post_recv (p_iser_conn->ib_conn->qp, &recv_wr, &recv_wr_failed); > - spin_unlock(&p_iser_conn->post_lock); > > if (ib_ret) { > iser_err("ib_post_recv failed ret=%d\n", ib_ret); > @@ -625,8 +620,6 @@ int iser_post_recv(struct iser_dto *p_re > */ > int iser_start_send(struct iser_dto *p_dto) > { > - /* #warning we consume way too much stack here, * > - * sizeof ib_send_wr=72 + sizeof ib_sge=16 * 28 */ > int ib_ret, ret_val = 0; > struct ib_send_wr send_wr, *send_wr_failed; > struct ib_sge iov[MAX_REGD_BUF_VECTOR_LEN]; > @@ -639,10 +632,6 @@ int iser_start_send(struct iser_dto *p_d > if (p_iser_conn == NULL) > iser_bug("NULL p_conn in dto: 0x%p\n", p_dto); > > - if (p_dto->regd_vector_len > MAX_REGD_BUF_VECTOR_LEN) > - iser_bug("DTO regd_vector_len %d exceeds maximal IOV len\n", > - p_dto->regd_vector_len); > - > iser_dto_to_iov(p_dto, iov, MAX_REGD_BUF_VECTOR_LEN); > > send_wr.next = NULL; > @@ -652,14 +641,9 @@ int iser_start_send(struct iser_dto *p_d > send_wr.opcode = IB_WR_SEND; > send_wr.send_flags = p_dto->notify_enable ? IB_SEND_SIGNALED : 0; > > - if(p_dto->type != ISER_DTO_SEND) > - iser_bug("Illegal DTO type for iser_start_dto\n"); > - > atomic_inc(&p_iser_conn->post_send_buf_count); > > - spin_lock(&p_iser_conn->post_lock); > ib_ret = ib_post_send(p_iser_conn->ib_conn->qp, &send_wr, &send_wr_failed); > - spin_unlock(&p_iser_conn->post_lock); > > if (ib_ret) { > iser_err("Failed to start SEND DTO, p_dto: 0x%p, IOV len: %d\n", > Index: ulp/iser/iser_conn.c > =================================================================== > --- ulp/iser/iser_conn.c (revision 4759) > +++ ulp/iser/iser_conn.c (working copy) > @@ -88,9 +88,6 @@ void iser_conn_init(struct iser_conn *p_ > void iser_conn_init_lowpart(struct iscsi_iser_conn *p_iser_conn) > { > /* moved from iser_conn_init */ > - spin_lock_init(&p_iser_conn->conn_lock); > - spin_lock_init(&p_iser_conn->post_lock); > - > atomic_set(&p_iser_conn->post_recv_buf_count, 0); > atomic_set(&p_iser_conn->post_send_buf_count, 0); > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Thu Jan 5 03:08:23 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 5 Jan 2006 13:08:23 +0200 (IST) Subject: [openib-general] [PATCH] iser: more cleanups following the iscsi_iser/iser merge Message-ID: more cleanups which are possible following the iscsi_iser/iser merge: remove unused struct iscsi_buf, make some func call pathes shorter. Index: ulp/iser/iscsi_iser.h =================================================================== --- ulp/iser/iscsi_iser.h (revision 4768) +++ ulp/iser/iscsi_iser.h (working copy) @@ -25,6 +25,7 @@ #include #include #include +#include #define AF_ISER 28 /* to be defined properly */ #define ISCSI_ISER_XMIT_CMDS_MAX 128 /* must be power of 2 */ @@ -227,19 +228,11 @@ struct iscsi_iser_queue { int max; /* Max number of elements */ }; -struct iscsi_buf { - struct scatterlist sg; - struct kvec iov; - unsigned int sent; -}; - struct iscsi_iser_mgmt_task { struct iscsi_hdr hdr; uint32_t itt; /* this ITT */ char *data; /* mgmt payload */ int data_count; /* counts data to be sent */ - struct iscsi_buf headbuf; /* header buffer */ - struct iscsi_buf sendbuf; /* in progress buffer */ }; struct iscsi_iser_cmd_task { @@ -348,14 +341,11 @@ int iser_send_data_out(struct iscsi_iser /* terminate a connection */ int iser_conn_term(struct iscsi_iser_conn *p_iscsi_iser_conn); +int iscsi_iser_hdr_recv(struct iscsi_iser_conn *conn, struct iscsi_hdr *hdr, + char *rx_data); -void iscsi_iser_control_notify(struct iscsi_iser_conn *p_iscsi_conn, - struct iscsi_hdr *hdr, char *rx_data); - -void iscsi_iser_conn_term_notify(struct iscsi_iser_conn *p_iscsi_conn); +void iscsi_iser_conn_failure(struct iscsi_iser_conn *conn, enum iscsi_err err); int iscsi_iser_init(void); void iscsi_iser_exit(void); -void iser_conn_init_lowpart(struct iscsi_iser_conn *p_iser_conn); - Index: ulp/iser/iser_conn.c =================================================================== --- ulp/iser/iser_conn.c (revision 4768) +++ ulp/iser/iser_conn.c (working copy) @@ -85,15 +85,6 @@ void iser_conn_init(struct iser_conn *p_ init_waitqueue_head(&p_iser_conn->connect_wait_q); } -void iser_conn_init_lowpart(struct iscsi_iser_conn *p_iser_conn) -{ - /* moved from iser_conn_init */ - atomic_set(&p_iser_conn->post_recv_buf_count, 0); - atomic_set(&p_iser_conn->post_send_buf_count, 0); - - init_waitqueue_head(&p_iser_conn->disconnect_wait_q); -} - /** * Initializes iSER adaptor structure. * @@ -454,7 +445,8 @@ int iser_complete_conn_termination(struc /* Notify the upper layer about asynch terminations */ if (cur_conn_state == ISER_CONN_ASYNC_TERM) - iscsi_iser_conn_term_notify(p_iser_conn); + iscsi_iser_conn_failure(p_iser_conn, + ISCSI_ERR_CONN_FAILED); if (cur_conn_state == ISER_CONN_SYNC_TERM) wake_up_interruptible(&p_iser_conn->disconnect_wait_q); Index: ulp/iser/iser_initiator.c =================================================================== --- ulp/iser/iser_initiator.c (revision 4768) +++ ulp/iser/iser_initiator.c (working copy) @@ -537,7 +537,7 @@ void iser_rcv_dto_completion(struct iser struct iscsi_iser_cmd_task *p_iser_task = NULL; struct iscsi_hdr *p_hdr; char *rx_data; - int rx_data_size; + int rx_data_size,rc; unsigned int itt; unsigned char opcode; int no_more_task_sends = 0; @@ -581,9 +581,9 @@ void iser_rcv_dto_completion(struct iser } } - iser_dbg("Control notify, DTO:0x%p, as PDU:0x%p\n", p_dto, p_hdr); - - iscsi_iser_control_notify(p_iser_conn, p_hdr, rx_data); + rc = iscsi_iser_hdr_recv(p_iser_conn, p_hdr, rx_data); + if(rc) + iscsi_iser_conn_failure(p_iser_conn, rc); if(p_iser_task != NULL) { spin_lock(&p_iser_task->task_lock); Index: ulp/iser/iscsi_iser.c =================================================================== --- ulp/iser/iscsi_iser.c (revision 4768) +++ ulp/iser/iscsi_iser.c (working copy) @@ -79,15 +79,6 @@ module_param_named(max_lun, iscsi_max_lu static kmem_cache_t *task_mem_cache; -static inline void iscsi_buf_init_iov(struct iscsi_buf *ibuf, - char *vbuf, int size) -{ - ibuf->sg.page = (void*)vbuf; - ibuf->sg.offset = (unsigned int)-1; - ibuf->sg.length = size; - ibuf->sent = 0; -} - /** * iscsi_iser_cmd_init - Initialize iSCSI SCSI_READ or SCSI_WRITE commands * @@ -185,26 +176,18 @@ static int iscsi_iser_mtask_xmit(struct { int error = 0; - debug_scsi("mtask deq [cid %d itt 0x%x]\n", - conn->id, mtask->itt); - - debug_scsi("%s: sending: size = 0x%x, data = 0x%p\n", - __FUNCTION__, - mtask->data_count, - mtask->data); + debug_scsi("mtask deq [cid %d itt 0x%x]\n", conn->id, mtask->itt); /* Send the control */ - /* FIXME enough to call with (conn, mtask) */ error = iser_send_control(conn, mtask); - if (error) { + if (error) printk(KERN_ERR "send_control failed\n"); - } return error; } -static void +void iscsi_iser_conn_failure(struct iscsi_iser_conn *conn, enum iscsi_err err) { struct iscsi_iser_session *session = conn->session; @@ -222,36 +205,6 @@ iscsi_iser_conn_failure(struct iscsi_ise debug_iser("%s: exit\n", __FUNCTION__); } -static int iscsi_iser_ctask_xmit_cmd(struct iscsi_iser_conn *conn, - struct iscsi_iser_cmd_task *ctask) -{ - int error = 0; - switch (ctask->sc->sc_data_direction) { - case DMA_TO_DEVICE: - case DMA_FROM_DEVICE: - break; - case DMA_NONE: - break; - default: - printk(KERN_ERR "Illegal data direction (%d)\n", - ctask->sc->sc_data_direction); - error = -1; - goto iscsi_iser_ctask_xmit_cmd_exit; - - } - - /* Send the command */ - error = iser_send_command(conn, ctask); - - if (error) { - printk(KERN_ERR "send_command failed\n"); - iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); - } - -iscsi_iser_ctask_xmit_cmd_exit: - return error; -} - static void iscsi_iser_unsolicit_data_init(struct iscsi_iser_conn *conn, struct iscsi_iser_cmd_task *ctask) { @@ -313,7 +266,6 @@ static int iscsi_iser_ctask_xmit_unsol_d error = iser_send_data_out(conn, ctask, &dtask->hdr); if (error) { printk(KERN_ERR "send_data_out failed\n"); - iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); goto iscsi_iser_ctask_xmit_unsol_data_exit; } @@ -341,7 +293,7 @@ static int iscsi_iser_ctask_xmit(struct return error; /* Send the cmd PDU */ - error = iscsi_iser_ctask_xmit_cmd(conn,ctask); + error = iser_send_command(conn, ctask); if (error) { printk(KERN_ERR "Couldn't send a cmd PDU\n"); goto iscsi_iser_ctask_xmit_exit; @@ -349,14 +301,15 @@ static int iscsi_iser_ctask_xmit(struct /* Send unsolicited data-out PDU(s) if necessary */ if (ctask->unsol_count) { - error = iscsi_iser_ctask_xmit_unsol_data(conn,ctask); - if (error) { + error = iscsi_iser_ctask_xmit_unsol_data(conn, ctask); + if (error) printk(KERN_ERR "Couldn't send unsolicited " "data-out PDU(s)\n"); - } } iscsi_iser_ctask_xmit_exit: + if(error) + iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); return error; } @@ -563,19 +516,6 @@ fault: return 0; } -static inline void iscsi_buf_init_virt(struct iscsi_buf *ibuf, - char *vbuf, int size) -{ - sg_init_one(&ibuf->sg, (u8 *)vbuf, size); - ibuf->sent = 0; -} - -static inline void iscsi_buf_init_hdr(struct iscsi_iser_conn *conn, - struct iscsi_buf *ibuf, char *vbuf) -{ - iscsi_buf_init_virt(ibuf, vbuf, sizeof(struct iscsi_hdr)); -} - static int iscsi_iser_conn_send_generic(iscsi_connh_t connh, struct iscsi_hdr *hdr, char *data, uint32_t data_size) @@ -627,8 +567,6 @@ static int iscsi_iser_conn_send_generic( memcpy(&mtask->hdr, hdr, sizeof(struct iscsi_hdr)); - iscsi_buf_init_hdr(conn, &mtask->headbuf, (char*)&mtask->hdr); - spin_unlock_bh(&session->lock); if (data_size) { @@ -637,11 +575,6 @@ static int iscsi_iser_conn_send_generic( } else mtask->data_count = 0; - if (mtask->data_count) { - iscsi_buf_init_iov(&mtask->sendbuf, (char*)mtask->data, - mtask->data_count); - } - debug_scsi("mgmtpdu [op 0x%x hdr->itt 0x%x datalen %d]\n", hdr->opcode, hdr->itt, data_size); @@ -1219,8 +1152,9 @@ static iscsi_connh_t iscsi_iser_conn_cre init_MUTEX(&conn->xmitsema); init_waitqueue_head(&conn->ehwait); - /* MERGE_ADDED */ - iser_conn_init_lowpart(conn); + atomic_set(&conn->post_recv_buf_count, 0); + atomic_set(&conn->post_send_buf_count, 0); + init_waitqueue_head(&conn->disconnect_wait_q); return iscsi_handle(conn); @@ -1761,7 +1695,7 @@ out: return rc; } -static int +int iscsi_iser_hdr_recv(struct iscsi_iser_conn *conn, struct iscsi_hdr *hdr, char *rx_data) { int rc = 0; @@ -1937,31 +1871,6 @@ iscsi_iser_hdr_recv(struct iscsi_iser_co return rc; } -void iscsi_iser_control_notify(struct iscsi_iser_conn *conn,/*void *conn_h,*/ - struct iscsi_hdr *hdr, char *rx_data) -{ - int rc; - - /* - * Verify and process incoming PDU header. - */ - rc = iscsi_iser_hdr_recv(conn, hdr, rx_data); - if (rc) { - debug_iser("%s: calling iscsi_iser_conn_failure\n", __FUNCTION__); - iscsi_iser_conn_failure(conn, rc); - } -} - -void iscsi_iser_conn_term_notify(struct iscsi_iser_conn *conn) -{ - debug_iser("%s: entry\n", __FUNCTION__); - - debug_iser("%s: calling iscsi_iser_conn_failure\n", __FUNCTION__); - iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); - - debug_iser("%s: exit\n", __FUNCTION__); -} - int iscsi_iser_init(void) { int error; From yael at mellanox.co.il Thu Jan 5 04:16:16 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 05 Jan 2006 14:16:16 +0200 Subject: [openib-general] Re[PATCH] Opensm - clean exit on ^C Message-ID: <5zzmmag99r.fsf@mtl066.yok.mtl.com> Hi Hal, I've noticed that sometimes when killing OpenSM using ^C not all threads are killed. The reason for that is that there are threads that mask the signalling, and when removing the ^C handling from OpenSM, these threads still mask the signalling and stay alive as a result. The following patch fixes this. Thanks, Yael Signed-off-by: Yael Kalka Index: include/complib/cl_signal_osd.h =================================================================== --- include/complib/cl_signal_osd.h (revision 4760) +++ include/complib/cl_signal_osd.h (working copy) @@ -148,12 +148,14 @@ cl_sig_mask_sigint(void) #ifdef __WIN__ /* we do not mask kill */ #else +#ifndef OSM_VENDOR_INTF_OPENIB sigset_t sigs; sigemptyset(&sigs); sigaddset(&sigs, SIGINT); pthread_sigmask(SIG_BLOCK, &sigs, NULL); - #endif +#endif /* OSM_VENDOR_INTF_OPENIB */ +#endif /* __WIN__ */ } /* *********/ Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 4760) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -244,10 +244,6 @@ umad_receiver(void *p_ptr) OSM_LOG_ENTER( p_ur->p_log, umad_receiver ); - sigemptyset(&sigs); - sigaddset(&sigs, SIGINT); - pthread_sigmask(SIG_BLOCK, &sigs, NULL); - for (;;) { if (!umad && !(umad = umad_alloc(1, umad_size() + MAD_BLOCK_SIZE))) { From eitan at mellanox.co.il Thu Jan 5 04:27:06 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 05 Jan 2006 14:27:06 +0200 Subject: [openib-general] SA cache design In-Reply-To: <43BB1A0F.2080305@ichips.intel.com> References: <43BB1A0F.2080305@ichips.intel.com> Message-ID: <43BD109A.3010302@mellanox.co.il> Hi Sean, This is great initiative - tackling an important issue. I am glad you took this on. Please see below. Sean Hefty wrote: > I've been given the task of trying to come up with an implementation for > an SA cache. The intent is to increase the scalability and performance > of the openib stack. My current thoughts on the implementation are > below. Any feedback is welcome. > > To keep the design as flexible as possible, my plan is to implement the > cache in userspace. The interface to the cache would be via MADs. > Clients would send their queries to the sa_cache instead of the SA > itself. The format of the MADs would be essentially identical to those > used to query the SA itself. Response MADs would contain any requested > information. If the cache could not satisfy a request, the sa_cache > would query the SA, update its cache, then return a reply. * I think the idea of using MADs to interface with the cache is very good. * User space implementation: This also might be a good tradeoff between coding and debugging versus the the impact on number of connections per second. I hope the impact on performance will not be too big. Maybe we can take the path of implementing in user space and if the performance penalty will be too high we can port to kernel. * Regarding the sentence:"Clients would send their queries to the sa_cache instead of the SA" I would propose that a "SA MAD send switch" be implemented in the core: Such a switch will enable plugging in the SA cache (I would prefer calling it SA local agent due to its extended functionality). Once plugged in, this "SA local agent" should be forwarded all outgoing SA queries. Once it handles the MAD it should be able to inject the response through the core "SA MAD send switch" as if they arrived from the wire. > > The benefits that I see with this approach are: > > + Clients would only need to send requests to the sa_cache. > + The sa_cache can be implemented in stages. Requests that it cannot > handle would just be forwarded to the SA. > + The sa_cache could be implemented on each host, or a select number of > hosts. > + The interface to the sa_cache is similar to that used by the SA. > + The cache would use virtual memory and could be saved to disk. > > Some drawbacks specific to this method are: > > - The MAD interface will result in additional data copies and userspace > to kernel transitions for clients residing on the local system. > - Clients require a mechanism to locate the sa_cache, or need to make > assumptions about its location. The proposal for "SA MAD send switch" in the core will resolve this issue. No client change will be required as all MADs are sent through the core which will redirect them to the SA agent ... Functional requirements: * It is clear that the first SA query to cache is PathRecord. So if a new client wants to connect to another node a new PathRecord query will not need to be sent to the SA. However, recent work on QoS has pointed out that under some QoS schemes PathRecord should not be shared by different clients or even connections. There are several ways to make such QoS scheme scale. Since this is a different discussion topic - I only bring this up such that we take into account caching might also need to be done by a complex key (not just SRC/DST ...) * Forgive me for bringing the following issue - over and over to the group: Multicast Join/Leave should be reference counted. The "SA local agent" could be the right place for doing this kind of reference counting (actually if it does that it probably needs to be located in the Kernel - to enable cleanup after killed processes). * Similarly - "Client re-registration" could be made transparent to clients. Cache Invalidation: Several discussions about PathRecord invalidation were spawn in the past. IMO, it is enough to be notified about death of local processes, remote port availability (trap 64/65) and multicast group availability (trap 66/67) in order to invalidate SA cache information. So each SA Agent could register to obtain this data. But that solution does not nicely scale, as the SA needs to send notification to all nodes (but is reliable - could resend until Repressed). However, current IBTA definition for InformInfo (event forwarding mechanism) does not allow for multicast of Report(Notice). The reason is that registration for event forwarding is done with Set(InformInfo) which uses the requester QP and LID as the address for sending the matching report. A simple way around that limitation could be to enable the SM to "pre-register" a well known multicast group target for event forwarding. One issue though, would be that UD multicast is not reliable and some notifications could get lost. A notification sequence number could be used to catch these missed notifications eventually. Eitan From yael at mellanox.co.il Thu Jan 5 04:52:31 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Thu, 5 Jan 2006 14:52:31 +0200 Subject: [openib-general] RE: Some opensm/osm_vl15intf.c questions Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FCFD@mtlexch01.mtl.com> Hi Hal, I've reviewed the code and as you noted - there is a bug in the code, actually 2 bugs. The qp0_mads_outstanding counter is incremented on mads with response expected. In the "regular" flow this counter is decremented through __osm_sm_mad_ctrl_retire_trans_mad, after the response was received and the mad was handled. The qp0_mads_outstanding is used for the signalling of "NO_PENDING_TRANSACTIONS", so I do not think this counter should be incremented when the mads are not ones with response expected. The 2 bugs are, thus: 1. In the _osm_vl15_poller in case of osm_vendor_send failure. This code is similar to the one in __osm_sm_mad_ctrl_retire_trans_mad, but should decrement and handle the qp0_mads_outstanding only if response_expected == TRUE. 2. In the shutdown process. Again, only mads with response_expected == TRUE should decrement the counter. The reason we haven't seen errors due to this is because the main and usual flow is fine. I will issue a patch for fixing this soon. Thanks, Yael -----Original Message----- From: Eitan Zahavi Sent: Saturday, December 31, 2005 11:43 AM To: 'Hal Rosenstock'; Yael Kalka Cc: openib-general at openib.org Subject: RE: Some opensm/osm_vl15intf.c questions Hi Hal, As Yael was working on the ref-counting issues (a month or two ago) I will let her answer. It is very possible we are missing some. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, December 30, 2005 6:03 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Some opensm/osm_vl15intf.c questions > > Hi Eitan, > > In chasing an issue with a trap repress not being sent in a certain > scenario, I stumbled across the following questions about > opensm/osm_vl15intf.c. > > 1. osm_vl15_post increments qp0_mads_outstanding when a response is > expected (rfifo) and not when unsolicited (ufifo) (what appears to be > called unicasts): > > osm_vl15_post: > if( p_madw->resp_expected == TRUE ) > { > cl_qlist_insert_tail( &p_vl->rfifo, (cl_list_item_t*)p_madw ); > cl_atomic_inc( &p_vl->p_stats->qp0_mads_outstanding ); > } > else > { > cl_qlist_insert_tail( &p_vl->ufifo, (cl_list_item_t*)p_madw ); > } > > osm_vl15_shutdown retires all outstanding MADs as follows: > > osm_vl15_shutdown: > while ( p_madw != (osm_madw_t*)cl_qlist_end( &p_vl->ufifo ) ) > { > if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) > { > osm_log( p_vl->p_log, OSM_LOG_DEBUG, > "osm_vl15_shutdown: " > "Releasing Response p_madw = %p\n", p_madw ); > } > > osm_mad_pool_put( p_mad_pool, p_madw ); > cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); > > p_madw = (osm_madw_t*)cl_qlist_remove_head( &p_vl->ufifo ); > } > > Either post should increment qp0_mads_outstanding for unsolicited or > shutdown shouldn't decrement it when removing from ufifo. If you agree, > which should it be ? > > 2. In the case of a failure from osm_vendor_send, __osm_vl15_poller > decrements qp0_mads_outstanding regardless of whether a response is > expected. This is inconsistent with the increment. This leads me to > believe that this should also be incremented for unsolicited (unicasts) > as well as those for which responses are expected. Is this correct or am > I missing something ? > > So my conclusion is that in osm_vl15_post, it should be: > > if( p_madw->resp_expected == TRUE ) > { > cl_qlist_insert_tail( &p_vl->rfifo, (cl_list_item_t*)p_madw ); > } > else > { > cl_qlist_insert_tail( &p_vl->ufifo, (cl_list_item_t*)p_madw ); > } > cl_atomic_inc( &p_vl->p_stats->qp0_mads_outstanding ); > > If you agree, I will generate a patch for this. Thanks. > > -- Hal From mst at mellanox.co.il Thu Jan 5 05:03:59 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 5 Jan 2006 15:03:59 +0200 Subject: [openib-general] [PATCH] libmthca: race condition fix in In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3010C7429@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3010C7429@mtlexch01.mtl.com> Message-ID: <20060105130359.GH2790@mellanox.co.il> Jack Morgenstein has discovered the following race condition in libmthca. We are actually hitting it in testing. Thread A destroys QP A at the kernel side by calling ibv_cmd_destroy_qp, but its time-slice is over before removing it from the user-space qp_table removal. Thread B allocates QP B, receiving a QP number that matches the just-destroyed QP A in the low 16 bits. Thread B will now over-write the slot in qp_table which was used for QP A. Thread A wakes up and clears qp_table slot, in effect removing QP B from qp_table. As a solution, remove the QP from qp_table before calling ibv_cmd_destroy_qp. This also makes sense since operations are performed in the reverse order in create_qp. --- Race condition fix: keep qp_table in userspace and in kernel consistent by performing destruction in the reverse order of construction. Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/libmthca/src/verbs.c =================================================================== --- openib.orig/src/userspace/libmthca/src/verbs.c 2005-12-14 23:29:54.000000000 +0200 +++ openib/src/userspace/libmthca/src/verbs.c 2006-01-05 14:54:29.000000000 +0200 @@ -529,10 +529,6 @@ int mthca_destroy_qp(struct ibv_qp *qp) { int ret; - ret = ibv_cmd_destroy_qp(qp); - if (ret) - return ret; - mthca_cq_clean(to_mcq(qp->recv_cq), qp->qp_num, qp->srq ? to_msrq(qp->srq) : NULL); if (qp->send_cq != qp->recv_cq) @@ -546,6 +542,18 @@ int mthca_destroy_qp(struct ibv_qp *qp) pthread_spin_unlock(&to_mcq(qp->recv_cq)->lock); pthread_spin_unlock(&to_mcq(qp->send_cq)->lock); + ret = ibv_cmd_destroy_qp(qp); + if (ret) { + pthread_spin_lock(&to_mcq(qp->send_cq)->lock); + if (qp->send_cq != qp->recv_cq) + pthread_spin_lock(&to_mcq(qp->recv_cq)->lock); + mthca_store_qp(to_mctx(qp->context), qp->qp_num, to_mqp(qp)); + if (qp->send_cq != qp->recv_cq) + pthread_spin_unlock(&to_mcq(qp->recv_cq)->lock); + pthread_spin_unlock(&to_mcq(qp->send_cq)->lock); + return ret; + } + if (mthca_is_memfree(qp->context)) { mthca_free_db(to_mctx(qp->context)->db_tab, MTHCA_DB_TYPE_RQ, to_mqp(qp)->rq.db_index); -- MST From yael at mellanox.co.il Thu Jan 5 05:25:41 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 05 Jan 2006 15:25:41 +0200 Subject: [openib-general] Re[PATCH] Opensm - osm_vl15intf.c fix Message-ID: <5zy81ug622.fsf@mtl066.yok.mtl.com> Hi Hal, Attached is a fix for the qp0_mads_outstanding handling, according to what I've described in the previous mail. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_vl15intf.c =================================================================== --- opensm/osm_vl15intf.c (revision 4760) +++ opensm/osm_vl15intf.c (working copy) @@ -182,7 +182,15 @@ __osm_vl15_poller( qp0_mads_outstanding counter, and if we reached 0 - need to call the cl_disp_post with OSM_SIGNAL_NO_PENDING_TRANSACTION (in order to wake up the state mgr). + There is one difference from the code in __osm_sm_mad_ctrl_retire_trans_mad. + This code is called on all mads, if osm_vendor_send() failed, unlike + __osm_sm_mad_ctrl_retire_trans_mad which is called only on mads where + resp_expected == TRUE. As a result, the qp0_mads_outstanding counter + should be decremented and handled accordingly only if this is a mad + with resp_expected == TRUE. */ + if ( p_madw->resp_expected == TRUE ) + { outstanding = cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); osm_log( p_vl->p_log, OSM_LOG_DEBUG, @@ -219,6 +227,7 @@ __osm_vl15_poller( } } } + } else { if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) @@ -514,7 +523,6 @@ osm_vl15_shutdown( } osm_mad_pool_put( p_mad_pool, p_madw ); - cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); p_madw = (osm_madw_t*)cl_qlist_remove_head( &p_vl->ufifo ); } From mst at mellanox.co.il Thu Jan 5 05:44:45 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 5 Jan 2006 15:44:45 +0200 Subject: [openib-general] [PATCH] mthca: add alternate path support Message-ID: <20060105134445.GJ2790@mellanox.co.il> mthca: add alternate path support. Signed-off-by: Dotan Barak Signed-off-by: Michael S. Tsirkin Index: last_stable/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2006-01-05 15:23:18.000000000 +0200 +++ last_stable/drivers/infiniband/hw/mthca/mthca_qp.c 2006-01-05 15:23:43.000000000 +0200 @@ -547,6 +547,24 @@ return cpu_to_be32(hw_access_flags); } +static void mthca_ah_set(struct ib_ah_attr *ah, struct mthca_qp_path *mthca_ah) +{ + mthca_ah->g_mylmc = ah->src_path_bits & 0x7f; + mthca_ah->rlid = cpu_to_be16(ah->dlid); + mthca_ah->static_rate = !!ah->static_rate; + if (ah->ah_flags & IB_AH_GRH) { + mthca_ah->g_mylmc |= 1 << 7; + mthca_ah->mgid_index = ah->grh.sgid_index; + mthca_ah->hop_limit = ah->grh.hop_limit; + mthca_ah->sl_tclass_flowlabel = + cpu_to_be32((ah->sl << 28) | + (ah->grh.traffic_class << 20) | + (ah->grh.flow_label)); + memcpy(mthca_ah->rgid, ah->grh.dgid.raw, 16); + } else + mthca_ah->sl_tclass_flowlabel = cpu_to_be32(ah->sl << 28); +} + int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) { struct mthca_dev *dev = to_mdev(ibqp->device); @@ -710,28 +728,14 @@ } if (attr_mask & IB_QP_RNR_RETRY) { - qp_context->pri_path.rnr_retry = attr->rnr_retry << 5; - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY); + qp_context->alt_path.rnr_retry = qp_context->pri_path.rnr_retry = + attr->rnr_retry << 5; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY | + MTHCA_QP_OPTPAR_ALT_RNR_RETRY); } if (attr_mask & IB_QP_AV) { - qp_context->pri_path.g_mylmc = attr->ah_attr.src_path_bits & 0x7f; - qp_context->pri_path.rlid = cpu_to_be16(attr->ah_attr.dlid); - qp_context->pri_path.static_rate = !!attr->ah_attr.static_rate; - if (attr->ah_attr.ah_flags & IB_AH_GRH) { - qp_context->pri_path.g_mylmc |= 1 << 7; - qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index; - qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit; - qp_context->pri_path.sl_tclass_flowlabel = - cpu_to_be32((attr->ah_attr.sl << 28) | - (attr->ah_attr.grh.traffic_class << 20) | - (attr->ah_attr.grh.flow_label)); - memcpy(qp_context->pri_path.rgid, - attr->ah_attr.grh.dgid.raw, 16); - } else { - qp_context->pri_path.sl_tclass_flowlabel = - cpu_to_be32(attr->ah_attr.sl << 28); - } + mthca_ah_set(&attr->ah_attr, &qp_context->pri_path); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); } @@ -740,7 +744,20 @@ qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT); } - /* XXX alt_path */ + /* alt_path */ + if (attr_mask & IB_QP_ALT_PATH) { + if (attr->alt_port_num == 0 || attr->alt_port_num > dev->limits.num_ports) { + mthca_dbg(dev, "Alternate port number (%u) is invalid\n", + attr->alt_port_num); + return -EINVAL; + } + + mthca_ah_set(&attr->alt_ah_attr, &qp_context->alt_path); + qp_context->alt_path.port_pkey |= cpu_to_be32(attr->alt_pkey_index | + attr->alt_port_num << 24); + qp_context->alt_path.ackto = attr->alt_timeout << 3; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ALT_ADDR_PATH); + } /* leave rdd as 0 */ qp_context->pd = cpu_to_be32(to_mpd(ibqp->pd)->pd_num); -- MST From jackm at mellanox.co.il Thu Jan 5 05:54:07 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 5 Jan 2006 15:54:07 +0200 Subject: [openib-general] [PATCH] libmthca: fix memory leak in mthca_destroy_qp and mthca_destroy_srq Message-ID: <20060105135407.GA13745@mellanox.co.il> libmthca: fix memory leak in mthca_destroy_qp and mthca_destroy_srq. Signed-off-by: Jack Morgenstein Index: last_stable/src/userspace/libmthca/src/verbs.c =================================================================== --- last_stable.orig/src/userspace/libmthca/src/verbs.c 2006-01-05 14:56:55.000000000 +0200 +++ last_stable/src/userspace/libmthca/src/verbs.c 2006-01-05 15:24:41.000000000 +0200 @@ -390,6 +390,7 @@ int mthca_destroy_srq(struct ibv_srq *sr free(to_msrq(srq)->buf); free(to_msrq(srq)->wrid); + free(to_msrq(srq)); return 0; } @@ -565,6 +566,7 @@ int mthca_destroy_qp(struct ibv_qp *qp) free(to_mqp(qp)->buf); free(to_mqp(qp)->wrid); + free(to_mqp(qp)); return 0; } From bardov at gmail.com Thu Jan 5 06:47:43 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Thu, 5 Jan 2006 16:47:43 +0200 Subject: [openib-general] [PATCH] iser: more cleanups following the iscsi_iser/iser merge In-Reply-To: References: Message-ID: Applied r4770. Thanks, Dan On 1/5/06, Or Gerlitz wrote: > more cleanups which are possible following the iscsi_iser/iser merge: > remove unused struct iscsi_buf, make some func call pathes shorter. > > > Index: ulp/iser/iscsi_iser.h > =================================================================== > --- ulp/iser/iscsi_iser.h (revision 4768) > +++ ulp/iser/iscsi_iser.h (working copy) > @@ -25,6 +25,7 @@ > #include > #include > #include > +#include > #define AF_ISER 28 /* to be defined properly */ > > #define ISCSI_ISER_XMIT_CMDS_MAX 128 /* must be power of 2 */ > @@ -227,19 +228,11 @@ struct iscsi_iser_queue { > int max; /* Max number of elements */ > }; > > -struct iscsi_buf { > - struct scatterlist sg; > - struct kvec iov; > - unsigned int sent; > -}; > - > struct iscsi_iser_mgmt_task { > struct iscsi_hdr hdr; > uint32_t itt; /* this ITT */ > char *data; /* mgmt payload */ > int data_count; /* counts data to be sent */ > - struct iscsi_buf headbuf; /* header buffer */ > - struct iscsi_buf sendbuf; /* in progress buffer */ > }; > > struct iscsi_iser_cmd_task { > @@ -348,14 +341,11 @@ int iser_send_data_out(struct iscsi_iser > /* terminate a connection */ > int iser_conn_term(struct iscsi_iser_conn *p_iscsi_iser_conn); > > +int iscsi_iser_hdr_recv(struct iscsi_iser_conn *conn, struct iscsi_hdr *hdr, > + char *rx_data); > > -void iscsi_iser_control_notify(struct iscsi_iser_conn *p_iscsi_conn, > - struct iscsi_hdr *hdr, char *rx_data); > - > -void iscsi_iser_conn_term_notify(struct iscsi_iser_conn *p_iscsi_conn); > +void iscsi_iser_conn_failure(struct iscsi_iser_conn *conn, enum iscsi_err err); > > int iscsi_iser_init(void); > void iscsi_iser_exit(void); > > -void iser_conn_init_lowpart(struct iscsi_iser_conn *p_iser_conn); > - > Index: ulp/iser/iser_conn.c > =================================================================== > --- ulp/iser/iser_conn.c (revision 4768) > +++ ulp/iser/iser_conn.c (working copy) > @@ -85,15 +85,6 @@ void iser_conn_init(struct iser_conn *p_ > init_waitqueue_head(&p_iser_conn->connect_wait_q); > } > > -void iser_conn_init_lowpart(struct iscsi_iser_conn *p_iser_conn) > -{ > - /* moved from iser_conn_init */ > - atomic_set(&p_iser_conn->post_recv_buf_count, 0); > - atomic_set(&p_iser_conn->post_send_buf_count, 0); > - > - init_waitqueue_head(&p_iser_conn->disconnect_wait_q); > -} > - > /** > * Initializes iSER adaptor structure. > * > @@ -454,7 +445,8 @@ int iser_complete_conn_termination(struc > > /* Notify the upper layer about asynch terminations */ > if (cur_conn_state == ISER_CONN_ASYNC_TERM) > - iscsi_iser_conn_term_notify(p_iser_conn); > + iscsi_iser_conn_failure(p_iser_conn, > + ISCSI_ERR_CONN_FAILED); > > if (cur_conn_state == ISER_CONN_SYNC_TERM) > wake_up_interruptible(&p_iser_conn->disconnect_wait_q); > Index: ulp/iser/iser_initiator.c > =================================================================== > --- ulp/iser/iser_initiator.c (revision 4768) > +++ ulp/iser/iser_initiator.c (working copy) > @@ -537,7 +537,7 @@ void iser_rcv_dto_completion(struct iser > struct iscsi_iser_cmd_task *p_iser_task = NULL; > struct iscsi_hdr *p_hdr; > char *rx_data; > - int rx_data_size; > + int rx_data_size,rc; > unsigned int itt; > unsigned char opcode; > int no_more_task_sends = 0; > @@ -581,9 +581,9 @@ void iser_rcv_dto_completion(struct iser > } > } > > - iser_dbg("Control notify, DTO:0x%p, as PDU:0x%p\n", p_dto, p_hdr); > - > - iscsi_iser_control_notify(p_iser_conn, p_hdr, rx_data); > + rc = iscsi_iser_hdr_recv(p_iser_conn, p_hdr, rx_data); > + if(rc) > + iscsi_iser_conn_failure(p_iser_conn, rc); > > if(p_iser_task != NULL) { > spin_lock(&p_iser_task->task_lock); > Index: ulp/iser/iscsi_iser.c > =================================================================== > --- ulp/iser/iscsi_iser.c (revision 4768) > +++ ulp/iser/iscsi_iser.c (working copy) > @@ -79,15 +79,6 @@ module_param_named(max_lun, iscsi_max_lu > > static kmem_cache_t *task_mem_cache; > > -static inline void iscsi_buf_init_iov(struct iscsi_buf *ibuf, > - char *vbuf, int size) > -{ > - ibuf->sg.page = (void*)vbuf; > - ibuf->sg.offset = (unsigned int)-1; > - ibuf->sg.length = size; > - ibuf->sent = 0; > -} > - > /** > * iscsi_iser_cmd_init - Initialize iSCSI SCSI_READ or SCSI_WRITE commands > * > @@ -185,26 +176,18 @@ static int iscsi_iser_mtask_xmit(struct > { > int error = 0; > > - debug_scsi("mtask deq [cid %d itt 0x%x]\n", > - conn->id, mtask->itt); > - > - debug_scsi("%s: sending: size = 0x%x, data = 0x%p\n", > - __FUNCTION__, > - mtask->data_count, > - mtask->data); > + debug_scsi("mtask deq [cid %d itt 0x%x]\n", conn->id, mtask->itt); > > /* Send the control */ > - /* FIXME enough to call with (conn, mtask) */ > error = iser_send_control(conn, mtask); > > - if (error) { > + if (error) > printk(KERN_ERR "send_control failed\n"); > - } > > return error; > } > > -static void > +void > iscsi_iser_conn_failure(struct iscsi_iser_conn *conn, enum iscsi_err err) > { > struct iscsi_iser_session *session = conn->session; > @@ -222,36 +205,6 @@ iscsi_iser_conn_failure(struct iscsi_ise > debug_iser("%s: exit\n", __FUNCTION__); > } > > -static int iscsi_iser_ctask_xmit_cmd(struct iscsi_iser_conn *conn, > - struct iscsi_iser_cmd_task *ctask) > -{ > - int error = 0; > - switch (ctask->sc->sc_data_direction) { > - case DMA_TO_DEVICE: > - case DMA_FROM_DEVICE: > - break; > - case DMA_NONE: > - break; > - default: > - printk(KERN_ERR "Illegal data direction (%d)\n", > - ctask->sc->sc_data_direction); > - error = -1; > - goto iscsi_iser_ctask_xmit_cmd_exit; > - > - } > - > - /* Send the command */ > - error = iser_send_command(conn, ctask); > - > - if (error) { > - printk(KERN_ERR "send_command failed\n"); > - iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); > - } > - > -iscsi_iser_ctask_xmit_cmd_exit: > - return error; > -} > - > static void iscsi_iser_unsolicit_data_init(struct iscsi_iser_conn *conn, > struct iscsi_iser_cmd_task *ctask) > { > @@ -313,7 +266,6 @@ static int iscsi_iser_ctask_xmit_unsol_d > error = iser_send_data_out(conn, ctask, &dtask->hdr); > if (error) { > printk(KERN_ERR "send_data_out failed\n"); > - iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); > goto iscsi_iser_ctask_xmit_unsol_data_exit; > } > > @@ -341,7 +293,7 @@ static int iscsi_iser_ctask_xmit(struct > return error; > > /* Send the cmd PDU */ > - error = iscsi_iser_ctask_xmit_cmd(conn,ctask); > + error = iser_send_command(conn, ctask); > if (error) { > printk(KERN_ERR "Couldn't send a cmd PDU\n"); > goto iscsi_iser_ctask_xmit_exit; > @@ -349,14 +301,15 @@ static int iscsi_iser_ctask_xmit(struct > > /* Send unsolicited data-out PDU(s) if necessary */ > if (ctask->unsol_count) { > - error = iscsi_iser_ctask_xmit_unsol_data(conn,ctask); > - if (error) { > + error = iscsi_iser_ctask_xmit_unsol_data(conn, ctask); > + if (error) > printk(KERN_ERR "Couldn't send unsolicited " > "data-out PDU(s)\n"); > - } > } > > iscsi_iser_ctask_xmit_exit: > + if(error) > + iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); > return error; > } > > @@ -563,19 +516,6 @@ fault: > return 0; > } > > -static inline void iscsi_buf_init_virt(struct iscsi_buf *ibuf, > - char *vbuf, int size) > -{ > - sg_init_one(&ibuf->sg, (u8 *)vbuf, size); > - ibuf->sent = 0; > -} > - > -static inline void iscsi_buf_init_hdr(struct iscsi_iser_conn *conn, > - struct iscsi_buf *ibuf, char *vbuf) > -{ > - iscsi_buf_init_virt(ibuf, vbuf, sizeof(struct iscsi_hdr)); > -} > - > static int iscsi_iser_conn_send_generic(iscsi_connh_t connh, > struct iscsi_hdr *hdr, > char *data, uint32_t data_size) > @@ -627,8 +567,6 @@ static int iscsi_iser_conn_send_generic( > > memcpy(&mtask->hdr, hdr, sizeof(struct iscsi_hdr)); > > - iscsi_buf_init_hdr(conn, &mtask->headbuf, (char*)&mtask->hdr); > - > spin_unlock_bh(&session->lock); > > if (data_size) { > @@ -637,11 +575,6 @@ static int iscsi_iser_conn_send_generic( > } else > mtask->data_count = 0; > > - if (mtask->data_count) { > - iscsi_buf_init_iov(&mtask->sendbuf, (char*)mtask->data, > - mtask->data_count); > - } > - > debug_scsi("mgmtpdu [op 0x%x hdr->itt 0x%x datalen %d]\n", > hdr->opcode, hdr->itt, data_size); > > @@ -1219,8 +1152,9 @@ static iscsi_connh_t iscsi_iser_conn_cre > init_MUTEX(&conn->xmitsema); > init_waitqueue_head(&conn->ehwait); > > - /* MERGE_ADDED */ > - iser_conn_init_lowpart(conn); > + atomic_set(&conn->post_recv_buf_count, 0); > + atomic_set(&conn->post_send_buf_count, 0); > + init_waitqueue_head(&conn->disconnect_wait_q); > > return iscsi_handle(conn); > > @@ -1761,7 +1695,7 @@ out: > return rc; > } > > -static int > +int > iscsi_iser_hdr_recv(struct iscsi_iser_conn *conn, struct iscsi_hdr *hdr, char *rx_data) > { > int rc = 0; > @@ -1937,31 +1871,6 @@ iscsi_iser_hdr_recv(struct iscsi_iser_co > return rc; > } > > -void iscsi_iser_control_notify(struct iscsi_iser_conn *conn,/*void *conn_h,*/ > - struct iscsi_hdr *hdr, char *rx_data) > -{ > - int rc; > - > - /* > - * Verify and process incoming PDU header. > - */ > - rc = iscsi_iser_hdr_recv(conn, hdr, rx_data); > - if (rc) { > - debug_iser("%s: calling iscsi_iser_conn_failure\n", __FUNCTION__); > - iscsi_iser_conn_failure(conn, rc); > - } > -} > - > -void iscsi_iser_conn_term_notify(struct iscsi_iser_conn *conn) > -{ > - debug_iser("%s: entry\n", __FUNCTION__); > - > - debug_iser("%s: calling iscsi_iser_conn_failure\n", __FUNCTION__); > - iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); > - > - debug_iser("%s: exit\n", __FUNCTION__); > -} > - > int iscsi_iser_init(void) > { > int error; > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Thu Jan 5 06:49:07 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jan 2006 09:49:07 -0500 Subject: [openib-general] RE: Some opensm/osm_vl15intf.c questions In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FCFD@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FCFD@mtlexch01.mtl.com> Message-ID: <1136472545.4339.16715.camel@hal.voltaire.com> Hi Yael, On Thu, 2006-01-05 at 07:52, Yael Kalka wrote: > Hi Hal, > I've reviewed the code and as you noted - there is a bug in the code, > actually 2 bugs. > The qp0_mads_outstanding counter is incremented on mads with response > expected. > In the "regular" flow this counter is decremented through > __osm_sm_mad_ctrl_retire_trans_mad, > after the response was received and the mad was handled. > The qp0_mads_outstanding is used for the signalling of > "NO_PENDING_TRANSACTIONS", so I > do not think this counter should be incremented when the mads are not > ones with response expected. > The 2 bugs are, thus: > 1. In the _osm_vl15_poller in case of osm_vendor_send failure. This code > is similar to > the one in __osm_sm_mad_ctrl_retire_trans_mad, but should decrement > and handle the > qp0_mads_outstanding only if response_expected == TRUE. > 2. In the shutdown process. Again, only mads with response_expected == > TRUE should decrement > the counter. Makes sense. > The reason we haven't seen errors due to this is because the main and > usual flow is fine. > I will issue a patch for fixing this soon. Thanks. -- Hal > Thanks, > Yael > > -----Original Message----- > From: Eitan Zahavi > Sent: Saturday, December 31, 2005 11:43 AM > To: 'Hal Rosenstock'; Yael Kalka > Cc: openib-general at openib.org > Subject: RE: Some opensm/osm_vl15intf.c questions > > > Hi Hal, > > As Yael was working on the ref-counting issues (a month or two ago) I > will let her answer. It is very possible we are missing some. > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Friday, December 30, 2005 6:03 PM > > To: Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: Some opensm/osm_vl15intf.c questions > > > > Hi Eitan, > > > > In chasing an issue with a trap repress not being sent in a certain > > scenario, I stumbled across the following questions about > > opensm/osm_vl15intf.c. > > > > 1. osm_vl15_post increments qp0_mads_outstanding when a response is > > expected (rfifo) and not when unsolicited (ufifo) (what appears to be > > called unicasts): > > > > osm_vl15_post: > > if( p_madw->resp_expected == TRUE ) > > { > > cl_qlist_insert_tail( &p_vl->rfifo, (cl_list_item_t*)p_madw ); > > cl_atomic_inc( &p_vl->p_stats->qp0_mads_outstanding ); > > } > > else > > { > > cl_qlist_insert_tail( &p_vl->ufifo, (cl_list_item_t*)p_madw ); > > } > > > > osm_vl15_shutdown retires all outstanding MADs as follows: > > > > osm_vl15_shutdown: > > while ( p_madw != (osm_madw_t*)cl_qlist_end( &p_vl->ufifo ) ) > > { > > if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) > > { > > osm_log( p_vl->p_log, OSM_LOG_DEBUG, > > "osm_vl15_shutdown: " > > "Releasing Response p_madw = %p\n", p_madw ); > > } > > > > osm_mad_pool_put( p_mad_pool, p_madw ); > > cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); > > > > p_madw = (osm_madw_t*)cl_qlist_remove_head( &p_vl->ufifo ); > > } > > > > Either post should increment qp0_mads_outstanding for unsolicited or > > shutdown shouldn't decrement it when removing from ufifo. If you > agree, > > which should it be ? > > > > 2. In the case of a failure from osm_vendor_send, __osm_vl15_poller > > decrements qp0_mads_outstanding regardless of whether a response is > > expected. This is inconsistent with the increment. This leads me to > > believe that this should also be incremented for unsolicited > (unicasts) > > as well as those for which responses are expected. Is this correct or > am > > I missing something ? > > > > So my conclusion is that in osm_vl15_post, it should be: > > > > if( p_madw->resp_expected == TRUE ) > > { > > cl_qlist_insert_tail( &p_vl->rfifo, (cl_list_item_t*)p_madw ); > > } > > else > > { > > cl_qlist_insert_tail( &p_vl->ufifo, (cl_list_item_t*)p_madw ); > > } > > cl_atomic_inc( &p_vl->p_stats->qp0_mads_outstanding ); > > > > If you agree, I will generate a patch for this. Thanks. > > > > -- Hal From halr at voltaire.com Thu Jan 5 07:09:17 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jan 2006 10:09:17 -0500 Subject: [openib-general] SA cache design In-Reply-To: <43BB21C6.4050800@ichips.intel.com> References: <43BB1A0F.2080305@ichips.intel.com> <1136335761.4331.52415.camel@hal.voltaire.com> <43BB21C6.4050800@ichips.intel.com> Message-ID: <1136473755.4339.16945.camel@hal.voltaire.com> Hi Sean, On Tue, 2006-01-03 at 20:15, Sean Hefty wrote: > Hal Rosenstock wrote: > >>I've been given the task of trying to come up with an implementation for an SA > >>cache. The intent is to increase the scalability and performance of the openib > >>stack. My current thoughts on the implementation are below. Any feedback is > >>welcome. > >> > >>To keep the design as flexible as possible, my plan is to implement the cache in > >>userspace. The interface to the cache would be via MADs. > > > > Would this be another MAD class which mimics the SA class ? > > I hadn't fully figured this out yet. I'm not sure if another MAD class is > needed or not. My goal is to implement this as transparent to the application > as possible without violating the spec, perhaps appearing as an SA on a > different LID. The LID for the (real) SA is determined from PortInfo:MasterSMLID so I don't see how this could be done that way. > >> Clients would send > >>their queries to the sa_cache instead of the SA itself. The format of the MADs > >>would be essentially identical to those used to query the SA itself. Response > >>MADs would contain any requested information. If the cache could not satisfy a > >>request, the sa_cache would query the SA, update its cache, then return a reply. > >> > >>The benefits that I see with this approach are: > >> > >>+ Clients would only need to send requests to the sa_cache. > >>+ The sa_cache can be implemented in stages. Requests that it cannot handle > >>would just be forwarded to the SA. > > > > Another option would be for the SA cache to indicate what requests its > > handles (some MADs for this) and have the clients only go to the cache > > for those queries (and direct to the SA for the others). > > I thought about this, but this puts an additional burden on the clients. Sure but how significant is this, especially if the 2 requests look alike with some minor exception(s) like the class. I would think this would make up for eliminating the extra indirection in the case where the cache does not support the request. > Letting the sa_cache forward the request allows it to send the requests to > another sa_cache, rather than directly to the SA. There's some additional > flexibility that we gain in the long term design by forwarding requests. (I'm > thinking of the possibility of having an sa_cache hierarchy.) Sure; a hierarchial cache should scale even better. > >>+ The sa_cache could be implemented on each host, or a select number of hosts. > >>+ The interface to the sa_cache is similar to that used by the SA. > >>+ The cache would use virtual memory and could be saved to disk. > >> > >>Some drawbacks specific to this method are: > >> > >>- The MAD interface will result in additional data copies and userspace to > >>kernel transitions for clients residing on the local system. > >>- Clients require a mechanism to locate the sa_cache, or need to make > >>assumptions about its location. > > > > Would SA caching be a service ID or set of IDs ? > > I'd like the sa_cache to give the appearance of being a standard SA as much as > possible. Yes, the closer to the real SA requests the cache requests are the better. > One effect is that an sa_cache may not be able to run on the same > node as the actual SA, Not sure why this would be the case. > but that restriction seems desirable to me. Agreed. > > Are there also issues around cache invalidation ? > > I didn't list cache synchronization as an issue because I couldn't think of any > problems that were specific to this design, versus being a general issue. Yes, this is a general design issue. The whole idea of how requests are matched to the cache (what info is kept in the cache) and the invalidation are keys. Just take PathRecords as one example. -- Hal > - Sean > From bos at pathscale.com Thu Jan 5 07:28:48 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 05 Jan 2006 07:28:48 -0800 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: References: <20051230080002.GA7438@kroah.com> <1135984304.13318.50.camel@serpentine.pathscale.com> <20051231001051.GB20314@kroah.com> <1135993250.13318.94.camel@serpentine.pathscale.com> Message-ID: <1136474928.31922.14.camel@serpentine.pathscale.com> On Wed, 2006-01-04 at 13:26 -0800, Roland Dreier wrote: > Yes, this might be a good idea. The "core" driver looks like it is > suffering from really being several things stuck together. Yes, this is undoubtedly the case; we developed it organically based on our evolving needs, and we're only now (maybe a bit belatedly) stepping back to take a breath and see how things should be logically split out. > Also, there are APIs in the "core" driver that are only exported for a > single user outside the driver -- it would probably make sense to move > that logic directly to where it's used. Right. The purpose of the whole ipath_layer.c file has perhaps been unclear; we've been holding back a network driver that makes use of it, to keep the size of the review patches down. Some of the other verbs-related routines in the core driver are in the process of finding a new home, as you suggested. As ever, thanks for the comments. References: <43BB1A0F.2080305@ichips.intel.com> <43BD109A.3010302@mellanox.co.il> Message-ID: <1136474443.4339.17068.camel@hal.voltaire.com> Hi Eitan, On Thu, 2006-01-05 at 07:27, Eitan Zahavi wrote: > Hi Sean, > > This is great initiative - tackling an important issue. > I am glad you took this on. > > Please see below. > > Sean Hefty wrote: > > I've been given the task of trying to come up with an implementation for > > an SA cache. The intent is to increase the scalability and performance > > of the openib stack. My current thoughts on the implementation are > > below. Any feedback is welcome. > > > > To keep the design as flexible as possible, my plan is to implement the > > cache in userspace. The interface to the cache would be via MADs. > > Clients would send their queries to the sa_cache instead of the SA > > itself. The format of the MADs would be essentially identical to those > > used to query the SA itself. Response MADs would contain any requested > > information. If the cache could not satisfy a request, the sa_cache > > would query the SA, update its cache, then return a reply. > * I think the idea of using MADs to interface with the cache is very good. > * User space implementation: > This also might be a good tradeoff between coding and debugging versus the > the impact on number of connections per second. I hope the impact on performance > will not be too big. Maybe we can take the path of implementing in user space and > if the performance penalty will be too high we can port to kernel. > * Regarding the sentence:"Clients would send their queries to the sa_cache instead of the SA" > I would propose that a "SA MAD send switch" be implemented in the core: Such a switch > will enable plugging in the SA cache (I would prefer calling it SA local agent due to > its extended functionality). Once plugged in, this "SA local agent" should be forwarded all > outgoing SA queries. Once it handles the MAD it should be able to inject the response through > the core "SA MAD send switch" as if they arrived from the wire. > > > > The benefits that I see with this approach are: > > > > + Clients would only need to send requests to the sa_cache. > > + The sa_cache can be implemented in stages. Requests that it cannot > > handle would just be forwarded to the SA. > > + The sa_cache could be implemented on each host, or a select number of > > hosts. > > + The interface to the sa_cache is similar to that used by the SA. > > + The cache would use virtual memory and could be saved to disk. > > > > Some drawbacks specific to this method are: > > > > - The MAD interface will result in additional data copies and userspace > > to kernel transitions for clients residing on the local system. > > - Clients require a mechanism to locate the sa_cache, or need to make > > assumptions about its location. > The proposal for "SA MAD send switch" in the core will resolve this issue. > No client change will be required as all MADs are sent through the core which will > redirect them to the SA agent ... I see this as more granular than a complete switch for the entire class. More like on a per attribute basis. > Functional requirements: > * It is clear that the first SA query to cache is PathRecord. > So if a new client wants to connect to another node a new PathRecord > query will not need to be sent to the SA. However, recent work on QoS has pointed out > that under some QoS schemes PathRecord should not be shared by different clients > or even connections. There are several ways to make such QoS scheme scale. > Since this is a different discussion topic - I only bring this up such that > we take into account caching might also need to be done by a complex key (not just > SRC/DST ...) Per the QoS direction, this complex key is indeed part of the enhanced PathRecord, right ? > * Forgive me for bringing the following issue - over and over to the group: > Multicast Join/Leave should be reference counted. The "SA local agent" could be > the right place for doing this kind of reference counting (actually if it does that > it probably needs to be located in the Kernel - to enable cleanup after killed processes). The cache itself may need another level of reference counting (even if invalidation is broadcast). > * Similarly - "Client re-registration" could be made transparent to clients. > > Cache Invalidation: > Several discussions about PathRecord invalidation were spawn in the past. > IMO, it is enough to be notified about death of local processes, remote port availability (trap 64/65) and > multicast group availability (trap 66/67) in order to invalidate SA cache information. I think that it's more complicated than this. As an example, how does the SA cache know whether a cached path record needs to be changed based on traps 64/65 ? It seems to me to need to be tightly tied to the SM/SA for this. > So each SA Agent could register to obtain this data. But that solution does not nicely scale, > as the SA needs to send notification to all nodes (but is reliable - could resend until Repressed). > However, current IBTA definition for InformInfo (event forwarding mechanism) does not > allow for multicast of Report(Notice). The reason is that registration for event forwarding > is done with Set(InformInfo) which uses the requester QP and LID as the address for sending > the matching report. A simple way around that limitation could be to enable the SM to "pre-register" > a well known multicast group target for event forwarding. One issue though, would be that UD multicast > is not reliable and some notifications could get lost. A notification sequence number could be used > to catch these missed notifications eventually. A multicast group could be defined for SA caching. The reliable aspects are another matter although the represses could be unicast back to the cache. -- Hal > Eitan > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bos at pathscale.com Thu Jan 5 07:31:18 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 05 Jan 2006 07:31:18 -0800 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: References: <20051230080002.GA7438@kroah.com> <1135984304.13318.50.camel@serpentine.pathscale.com> <20051231001051.GB20314@kroah.com> <1135993250.13318.94.camel@serpentine.pathscale.com> <20060103172732.GA9170@kroah.com> <1136321691.10862.61.camel@localhost.localdomain> Message-ID: <1136475079.31922.18.camel@serpentine.pathscale.com> On Wed, 2006-01-04 at 13:28 -0800, Roland Dreier wrote: > Isn't there some way you can use the same SMA (subnet management > agent) interface in all the cases? I'll look into it, but I rather doubt it. > Can ipath_mad.c just go away in > favor of your userspace SMA? Our userspace SMA is a tiny shrivelled thing that expects there to be a real subnet manager out there, so it only needs a very simple interface, and it's decoupled from OpenIB entirely. ipath_mad.c is part of our OpenIB layer, so it can't really go away. Fix mthca_create_eq for when the EQ size is not a power of 2. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14.3-kgdb/drivers/infiniband/hw/mthca/mthca_eq.c =================================================================== --- linux-2.6.14.3-kgdb.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2006-01-05 14:10:13.000000000 +0200 +++ linux-2.6.14.3-kgdb/drivers/infiniband/hw/mthca/mthca_eq.c 2006-01-05 15:12:57.000000000 +0200 @@ -484,8 +483,7 @@ static int __devinit mthca_create_eq(str u8 intr, struct mthca_eq *eq) { - int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / - PAGE_SIZE; + int npages; u64 *dma_list = NULL; dma_addr_t t; struct mthca_mailbox *mailbox; @@ -496,6 +494,7 @@ static int __devinit mthca_create_eq(str eq->dev = dev; eq->nent = roundup_pow_of_two(max(nent, 2)); + npages = ALIGN(eq->nent * MTHCA_EQ_ENTRY_SIZE, PAGE_SIZE) / PAGE_SIZE; eq->page_list = kmalloc(npages * sizeof *eq->page_list, GFP_KERNEL); -- MST From bos at pathscale.com Thu Jan 5 08:37:43 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 05 Jan 2006 08:37:43 -0800 Subject: [openib-general] PathScale license In-Reply-To: <20051229144221.GA29260@lst.de> References: <1135363454.4328.95007.camel@hal.voltaire.com> <20051228020255.GA3280@cuprite.internal.keyresearch.com> <20051229144221.GA29260@lst.de> Message-ID: <1136479063.31922.44.camel@serpentine.pathscale.com> On Thu, 2005-12-29 at 15:42 +0100, Christoph Hellwig wrote: > > PathScale's use of this language is not original. SGI has used, and perhaps > > originated, the additional language. > XFS has been switched to a normal short GPL boilerplate exactly because > this wording is not okay. FYI, we have decided to drop the additional SGI-style wording from future driver submissions. References: <1135363454.4328.95007.camel@hal.voltaire.com> <20051228020255.GA3280@cuprite.internal.keyresearch.com> <20051229144221.GA29260@lst.de> <1136479063.31922.44.camel@serpentine.pathscale.com> Message-ID: <43BD4D6C.4090403@lanl.gov> Bryan O'Sullivan wrote: > On Thu, 2005-12-29 at 15:42 +0100, Christoph Hellwig wrote: > > >>>PathScale's use of this language is not original. SGI has used, and perhaps >>>originated, the additional language. > > >>XFS has been switched to a normal short GPL boilerplate exactly because >>this wording is not okay. > > > FYI, we have decided to drop the additional SGI-style wording from > future driver submissions. Thank you bryan, we're really looking forward to working with the pathscale hardware. ron From CalvinagHolman at proxad.net Thu Jan 5 05:02:32 2006 From: CalvinagHolman at proxad.net (Calvin Holman) Date: Thu, 5 Jan 2006 18:02:32 +0500 Subject: [openib-general] Get your personalized rate quote NOW! Message-ID: Hello, You have been chosen to participate in an invitation only limited time event! Are you currently paying over 3% for your mortgage? STOP! We can help you lower that today! Answer only a few questions and we can give you an approval in under 30 seconds it's that simple! http://pillyfour.net/p3.asp And stop fighting for lenders let them fight for you! Make them work for your business by giving you the lowest rates around! $230,000 loans are available for only $340/month! WE'RE PRACTICALLY GIVING AWAY MONEY! Think your credit is too bad to get a deal like this? THINK AGAIN! We will have you saving your money in no time! Are you ready to save your money? http://pillyfour.net/save2.asp Regards, Calvin Holman From halr at voltaire.com Thu Jan 5 10:10:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jan 2006 13:10:29 -0500 Subject: [openib-general] Re: Re[PATCH] Opensm - osm_vl15intf.c fix In-Reply-To: <5zy81ug622.fsf@mtl066.yok.mtl.com> References: <5zy81ug622.fsf@mtl066.yok.mtl.com> Message-ID: <1136484628.4339.18555.camel@hal.voltaire.com> On Thu, 2006-01-05 at 08:25, Yael Kalka wrote: > Hi Hal, > > Attached is a fix for the qp0_mads_outstanding handling, according to > what I've described in the previous mail. Thanks. Applied with some formatting changes. -- Hal From tom at opengridcomputing.com Thu Jan 5 10:54:25 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 05 Jan 2006 12:54:25 -0600 Subject: [openib-general] [PATCH] iWARP Include File Changes Message-ID: <1136487265.10878.17.camel@trinity.austin.ammasso.com> This patch is for the include files needed to support iWARP and is relative to the trunk. I added some new device capabilities bits that could use some name tweaking -- Thanks, Signed-off-by: Tom Tucker Index: rdma/ib_verbs.h =================================================================== --- rdma/ib_verbs.h (revision 4748) +++ rdma/ib_verbs.h (working copy) @@ -67,7 +67,8 @@ enum ib_node_type { IB_NODE_CA = 1, IB_NODE_SWITCH, - IB_NODE_ROUTER + IB_NODE_ROUTER, + IB_NODE_RNIC }; enum ib_device_cap_flags { @@ -86,6 +87,14 @@ IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), IB_DEVICE_SRQ_RESIZE = (1<<13), IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_IN_ORD_PLCMNT = (1<<15), + IB_DEVICE_ZERO_STAG = (1<<16), + IB_DEVICE_SEND_W_INV = (1<<17), + IB_DEVICE_MW = (1<<18), + IB_DEVICE_FMR = (1<<19), + IB_DEVICE_SRQ = (1<<20), + IB_DEVICE_ARP = (1<<21), + IB_DEVICE_LLP = (1<<22), }; enum ib_atomic_cap { @@ -824,6 +833,8 @@ u32 flags; + struct iw_cm_verbs *iwcm; + int (*query_device)(struct ib_device *device, struct ib_device_attr *device_attr); int (*query_port)(struct ib_device *device, Index: rdma/iw_cm.h =================================================================== --- rdma/iw_cm.h (revision 0) +++ rdma/iw_cm.h (revision 0) @@ -0,0 +1,152 @@ +/* + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#if !defined(IW_CM_H) +#define IW_CM_H + +#include +#include + +struct iw_cm_id; +struct iw_cm_event; + +enum iw_cm_event_type { + IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */ + IW_CM_EVENT_CONNECT_REPLY, /* reply from active connect request */ + IW_CM_EVENT_ESTABLISHED, + IW_CM_EVENT_LLP_DISCONNECT, + IW_CM_EVENT_LLP_RESET, + IW_CM_EVENT_LLP_TIMEOUT, + IW_CM_EVENT_CLOSE +}; + +struct iw_cm_event { + enum iw_cm_event_type event; + int status; + u32 provider_id; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *private_data; + u8 private_data_len; +}; + +typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); + +enum iw_cm_state { + IW_CM_STATE_IDLE, /* unbound, inactive */ + IW_CM_STATE_LISTEN, /* listen waiting for connect */ + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ + IW_CM_STATE_ESTABLISHED, /* established */ +}; + +typedef void (*iw_event_handler)(struct iw_cm_id* cm_id, + struct iw_cm_event* event); +struct iw_cm_id { + iw_cm_handler cm_handler; /* client callback function */ + void *context; /* context to provide to client cb */ + enum iw_cm_state state; + struct ib_device *device; + struct ib_qp *qp; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + u64 provider_id; /* device handle for this conn. */ + iw_event_handler event_handler; /* callback for IW CM Provider events */ +}; + +/** + * iw_create_cm_id - Allocate a communication identifier. + * @device: Device associated with the cm_id. All related communication will + * be associated with the specified device. + * @cm_handler: Callback invoked to notify the user of CM events. + * @context: User specified context associated with the communication + * identifier. + * + * Communication identifiers are used to track connection states, + * addr resolution requests, and listen requests. + */ +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context); + +/* This is provided in the event generated when + * a remote peer accepts our connect request + */ + +struct iw_cm_verbs { + int (*connect)(struct iw_cm_id* cm_id, + const void* private_data, + u8 private_data_len); + + int (*disconnect)(struct iw_cm_id* cm_id, + int abrupt); + + int (*accept)(struct iw_cm_id*, + const void *private_data, + u8 pdata_data_len); + + int (*reject)(struct iw_cm_id* cm_id, + const void* private_data, + u8 private_data_len); + + int (*getpeername)(struct iw_cm_id* cm_id, + struct sockaddr_in* local_addr, + struct sockaddr_in* remote_addr); + + int (*create_listen)(struct iw_cm_id* cm_id, + int backlog); + + int (*destroy_listen)(struct iw_cm_id* cm_id); + +}; + +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context); +void iw_destroy_cm_id(struct iw_cm_id *cm_id); +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog); +int iw_cm_getpeername(struct iw_cm_id *cm_id, + struct sockaddr_in* local_add, + struct sockaddr_in* remote_addr); +int iw_cm_reject(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len); +int iw_cm_accept(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len); +int iw_cm_connect(struct iw_cm_id *cm_id, + const void* pdata, u8 pdata_len); +int iw_cm_disconnect(struct iw_cm_id *cm_id); +int iw_cm_bind_qp(struct iw_cm_id* cm_id, struct ib_qp* qp); + +#endif /* IW_CM_H */ From tom at opengridcomputing.com Thu Jan 5 10:57:12 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 05 Jan 2006 12:57:12 -0600 Subject: [openib-general] [PATCH] CMA mods for iWARP Message-ID: <1136487432.10878.22.camel@trinity.austin.ammasso.com> This patch is for CMA changes to support iWARP and is relative to the trunk. It includes the latest ib_addr generalizations that allowed for some simplification in the rdma_resolve_addr implementation. This patch needs the include file patch to compile. I tested this on 2.6.14.5 with the AMSO1100 iWARP and Volataire IB adapters. Please review and comment as appropriate. I would love to get this in the trunk -- the merges are killing me. Thanks, Signed-off-by: Tom Tucker Index: cm.c =================================================================== --- cm.c (revision 4748) +++ cm.c (working copy) @@ -3261,6 +3261,9 @@ int ret; u8 i; + if (device->node_type == IB_NODE_RNIC) + return; + cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) Index: iwcm.c =================================================================== --- iwcm.c (revision 0) +++ iwcm.c (revision 0) @@ -0,0 +1,648 @@ +/* + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "cm_msgs.h" + +MODULE_AUTHOR("Tom Tucker"); +MODULE_DESCRIPTION("iWARP CM"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void iwcm_add_one(struct ib_device *device); +static void iwcm_remove_one(struct ib_device *device); +struct iwcm_id_private; + +static struct ib_client iwcm_client = { + .name = "iwcm", + .add = iwcm_add_one, + .remove = iwcm_remove_one +}; + +static struct { + spinlock_t lock; + struct list_head device_list; + rwlock_t device_lock; + struct workqueue_struct* wq; +} iwcm; + +struct iwcm_device; +struct iwcm_port { + struct iwcm_device *iwcm_dev; + struct sockaddr_in local_addr; + u8 port_num; +}; + +struct iwcm_device { + struct list_head list; + struct ib_device *device; + struct iwcm_port port[0]; +}; + +struct iwcm_id_private { + struct iw_cm_id id; + + spinlock_t lock; + wait_queue_head_t wait; + atomic_t refcount; + + struct rb_node listen_node; + + struct list_head work_list; + atomic_t work_count; +}; + +struct iwcm_work { + struct work_struct work; + struct iwcm_id_private* cm_id; + struct iw_cm_event event; +}; + +/* Called whenever a reference added for a cm_id */ +static inline void iwcm_addref_id(struct iwcm_id_private *cm_id_priv) +{ + atomic_inc(&cm_id_priv->refcount); +} + +/* Called whenever releasing a reference to a cm id */ +static inline void iwcm_deref_id(struct iwcm_id_private *cm_id_priv) +{ + if (atomic_dec_and_test(&cm_id_priv->refcount)) + wake_up(&cm_id_priv->wait); +} + +static void cm_event_handler(struct iw_cm_id* cm_id, struct iw_cm_event* event); + +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context) +{ + struct iwcm_id_private *iwcm_id_priv; + + iwcm_id_priv = kmalloc(sizeof *iwcm_id_priv, GFP_KERNEL); + if (!iwcm_id_priv) + return ERR_PTR(-ENOMEM); + + memset(iwcm_id_priv, 0, sizeof *iwcm_id_priv); + iwcm_id_priv->id.state = IW_CM_STATE_IDLE; + iwcm_id_priv->id.device = device; + iwcm_id_priv->id.cm_handler = cm_handler; + iwcm_id_priv->id.context = context; + iwcm_id_priv->id.event_handler = cm_event_handler; + + spin_lock_init(&iwcm_id_priv->lock); + init_waitqueue_head(&iwcm_id_priv->wait); + atomic_set(&iwcm_id_priv->refcount, 1); + + return &iwcm_id_priv->id; + +} +EXPORT_SYMBOL(iw_create_cm_id); + +void iw_destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret = 0; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + switch (cm_id->state) { + case IW_CM_STATE_LISTEN: + cm_id->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = cm_id->device->iwcm->destroy_listen(cm_id); + break; + + case IW_CM_STATE_CONN_RECV: + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_ESTABLISHED: + cm_id->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = cm_id->device->iwcm->disconnect(cm_id,1); + break; + + case IW_CM_STATE_IDLE: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + break; + + default: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + printk(KERN_ERR "%s:%s:%u Illegal state %d for iw_cm_id.\n", + __FILE__, __FUNCTION__, __LINE__, cm_id->state); + ; + } + + atomic_dec(&iwcm_id_priv->refcount); + wait_event(iwcm_id_priv->wait, !atomic_read(&iwcm_id_priv->refcount)); + + kfree(iwcm_id_priv); +} +EXPORT_SYMBOL(iw_destroy_cm_id); + +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret = 0; + + if (cm_id->device == 0 || cm_id->device->iwcm == 0) + return -EINVAL; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + if (cm_id->state != IW_CM_STATE_IDLE) { + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + return -EBUSY; + } + cm_id->state = IW_CM_STATE_LISTEN; + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->create_listen(cm_id, backlog); + if (ret != 0) + cm_id->state = IW_CM_STATE_IDLE; + + return ret; +} +EXPORT_SYMBOL(iw_cm_listen); + +int iw_cm_getpeername(struct iw_cm_id *cm_id, + struct sockaddr_in* local_addr, + struct sockaddr_in* remote_addr) +{ + if (cm_id->device == 0) + return -EINVAL; + + if (cm_id->device->iwcm == 0) + return -EINVAL; + + /* Make sure there's a connection */ + if (cm_id->state != IW_CM_STATE_ESTABLISHED) + return -ENOTCONN; + + return cm_id->device->iwcm->getpeername(cm_id, local_addr, remote_addr); +} +EXPORT_SYMBOL(iw_cm_getpeername); + +int iw_cm_reject(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret; + + + if (cm_id->device == 0 || cm_id->device->iwcm == 0) + return -EINVAL; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + switch (cm_id->state) { + case IW_CM_STATE_CONN_RECV: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = cm_id->device->iwcm->reject(cm_id, private_data, private_data_len); + cm_id->state = IW_CM_STATE_IDLE; + break; + default: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = -EINVAL; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_reject); + +int iw_cm_accept(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret; + + if (cm_id->device == 0 || cm_id->device->iwcm == 0) + return -EINVAL; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + switch (cm_id->state) { + case IW_CM_STATE_CONN_RECV: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = cm_id->device->iwcm->accept(cm_id, private_data, + private_data_len); + if (ret == 0) { + struct iw_cm_event event; + event.event = IW_CM_EVENT_ESTABLISHED; + event.provider_id = cm_id->provider_id; + event.status = 0; + event.local_addr = cm_id->local_addr; + event.remote_addr = cm_id->remote_addr; + event.private_data = 0; + event.private_data_len = 0; + cm_event_handler(cm_id, &event); + } + + break; + default: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = -EINVAL; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_accept); + +int iw_cm_bind_qp(struct iw_cm_id* cm_id, struct ib_qp* qp) +{ + int ret = -EINVAL; + + if (cm_id) { + cm_id->qp = qp; + ret = 0; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_bind_qp); + +int iw_cm_connect(struct iw_cm_id *cm_id, + const void* pdata, u8 pdata_len) +{ + struct iwcm_id_private* cm_id_priv; + int ret = 0; + unsigned long flags; + + if (cm_id->device == 0 || cm_id->device->iwcm == 0) + return -EINVAL; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id->state != IW_CM_STATE_IDLE) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return -EBUSY; + } + cm_id->state = IW_CM_STATE_CONN_SENT; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->connect(cm_id, pdata, pdata_len); + if (ret != 0) + cm_id->state = IW_CM_STATE_IDLE; + + return ret; +} +EXPORT_SYMBOL(iw_cm_connect); + +int iw_cm_disconnect(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret; + + if (cm_id->device == 0 || cm_id->device->iwcm == 0 || cm_id->qp == 0) + return -EINVAL; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + switch (cm_id->state) { + case IW_CM_STATE_ESTABLISHED: + cm_id->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = cm_id->device->iwcm->disconnect(cm_id, 1); + if (ret == 0) { + struct iw_cm_event event; + event.event = IW_CM_EVENT_LLP_DISCONNECT; + event.provider_id = cm_id->provider_id; + event.status = 0; + event.local_addr = cm_id->local_addr; + event.remote_addr = cm_id->remote_addr; + event.private_data = 0; + event.private_data_len = 0; + cm_event_handler(cm_id, &event); + } + + break; + default: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = -EINVAL; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_disconnect); + +static void iwcm_add_one(struct ib_device *device) +{ + struct iwcm_device *iwcm_dev; + struct iwcm_port *port; + unsigned long flags; + u8 i; + + if (device->node_type != IB_NODE_RNIC) + return; + + iwcm_dev = kmalloc(sizeof(*iwcm_dev) + sizeof(*port) * + device->phys_port_cnt, GFP_KERNEL); + if (!iwcm_dev) + return; + + iwcm_dev->device = device; + + for (i = 1; i <= device->phys_port_cnt; i++) { + port = &iwcm_dev->port[i-1]; + port->iwcm_dev = iwcm_dev; + port->port_num = i; + } + + ib_set_client_data(device, &iwcm_client, iwcm_dev); + + write_lock_irqsave(&iwcm.device_lock, flags); + list_add_tail(&iwcm_dev->list, &iwcm.device_list); + write_unlock_irqrestore(&iwcm.device_lock, flags); + return; +} + +static void iwcm_remove_one(struct ib_device *device) +{ + struct iwcm_device *iwcm_dev; + unsigned long flags; + + iwcm_dev = ib_get_client_data(device, &iwcm_client); + if (!iwcm_dev) + return; + + write_lock_irqsave(&iwcm.device_lock, flags); + list_del(&iwcm_dev->list); + write_unlock_irqrestore(&iwcm.device_lock, flags); + + kfree(iwcm_dev); +} + +/* Handles an inbound connect request. The function creates a new + * iw_cm_id to represent the new connection and inherits the client + * callback function and other attributes from the listening parent. + * + * The work item contains a pointer to the listen_cm_id and the event. The + * listen_cm_id contains the client cm_handler, context and device. These are + * copied when the device is cloned. The event contains the new four tuple. + */ +static int cm_conn_req_handler(struct iwcm_work* work) +{ + struct iw_cm_id* cm_id; + struct iwcm_id_private* cm_id_priv; + int rc; + + /* If the status was not successful, ignore request */ + if (work->event.status) { + printk(KERN_ERR "%s:%d Bad status=%d for connection request ... " + "should be filtered by provider\n", + __FUNCTION__, __LINE__, + work->event.status); + return work->event.status; + } + cm_id = iw_create_cm_id(work->cm_id->id.device, work->cm_id->id.cm_handler, + work->cm_id->id.context); + if (IS_ERR(cm_id)) + return PTR_ERR(cm_id); + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + cm_id_priv->id.local_addr = work->event.local_addr; + cm_id_priv->id.remote_addr = work->event.remote_addr; + cm_id_priv->id.provider_id = work->event.provider_id; + cm_id_priv->id.state = IW_CM_STATE_CONN_RECV; + + /* Call the client CM handler */ + rc = cm_id->cm_handler(cm_id, &work->event); + if (rc) { + cm_id->state = IW_CM_STATE_IDLE; + iw_destroy_cm_id(cm_id); + } + kfree(work); + return 0; +} + +/* + * Handles the transition to established state on the passive side. + */ +static int cm_conn_est_handler(struct iwcm_work* work) +{ + struct iwcm_id_private* cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = work->cm_id; + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->id.state != IW_CM_STATE_CONN_RECV) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + printk(KERN_ERR "%s:%d Invalid cm_id state=%d for established event\n", + __FUNCTION__, __LINE__, cm_id_priv->id.state); + ret = -EINVAL; + goto error_out; + } + + if (work->event.status == 0) { + cm_id_priv = work->cm_id; + cm_id_priv->id.local_addr = work->event.local_addr; + cm_id_priv->id.remote_addr = work->event.remote_addr; + cm_id_priv->id.state = IW_CM_STATE_ESTABLISHED; + } else + cm_id_priv->id.state = IW_CM_STATE_IDLE; + + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + /* Call the client CM handler */ + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); + if (ret) { + cm_id_priv->id.state = IW_CM_STATE_IDLE; + iw_destroy_cm_id(&cm_id_priv->id); + } + + error_out: + kfree(work); + return ret; +} + +/* + * Handles the reply to our connect request. There are three + * possibilities: + * - If the cm_id is in the wrong state when the event is + * delivered, the event is ignored. [What should we do when the + * provider does something crazy?] + * - If the remote peer accepts the connection, we update the 4-tuple + * in the cm_id with the remote peer info, move the cm_id to the + * ESTABLISHED state and deliver the event to the client. + * - If the remote peer rejects the connection, or there is some + * connection error, move the cm_id to the IDLE state, and deliver + * the event to the client. + */ +static int cm_conn_rep_handler(struct iwcm_work* work) +{ + struct iwcm_id_private* cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = work->cm_id; + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->id.state != IW_CM_STATE_CONN_SENT) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + printk(KERN_ERR "%s:%d Invalid cm_id state=%d for connect reply event\n", + __FUNCTION__, __LINE__, cm_id_priv->id.state); + ret = -EINVAL; + goto error_out; + } + + if (work->event.status == 0) { + cm_id_priv = work->cm_id; + cm_id_priv->id.local_addr = work->event.local_addr; + cm_id_priv->id.remote_addr = work->event.remote_addr; + cm_id_priv->id.state = IW_CM_STATE_ESTABLISHED; + } else + cm_id_priv->id.state = IW_CM_STATE_IDLE; + + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + /* Call the client CM handler */ + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); + if (ret) { + cm_id_priv->id.state = IW_CM_STATE_IDLE; + iw_destroy_cm_id(&cm_id_priv->id); + } + + error_out: + kfree(work); + return ret; +} + +static int cm_disconnect_handler(struct iwcm_work* work) +{ + struct iwcm_id_private* cm_id_priv; + int ret = 0; + + cm_id_priv = work->cm_id; + + cm_id_priv->id.state = IW_CM_STATE_IDLE; + + /* Call the client CM handler */ + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); + if (ret) + iw_destroy_cm_id(&cm_id_priv->id); + + kfree(work); + return ret; +} + +static void cm_work_handler(void* arg) +{ + struct iwcm_work* work = (struct iwcm_work*)arg; + int rc; + + switch (work->event.event) { + case IW_CM_EVENT_CONNECT_REQUEST: + rc = cm_conn_req_handler(work); + break; + case IW_CM_EVENT_CONNECT_REPLY: + rc = cm_conn_rep_handler(work); + break; + case IW_CM_EVENT_ESTABLISHED: + rc = cm_conn_est_handler(work); + break; + case IW_CM_EVENT_LLP_DISCONNECT: + case IW_CM_EVENT_LLP_TIMEOUT: + case IW_CM_EVENT_LLP_RESET: + case IW_CM_EVENT_CLOSE: + rc = cm_disconnect_handler(work); + break; + } +} + +/* IW CM provider event callback handler. This function is called on + * interrupt context. The function builds a work queue element + * and enqueues it for processing on a work queue thread. This allows + * CM client callback functions to block. + */ +static void cm_event_handler(struct iw_cm_id* cm_id, + struct iw_cm_event* event) +{ + struct iwcm_work *work; + struct iwcm_id_private* cm_id_priv; + + work = kmalloc(sizeof *work, GFP_ATOMIC); + if (!work) + return; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + INIT_WORK(&work->work, cm_work_handler, work); + work->cm_id = cm_id_priv; + work->event = *event; + queue_work(iwcm.wq, &work->work); +} + +static int __init iw_cm_init(void) +{ + memset(&iwcm, 0, sizeof iwcm); + INIT_LIST_HEAD(&iwcm.device_list); + rwlock_init(&iwcm.device_lock); + spin_lock_init(&iwcm.lock); + iwcm.wq = create_workqueue("iw_cm"); + if (!iwcm.wq) + return -ENOMEM; + + return ib_register_client(&iwcm_client); +} + +static void __exit iw_cm_cleanup(void) +{ + ib_unregister_client(&iwcm_client); +} + +module_init(iw_cm_init); +module_exit(iw_cm_cleanup); + Index: addr.c =================================================================== --- addr.c (revision 4748) +++ addr.c (working copy) @@ -65,6 +65,9 @@ case ARPHRD_INFINIBAND: dev_addr->dev_type = IB_NODE_CA; break; + case ARPHRD_ETHER: + dev_addr->dev_type = IB_NODE_RNIC; + break; default: return -EADDRNOTAVAIL; } Index: Makefile =================================================================== --- Makefile (revision 4748) +++ Makefile (working copy) @@ -1,6 +1,6 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -Idrivers/infiniband/ulp/ipoib -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o \ +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o iw_cm.o \ ib_sa.o ib_at.o ib_addr.o rdma_cm.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o ib_uat.o rdma_ucm.o @@ -14,6 +14,8 @@ ib_cm-y := cm.o +iw_cm-y := iwcm.o + rdma_cm-y := cma.o rdma_ucm-y := ucma.o Index: cma.c =================================================================== --- cma.c (revision 4748) +++ cma.c (working copy) @@ -3,6 +3,7 @@ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. * * This Software is licensed under one of the following licenses: * @@ -31,9 +32,14 @@ #include #include #include +#include +#include +#include +#include #include #include #include +#include #include MODULE_AUTHOR("Guy German"); @@ -102,8 +108,12 @@ int timeout_ms; struct ib_sa_query *query; int query_id; - struct ib_cm_id *cm_id; + union { + struct ib_cm_id *ib; + struct iw_cm_id *iw; + } cm_id; + u32 seq_num; u32 qp_num; enum ib_qp_type qp_type; @@ -239,11 +249,40 @@ return ret; } +static int cma_acquire_iw_dev(struct rdma_id_private* id_priv) +{ + struct rdma_dev_addr* dev_addr = &id_priv->id.route.addr.dev_addr; + struct cma_device* cma_dev; + int ret = -ENOENT; + + down(&mutex); + list_for_each_entry(cma_dev, &dev_list, list) { + if (memcmp(dev_addr->src_dev_addr, + &cma_dev->node_guid, + sizeof(cma_dev->node_guid)) == 0) { + + /* If we find the device, then check if this + * is an iWARP device. If it is, then attach + */ + if (cma_dev->device->node_type == IB_NODE_RNIC) { + cma_attach_to_dev(id_priv, cma_dev); + ret = 0; + break; + } + } + } + up(&mutex); + + return ret; +} + static int cma_acquire_dev(struct rdma_id_private *id_priv) { switch (id_priv->id.route.addr.dev_addr.dev_type) { case IB_NODE_CA: return cma_acquire_ib_dev(id_priv); + case IB_NODE_RNIC: + return cma_acquire_iw_dev(id_priv); default: return -ENODEV; } @@ -306,6 +345,16 @@ IB_QP_PKEY_INDEX | IB_QP_PORT); } +static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS); +} + int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, struct ib_qp_init_attr *qp_init_attr) { @@ -325,6 +374,9 @@ case IB_NODE_CA: ret = cma_init_ib_qp(id_priv, qp); break; + case IB_NODE_RNIC: + ret = cma_init_iw_qp(id_priv, qp); + break; default: ret = -ENOSYS; break; @@ -412,7 +464,7 @@ id_priv = container_of(id, struct rdma_id_private, id); switch (id_priv->id.device->node_type) { case IB_NODE_CA: - ret = ib_cm_init_qp_attr(id_priv->cm_id, qp_attr, + ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr, qp_attr_mask); if (qp_attr->qp_state == IB_QPS_RTR) qp_attr->rq_psn = id_priv->seq_num; @@ -567,8 +619,8 @@ { cma_exch(id_priv, CMA_DESTROYING); - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) - ib_destroy_cm_id(id_priv->cm_id); + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) + ib_destroy_cm_id(id_priv->cm_id.ib); list_del(&id_priv->listen_list); if (id_priv->cma_dev) @@ -624,9 +676,20 @@ state = cma_exch(id_priv, CMA_DESTROYING); cma_cancel_operation(id_priv, state); - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) - ib_destroy_cm_id(id_priv->cm_id); + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) { + switch (id->device->node_type) { + case IB_NODE_RNIC: + iw_destroy_cm_id(id_priv->cm_id.iw); + break; + default: + ib_destroy_cm_id(id_priv->cm_id.ib); + break; + } + + id_priv->cm_id.ib = NULL; + } + if (id_priv->cma_dev) { down(&mutex); cma_detach_from_dev(id_priv); @@ -652,15 +715,15 @@ ret = cma_modify_qp_rts(&id_priv->id); if (ret) goto reject; - - ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); + + ret = ib_send_cm_rtu(id_priv->cm_id.ib, NULL, 0); if (ret) goto reject; return 0; reject: cma_modify_qp_err(&id_priv->id); - ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); return ret; } @@ -676,7 +739,7 @@ return 0; reject: cma_modify_qp_err(&id_priv->id); - ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); return ret; } @@ -737,7 +800,7 @@ private_data_len); if (ret) { /* Destroy the CM ID by returning a non-zero value. */ - id_priv->cm_id = NULL; + id_priv->cm_id.ib = NULL; cma_exch(id_priv, CMA_DESTROYING); cma_release_remove(id_priv); rdma_destroy_id(&id_priv->id); @@ -819,7 +882,7 @@ goto out; } - conn_id->cm_id = cm_id; + conn_id->cm_id.ib = cm_id; cm_id->context = conn_id; cm_id->cm_handler = cma_ib_handler; @@ -829,7 +892,7 @@ IB_CM_REQ_PRIVATE_DATA_SIZE - offset); if (ret) { /* Destroy the CM ID by returning a non-zero value. */ - conn_id->cm_id = NULL; + conn_id->cm_id.ib = NULL; cma_exch(conn_id, CMA_DESTROYING); cma_release_remove(conn_id); rdma_destroy_id(&conn_id->id); @@ -874,6 +937,115 @@ } } +static int cma_iw_handler(struct iw_cm_id* iw_id, struct iw_cm_event* event) +{ + struct rdma_id_private *id_priv = iw_id->context; + enum rdma_cm_event_type event_type = 0; + int ret = 0; + + atomic_inc(&id_priv->dev_remove); + + switch (event->event) { + case IW_CM_EVENT_LLP_DISCONNECT: + case IW_CM_EVENT_LLP_RESET: + case IW_CM_EVENT_LLP_TIMEOUT: + case IW_CM_EVENT_CLOSE: + event_type = RDMA_CM_EVENT_DISCONNECTED; + break; + + case IW_CM_EVENT_CONNECT_REQUEST: + BUG_ON(1); + break; + + case IW_CM_EVENT_CONNECT_REPLY: { + if (event->status) + event_type = RDMA_CM_EVENT_REJECTED; + else + event_type = RDMA_CM_EVENT_ESTABLISHED; + break; + } + + case IW_CM_EVENT_ESTABLISHED: + event_type = RDMA_CM_EVENT_ESTABLISHED; + break; + } + + ret = cma_notify_user(id_priv, + event_type, + event->status, + event->private_data, + event->private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + id_priv->cm_id.iw = NULL; + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + rdma_destroy_id(&id_priv->id); + return ret; + } + + cma_release_remove(id_priv); + return ret; +} + +static int iw_conn_req_handler(struct iw_cm_id *cm_id, + struct iw_cm_event *iw_event) +{ + struct rdma_cm_id* new_cm_id; + struct rdma_id_private *listen_id, *conn_id; + struct sockaddr_in* sin; + int ret; + + listen_id = cm_id->context; + atomic_inc(&listen_id->dev_remove); + if (!cma_comp(listen_id, CMA_LISTEN)) { + ret = -ECONNABORTED; + goto out; + } + + /* Create a new RDMA id the new IW CM ID */ + new_cm_id = rdma_create_id(listen_id->id.event_handler, + listen_id->id.context, + RDMA_PS_TCP); + if (!new_cm_id) { + ret = -ENOMEM; + goto out; + } + conn_id = container_of(new_cm_id, struct rdma_id_private, id); + atomic_inc(&conn_id->dev_remove); + conn_id->state = CMA_CONNECT; + + /* New connection inherits device from parent */ + down(&mutex); + cma_attach_to_dev(conn_id, listen_id->cma_dev); + up(&mutex); + + conn_id->cm_id.iw = cm_id; + cm_id->context = conn_id; + cm_id->cm_handler = cma_iw_handler; + + sin = (struct sockaddr_in*)&new_cm_id->route.addr.src_addr; + *sin = iw_event->local_addr; + + sin = (struct sockaddr_in*)&new_cm_id->route.addr.dst_addr; + *sin = iw_event->remote_addr; + + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, + iw_event->private_data, + iw_event->private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + conn_id->cm_id.iw = NULL; + cma_exch(conn_id, CMA_DESTROYING); + cma_release_remove(conn_id); + rdma_destroy_id(&conn_id->id); + } + +out: + cma_release_remove(listen_id); + return ret; +} + static int cma_ib_listen(struct rdma_id_private *id_priv) { struct ib_cm_private_data_compare compare_data; @@ -881,28 +1053,52 @@ __be64 svc_id; int ret; - id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_req_handler, + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, cma_req_handler, id_priv); - if (IS_ERR(id_priv->cm_id)) - return PTR_ERR(id_priv->cm_id); + if (IS_ERR(id_priv->cm_id.ib)) + return PTR_ERR(id_priv->cm_id.ib); addr = &id_priv->id.route.addr.src_addr; svc_id = cma_get_service_id(id_priv->id.ps, addr); if (cma_any_addr(addr)) - ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, NULL); + ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, NULL); else { cma_set_compare_data(addr, &compare_data); - ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, &compare_data); + ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, &compare_data); } if (ret) { - ib_destroy_cm_id(id_priv->cm_id); - id_priv->cm_id = NULL; + ib_destroy_cm_id(id_priv->cm_id.ib); + id_priv->cm_id.ib = NULL; } return ret; } +static int cma_iw_listen(struct rdma_id_private *id_priv) +{ + int ret; + struct sockaddr_in* sin; + + id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, + iw_conn_req_handler, + id_priv); + if (IS_ERR(id_priv->cm_id.iw)) + return PTR_ERR(id_priv->cm_id.iw); + + sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr; + id_priv->cm_id.iw->local_addr = *sin; + + ret = iw_cm_listen(id_priv->cm_id.iw, 10 /* backlog */); + + if (ret) { + iw_destroy_cm_id(id_priv->cm_id.iw); + id_priv->cm_id.iw = NULL; + } + + return ret; +} + static int cma_duplicate_listen(struct rdma_id_private *id_priv) { struct rdma_id_private *cur_id_priv; @@ -988,6 +1184,9 @@ case IB_NODE_CA: ret = cma_ib_listen(id_priv); break; + case IB_NODE_RNIC: + ret = cma_iw_listen(id_priv); + break; default: ret = -ENOSYS; break; @@ -1067,6 +1266,45 @@ return (id_priv->query_id < 0) ? id_priv->query_id : 0; } +static void iw_route_handler(void* data) +{ + struct cma_work *work = data; + struct rdma_id_private *id_priv = work->id; + + kfree(work); + + atomic_inc(&id_priv->dev_remove); + + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ROUTE_RESOLVED)) + goto out; + + if (cma_notify_user(id_priv, RDMA_CM_EVENT_ROUTE_RESOLVED, 0, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } + out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); +} + +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) +{ + struct cma_work *work; + + work = kmalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + work->id = id_priv; + INIT_WORK(&work->work, iw_route_handler, work); + queue_work(rdma_wq, &work->work); + + return 0; +} + int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) { struct rdma_id_private *id_priv; @@ -1081,6 +1319,9 @@ case IB_NODE_CA: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; + case IB_NODE_RNIC: + ret = cma_resolve_iw_route(id_priv, timeout_ms); + break; default: ret = -ENOSYS; break; @@ -1221,12 +1462,36 @@ return ret; } +static void iw_addr_handler(void* data) +{ + struct cma_work *work = data; + struct rdma_id_private *id_priv = work->id; + + kfree(work); + + atomic_inc(&id_priv->dev_remove); + + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) + goto out; + + if (cma_notify_user(id_priv, RDMA_CM_EVENT_ADDR_RESOLVED, 0, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); +} + int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr, int timeout_ms) { struct rdma_id_private *id_priv; enum cma_state expected_state; - int ret; + int ret = 0; id_priv = container_of(id, struct rdma_id_private, id); if (id_priv->cma_dev) { @@ -1341,10 +1606,10 @@ memcpy(private_data + offset, conn_param->private_data, conn_param->private_data_len); - id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, cma_ib_handler, id_priv); - if (IS_ERR(id_priv->cm_id)) { - ret = PTR_ERR(id_priv->cm_id); + if (IS_ERR(id_priv->cm_id.ib)) { + ret = PTR_ERR(id_priv->cm_id.ib); goto out; } @@ -1371,12 +1636,45 @@ req.max_cm_retries = CMA_MAX_CM_RETRIES; req.srq = id_priv->srq ? 1 : 0; - ret = ib_send_cm_req(id_priv->cm_id, &req); + ret = ib_send_cm_req(id_priv->cm_id.ib, &req); out: kfree(private_data); return ret; } +static int cma_connect_iw(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct iw_cm_id* cm_id; + struct sockaddr_in* sin; + int ret; + + if (id_priv->id.qp == NULL) + return -EINVAL; + + cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv); + if (IS_ERR(cm_id)) { + ret = PTR_ERR(cm_id); + goto out; + } + + id_priv->cm_id.iw = cm_id; + + sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr; + cm_id->local_addr = *sin; + + sin = (struct sockaddr_in*)&id_priv->id.route.addr.dst_addr; + cm_id->remote_addr = *sin; + + iw_cm_bind_qp(cm_id, id_priv->id.qp); + + ret = iw_cm_connect(cm_id, conn_param->private_data, + conn_param->private_data_len); + +out: + return ret; +} + int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; @@ -1396,6 +1694,9 @@ case IB_NODE_CA: ret = cma_connect_ib(id_priv, conn_param); break; + case IB_NODE_RNIC: + ret = cma_connect_iw(id_priv, conn_param); + break; default: ret = -ENOSYS; break; @@ -1433,7 +1734,7 @@ rep.rnr_retry_count = conn_param->rnr_retry_count; rep.srq = id_priv->srq ? 1 : 0; - return ib_send_cm_rep(id_priv->cm_id, &rep); + return ib_send_cm_rep(id_priv->cm_id.ib, &rep); } int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) @@ -1458,6 +1759,12 @@ else ret = cma_rep_recv(id_priv); break; + case IB_NODE_RNIC: { + iw_cm_bind_qp(id_priv->cm_id.iw, id_priv->id.qp); + ret = iw_cm_accept(id_priv->cm_id.iw, conn_param->private_data, + conn_param->private_data_len); + break; + } default: ret = -ENOSYS; break; @@ -1486,9 +1793,15 @@ switch (id->device->node_type) { case IB_NODE_CA: - ret = ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + ret = ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, private_data, private_data_len); break; + + case IB_NODE_RNIC: + ret = iw_cm_reject(id_priv->cm_id.iw, + private_data, private_data_len); + break; + default: ret = -ENOSYS; break; @@ -1513,9 +1826,12 @@ switch (id->device->node_type) { case IB_NODE_CA: /* Initiate or respond to a disconnect. */ - if (ib_send_cm_dreq(id_priv->cm_id, NULL, 0)) - ib_send_cm_drep(id_priv->cm_id, NULL, 0); + if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) + ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); break; + case IB_NODE_RNIC: + ret = iw_cm_disconnect(id_priv->cm_id.iw); + break; default: break; } Index: mad.c =================================================================== --- mad.c (revision 4748) +++ mad.c (working copy) @@ -2655,7 +2655,9 @@ { int start, end, i; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == IB_NODE_RNIC) + return; + else if (device->node_type == IB_NODE_SWITCH) { start = 0; end = 0; } else { @@ -2702,7 +2704,9 @@ { int i, num_ports, cur_port; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == IB_NODE_RNIC) + return; + else if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; cur_port = 0; } else { From nacc at us.ibm.com Thu Jan 5 11:17:28 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 5 Jan 2006 11:17:28 -0800 Subject: [openib-general] ppc64 build failure (4778) Message-ID: <20060105191728.GA23370@us.ibm.com> Hi all, New build failure today: drivers/infiniband/core/sysfs.c: In function `ib_device_hotplug': drivers/infiniband/core/sysfs.c:440: warning: implicit declaration of function `add_hotplug_env_var' drivers/infiniband/core/sysfs.c: At top level: drivers/infiniband/core/sysfs.c:653: error: unknown field `hotplug' specified in initializer drivers/infiniband/core/sysfs.c:653: warning: initialization from incompatible pointer type Thanks, Nish From halr at voltaire.com Thu Jan 5 11:10:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jan 2006 14:10:54 -0500 Subject: [openib-general] Re: Re[PATCH] Opensm - running on system with 2 hcas In-Reply-To: <5z1wzngk00.fsf@mtl066.yok.mtl.com> References: <5z1wzngk00.fsf@mtl066.yok.mtl.com> Message-ID: <1136488253.4336.62.camel@hal.voltaire.com> On Thu, 2006-01-05 at 03:24, Yael Kalka wrote: > Hi Hal, > > When trying to run OpenSM on a system with 2 hca cards, we noticed > that there is a problem with the osm_vendor_get_all_port_attr. > What happens is that we are saving the port 0 for each hca, though > this data is relevant for the default port only once. > The result is that if running with -g 0, we get 5 ports instead of 4, > and the third port (which was the data copied as the default port for > the second hca) is not valid. > The following patch fixes this. Thanks. Applied by hand (please double check). You patch was rejected. This seems to happen a fair bit. -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 4760) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -637,18 +637,24 @@ osm_vendor_get_all_port_attr( > umad_release_port(&def_port); > } > > + j = 0; > if (p_attr_array) { > /* set the port guid, lid, and sm lid in the port attr struct */ > for (i = 0; i < *p_num_ports; i++) { > - p_attr_array[i].port_guid = portguids[i]; > - p_attr_array[i].lid = lids[i]; > - if (i == 0) > - p_attr_array[i].sm_lid = sm_lid; > + if (i > 0 && portguids[i] == 0) { > + continue; > + } > + p_attr_array[j].port_guid = portguids[i]; > + p_attr_array[j].lid = lids[i]; > + if (j == 0) > + p_attr_array[j].sm_lid = sm_lid; > else > - p_attr_array[i].sm_lid = p_vend->umad_port.sm_lid; > - p_attr_array[i].link_state = linkstates[i]; > + p_attr_array[j].sm_lid = p_vend->umad_port.sm_lid; > + p_attr_array[j].link_state = linkstates[i]; > + j++; > } > r = 0; > + *p_num_ports = j; > } else > r = IB_INSUFFICIENT_MEMORY; > From rdreier at cisco.com Thu Jan 5 11:38:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Jan 2006 11:38:33 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix memory leak in mthca_destroy_qp and mthca_destroy_srq In-Reply-To: <20060105135407.GA13745@mellanox.co.il> (Jack Morgenstein's message of "Thu, 5 Jan 2006 15:54:07 +0200") References: <20060105135407.GA13745@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Jan 5 11:44:47 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Jan 2006 11:44:47 -0800 Subject: [openib-general] Re: ppc64 build failure (4778) In-Reply-To: <20060105191728.GA23370@us.ibm.com> (Nishanth Aravamudan's message of "Thu, 5 Jan 2006 11:17:28 -0800") References: <20060105191728.GA23370@us.ibm.com> Message-ID: Nishanth> Hi all, New build failure today: Nishanth> drivers/infiniband/core/sysfs.c: In function Nishanth> `ib_device_hotplug': Nishanth> drivers/infiniband/core/sysfs.c:440: warning: implicit Nishanth> declaration of function `add_hotplug_env_var' Looks like add_hotplug_env_var was renamed to add_uevent_var in the upstream kernel. I checked in the following hack (replacing the old hack for 2.6.15 ;): --- include/rdma/ib_verbs.h (revision 4754) +++ include/rdma/ib_verbs.h (working copy) @@ -48,12 +48,11 @@ #include #include -/* XXX remove this compatibility hack when 2.6.15 is released */ +/* XXX remove this compatibility hack when 2.6.16 is released */ #include -#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,15) -#define class_device_create(cls, parent, devt, device, fmt, arg...) \ - class_device_create(cls, devt, device, fmt, ## arg) +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,16) +#define add_hotplug_env_var add_uevent_var #endif /* XXX end of hack */ union ib_gid { From nacc at us.ibm.com Thu Jan 5 11:55:31 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 5 Jan 2006 11:55:31 -0800 Subject: [openib-general] Re: ppc64 build failure (4778) In-Reply-To: References: <20060105191728.GA23370@us.ibm.com> Message-ID: <20060105195531.GA2064@us.ibm.com> On 05.01.2006 [11:44:47 -0800], Roland Dreier wrote: > Nishanth> Hi all, New build failure today: > > Nishanth> drivers/infiniband/core/sysfs.c: In function > Nishanth> `ib_device_hotplug': > Nishanth> drivers/infiniband/core/sysfs.c:440: warning: implicit > Nishanth> declaration of function `add_hotplug_env_var' > > Looks like add_hotplug_env_var was renamed to add_uevent_var in the > upstream kernel. I checked in the following hack (replacing the old > hack for 2.6.15 ;): I'll rerun the tests now to see what happens. Also, since it turned out to be quite simple, i've updated my script to be able to test any Linus kernel tree with any svn revision (should help trace down regressions when they occur). Thanks, Nish From Richard.Frank at oracle.com Thu Jan 5 11:59:43 2006 From: Richard.Frank at oracle.com (Richard Frank) Date: Thu, 05 Jan 2006 14:59:43 -0500 Subject: [openib-general] SDP - What are the platforms that support SDP ? Message-ID: <1136491183.5216.10.camel@localhost.localdomain> Besides OpenIB for Linux and Windows ? Solaris ? AIX ? HP-UX ? Are there any plans for interoperability tests / have any completed ? From iod00d at hp.com Thu Jan 5 12:27:54 2006 From: iod00d at hp.com (Grant Grundler) Date: Thu, 5 Jan 2006 12:27:54 -0800 Subject: [openib-general] SDP - What are the platforms that support SDP ? In-Reply-To: <1136491183.5216.10.camel@localhost.localdomain> References: <1136491183.5216.10.camel@localhost.localdomain> Message-ID: <20060105202754.GC23796@esmail.cup.hp.com> On Thu, Jan 05, 2006 at 02:59:43PM -0500, Richard Frank wrote: > Besides OpenIB for Linux and Windows ? > HP-UX ? Almost certainly not for HPUX. Oracle should plan on continuing to use existing IT-API interface. I'm told "it's known to work" and meets HP's requirements (which RDS does not AFAICT). grant From rpandit at silverstorm.com Thu Jan 5 12:54:38 2006 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Thu, 5 Jan 2006 12:54:38 -0800 Subject: [openib-general] Failure in reset HCA with backport-svn4507-to-2.6.9 Message-ID: <96f8e60e0601051254h36ec656bscbc6f456401ec2c9@mail.gmail.com> Has anybody seen this problem before? modprobe ib_mthca ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing 0000:03:00.0 ACPI: PCI interrupt 0000:03:00.0[A] -> GSI 24 (level, low) -> IRQ 233 ib_mthca 0000:03:00.0: PCI device did not come back after reset, aborting. ib_mthca 0000:03:00.0: Failed to reset HCA, aborting. I'm using 'infiniband-backport-svn4507-to-2.6.9' on RHEL 4 U2 kernel (2.6.9-22.EL). To confirm that the HCA wasn't bad, I tried to bring up SilverStorm stack and that came up fine - port goes Active etc... Here is what lspci shows before and after "modprobe ib_mthca" Before: 02:06.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) 03:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) After: 02:06.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff) ^^^^^ 03:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff) I presume that is because of the problem during reset. Any ideas why it's having problems resetting the card? thanks, Ranjit From rdreier at cisco.com Thu Jan 5 13:06:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Jan 2006 13:06:44 -0800 Subject: [openib-general] Re: [PATCH updated] mthca: fix page shift calculation In-Reply-To: <20060104224841.GA9839@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 5 Jan 2006 00:48:41 +0200") References: <20060104224841.GA9839@mellanox.co.il> Message-ID: Michael> Hmm. Lets suppose I have a first chunk in bytes 1 to Michael> 2095, and then another chunk in bytes 0x100000 to Michael> 0x1ffffff - should not we limit the page size to 4K? Does Michael> your proposed change do this? Yes, you're right again. It seems like we can get rid of some special casing and just do things like this: --- infiniband/hw/mthca/mthca_provider.c (revision 4754) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -771,24 +771,20 @@ static struct ib_mr *mthca_reg_phys_mr(s if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) return ERR_PTR(-EINVAL); - if (num_phys_buf > 1 && - ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) - return ERR_PTR(-EINVAL); - mask = 0; total_size = 0; for (i = 0; i < num_phys_buf; ++i) { - if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) - return ERR_PTR(-EINVAL); - if (i != 0 && i != num_phys_buf - 1 && - (buffer_list[i].size & ~PAGE_MASK)) - return ERR_PTR(-EINVAL); + if (i != 0) + mask |= buffer_list[i].addr; + if (i != num_phys_buf - 1) + mask |= buffer_list[i].addr + buffer_list[i].size; total_size += buffer_list[i].size; - if (i > 0) - mask |= buffer_list[i].addr; } + if (mask & ~PAGE_MASK) + return ERR_PTR(-EINVAL); + /* Find largest page shift we can use to cover buffers */ for (shift = PAGE_SHIFT; shift < 31; ++shift) if (num_phys_buf > 1) { From rdreier at cisco.com Thu Jan 5 13:08:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Jan 2006 13:08:39 -0800 Subject: [openib-general] Re: ppc64 build failure (4778) In-Reply-To: <20060105195531.GA2064@us.ibm.com> (Nishanth Aravamudan's message of "Thu, 5 Jan 2006 11:55:31 -0800") References: <20060105191728.GA23370@us.ibm.com> <20060105195531.GA2064@us.ibm.com> Message-ID: Sorry, my previous "fix" was totally bogus. I checked something in that should compile OK against both 2.6.15 and mainline. - R. From sobebike at gmail.com Thu Jan 5 13:09:33 2006 From: sobebike at gmail.com (Jimmy Hill) Date: Thu, 5 Jan 2006 15:09:33 -0600 Subject: [openib-general] dat_evd_wait & dat_cno_wait not exercised in dtest.c Message-ID: It appears that dtest.c does not ever exercise the dat_cno_wait() or dat_evd_wait() calls. The "polling" flag is initialized to "1" at declaration time. Specifying "-p" on the command line will set "polling" to "1", but there is nothing that sets "polling" to "0". Therefore "polling" is always "1" and all of the dat_cno_wait() and dat_evd_wait() calls are skipped regardless if the consumer leaves off the "-p" or specifies "-c". Initializing "polling" to "0" would fix the problem. It appears that dat_evd_wait() was intended to be the default with polling or dat_cno_wait as optional methods invoked via command line arguments. However, with the current code, it appears that only polling is being exercised. -- jimmy Jimmy Hill jimmy.hill at us.ibm.com sobebike at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From nacc at us.ibm.com Thu Jan 5 13:14:21 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 5 Jan 2006 13:14:21 -0800 Subject: [openib-general] Re: ppc64 build failure (4778) In-Reply-To: References: <20060105191728.GA23370@us.ibm.com> <20060105195531.GA2064@us.ibm.com> Message-ID: <20060105211421.GB2064@us.ibm.com> On 05.01.2006 [13:08:39 -0800], Roland Dreier wrote: > Sorry, my previous "fix" was totally bogus. I checked something in > that should compile OK against both 2.6.15 and mainline. I assume that was what caused this: drivers/infiniband/core/sysfs.c:653: error: unknown field `hotplug' specified in initializer drivers/infiniband/core/sysfs.c:653: warning: initialization from incompatible pointer type make[3]: *** [drivers/infiniband/core/sysfs.o] Error 1 make[2]: *** [drivers/infiniband/core] Error 2 make[1]: *** [drivers/infiniband] Error 2 make: *** [drivers] Error 2 I'll redo those runs again ;) Thanks, Nish From trimmer at silverstorm.com Thu Jan 5 13:41:02 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Thu, 5 Jan 2006 16:41:02 -0500 Subject: [openib-general] SA cache design Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670A0952@mercury.infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > I've been given the task of trying to come up with an > implementation for an SA > cache. The intent is to increase the scalability and > performance of the openib > stack. My current thoughts on the implementation are below. > Any feedback is > welcome. Sean, This is great. This is a feature which I find near and dear and is very important to large fabric scalability. If you look in contrib in the infinicon area, you will see a version of a SA replica which we implemented in the linux_discovery tree. The version in SVN is a little dated, but has the major features and capabilities. If you find it useful I could provide a more updated version of that component for your reference. Some features of it (which you should consider or possibly use as reference code): - It maintains a full replica of: - All Node Records - Path Records relevant to this Node (where this node is Source) - Device Management Agent records for IOUs, IOCs and Service Records - even for a large cluster, the footprint of the above will be < 1MB - It is implemented in kernel mode - while user mode may help during initial debug, it will be important for kernel mode ULPs such as SRP, IPoIB and SDP to also make use of these records - It is infact a replica, not a cache. It maintains an up to date replica using the following techniques - registers for SA GID in/out of service notices - such notices when received trigger a query of information about that node only - schedules a periodic full SA query - if notices are successfully registered for, the query is at a slow pace (once every 10 minutes is default, but its configureable) - if notices are not successfully registered for, the query is at a faster pace (once a minute, but its configurable) - since notices are unreliable, the periodic sweep is needed to cover for lost notices, however the SA should resend notices which are not responded to - In addition for CAs it performs IOU, IOC and Service record queries and replicates them - this allows for very fast access to IOU/IOC/Service record info by drivers like SRP - hence allowing for faster reconnection and failure recovery handling - It can handle SA outages and still respond to queries while the SA is down, the SA is slow, or while the synchronization process is being performed (eg. it does all its queries to a temporary replica then updates the main replica, hence if the queries fail or take a long time, the main replica is still available and reasonably accurate). - I like the idea of using the same API for SA queries and allowing an SA mux to choose to query the replica or the actual SA. Hence if later versions choose to extend what is maintained in the replica, it would be transparent to applications - The API could allow for a flag to force a query against the replica or against the actual SA, with the default being to allow the "SA mux" to select which to use > > To keep the design as flexible as possible, my plan is to > implement the cache in > userspace. The interface to the cache would be via MADs. > Clients would send > their queries to the sa_cache instead of the SA itself. The > format of the MADs > would be essentially identical to those used to query the SA > itself. Response > MADs would contain any requested information. If the cache > could not satisfy a > request, the sa_cache would query the SA, update its cache, > then return a reply. - in our stack we had a separate more advanced SA query API (refered to the Subnet Driver API). This has evolved significantly since the old Intel IbAccess days, but still has similarities. It handled all the details of the query including retries (as specified by the caller), timeouts and even multi-level queries (get path records based on Node Guids, etc). It also handled the RMPP aspects and hid the intermediate RMPP headers and control protocol. You may want to consider defining and using such an API instead of MADs, least the user of the SA replica need to also implement RMPP itself. Given such an API the implementation could choose to query the actual SA or the replica and hide the RMPP details in the SA query case. Todd Rimmer From rpandit at silverstorm.com Thu Jan 5 13:48:49 2006 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Thu, 5 Jan 2006 13:48:49 -0800 Subject: [openib-general] Re: Failure in reset HCA with backport-svn4507-to-2.6.9 In-Reply-To: <96f8e60e0601051254h36ec656bscbc6f456401ec2c9@mail.gmail.com> References: <96f8e60e0601051254h36ec656bscbc6f456401ec2c9@mail.gmail.com> Message-ID: <96f8e60e0601051348m1dbe7797waa2157620a31de06@mail.gmail.com> I'm having the same problem with the kernel that Redhat posted "kernel-smp-2.6.9-22.14.EL.OpenIB_3965.3.i686.rpm" from: http://people.redhat.com/dledford/Infiniband/RHEL-4/kernel/2.6.9-22.14.EL.OpenIB_3965.3/i686/ Again, I know that the card is functional. Ranjit On 1/5/06, Ranjit Pandit wrote: > Has anybody seen this problem before? > > modprobe ib_mthca > > ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) > ib_mthca: Initializing 0000:03:00.0 > ACPI: PCI interrupt 0000:03:00.0[A] -> GSI 24 (level, low) -> IRQ 233 > ib_mthca 0000:03:00.0: PCI device did not come back after reset, aborting. > ib_mthca 0000:03:00.0: Failed to reset HCA, aborting. > > I'm using 'infiniband-backport-svn4507-to-2.6.9' on RHEL 4 U2 kernel > (2.6.9-22.EL). > > To confirm that the HCA wasn't bad, I tried to bring up SilverStorm > stack and that came up fine - port goes Active etc... > > Here is what lspci shows before and after "modprobe ib_mthca" > > Before: > 02:06.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) > 03:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) > > After: > 02:06.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff) > > ^^^^^ > 03:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff) > > > I presume that is because of the problem during reset. > > Any ideas why it's having problems resetting the card? > > thanks, > > Ranjit > From sean.hefty at intel.com Thu Jan 5 13:51:45 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Jan 2006 13:51:45 -0800 Subject: [openib-general] SA cache design In-Reply-To: <43BD109A.3010302@mellanox.co.il> Message-ID: >* Regarding the sentence:"Clients would send their queries to the sa_cache >instead of the SA" > I would propose that a "SA MAD send switch" be implemented in the core: Such >a switch > will enable plugging in the SA cache (I would prefer calling it SA local >agent due to > its extended functionality). Once plugged in, this "SA local agent" should >be forwarded all > outgoing SA queries. Once it handles the MAD it should be able to inject the >response through > the core "SA MAD send switch" as if they arrived from the wire. This was my thought as well. I hesitated to refer to the cache as a local agent, since that's an implementation detail. I want to allow the possibility for the cache to reside on another system. For the initial implementation, the cache would be local however. >Functional requirements: >* It is clear that the first SA query to cache is PathRecord. This will be the first cached query in the initial check-in. > So if a new client wants to connect to another node a new PathRecord > query will not need to be sent to the SA. However, recent work on QoS has >pointed out > that under some QoS schemes PathRecord should not be shared by different >clients I'm not sure that QoS handling is the responsibility of the cache. The module requesting the path records should probably deal with this. >* Forgive me for bringing the following issue - over and over to the group: > Multicast Join/Leave should be reference counted. The "SA local agent" could >be > the right place for doing this kind of reference counting (actually if it >does that > it probably needs to be located in the Kernel - to enable cleanup after >killed processes). I agree that this is a problem, but I my preference would be for a dedicated kernel module to handle multicast join/leave requests. - Sean From nacc at us.ibm.com Thu Jan 5 14:02:48 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 5 Jan 2006 14:02:48 -0800 Subject: [openib-general] Re: ppc64 build failure (4778) In-Reply-To: References: <20060105191728.GA23370@us.ibm.com> <20060105195531.GA2064@us.ibm.com> Message-ID: <20060105220248.GC2064@us.ibm.com> On 05.01.2006 [13:08:39 -0800], Roland Dreier wrote: > Sorry, my previous "fix" was totally bogus. I checked something in > that should compile OK against both 2.6.15 and mainline. with 4785, I still get: drivers/infiniband/core/sysfs.c: In function `ib_device_uevent': drivers/infiniband/core/sysfs.c:447: warning: implicit declaration of function `add_hotplug_env_var' drivers/infiniband/core/sysfs.c: At top level: drivers/infiniband/core/sysfs.c:662: error: unknown field `hotplug' specified in initializer drivers/infiniband/core/sysfs.c:662: warning: initialization from incompatible pointer type make[3]: *** [drivers/infiniband/core/sysfs.o] Error 1 make[2]: *** [drivers/infiniband/core] Error 2 make[1]: *** [drivers/infiniband] Error 2 Thanks, Nish From robert.j.woodruff at intel.com Thu Jan 5 14:03:36 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 5 Jan 2006 14:03:36 -0800 Subject: [openib-general] Re: Failure in reset HCA withbackport-svn4507-to-2.6.9 Message-ID: <1AC79F16F5C5284499BB9591B33D6F000683D980@orsmsx408> >I'm having the same problem with the kernel that Redhat posted >"kernel-smp-2.6.9-22.14.EL.OpenIB_3965.3.i686.rpm" I have not seen any reset issues with 4507 or the Redhat 3965 kernel on a mellanox SDR PCI-E card on EM64T platform. I have also ran 4507 on an IPF platform with PCI-X cards. What version of fw does the card have ? We did see some weird problems with performance when the card F/W was not up to date, but have not seen any reset problems. woody From sean.hefty at intel.com Thu Jan 5 14:04:49 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Jan 2006 14:04:49 -0800 Subject: [openib-general] SA cache design In-Reply-To: <1136473755.4339.16945.camel@hal.voltaire.com> Message-ID: >> I hadn't fully figured this out yet. I'm not sure if another MAD class is >> needed or not. My goal is to implement this as transparent to the >application >> as possible without violating the spec, perhaps appearing as an SA on a >> different LID. > >The LID for the (real) SA is determined from PortInfo:MasterSMLID so I >don't see how this could be done that way. I didn't think that it was a requirement that the SA share the same LID as the SM. - Sean From sean.hefty at intel.com Thu Jan 5 14:07:59 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Jan 2006 14:07:59 -0800 Subject: [openib-general] SA cache design In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A43670A0952@mercury.infiniconsys.com> Message-ID: >Sean, This is great. This is a feature which I find near and dear and is very >important to large fabric scalability. If you look in contrib in the infinicon >area, you will see a version of a SA replica which we implemented in the >linux_discovery tree. The version in SVN is a little dated, but has the major >features and capabilities. If you find it useful I could provide a more >updated version of that component for your reference. Thanks - I will look at the version that is there. - Sean From rpandit at silverstorm.com Thu Jan 5 14:40:12 2006 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Thu, 5 Jan 2006 14:40:12 -0800 Subject: [openib-general] Re: Failure in reset HCA withbackport-svn4507-to-2.6.9 In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000683D980@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F000683D980@orsmsx408> Message-ID: <96f8e60e0601051440t69782459k113307dd82a1b559@mail.gmail.com> fw rev: 3.03.0003rc16b On 1/5/06, Woodruff, Robert J wrote: > >I'm having the same problem with the kernel that Redhat posted > > >"kernel-smp-2.6.9-22.14.EL.OpenIB_3965.3.i686.rpm" > > I have not seen any reset issues with 4507 or the > Redhat 3965 kernel on a mellanox SDR PCI-E card > on EM64T platform. I have also ran 4507 on an IPF platform with > PCI-X cards. > > What version of fw does the card have ? > > We did see some weird problems with performance when the > card F/W was not up to date, but have not seen any reset problems. > > woody > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From robert.j.woodruff at intel.com Thu Jan 5 14:46:42 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 5 Jan 2006 14:46:42 -0800 Subject: [openib-general] Re: Failure in reset HCA withbackport-svn4507-to-2.6.9 In-Reply-To: <96f8e60e0601051440t69782459k113307dd82a1b559@mail.gmail.com> Message-ID: Ranjit wrote, >fw rev: 3.03.0003rc16b I think that 4.7 is the latest (at least for the PCI-E cards) and 3.2 ( for the PCI-X cards) and I think the latest FW is required for correct operation with the openIB stack. Michael is this correct ? woody From trimmer at silverstorm.com Thu Jan 5 14:53:00 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Thu, 5 Jan 2006 17:53:00 -0500 Subject: [openib-general] Re: Failure in reset HCAwithbackport-svn4507-to-2.6.9 Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670A0953@mercury.infiniconsys.com> > From: Bob Woodruff [mailto:robert.j.woodruff at intel.com] > > Ranjit wrote, > >fw rev: 3.03.0003rc16b > > I think that 4.7 is the latest (at least for the PCI-E cards) > and 3.2 ( for the PCI-X cards) > and I think the latest FW is required for correct operation > with the openIB > stack. > Michael is this correct ? 3.3.3 is the latest firmware (as of last week) for PCI-X cards. 3.2 is actually a few months old. Today Mellanox just posted 3.3.5 for PCI-X cards, however I doubt Ranjit needs 3.3.5 since Open IB has run successfully for others before 3.3.5 was available. Internally with our stack we have successfully done extensive testing with 3.3.3 Todd Rimmer From halr at voltaire.com Thu Jan 5 14:47:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jan 2006 17:47:13 -0500 Subject: [openib-general] SA cache design In-Reply-To: References: Message-ID: <1136501232.4336.1382.camel@hal.voltaire.com> On Thu, 2006-01-05 at 16:51, Sean Hefty wrote: > I agree that this is a problem, but I my preference would be for a dedicated > kernel module to handle multicast join/leave requests. In addition to multicast, it's also service records and event subscriptions too. -- Hal From sean.hefty at intel.com Thu Jan 5 15:00:40 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Jan 2006 15:00:40 -0800 Subject: [openib-general] SA cache design In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A43670A0952@mercury.infiniconsys.com> Message-ID: >- It is implemented in kernel mode > - while user mode may help during initial debug, it will be important for > kernel mode ULPs such as SRP, IPoIB and SDP to also make use of >these records Your kernel footprint is smaller than I expected, which is good. Note that with a MAD interface, kernel modules would still have access to any cached data. I also wanted to stick with usermode to allow saving the cache to disk, so that it would be available immediately after a reboot. (My assumption being that changes to the network topology would be rare, so we could optimize around a stable network design.) As a related topic, there will be a separate SA client interface defined that will generate SA query MADs for the user. - Sean From halr at voltaire.com Thu Jan 5 14:54:52 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jan 2006 17:54:52 -0500 Subject: [openib-general] SA cache design In-Reply-To: References: Message-ID: <1136501691.4336.1447.camel@hal.voltaire.com> On Thu, 2006-01-05 at 17:04, Sean Hefty wrote: > >> I hadn't fully figured this out yet. I'm not sure if another MAD class is > >> needed or not. My goal is to implement this as transparent to the > >application > >> as possible without violating the spec, perhaps appearing as an SA on a > >> different LID. > > > >The LID for the (real) SA is determined from PortInfo:MasterSMLID so I > >don't see how this could be done that way. > > I didn't think that it was a requirement that the SA share the same LID as the > SM. For the precise language, see C15-0-1.24 p. 923 IBA 1.2: C15-0.1.24: It shall be possible to determine the location of SA from any endport by sending a GMP to QP1 (the GSI) of the node identified by the endport's PortInfo:MasterSMLID, using in the GMP the base LID of the endport as the SLID, the endport's PortInfo:MasterSMSL as the SL, the well-known Q_Key (0x8001_0000), and whichever of the default P_Keys (0xFFFF or 0x7FFF) was placed in the endport's P_Key Table by the SM (Table 183 Initialization on page 868). so I overstated it a bit but this needs to be obeyed. Also, C15-0.1.25: A SubnAdmGet(ClassPortInfo) sent according to C15- 0.1.24: shall return all information needed to communicate with Subnet Administration. Alternatively, valid GMPs for SA sent according to C15- 0.1.24: shall either return redirection responses providing all such information, or shall be normally processed by SA. -- Hal From rdreier at cisco.com Thu Jan 5 15:15:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Jan 2006 15:15:10 -0800 Subject: [openib-general] Re: ppc64 build failure (4778) In-Reply-To: <20060105220248.GC2064@us.ibm.com> (Nishanth Aravamudan's message of "Thu, 5 Jan 2006 14:02:48 -0800") References: <20060105191728.GA23370@us.ibm.com> <20060105195531.GA2064@us.ibm.com> <20060105220248.GC2064@us.ibm.com> Message-ID: Ugh, the latest git tree still has version 2.6.15, so there's no way to tell if add_hotplug_env_var() has changed or not. This will fix itself once 2.6.16-rc1 comes out in about 10 days, but I don't know of a good way to fix it until then. You can hack drivers/infiniband/core/sysfs.c by hand to change the #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,16) to #if 0 but then that won't build against stock 2.6.15. I'm out of ideas for now... - R. From sean.hefty at intel.com Thu Jan 5 15:24:30 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Jan 2006 15:24:30 -0800 Subject: [openib-general] SA cache design In-Reply-To: <1136501691.4336.1447.camel@hal.voltaire.com> Message-ID: >For the precise language, see C15-0-1.24 p. 923 IBA 1.2: > > >C15-0.1.24: It shall be possible to determine the location of SA from >any >endport by sending a GMP to QP1 (the GSI) of the node identified by the >endport's PortInfo:MasterSMLID, using in the GMP the base LID of the >endport as the SLID, the endport's PortInfo:MasterSMSL as the SL, the >well-known Q_Key (0x8001_0000), and whichever of the default P_Keys >(0xFFFF or 0x7FFF) was placed in the endport's P_Key Table by the SM >(Table 183 Initialization on page 868). > >so I overstated it a bit but this needs to be obeyed. Could each of the requests be redirected to different nodes? I can envision how the sa_cache could eventually build towards a distributed SA. >C15-0.1.25: A SubnAdmGet(ClassPortInfo) sent according to C15- >0.1.24: shall return all information needed to communicate with Subnet >Administration. Alternatively, valid GMPs for SA sent according to C15- >0.1.24: shall either return redirection responses providing all such >information, or shall be normally processed by SA. Thanks for the references. - Sean From halr at voltaire.com Thu Jan 5 15:25:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jan 2006 18:25:19 -0500 Subject: [openib-general] SA cache design In-Reply-To: References: Message-ID: <1136503336.4336.1654.camel@hal.voltaire.com> On Thu, 2006-01-05 at 18:24, Sean Hefty wrote: > >For the precise language, see C15-0-1.24 p. 923 IBA 1.2: > > > > > >C15-0.1.24: It shall be possible to determine the location of SA from > >any > >endport by sending a GMP to QP1 (the GSI) of the node identified by the > >endport's PortInfo:MasterSMLID, using in the GMP the base LID of the > >endport as the SLID, the endport's PortInfo:MasterSMSL as the SL, the > >well-known Q_Key (0x8001_0000), and whichever of the default P_Keys > >(0xFFFF or 0x7FFF) was placed in the endport's P_Key Table by the SM > >(Table 183 Initialization on page 868). > > > >so I overstated it a bit but this needs to be obeyed. > > Could each of the requests be redirected to different nodes? Yes. > I can envision how > the sa_cache could eventually build towards a distributed SA. I think a distributed SA is more like it rather than an SA cache. -- Hal > >C15-0.1.25: A SubnAdmGet(ClassPortInfo) sent according to C15- > >0.1.24: shall return all information needed to communicate with Subnet > >Administration. Alternatively, valid GMPs for SA sent according to C15- > >0.1.24: shall either return redirection responses providing all such > >information, or shall be normally processed by SA. > > Thanks for the references. > > - Sean > From ardavis at ichips.intel.com Thu Jan 5 15:34:47 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 05 Jan 2006 15:34:47 -0800 Subject: [openib-general] RE: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's In-Reply-To: References: Message-ID: <43BDAD17.9050301@ichips.intel.com> Kanevsky, Arkady wrote: > Arlin, > nice proposal, thanks. > I have one high level question and a few specific technical ones. > > 1. Why do you want to provide this functionality via extension instead > of part of new DAT spec, say 2.0? > This will allow Consumers to use all events, operations, and > Provider/IA functionality uniformly instead > of via 2 separate layers. This will also ensure that this basic > funcionality can be provided by all DAPL Provider > the same way on DAPL and DAT layers. > DAPL 2.0 is not done yet so we have time to incorporate that. > DAPL 2.0 already introduced new functionality which is easy to beef up > for your proposal. > See DAT_DTOS for example. DAT_EVENT is also modified to handle remote > invalidation so > a small addition for Immediate data and Atoimc ops is a sensible addition. > This should simplify proposal significantly. As you will not need to > introduce any new > EXT structures. As mentioned on the con-call, there are two separate items to consider while looking at the proposal. The first is the ability to extend DAT for specific provider value-add and the second is to validate the need for general atomic and immediate data functionality in the basic set of API's for all providers. I included atomics and immediate data as examples since it is specific to one provider (IB), it includes operations that require new ops, events, and event data types, and it also provides a working model to validate the extension model from request to completion events. I would like to concentrate on getting consensus on the extension proposal first if possible. Just try to think of the actual operations as some opaque dat_ext_foobar_op(). > > In general, extension route was intended for RNIC|HCA providers to > expose HW capabilities beyond > IBTA, iWARP and VIA standards. The standard RDMA functionality is best > handle via spec addition. > DAT 2.0 does it for FMR, remote and local memory invalidation as well > as others. True, but the extension route is not fully defined, documented, nor implemented. This is what I would like to work on getting completed in time for 2.0 if possible. BTW: The existing implementation actually uses dapl_provider->extension to store the hca_ptr but the specification states that it is reserved for the providers private use (8.2.1 in DAPL1.2 spec). This is why I had to defined another extension_func in the patch. > > I had posted a complete list of changes/addition to DAT 2.0 about a > month ago. > But we had not discussed yet version change from 1.3 to 2.0 nor how > much backwards compatibility spec > will provide. > > 2. What is IMMED_EVENT? is it just immediate data without any payload one? > I suggest chnaging the name so it will not use "EVENT". Just call it > NO_PAYLOAD. > Do you want to support 2 different way to delivery immediate data? > One in event and one in data payload? > Why? I would think that just an event way will do. This was modeled after the immediate data discussions on the DAT reflector based on iWARP requirements. http://groups.yahoo.com/group/dat-discussions/message/3285 > > 3. I suggest beefing up DAT_DTO_COMPLETION_EVENT_DATA and DAT_DTOS > to convey which operation completed and return Immediate data if > complete operation had immediate data. > Since we already modified these 2 struct as part of DAT 2.0 change > lets add your proposal to the change. > This will allow Consumers to use single approach to deal with > completions, extension to the current one > but not a structural one. No need for DAT_EXTENSION_DATA, > DAT_EXT_EVENT_TYPE, DAT_EXT_OP > nor the whole mechanism for extended ops. You still need extension types for the "other" value-add operations/evnts that will not be accepted as standard and are vendor specific. I would like to defer the rest of the questions for now since they touch on actual operations and not the extension mechanism. Although, I do need to think about how to extend memory registration privledges. Any suggestions? > 4. What is the purpose of DAT_EXT_WRITE_CONFIRM_FLAG? Is it to expose > IB round trip semantic? > iWARP does not support immediate data. One can try to format payload > to pass immediate data. > Is that what you had in mind? > > What is the semantic meaning of the completion with this flag set? > without flag set? > Are extended flags are additonal values for COMPLETION_FLAGS? 2.4.1 > talks about extended flags > but where they are passed in is not defined. > DAT 2.0 extended them already for FMR barrier. I would prefer to > follow that route rather than creating a separate > extension completion flags. > > 5. Why do you need RECV_IMMED? If Immed data is delivered in event no > new Recv operation is needed. > If Consumer asks for immediate data in payload where in payload will > it be? > If this is needed for local match for remote RDMA_Write to handle > immediate data lets state so. > > What happens for mismatch between local and remote op? That is recv > was posted for Send and RDMA_Write > "arrived"? Vice Versa? > > 6. I see extension for immediate data for rdma_write but not for send. > Is this deliberate? If we are going > to extend DAT semantic to support Immediate data we can as well > support the full IBTA/iWARP functionality for it. > > 7. Currently memory registration do not support access to LMR or RMR > by Atomic ops. > Do you propose to extend the meaning of current MEM_PRIV for LMR and > RMR to covers atomic accesses > or add new values to LMR_MEM_PRIV and RMR_MEM_PRIV for atomic > operation support? > > 8. Any alignment requirements for memory used for atomic ops? > > 9. Any correlation requirements for SRQ buffers to support recv with > immediate data? > > Have a great holidays, > Arkady > > > Arkady Kanevsky email: arkady at netapp.com > > > Network Appliance Inc. phone: 781-768-5395 > > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > > Waltham, MA 02451 central phone: 781-768-5300 > > From trimmer at silverstorm.com Thu Jan 5 15:36:13 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Thu, 5 Jan 2006 18:36:13 -0500 Subject: [openib-general] SA cache design Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670A0954@mercury.infiniconsys.com> > From: Sean Hefty [mailto:sean.hefty at intel.com] > Your kernel footprint is smaller than I expected, which is > good. The key is that while there are O(N^2) path records in a fabric, only O(N) are of interest to a given node. Hence if you only replicate entries where this node is the source the size of the replica is significantly smaller. If someone is curious and wants to see all path records in the system, that would be a query you would let go through to the SA (and it would be a very infrequent query since no real world app, beyond fabric debug tools, would care about the paths which don't involve the node making the query). This of course implies the "SA Mux" must analyze more than just the attribute ID to determine if the replica can handle the query. But the memory savings is well worth the extra level of filtering. > Note that with > a MAD interface, kernel modules would still have access to > any cached data. I > also wanted to stick with usermode to allow saving the cache > to disk, so that it > would be available immediately after a reboot. (My > assumption being that > changes to the network topology would be rare, so we could > optimize around a > stable network design.) It is risky to assume that PathRecords would stay the same across a node reboot. It is very likely that the SM could assign different LIDs or if the node is down for an extended period other things in the fabric could have significantly changed. > > As a related topic, there will be a separate SA client > interface defined that > will generate SA query MADs for the user. Given the complexity of the RMPP protocol and the subtle bugs which everyone has encountered while implementing and debugging it (timeouts, retries, abort, window size management, class header offset, etc), it would be best to limit the number of copies of this protocol within the system. Keeping the RMPP details hidden just in the kernel would be best. An analogy might be the way sockets hides the details of the TCP/IP protocol from applications. While I'm not aware of any changes in the works, we all remember the significant changes which occurred between IBTA 1.0 and IBTA 1.1 in the RMPP area. If any similar significant revision to the protocol occurred it would be best to have it all implemented in just one place. my $0.02 Todd Rimmer From nacc at us.ibm.com Thu Jan 5 15:37:46 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 5 Jan 2006 15:37:46 -0800 Subject: [openib-general] Re: ppc64 build failure (4778) In-Reply-To: References: <20060105191728.GA23370@us.ibm.com> <20060105195531.GA2064@us.ibm.com> <20060105220248.GC2064@us.ibm.com> Message-ID: <20060105233746.GD2064@us.ibm.com> On 05.01.2006 [15:15:10 -0800], Roland Dreier wrote: > Ugh, the latest git tree still has version 2.6.15, so there's no way > to tell if add_hotplug_env_var() has changed or not. This will fix > itself once 2.6.16-rc1 comes out in about 10 days, but I don't know of > a good way to fix it until then. > > You can hack drivers/infiniband/core/sysfs.c by hand to change the > > #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,16) > > to > > #if 0 > > but then that won't build against stock 2.6.15. True -- but with my handy-dandy modified script, I can just build current svn against 2.6.15 until 2.6.16-rc1 comes out. Yay :) I have valid (mostly) results from 4 separate runs of all the tests relative to the svn tree. I will try to get them in a reasonable format soon. Thanks, Nish From rdreier at cisco.com Thu Jan 5 16:12:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Jan 2006 16:12:11 -0800 Subject: [openib-general] Re: [PATCH] mthca: check return value in mthca_dev_lim call In-Reply-To: <20051219120049.GA4858@mellanox.co.il> (Jack Morgenstein's message of "Mon, 19 Dec 2005 14:00:49 +0200") References: <20051219120049.GA4858@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Jan 5 16:13:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Jan 2006 16:13:53 -0800 Subject: [openib-general] Re: [PATCH] mthca: check port validity in modify_qp In-Reply-To: <20060101104952.GA3082@mellanox.co.il> (Jack Morgenstein's message of "Sun, 1 Jan 2006 12:49:52 +0200") References: <20060101104952.GA3082@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Jan 5 16:17:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Jan 2006 16:17:45 -0800 Subject: [openib-general] Re: [PATCH] mthca: create_eq with size not a pow of 2 In-Reply-To: <20060105162946.GM2790@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 5 Jan 2006 18:29:46 +0200") References: <20060105162946.GM2790@mellanox.co.il> Message-ID: Thanks, applied. From rolandd at cisco.com Thu Jan 5 16:19:41 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 00:19:41 +0000 Subject: [openib-general] [git patch review 1/4] IB/mthca: fix WQE size calculation in create-srq Message-ID: <1136506781419-a5fabee982034082@cisco.com> Thinko: 64 bytes is the minimum SRQ WQE size (not the maximum). Signed-off-by: Jack Morgenstein Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_srq.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) 1d7d2f6f476cf7aa65f9f740a6c932fb75608110 diff --git a/drivers/infiniband/hw/mthca/mthca_srq.c b/drivers/infiniband/hw/mthca/mthca_srq.c index f7d2342..e7e153d 100644 --- a/drivers/infiniband/hw/mthca/mthca_srq.c +++ b/drivers/infiniband/hw/mthca/mthca_srq.c @@ -201,7 +201,7 @@ int mthca_alloc_srq(struct mthca_dev *de if (mthca_is_memfree(dev)) srq->max = roundup_pow_of_two(srq->max + 1); - ds = min(64UL, + ds = max(64UL, roundup_pow_of_two(sizeof (struct mthca_next_seg) + srq->max_gs * sizeof (struct mthca_data_seg))); srq->wqe_shift = long_log2(ds); -- 0.99.9n From rolandd at cisco.com Thu Jan 5 16:19:41 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 00:19:41 +0000 Subject: [openib-general] [git patch review 2/4] IB/mthca: check return value in mthca_dev_lim call In-Reply-To: <1136506781419-a5fabee982034082@cisco.com> Message-ID: <1136506781419-2b71405b820f1e9d@cisco.com> Check error return on call to mthca_dev_lim for Tavor (as is done for memfree). Signed-off-by: Jack Morgenstein Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_main.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) aa2f9367790ad81ef51d3f667124227ca3003d3b diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 6f94b25..8b00d9a 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -261,6 +261,10 @@ static int __devinit mthca_init_tavor(st } err = mthca_dev_lim(mdev, &dev_lim); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + goto err_disable; + } profile = default_profile; profile.num_uar = dev_lim.uar_size / PAGE_SIZE; -- 0.99.9n From rolandd at cisco.com Thu Jan 5 16:19:41 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 00:19:41 +0000 Subject: [openib-general] [git patch review 3/4] IB/mthca: check port validity in modify_qp In-Reply-To: <1136506781419-2b71405b820f1e9d@cisco.com> Message-ID: <1136506781419-5ed071c9c7a80e29@cisco.com> Modify_qp should check that the physical port number provided is a legal value. Signed-off-by: Jack Morgenstein Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 6 ++++++ 1 files changed, 6 insertions(+), 0 deletions(-) 38d1e793471d95728219f500bbb8bd25658d73b0 diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index d786ef4..ea45fa4 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -621,6 +621,12 @@ int mthca_modify_qp(struct ib_qp *ibqp, return -EINVAL; } + if ((attr_mask & IB_QP_PORT) && + (attr->port_num == 0 || attr->port_num > dev->limits.num_ports)) { + mthca_dbg(dev, "Port number (%u) is invalid\n", attr->port_num); + return -EINVAL; + } + if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC && attr->max_rd_atomic > dev->limits.max_qp_init_rdma) { mthca_dbg(dev, "Max rdma_atomic as initiator %u too large (max is %d)\n", -- 0.99.9n From rolandd at cisco.com Thu Jan 5 16:19:41 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 00:19:41 +0000 Subject: [openib-general] [git patch review 4/4] IB/mthca: create_eq with size not a power of 2 In-Reply-To: <1136506781419-5ed071c9c7a80e29@cisco.com> Message-ID: <1136506781420-ac8ffb517687652c@cisco.com> Fix mthca_create_eq for when the EQ size is not a power of 2. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_eq.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) 466200562ccd80f728f7ef602d2b97b4fdedd566 diff --git a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c index 34d68e5..e8a948f 100644 --- a/drivers/infiniband/hw/mthca/mthca_eq.c +++ b/drivers/infiniband/hw/mthca/mthca_eq.c @@ -484,8 +484,7 @@ static int __devinit mthca_create_eq(str u8 intr, struct mthca_eq *eq) { - int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / - PAGE_SIZE; + int npages; u64 *dma_list = NULL; dma_addr_t t; struct mthca_mailbox *mailbox; @@ -496,6 +495,7 @@ static int __devinit mthca_create_eq(str eq->dev = dev; eq->nent = roundup_pow_of_two(max(nent, 2)); + npages = ALIGN(eq->nent * MTHCA_EQ_ENTRY_SIZE, PAGE_SIZE) / PAGE_SIZE; eq->page_list = kmalloc(npages * sizeof *eq->page_list, GFP_KERNEL); -- 0.99.9n From Richard.Frank at oracle.com Thu Jan 5 16:38:47 2006 From: Richard.Frank at oracle.com (Richard Frank) Date: Thu, 05 Jan 2006 19:38:47 -0500 Subject: [openib-general] SDP - What are the platforms that support SDP ? In-Reply-To: <20060105202754.GC23796@esmail.cup.hp.com> References: <1136491183.5216.10.camel@localhost.localdomain> <20060105202754.GC23796@esmail.cup.hp.com> Message-ID: <1136507927.5216.22.camel@localhost.localdomain> What platforms does IT-API inter-operate with ? OK - for HPUX we can fall back to normal TCP / sockets - for the stream mode cases. Any platform that supports SDP will have a distinct performance advantage - especially if it supports zero copy. W.R.T. RDS - we are moving to RDS as a replacement for IT-API / uDAPL / and standard UDP. Again any platform with support for RDS will have a significant performance advantage. We will fall back to running with UDP on HPUX. On Thu, 2006-01-05 at 12:27 -0800, Grant Grundler wrote: > On Thu, Jan 05, 2006 at 02:59:43PM -0500, Richard Frank wrote: > > Besides OpenIB for Linux and Windows ? > > HP-UX ? > > Almost certainly not for HPUX. > Oracle should plan on continuing to use existing IT-API interface. > I'm told "it's known to work" and meets HP's requirements > (which RDS does not AFAICT). > > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Thu Jan 5 16:41:59 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Jan 2006 16:41:59 -0800 Subject: [openib-general] SA cache design In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A43670A0954@mercury.infiniconsys.com> Message-ID: >> Note that with >> a MAD interface, kernel modules would still have access to >> any cached data. I >> also wanted to stick with usermode to allow saving the cache >> to disk, so that it >> would be available immediately after a reboot. (My >> assumption being that >> changes to the network topology would be rare, so we could >> optimize around a >> stable network design.) >It is risky to assume that PathRecords would stay the same across a node >reboot. It is very likely that the SM could assign different LIDs or if the >node is down for an extended period other things in the fabric could have >significantly changed. OpenSM currently maintains LIDs between system reboots, and I believe that this is desirable for fast fabric bring-up. And I believe that this is a desirable feature for any SM to have. In any case, a local LID change is trivial to detect and can easily be used to invalidate the entire cache. Likewise, the cache could automatically be flushed if not updated for some specified time period, or if some other defined event occurred - such as a GUID change on the local HCA. Overall, I think that the risk here is low. >> As a related topic, there will be a separate SA client >> interface defined that >> will generate SA query MADs for the user. >Given the complexity of the RMPP protocol and the subtle bugs which everyone >has encountered while implementing and debugging it (timeouts, retries, abort, >window size management, class header offset, etc), it would be best to limit >the number of copies of this protocol within the system. Keeping the RMPP >details hidden just in the kernel would be best. An analogy might be the way >sockets hides the details of the TCP/IP protocol from applications. While I'm >not aware of any changes in the works, we all remember the significant changes >which occurred between IBTA 1.0 and IBTA 1.1 in the RMPP area. If any similar >significant revision to the protocol occurred it would be best to have it all >implemented in just one place. RMPP is implemented by the MAD layer, and is hidden to any clients using the MAD services. There will still only be a single RMPP implementation in the stack. - Sean From sean.hefty at intel.com Thu Jan 5 16:43:39 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Jan 2006 16:43:39 -0800 Subject: [openib-general] SDP - What are the platforms that support SDP ? In-Reply-To: <1136507927.5216.22.camel@localhost.localdomain> Message-ID: >Any platform that supports SDP will have a distinct performance >advantage - especially if it supports zero copy. > >W.R.T. RDS - we are moving to RDS as a replacement for IT-API / uDAPL / >and standard UDP. Are you planning on using SDP or RDS? What platforms will have RDS that will not have SDP? - Sean From Thomas.Duffy.99 at alumni.brown.edu Thu Jan 5 16:53:54 2006 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Thu, 5 Jan 2006 16:53:54 -0800 Subject: [openib-general] SDP - What are the platforms that support SDP ? In-Reply-To: <1136491183.5216.10.camel@localhost.localdomain> References: <1136491183.5216.10.camel@localhost.localdomain> Message-ID: On Jan 5, 2006, at 11:59 AM, Richard Frank wrote: > Besides OpenIB for Linux and Windows ? > > Solaris ? Sun had a fully working SDP for Solaris that got cut from Solaris 10 right at the last moment. I am not sure what the current status is, but I know there were folks that were trying to resurrect it for Solaris 11 if not a 10 update. Perhaps the code could be put out as part of OpenSolaris? -tduffy From Richard.Frank at oracle.com Thu Jan 5 17:07:50 2006 From: Richard.Frank at oracle.com (Richard Frank) Date: Thu, 05 Jan 2006 20:07:50 -0500 Subject: [openib-general] SDP - What are the platforms that support SDP ? In-Reply-To: References: Message-ID: <1136509670.5216.34.camel@localhost.localdomain> We need both - each for different Oracle clients / functionality with respective connection models / modes of operation (stream vs datagram). BTW - Oracle currently uses TCP streams / SDP for Client / middle tier connectivity to the database. We use UDP / RDS within the database for inter database instance communication. We are planning on using TCP streams / SDP for additional functionality - specifically for its AIO + zero copy capability - on platforms that support it. On Thu, 2006-01-05 at 16:43 -0800, Sean Hefty wrote: > >Any platform that supports SDP will have a distinct performance > >advantage - especially if it supports zero copy. > > > >W.R.T. RDS - we are moving to RDS as a replacement for IT-API / uDAPL / > >and standard UDP. > > Are you planning on using SDP or RDS? What platforms will have RDS that will > not have SDP? > > - Sean > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From nrikah82xxx at kobej.zzn.com Thu Jan 5 17:34:17 2006 From: nrikah82xxx at kobej.zzn.com (nrikah82xxx at kobej.zzn.com) Date: Thu, 5 Jan 2006 17:34:17 -0800 (PST) Subject: [openib-general] =?utf-8?b?wpNvwphewoLDiMKCwrXCgcKZwotDwox5woI=?= =?utf-8?b?w4nCisKuwpFTwpbCs8KXwr/Cg8KBwoFbwoPCi8KRwpfCkE0=?= Message-ID: 20060106082936.77130mail@mail.hothot-top7789548_5524_superwebserver09_hothot-top99.cc �E*��*�E�E*��*�E�E*��*�E�E*��*�E*��*�E*��*�E*��*�E*��*�E �@�@�@�@�@�@�Z���������������Ȅ������Ʉ��@�@ �@�@�@�@�@�@�����������������������������@�@�@ �@�@�@���������Å�����Ȅ������Å�������I���@�@ �@�@�@�����������������������������������������@�@�@ �E*��*�E*��*�E�E*��*�E�E*��*�E�E*��*�E*��*�E*��*�E*��*�E :*.���B�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�B��.*: �@�@�@�@�@�@�h����ɏ�����ꂽ��c�h �@http://h-time.2y.net/ �@�@�@�@�@�@���܂��������p�������I �@�@�@�@�@�@�o�^�s�v�Ŋ��S�����Ń��[�����M�o���܂��B �@�@�@�@�@�@���܂莞�Ԃ���Ȃ����ł�y���߂܂��B �@�@�@�@�@�@������A��������ƃp�[�g�i�[��T���������ɂ�œK�ł��B �@http://h-time.2y.net/ �@�@�@�@�@�@�����������ȑ̌����]����������A �@�@�@�@�@�@�^���ɗ��l��T����Ă�����܂ŁA �@�@�@�@�@�@���L�����������p����Ă��܂��B �@http://h-time.2y.net/ ========================================================= �@�@����ȕ��ɃI�X�X�����܂��� �@�@��@�E��Ɉِ����S�R���Ȃ��̂ŁA�o����߂āB �@�@��@���܂łƂ͈Ⴄ���E�̐l�Ƃ̏o����߂āB �@�@��@���z�̑����������ƒT�������B �@�@��@��l�̎��ԁA�₵���𖄂߂鑊�肪�~���� �@http://h-time.2y.net/ -��--���S�����ň��S�̏o�--��- �@�@�@�@�������p����� ALL \0!! �@�@�@�S�ẴT�[�r�X�������Ŋy���߂�I �@�@������������������������ �@�@�@���@�o�^�@�@�@�@�O�~�I �@�@�@���@���[�����M�@�O�~�I �@�@�@���@���[����M�@�O�~�I �@�@�@���@�����݁@�@�@�O�~�I �@�@�@���@�f���‰{���@�O�~�I �@�@�@���@���A�h���@�O�~�I �@�@�@���@���d���@�@�O�~�I �@�@�@���@�މ�@�@�@�@�O�~�I �@�@�������炩��ǂ����� �@http://h-time.2y.net/ �@�@ �@�^�^����������̓��e���_�_ �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c �@���ߎq����@27�� �@�����R�����̐l�Ȃł��B �@�������ĊԂ�Ȃ�����G�b�`�͂����񂾂Ǝv��ꂪ���ł��� �@�S�R���̑�����Ă���Ȃ���ł��c�B �@�d�����Z�����炵���ƂɋA���ė��Ă���Ă����Q���Ⴄ��ł�� �@�閧����ł��肢���܂��B �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c �@�o�^���Ȃ��Ń��[�����MOK�Fhttp://h-time.2y.net/ �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c �@�Ȃ�������@22�� �@���߂܂��āB �@�^��w�a�@�ŊŌ�m���Ă�Ȃ���27�΂ł��B �@�ŋ߁A�v���C�x�[�g���ق�Ƃɂ�鎖�Ȃ��ăq�}(><) �@�N���V��ł���܂��񂩁H �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c �@�o�^���Ȃ��Ń��[�����MOK�Fhttp://h-time.2y.net/ �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c �@��������@20�΁@ �@�̒j�֌W�Őh�����������Ă�������ɉ��a�ɂȂ�������āc �@�ł�ŋ߂͏����—ǂ��Ȃ��Ă��āB �@���߂͗F�B���炨�肢���܂��B�����ގ���o����Ƃ����ȁB �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c �@�o�^���Ȃ��Ń��[�����MOK�Fhttp://h-time.2y.net/ �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c From iod00d at hp.com Thu Jan 5 17:40:45 2006 From: iod00d at hp.com (Grant Grundler) Date: Thu, 5 Jan 2006 17:40:45 -0800 Subject: [openib-general] Re: Failure in reset HCA withbackport-svn4507-to-2.6.9 In-Reply-To: References: <96f8e60e0601051440t69782459k113307dd82a1b559@mail.gmail.com> Message-ID: <20060106014045.GI23796@esmail.cup.hp.com> On Thu, Jan 05, 2006 at 02:46:42PM -0800, Bob Woodruff wrote: > Ranjit wrote, > >fw rev: 3.03.0003rc16b > > I think that 4.7 is the latest (at least for the PCI-E cards) > and 3.2 ( for the PCI-X cards) Latest for PCI-X is 3.3.5. Latest for PCI-e is 4.7.4. The 3.3.3 is likely to refer to a PCI-X card. See the openib wiki page: https://openib.org/tiki/tiki-index.php?page=MellanoxHcaFirmware (also has link to mellanox firmware support site) grant > and I think the latest FW is required for correct operation with the openIB > stack. > Michael is this correct ? > > woody > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Richard.Frank at oracle.com Thu Jan 5 18:34:33 2006 From: Richard.Frank at oracle.com (Richard Frank) Date: Thu, 05 Jan 2006 21:34:33 -0500 Subject: [openib-general] SDP - What are the platforms that support SDP ? In-Reply-To: <1136509670.5216.34.camel@localhost.localdomain> References: <1136509670.5216.34.camel@localhost.localdomain> Message-ID: <1136514873.5216.48.camel@localhost.localdomain> We also use TCP streams for disaster recovery archiving involving very large amounts of data. We would like to move this to SDP via AIO too. On Thu, 2006-01-05 at 20:07 -0500, Richard Frank wrote: > We need both - each for different Oracle clients / functionality with > respective connection models / modes of operation (stream vs datagram). > > BTW - Oracle currently uses TCP streams / SDP for Client / middle tier > connectivity to the database. We use UDP / RDS within the database for > inter database instance communication. > > We are planning on using TCP streams / SDP for additional functionality > - specifically for its AIO + zero copy capability - on platforms that > support it. > > > On Thu, 2006-01-05 at 16:43 -0800, Sean Hefty wrote: > > >Any platform that supports SDP will have a distinct performance > > >advantage - especially if it supports zero copy. > > > > > >W.R.T. RDS - we are moving to RDS as a replacement for IT-API / uDAPL / > > >and standard UDP. > > > > Are you planning on using SDP or RDS? What platforms will have RDS that will > > not have SDP? > > > > - Sean > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From huanwei at cse.ohio-state.edu Thu Jan 5 19:21:36 2006 From: huanwei at cse.ohio-state.edu (wei huang) Date: Thu, 5 Jan 2006 22:21:36 -0500 (EST) Subject: [openib-general] *** glibc detected *** corrupted double-linked list error In-Reply-To: Message-ID: Hi Roland, Sorry we were distracted by some other work so I did not respond your email. Yes, unfortunately I still see the problem. I get the core dump, but I am not sure what exactly information is helpful to you. Anyway, here is some of the output from the core dump: ================================================================== #x0-gen2# /home/4/huangwei/new_cache/mvapich2-new-cache/test/mpi/rma> ../../../bin/mpiexec -gdb -n 2 ./test2 0-1: (gdb) core core.4576 0-1: Core was generated by `./test2'. 0-1: Program terminated with signal 6, Aborted. 0-1: Reading symbols from /usr/local/lib/libibverbs.so.1...done. 0-1: Loaded symbols for /usr/local/lib/libibverbs.so.1 0-1: Reading symbols from /lib64/tls/libpthread.so.0...done. 0-1: Loaded symbols for /lib64/tls/libpthread.so.0 0-1: Reading symbols from /lib64/tls/librt.so.1...done. 0-1: Loaded symbols for /lib64/tls/librt.so.1 0-1: Reading symbols from /lib64/tls/libc.so.6...done. 0-1: Loaded symbols for /lib64/tls/libc.so.6 0-1: Reading symbols from /usr/lib64/libsysfs.so.1...done. 0-1: Loaded symbols for /usr/lib64/libsysfs.so.1 0-1: Reading symbols from /lib64/libdl.so.2...done. 0-1: Loaded symbols for /lib64/libdl.so.2 0-1: Reading symbols from /lib64/ld-linux-x86-64.so.2...done. 0-1: Loaded symbols for /lib64/ld-linux-x86-64.so.2 0-1: Reading symbols from /usr/local/lib/infiniband/mthca.so...done. 0-1: Loaded symbols for /usr/local/lib/infiniband/mthca.so 0-1: Reading symbols from /lib64/libnss_files.so.2...done. 0-1: Loaded symbols for /lib64/libnss_files.so.2 0: #0 0x0000003dd9a2e4dd in ?? () 1: #0 0x0000003dd9a2e4dd in raise () from /lib64/tls/libc.so.6 0-1: (gdb) where 0: #0 0x0000003dd9a2e4dd in ?? () 1: #0 0x0000003dd9a2e4dd in raise () from /lib64/tls/libc.so.6 0: #1 0x0000003dd9a2fc8e in ?? () 1: #1 0x0000003dd9a2fc8e in abort () from /lib64/tls/libc.so.6 0: #2 0x0000000000000020 in ?? () 1: #2 0x0000003dd9a62b41 in __libc_message () from /lib64/tls/libc.so.6 0: #3 0x0000000000000000 in ?? () 1: #3 0x0000003dd9a67da1 in malloc_consolidate () from /lib64/tls/libc.so.6 0: (gdb) 1: #4 0x0000003dd9a684d6 in _int_free () from /lib64/tls/libc.so.6 1: #5 0x0000003dd9a68a06 in free () from /lib64/tls/libc.so.6 1: #6 0x00002aaaaace7765 in mthca_free_db_tab (db_tab=0x5b88f0) 1: at src/memfree.c:201 1: #7 0x00002aaaaace79b3 in mthca_free_context (ibctx=0x5b6b50) 1: at src/mthca.c:206 1: #8 0x00002aaaaaaafce0 in ibv_close_device (context=0x5b6b50) 1: at src/device.c:151 1: #9 0x00000000004424ec in MPIDI_CH3I_RMDA_finalize () at rdma_iba_init.c:897 1: #10 0x000000000043e9c5 in MPIDI_CH3_Finalize () at ch3_finalize.c:43 1: #11 0x0000000000422e10 in MPID_Finalize () at mpid_finalize.c:157 1: #12 0x000000000040f536 in PMPI_Finalize () at finalize.c:145 1: #13 0x00000000004035a2 in main (argc=1, argv=0x7fffffc86558) at test2.c:74 Would you please let me know how I can provide more valuable information? Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Wed, 4 Jan 2006, Roland Dreier wrote: > wei> Hi, We encountered the following error when we call > wei> ibv_close_device: *** glibc detected *** corrupted > wei> double-linked list: 0x0000000000a54e10 *** > > Any further information on this? Are you still seeing the problem? > > - R. > From eitan at mellanox.co.il Fri Jan 6 00:53:10 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 06 Jan 2006 10:53:10 +0200 Subject: [openib-general] SA cache design In-Reply-To: References: Message-ID: <43BE2FF6.2060506@mellanox.co.il> Hi Sean, Please see below. Sean Hefty wrote: >>* Regarding the sentence:"Clients would send their queries to the sa_cache >>instead of the SA" >> I would propose that a "SA MAD send switch" be implemented in the core: Such >>a switch >> will enable plugging in the SA cache (I would prefer calling it SA local >>agent due to >> its extended functionality). Once plugged in, this "SA local agent" should >>be forwarded all >> outgoing SA queries. Once it handles the MAD it should be able to inject the >>response through >> the core "SA MAD send switch" as if they arrived from the wire. > > > This was my thought as well. I hesitated to refer to the cache as a local > agent, since that's an implementation detail. I want to allow the possibility > for the cache to reside on another system. For the initial implementation, the > cache would be local however. So if the cache is on another host - a new kind of MAD will have to be sent on behalf of the original request? > > >>Functional requirements: >>* It is clear that the first SA query to cache is PathRecord. > > > This will be the first cached query in the initial check-in. > > >> So if a new client wants to connect to another node a new PathRecord >> query will not need to be sent to the SA. However, recent work on QoS has >>pointed out >> that under some QoS schemes PathRecord should not be shared by different >>clients > > > I'm not sure that QoS handling is the responsibility of the cache. The module > requesting the path records should probably deal with this. In IB QoS properties are mainly the PathRecord parameters: SL, Rate, MTU, PathBits (LMC bits). So if traditionally we had PathRecord requested for each Src->Dst port now we will need to track at least: Src->Dst * #QoS-levels. (a non optimal implementation will require even more: #Src->Dst * #Clients * #Servers * #Services). > > >>* Forgive me for bringing the following issue - over and over to the group: >> Multicast Join/Leave should be reference counted. The "SA local agent" could >>be >> the right place for doing this kind of reference counting (actually if it >>does that >> it probably needs to be located in the Kernel - to enable cleanup after >>killed processes). > > > I agree that this is a problem, but I my preference would be for a dedicated > kernel module to handle multicast join/leave requests. Since we already sniff into the SA queries it makes sense to have the same code also handle other functionality that requires sniffing into the SA requests. As HAL points out this involves both ServiceRecord, Multicast Join/Leave and InformInfo requests. Multicast Join/Leave actually behaves like a cache: if a "join" to the same MGID already took place (no leave yet) then no need to sent the new request to the SA. > > - Sean > From eitan at mellanox.co.il Fri Jan 6 01:06:15 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 06 Jan 2006 11:06:15 +0200 Subject: [openib-general] SA cache design In-Reply-To: References: Message-ID: <43BE3307.1090606@mellanox.co.il> Hi Sean, Todd, Although I like the "replica" idea for its "query" performance boost - I suspect it will actually do not scale for very large networks: Each node has to query for the entire database would cause N^2 load on the SA. After any change (which do happen with higher probability on large networks) the SA will need to send each Report to N targets. We already have some bad experience with large clusters SA query issues, like the one reported by Roland "searching for SRP targets using PortInfo capability mask". Eitan Sean Hefty wrote: >>- It is implemented in kernel mode >> - while user mode may help during initial debug, it will be important > > for > >> kernel mode ULPs such as SRP, IPoIB and SDP to also make use of >>these records > > > Your kernel footprint is smaller than I expected, which is good. Note that with > a MAD interface, kernel modules would still have access to any cached data. I > also wanted to stick with usermode to allow saving the cache to disk, so that it > would be available immediately after a reboot. (My assumption being that > changes to the network topology would be rare, so we could optimize around a > stable network design.) > > As a related topic, there will be a separate SA client interface defined that > will generate SA query MADs for the user. > > - Sean > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From trimmer at silverstorm.com Fri Jan 6 05:50:33 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 6 Jan 2006 08:50:33 -0500 Subject: [openib-general] SA cache design Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670A0955@mercury.infiniconsys.com> > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Hi Sean, Todd, > > Although I like the "replica" idea for its "query" > performance boost - I suspect it will actually do not scale > for very large > networks: Each node has to query for the entire database > would cause N^2 load on the SA. > After any change (which do happen with higher probability on > large networks) the SA will need to send each Report to N targets. > > We already have some bad experience with large clusters SA > query issues, like the one reported by Roland > "searching for SRP targets using PortInfo capability mask". > Our experience has been the exact opposite. While there is an initial load on the SA to populate the replica (which we have used various techniques to reduce such as backing off when the SA reports Busy, having a random time offset of start of query, etc). The boost occurs when a new application starts, such as an MPI using the SA/CM to establish connections as per the IBTA spec. A 1000 process MPI job would have each process make 999 queries to the SA at job startup time. This causes a burst of 999,0000 sets of SA queries (most will involve both Node Record and Path record queries so it will really be 2x this amount), BEFORE the MPI job can actually start. As Open IB moves forward to implement QOS and other features, MPI will have to use the SA to get its path records. If you study MVAPICH at present, it merely exchanges LIDs between nodes and hardcodes (or via enviornment variables uses the same value for all processes) all the other QOS parameters. In a true QOS and congestion management environment it will instead have to use the CM/SA. We have been using this replica technique quite successfully for 2-3 years now. Our MPI has used the SA/CM for connection establishment for just as long. As it was pointed out, most fabrics will be quite stable. Hence having a replica and paying the cost of the SA queries once will be much more efficient than paying that cost on every application startup. Todd Rimmer From halr at voltaire.com Fri Jan 6 05:43:06 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jan 2006 08:43:06 -0500 Subject: [openib-general] SA cache design In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A43670A0954@mercury.infiniconsys.com> References: <5D78D28F88822E4D8702BB9EEF1A43670A0954@mercury.infiniconsys.com> Message-ID: <1136554985.4336.8389.camel@hal.voltaire.com> On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > This of course implies the "SA Mux" must analyze more than just > the attribute ID to determine if the replica can handle the query. > But the memory savings is well worth the extra level of filtering. If the SA cache does this, it seems it would be pretty simple to return this info in an attribute to the client so the client would know when to go to the cache/replica and when to go direct to the SA in the case where only certain queries are supported. Wouldn't this be advantageous when the replica doesn't support all queries ? -- Hal From trimmer at silverstorm.com Fri Jan 6 06:05:52 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 6 Jan 2006 09:05:52 -0500 Subject: [openib-general] SA cache design Message-ID: <5D78D28F88822E4D8702BB9EEF1A4367D12C70@mercury.infiniconsys.com> > From: Hal Rosenstock [mailto:halr at voltaire.com] > On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > > This of course implies the "SA Mux" must analyze more than just > > the attribute ID to determine if the replica can handle the query. > > But the memory savings is well worth the extra level of filtering. > > If the SA cache does this, it seems it would be pretty simple > to return > this info in an attribute to the client so the client would > know when to > go to the cache/replica and when to go direct to the SA in the case > where only certain queries are supported. Wouldn't this be > advantageous > when the replica doesn't support all queries ? Why put the burden on the application. give the query to the Mux. With an optional flag indicating a prefered "routing" (choices of: to SA, to replica, let Mux decide). Then let it decide. As you suggest it may be simplest to let the Mux try the replica and on failure fallback to the SA transparent to the app (sort of the way SDP intercepts socket ops and falls back to TCP/IP when SDP isn't appropriate). Todd R. From halr at voltaire.com Fri Jan 6 06:09:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jan 2006 09:09:57 -0500 Subject: [openib-general] SA cache design In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A4367D12C70@mercury.infiniconsys.com> References: <5D78D28F88822E4D8702BB9EEF1A4367D12C70@mercury.infiniconsys.com> Message-ID: <1136556596.4336.8644.camel@hal.voltaire.com> On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote: > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > > > This of course implies the "SA Mux" must analyze more than just > > > the attribute ID to determine if the replica can handle the query. > > > But the memory savings is well worth the extra level of filtering. > > > > If the SA cache does this, it seems it would be pretty simple > > to return > > this info in an attribute to the client so the client would > > know when to > > go to the cache/replica and when to go direct to the SA in the case > > where only certain queries are supported. Wouldn't this be > > advantageous > > when the replica doesn't support all queries ? > > Why put the burden on the application. give the query to the Mux. That's what I'm suggesting. Rather than a binary switch mux, a more granular one which determines how to route the outgoing SA request. > With an optional flag indicating a prefered "routing" (choices of: to SA, > to replica, let Mux decide). Then let it decide. As you suggest it may > be simplest to let the Mux try the replica and on failure fallback > to the SA transparent to the app (sort of the way SDP intercepts > socket ops and falls back to TCP/IP when SDP isn't appropriate). It depends on whether the replica/cache forwards unsupported requests on or responds with not supported back to the client as to how this is handled. Sean was proposing the forward on model and a binary switch at the client. I think this is more granular and can be mux'd only with the knowledge of what a replica/cache supports (not sure about dealing with different replica/caches supporting a different set of queries; need to think more on how the caches are located, etc.). You are mentioning a third model here. -- Hal From halr at voltaire.com Fri Jan 6 06:29:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jan 2006 09:29:08 -0500 Subject: [openib-general] SA cache design In-Reply-To: <43BE2FF6.2060506@mellanox.co.il> References: <43BE2FF6.2060506@mellanox.co.il> Message-ID: <1136557747.4336.8823.camel@hal.voltaire.com> Hi Eitan, [snip...] > >> So if a new client wants to connect to another node a new PathRecord > >> query will not need to be sent to the SA. However, recent work on QoS has > >>pointed out > >> that under some QoS schemes PathRecord should not be shared by different > >>clients > > > > > > I'm not sure that QoS handling is the responsibility of the cache. The module > > requesting the path records should probably deal with this. > In IB QoS properties are mainly the PathRecord parameters: SL, Rate, MTU, PathBits (LMC bits). > So if traditionally we had PathRecord requested for each Src->Dst port now we will need to > track at least: > Src->Dst * #QoS-levels. (a non optimal implementation will require even more: #Src->Dst * #Clients * #Servers * #Services). Perhaps QoS requests (I'm referring to those with the new proposed key) are not cached as I think this may end up with the cache needing to know the path record policies). I would propose deferring this aspect until the new QoS work is a little firmer and the cache direction in OpenIB is also a little firmer (e.g. QoS = phase 2 or beyond of this work). [snip...] -- Hal From mshefty at ichips.intel.com Fri Jan 6 09:45:34 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 06 Jan 2006 09:45:34 -0800 Subject: [openib-general] Patch for possible bug in ib_create_ah_from_wc() In-Reply-To: <1136340365.5081.125.camel@brick.internal.keyresearch.com> References: <1136340365.5081.125.camel@brick.internal.keyresearch.com> Message-ID: <43BEACBE.5000909@ichips.intel.com> Ralph Campbell wrote: > It looks like ib_create_ah_from_wc() doesn't create the correct > return address (AH) when there is a GRH present (source & dest GIDs > need to be swapped). Your fix looks correct to me. Can you please resend with a signed-off-by line? - Sean From mshefty at ichips.intel.com Fri Jan 6 10:06:23 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 06 Jan 2006 10:06:23 -0800 Subject: [openib-general] Re: [PATCH] CMA mods for iWARP In-Reply-To: <1136487432.10878.22.camel@trinity.austin.ammasso.com> References: <1136487432.10878.22.camel@trinity.austin.ammasso.com> Message-ID: <43BEB19F.5030603@ichips.intel.com> Tom Tucker wrote: > This patch is for CMA changes to support iWARP and is relative to the > trunk. It includes the latest ib_addr generalizations that allowed for > some simplification in the rdma_resolve_addr implementation. This patch > needs the include file patch to compile. Thanks - I would prefer hold off committing these changes for a few days until we can submit a patch to merge the existing rdma_cm and ib_addr upstream. Only a couple of minor nits below. > +static int cma_iw_handler(struct iw_cm_id* iw_id, struct iw_cm_event* event) > +{ > + struct rdma_id_private *id_priv = iw_id->context; > + enum rdma_cm_event_type event_type = 0; > + int ret = 0; > + > + atomic_inc(&id_priv->dev_remove); > + > + switch (event->event) { > + case IW_CM_EVENT_LLP_DISCONNECT: > + case IW_CM_EVENT_LLP_RESET: > + case IW_CM_EVENT_LLP_TIMEOUT: > + case IW_CM_EVENT_CLOSE: > + event_type = RDMA_CM_EVENT_DISCONNECTED; > + break; > + > + case IW_CM_EVENT_CONNECT_REQUEST: > + BUG_ON(1); > + break; Maybe move the BUG_ON to a default case? > +static int cma_iw_listen(struct rdma_id_private *id_priv) > +{ > + int ret; > + struct sockaddr_in* sin; > + > + id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, > + iw_conn_req_handler, > + id_priv); > + if (IS_ERR(id_priv->cm_id.iw)) > + return PTR_ERR(id_priv->cm_id.iw); > + > + sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr; > + id_priv->cm_id.iw->local_addr = *sin; > + > + ret = iw_cm_listen(id_priv->cm_id.iw, 10 /* backlog */); rdma_listen() has a backlog parameter that can be used here. - Sean From mshefty at ichips.intel.com Fri Jan 6 10:19:59 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 06 Jan 2006 10:19:59 -0800 Subject: [openib-general] merge rdma_cm and ib_addr upstream Message-ID: <43BEB4CF.1020103@ichips.intel.com> Roland, I think that we're ready to try to merge the CMA upstream. I don't believe that any of the kernel ULP consumers are ready to be merged yet, but we could submit the userspace CMA support as a consumer of the API. What is your recommendation on how best to proceed? - Sean From ralphc at pathscale.com Fri Jan 6 10:50:32 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 06 Jan 2006 10:50:32 -0800 Subject: [openib-general] [PATCH] bug in ib_create_ah_from_wc() when creating GRH reply address Message-ID: <1136573432.5081.164.camel@brick.internal.keyresearch.com> ib_create_ah_from_wc() doesn't create the correct return address (AH) when there is a GRH present (source & dest GIDs need to be swapped). Signed-off-by: Ralph Campbell Index: gen2/trunk/src/linux-kernel/infiniband/core/verbs.c =================================================================== --- verbs.c (revision 4718) +++ verbs.c (working copy) @@ -106,9 +106,9 @@ if (wc->wc_flags & IB_WC_GRH) { ah_attr.ah_flags = IB_AH_GRH; - ah_attr.grh.dgid = grh->dgid; + ah_attr.grh.dgid = grh->sgid; - ret = ib_find_cached_gid(pd->device, &grh->sgid, &port_num, + ret = ib_find_cached_gid(pd->device, &grh->dgid, &port_num, &gid_index); if (ret) return ERR_PTR(ret); -- Ralph Campbell From mshefty at ichips.intel.com Fri Jan 6 10:52:21 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 06 Jan 2006 10:52:21 -0800 Subject: [openib-general] [PATCH] bug in ib_create_ah_from_wc() when creating GRH reply address In-Reply-To: <1136573432.5081.164.camel@brick.internal.keyresearch.com> References: <1136573432.5081.164.camel@brick.internal.keyresearch.com> Message-ID: <43BEBC65.8090104@ichips.intel.com> Ralph Campbell wrote: > ib_create_ah_from_wc() doesn't create the correct > return address (AH) when there is a GRH present (source & dest GIDs > need to be swapped). Thanks - committed. - Sean From mshefty at ichips.intel.com Fri Jan 6 10:59:15 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 06 Jan 2006 10:59:15 -0800 Subject: [openib-general] SA cache design In-Reply-To: <43BE2FF6.2060506@mellanox.co.il> References: <43BE2FF6.2060506@mellanox.co.il> Message-ID: <43BEBE03.6030309@ichips.intel.com> Eitan Zahavi wrote: > So if the cache is on another host - a new kind of MAD will have to be > sent on behalf of > the original request? I was thinking more in terms of redirection. > In IB QoS properties are mainly the PathRecord parameters: SL, Rate, > MTU, PathBits (LMC bits). > So if traditionally we had PathRecord requested for each Src->Dst port > now we will need to track at least: > Src->Dst * #QoS-levels. (a non optimal implementation will require even > more: #Src->Dst * #Clients * #Servers * #Services). I understand you now. Can someone familiar with the opensm code tell me how difficult it would be to extract out the code that tracks the subnet data and responds to queries? - Sean From halr at voltaire.com Fri Jan 6 11:00:48 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jan 2006 14:00:48 -0500 Subject: [openib-general] SA cache design In-Reply-To: <43BEBE03.6030309@ichips.intel.com> References: <43BE2FF6.2060506@mellanox.co.il> <43BEBE03.6030309@ichips.intel.com> Message-ID: <1136574047.4336.11293.camel@hal.voltaire.com> On Fri, 2006-01-06 at 13:59, Sean Hefty wrote: > Eitan Zahavi wrote: > > So if the cache is on another host - a new kind of MAD will have to be > > sent on behalf of > > the original request? > > I was thinking more in terms of redirection. > > > In IB QoS properties are mainly the PathRecord parameters: SL, Rate, > > MTU, PathBits (LMC bits). > > So if traditionally we had PathRecord requested for each Src->Dst port > > now we will need to track at least: > > Src->Dst * #QoS-levels. (a non optimal implementation will require even > > more: #Src->Dst * #Clients * #Servers * #Services). > > I understand you now. I'm not sure about the granularity this needs tracking at. > Can someone familiar with the opensm code tell me how difficult it would be to > extract out the code that tracks the subnet data and responds to queries? Although I don't think that is difficult, IMO it is more a matter of whether you want to buy into the architecture with the component and vendor libraries. I can help with this if this is the direction chosen. I would make this another build option. The other question is how this would be changed so that when the data is not present the real SA is queried. -- Hal From eitan at mellanox.co.il Fri Jan 6 11:50:33 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 06 Jan 2006 21:50:33 +0200 Subject: [openib-general] SA cache design In-Reply-To: <1136554985.4336.8389.camel@hal.voltaire.com> References: <5D78D28F88822E4D8702BB9EEF1A43670A0954@mercury.infiniconsys.com> <1136554985.4336.8389.camel@hal.voltaire.com> Message-ID: <43BECA09.2080502@mellanox.co.il> Hal Rosenstock wrote: > On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > >>This of course implies the "SA Mux" must analyze more than just >>the attribute ID to determine if the replica can handle the query. >>But the memory savings is well worth the extra level of filtering. > > > If the SA cache does this, it seems it would be pretty simple to return > this info in an attribute to the client so the client would know when to > go to the cache/replica and when to go direct to the SA in the case > where only certain queries are supported. Wouldn't this be advantageous > when the replica doesn't support all queries ? I think we want to make the client totally unaware to the existence of the cache. So the cache itself will simply forward the message (maybe changing TID). > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Arkady.Kanevsky at netapp.com Fri Jan 6 11:54:00 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 6 Jan 2006 14:54:00 -0500 Subject: [openib-general] RE: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's Message-ID: comments inline. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Arlin Davis [mailto:ardavis at ichips.intel.com] > Sent: Thursday, January 05, 2006 6:35 PM > To: Kanevsky, Arkady > Cc: Arlin Davis; Lentini, James; > dat-discussions at yahoogroups.com; openib-general at openib.org > Subject: Re: [openib-general] RE: [RFC][PATCH] OpenIB uDAPL > extension proposal - sample immed data and atomic api's > > Kanevsky, Arkady wrote: > > > Arlin, > > nice proposal, thanks. > > I have one high level question and a few specific technical ones. > > > > 1. Why do you want to provide this functionality via > extension instead > > of part of new DAT spec, say 2.0? > > This will allow Consumers to use all events, operations, and > > Provider/IA functionality uniformly instead of via 2 > separate layers. > > This will also ensure that this basic funcionality can be > provided by > > all DAPL Provider the same way on DAPL and DAT layers. > > DAPL 2.0 is not done yet so we have time to incorporate that. > > DAPL 2.0 already introduced new functionality which is easy > to beef up > > for your proposal. > > See DAT_DTOS for example. DAT_EVENT is also modified to > handle remote > > invalidation so a small addition for Immediate data and > Atoimc ops is > > a sensible addition. > > This should simplify proposal significantly. As you will > not need to > > introduce any new EXT structures. > > As mentioned on the con-call, there are two separate items to > consider while looking at the proposal. The first is the > ability to extend DAT for specific provider value-add and the > second is to validate the need for general atomic and > immediate data functionality in the basic set of API's for > all providers. I included atomics and immediate data as > examples since it is specific to one provider (IB), it > includes operations that require new ops, events, and event > data types, and it also provides a working model to validate > the extension model from request to completion events. I > would like to concentrate on getting consensus on the > extension proposal first if possible. Just try to think of > the actual operations as some opaque dat_ext_foobar_op(). The thing that bothers me is that we already have several APIs that are transport specific. While some are possible to implement on other transports the others, like Socket CM, can not. So I view both of your specific extensions as transport specific amd hence prefer to add them as normal APIs not extensions. The secondary goal is that Provider can add extensions without requiring to change to DAT. These fall into 3 categories. 1. New memory types including privilages and protection attributes. We can add "extension" entry to these structures. We need to check if this is sufficient. Think of shared memory for example. I am assuming no changes to PZ. 2. New DTOs. The main issue is not DTOs but their completions and async errors. This is why Immediate data is better handled by incorporating into DAT spec while atomic can be handled by extensions. That is completion will return "extention" and Consumer will do the secondary switch on the extension type. Extension should not impact backwards compatibility. We had not looked at errors. But assuming a simple model that async errors break connection and we can return "extension error" with extensions defining new reason. Again details need to be polished. 3. new connection types or CM models... New connections seems to have little impact on existing API assuming that EP type can be extended. The new connection can even restrict which DTO they can handle. CM model is more problematic. Arlin, it would be nice to consider some of your other extensions that are not transport specific to see how it will fit before we make the final decision. This should give us idea how extensible DAT "extension" model is. > > > > > In general, extension route was intended for RNIC|HCA providers to > > expose HW capabilities beyond IBTA, iWARP and VIA standards. The > > standard RDMA functionality is best handle via spec addition. > > DAT 2.0 does it for FMR, remote and local memory > invalidation as well > > as others. > > True, but the extension route is not fully defined, > documented, nor implemented. This is what I would like to > work on getting completed in time for 2.0 if possible. > > BTW: The existing implementation actually uses > dapl_provider->extension to store the hca_ptr but the > specification states that it is reserved for the providers > private use (8.2.1 in DAPL1.2 spec). This is why I had to > defined another extension_func in the patch. > > > > > I had posted a complete list of changes/addition to DAT 2.0 about a > > month ago. > > But we had not discussed yet version change from 1.3 to 2.0 nor how > > much backwards compatibility spec will provide. > > > > 2. What is IMMED_EVENT? is it just immediate data without > any payload one? > > I suggest chnaging the name so it will not use "EVENT". > Just call it > > NO_PAYLOAD. > > Do you want to support 2 different way to delivery immediate data? > > One in event and one in data payload? > > Why? I would think that just an event way will do. > > This was modeled after the immediate data discussions on the > DAT reflector based on iWARP requirements. > > http://groups.yahoo.com/group/dat-discussions/message/3285 > I recall it now. I want to consider a few usage cases. 1. Existing app running on the Provider with extensions. Want to make sure we do not require any App changes beyond recompile due to extensions. 2. App wants to be modified to use Immediate data. How big impact it has on existing code. For example buffer size allocation and completion handling for immediate data over existing connection. 2a. Can application take advantage if it knows that Provider will return immediate data in event? 2b. Immediate data inline only? 3. Ditto for atomic operations over existing connection. > > > > 3. I suggest beefing up DAT_DTO_COMPLETION_EVENT_DATA and > DAT_DTOS to > > convey which operation completed and return Immediate data > if complete > > operation had immediate data. > > Since we already modified these 2 struct as part of DAT 2.0 change > > lets add your proposal to the change. > > This will allow Consumers to use single approach to deal with > > completions, extension to the current one but not a > structural one. No > > need for DAT_EXTENSION_DATA, DAT_EXT_EVENT_TYPE, DAT_EXT_OP nor the > > whole mechanism for extended ops. > > You still need extension types for the "other" value-add > operations/evnts that will not be accepted as standard and > are vendor specific. > > I would like to defer the rest of the questions for now since > they touch on actual operations and not the extension > mechanism. Although, I do need to think about how to extend > memory registration privledges. Any suggestions? Going with your generic extension design we add "extension" entry to relevant data structures. And then outside DAT define the structure for its values which can be extensible. This imply that adding extension by Provider will force apps to be recompiled. I hope this is enough. I am assuming that apps use values not position the structures. > > > 4. What is the purpose of DAT_EXT_WRITE_CONFIRM_FLAG? Is it > to expose > > IB round trip semantic? > > iWARP does not support immediate data. One can try to > format payload > > to pass immediate data. > > Is that what you had in mind? > > > > What is the semantic meaning of the completion with this flag set? > > without flag set? > > Are extended flags are additonal values for COMPLETION_FLAGS? 2.4.1 > > talks about extended flags but where they are passed in is not > > defined. > > DAT 2.0 extended them already for FMR barrier. I would prefer to > > follow that route rather than creating a separate extension > completion > > flags. > > > > 5. Why do you need RECV_IMMED? If Immed data is delivered > in event no > > new Recv operation is needed. > > If Consumer asks for immediate data in payload where in > payload will > > it be? > > If this is needed for local match for remote RDMA_Write to handle > > immediate data lets state so. > > > > What happens for mismatch between local and remote op? That is recv > > was posted for Send and RDMA_Write "arrived"? Vice Versa? > > > > 6. I see extension for immediate data for rdma_write but > not for send. > > Is this deliberate? If we are going > > to extend DAT semantic to support Immediate data we can as well > > support the full IBTA/iWARP functionality for it. > > > > 7. Currently memory registration do not support access to > LMR or RMR > > by Atomic ops. > > Do you propose to extend the meaning of current MEM_PRIV > for LMR and > > RMR to covers atomic accesses or add new values to LMR_MEM_PRIV and > > RMR_MEM_PRIV for atomic operation support? > > > > 8. Any alignment requirements for memory used for atomic ops? > > > > 9. Any correlation requirements for SRQ buffers to support > recv with > > immediate data? > > > > Have a great holidays, > > Arkady > > > > > > Arkady Kanevsky email: arkady at netapp.com > > > > > > Network Appliance Inc. phone: 781-768-5395 > > > > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > > > > Waltham, MA 02451 central phone: 781-768-5300 > > > > > > From eitan at mellanox.co.il Fri Jan 6 11:55:03 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 06 Jan 2006 21:55:03 +0200 Subject: [openib-general] SA cache design In-Reply-To: <43BEBE03.6030309@ichips.intel.com> References: <43BE2FF6.2060506@mellanox.co.il> <43BEBE03.6030309@ichips.intel.com> Message-ID: <43BECB17.8020007@mellanox.co.il> Sean Hefty wrote: > Eitan Zahavi wrote: > >> So if the cache is on another host - a new kind of MAD will have to >> be sent on behalf of >> the original request? > > > I was thinking more in terms of redirection. > Today none of the clients support redirection. It would take significant duplicated effort on the client front to support that. >> In IB QoS properties are mainly the PathRecord parameters: SL, Rate, >> MTU, PathBits (LMC bits). >> So if traditionally we had PathRecord requested for each Src->Dst port >> now we will need to track at least: >> Src->Dst * #QoS-levels. (a non optimal implementation will require >> even more: #Src->Dst * #Clients * #Servers * #Services). > > > I understand you now. > > Can someone familiar with the opensm code tell me how difficult it would > be to extract out the code that tracks the subnet data and responds to > queries? I guess you mean the code that is answering to PathRecord queries? It is possible to extract the "SMDB" objects and duplicate that database. I am not sure it is such a good idea. What if the SM is not OpenSM? > > - Sean From eitan at mellanox.co.il Fri Jan 6 12:00:54 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 06 Jan 2006 22:00:54 +0200 Subject: [openib-general] SA cache design In-Reply-To: <1136556596.4336.8644.camel@hal.voltaire.com> References: <5D78D28F88822E4D8702BB9EEF1A4367D12C70@mercury.infiniconsys.com> <1136556596.4336.8644.camel@hal.voltaire.com> Message-ID: <43BECC76.60803@mellanox.co.il> I agree with Todd: a key is to keep the client unaware of the mux existence. So the same client can be run on system without the cache. Hal Rosenstock wrote: > On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote: > >>>From: Hal Rosenstock [mailto:halr at voltaire.com] >>>On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: >>> >>>>This of course implies the "SA Mux" must analyze more than just >>>>the attribute ID to determine if the replica can handle the query. >>>>But the memory savings is well worth the extra level of filtering. >>> >>>If the SA cache does this, it seems it would be pretty simple >>>to return >>>this info in an attribute to the client so the client would >>>know when to >>>go to the cache/replica and when to go direct to the SA in the case >>>where only certain queries are supported. Wouldn't this be >>>advantageous >>>when the replica doesn't support all queries ? >> >>Why put the burden on the application. give the query to the Mux. > > > That's what I'm suggesting. Rather than a binary switch mux, a more > granular one which determines how to route the outgoing SA request. > > >> With an optional flag indicating a prefered "routing" (choices of: to SA, >>to replica, let Mux decide). Then let it decide. As you suggest it may >>be simplest to let the Mux try the replica and on failure fallback >>to the SA transparent to the app (sort of the way SDP intercepts >>socket ops and falls back to TCP/IP when SDP isn't appropriate). > > > It depends on whether the replica/cache forwards unsupported requests on > or responds with not supported back to the client as to how this is > handled. Sean was proposing the forward on model and a binary switch at > the client. I think this is more granular and can be mux'd only with the > knowledge of what a replica/cache supports (not sure about dealing with > different replica/caches supporting a different set of queries; need to > think more on how the caches are located, etc.). You are mentioning a > third model here. > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Fri Jan 6 12:04:00 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 06 Jan 2006 12:04:00 -0800 Subject: [openib-general] SA cache design In-Reply-To: <43BECB17.8020007@mellanox.co.il> References: <43BE2FF6.2060506@mellanox.co.il> <43BEBE03.6030309@ichips.intel.com> <43BECB17.8020007@mellanox.co.il> Message-ID: <43BECD30.8030501@ichips.intel.com> Eitan Zahavi wrote: >> Can someone familiar with the opensm code tell me how difficult it >> would be to extract out the code that tracks the subnet data and >> responds to queries? > > I guess you mean the code that is answering to PathRecord queries? Yes - that along with answering other queries. > It is possible to extract the "SMDB" objects and duplicate that database. > I am not sure it is such a good idea. What if the SM is not OpenSM? I was thinking in terms of code re-use, and not in terms of which SM was running. Interfacing to the SM would be through standard queries. - Sean From rdreier at cisco.com Fri Jan 6 12:06:50 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 12:06:50 -0800 Subject: [openib-general] Re: merge rdma_cm and ib_addr upstream In-Reply-To: <43BEB4CF.1020103@ichips.intel.com> (Sean Hefty's message of "Fri, 06 Jan 2006 10:19:59 -0800") References: <43BEB4CF.1020103@ichips.intel.com> Message-ID: Sean> Roland, I think that we're ready to try to merge the CMA Sean> upstream. I don't believe that any of the kernel ULP Sean> consumers are ready to be merged yet, but we could submit Sean> the userspace CMA support as a consumer of the API. Sean> What is your recommendation on how best to proceed? We should generate a patch (or series of patches depending on how big it ends up being) against Linus's latest tree and post it to linux-kernel and openib-general for review. It would be fine if you post or, or I can do it if you don't feel like it. - R. From halr at voltaire.com Fri Jan 6 11:59:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jan 2006 14:59:28 -0500 Subject: [openib-general] SA cache design In-Reply-To: <43BECA09.2080502@mellanox.co.il> References: <5D78D28F88822E4D8702BB9EEF1A43670A0954@mercury.infiniconsys.com> <1136554985.4336.8389.camel@hal.voltaire.com> <43BECA09.2080502@mellanox.co.il> Message-ID: <1136577567.4336.11873.camel@hal.voltaire.com> On Fri, 2006-01-06 at 14:50, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > > > >>This of course implies the "SA Mux" must analyze more than just > >>the attribute ID to determine if the replica can handle the query. > >>But the memory savings is well worth the extra level of filtering. > > > > > > If the SA cache does this, it seems it would be pretty simple to return > > this info in an attribute to the client so the client would know when to > > go to the cache/replica and when to go direct to the SA in the case > > where only certain queries are supported. Wouldn't this be advantageous > > when the replica doesn't support all queries ? > I think we want to make the client totally unaware to the > existence of the cache. Perhaps. I would express this differently: the client to be as unaware as possible (the muxing on a per attribute to direct the request seems reasonably straightforward). > So the cache itself will simply forward the message (maybe changing TID). Yes, the transformation at the cache should be as trivial as possible. I would like to eliminate the doubling up of packets when unncessary (for requests that the cache does not support rather than ones it does support but does not have the information). -- Hal > > -- Hal > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Fri Jan 6 12:03:38 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jan 2006 15:03:38 -0500 Subject: [openib-general] SA cache design In-Reply-To: <43BECC76.60803@mellanox.co.il> References: <5D78D28F88822E4D8702BB9EEF1A4367D12C70@mercury.infiniconsys.com> <1136556596.4336.8644.camel@hal.voltaire.com> <43BECC76.60803@mellanox.co.il> Message-ID: <1136577688.4336.11886.camel@hal.voltaire.com> On Fri, 2006-01-06 at 15:00, Eitan Zahavi wrote: > I agree with Todd: a key is to keep the client unaware of the mux existence. > So the same client can be run on system without the cache. Define same client ? I would consider it the same SA client directing requests differently based on how the mux is configured based on a query to the cache (if it exists) as to its capabilities. -- Hal > Hal Rosenstock wrote: > > On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote: > > > >>>From: Hal Rosenstock [mailto:halr at voltaire.com] > >>>On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > >>> > >>>>This of course implies the "SA Mux" must analyze more than just > >>>>the attribute ID to determine if the replica can handle the query. > >>>>But the memory savings is well worth the extra level of filtering. > >>> > >>>If the SA cache does this, it seems it would be pretty simple > >>>to return > >>>this info in an attribute to the client so the client would > >>>know when to > >>>go to the cache/replica and when to go direct to the SA in the case > >>>where only certain queries are supported. Wouldn't this be > >>>advantageous > >>>when the replica doesn't support all queries ? > >> > >>Why put the burden on the application. give the query to the Mux. > > > > > > That's what I'm suggesting. Rather than a binary switch mux, a more > > granular one which determines how to route the outgoing SA request. > > > > > >> With an optional flag indicating a prefered "routing" (choices of: to SA, > >>to replica, let Mux decide). Then let it decide. As you suggest it may > >>be simplest to let the Mux try the replica and on failure fallback > >>to the SA transparent to the app (sort of the way SDP intercepts > >>socket ops and falls back to TCP/IP when SDP isn't appropriate). > > > > > > It depends on whether the replica/cache forwards unsupported requests on > > or responds with not supported back to the client as to how this is > > handled. Sean was proposing the forward on model and a binary switch at > > the client. I think this is more granular and can be mux'd only with the > > knowledge of what a replica/cache supports (not sure about dealing with > > different replica/caches supporting a different set of queries; need to > > think more on how the caches are located, etc.). You are mentioning a > > third model here. > > > > -- Hal > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tom at opengridcomputing.com Fri Jan 6 12:19:37 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 06 Jan 2006 14:19:37 -0600 Subject: [openib-general] [PATCH] CMA and iWARP Message-ID: <1136578777.14108.6.camel@trinity.austin.ammasso.com> Enclosed is a combined include file and core patch for iWARP support in CMA. This patch includes changes per your last review. Signed-off-by: Tom Tucker Index: core/cm.c =================================================================== --- core/cm.c (revision 4748) +++ core/cm.c (working copy) @@ -3261,6 +3261,9 @@ int ret; u8 i; + if (device->node_type == IB_NODE_RNIC) + return; + cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) Index: core/iwcm.c =================================================================== --- core/iwcm.c (revision 0) +++ core/iwcm.c (revision 0) @@ -0,0 +1,648 @@ +/* + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "cm_msgs.h" + +MODULE_AUTHOR("Tom Tucker"); +MODULE_DESCRIPTION("iWARP CM"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void iwcm_add_one(struct ib_device *device); +static void iwcm_remove_one(struct ib_device *device); +struct iwcm_id_private; + +static struct ib_client iwcm_client = { + .name = "iwcm", + .add = iwcm_add_one, + .remove = iwcm_remove_one +}; + +static struct { + spinlock_t lock; + struct list_head device_list; + rwlock_t device_lock; + struct workqueue_struct* wq; +} iwcm; + +struct iwcm_device; +struct iwcm_port { + struct iwcm_device *iwcm_dev; + struct sockaddr_in local_addr; + u8 port_num; +}; + +struct iwcm_device { + struct list_head list; + struct ib_device *device; + struct iwcm_port port[0]; +}; + +struct iwcm_id_private { + struct iw_cm_id id; + + spinlock_t lock; + wait_queue_head_t wait; + atomic_t refcount; + + struct rb_node listen_node; + + struct list_head work_list; + atomic_t work_count; +}; + +struct iwcm_work { + struct work_struct work; + struct iwcm_id_private* cm_id; + struct iw_cm_event event; +}; + +/* Called whenever a reference added for a cm_id */ +static inline void iwcm_addref_id(struct iwcm_id_private *cm_id_priv) +{ + atomic_inc(&cm_id_priv->refcount); +} + +/* Called whenever releasing a reference to a cm id */ +static inline void iwcm_deref_id(struct iwcm_id_private *cm_id_priv) +{ + if (atomic_dec_and_test(&cm_id_priv->refcount)) + wake_up(&cm_id_priv->wait); +} + +static void cm_event_handler(struct iw_cm_id* cm_id, struct iw_cm_event* event); + +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context) +{ + struct iwcm_id_private *iwcm_id_priv; + + iwcm_id_priv = kmalloc(sizeof *iwcm_id_priv, GFP_KERNEL); + if (!iwcm_id_priv) + return ERR_PTR(-ENOMEM); + + memset(iwcm_id_priv, 0, sizeof *iwcm_id_priv); + iwcm_id_priv->id.state = IW_CM_STATE_IDLE; + iwcm_id_priv->id.device = device; + iwcm_id_priv->id.cm_handler = cm_handler; + iwcm_id_priv->id.context = context; + iwcm_id_priv->id.event_handler = cm_event_handler; + + spin_lock_init(&iwcm_id_priv->lock); + init_waitqueue_head(&iwcm_id_priv->wait); + atomic_set(&iwcm_id_priv->refcount, 1); + + return &iwcm_id_priv->id; + +} +EXPORT_SYMBOL(iw_create_cm_id); + +void iw_destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret = 0; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + switch (cm_id->state) { + case IW_CM_STATE_LISTEN: + cm_id->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = cm_id->device->iwcm->destroy_listen(cm_id); + break; + + case IW_CM_STATE_CONN_RECV: + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_ESTABLISHED: + cm_id->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = cm_id->device->iwcm->disconnect(cm_id,1); + break; + + case IW_CM_STATE_IDLE: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + break; + + default: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + printk(KERN_ERR "%s:%s:%u Illegal state %d for iw_cm_id.\n", + __FILE__, __FUNCTION__, __LINE__, cm_id->state); + ; + } + + atomic_dec(&iwcm_id_priv->refcount); + wait_event(iwcm_id_priv->wait, !atomic_read(&iwcm_id_priv->refcount)); + + kfree(iwcm_id_priv); +} +EXPORT_SYMBOL(iw_destroy_cm_id); + +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret = 0; + + if (cm_id->device == 0 || cm_id->device->iwcm == 0) + return -EINVAL; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + if (cm_id->state != IW_CM_STATE_IDLE) { + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + return -EBUSY; + } + cm_id->state = IW_CM_STATE_LISTEN; + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->create_listen(cm_id, backlog); + if (ret != 0) + cm_id->state = IW_CM_STATE_IDLE; + + return ret; +} +EXPORT_SYMBOL(iw_cm_listen); + +int iw_cm_getpeername(struct iw_cm_id *cm_id, + struct sockaddr_in* local_addr, + struct sockaddr_in* remote_addr) +{ + if (cm_id->device == 0) + return -EINVAL; + + if (cm_id->device->iwcm == 0) + return -EINVAL; + + /* Make sure there's a connection */ + if (cm_id->state != IW_CM_STATE_ESTABLISHED) + return -ENOTCONN; + + return cm_id->device->iwcm->getpeername(cm_id, local_addr, remote_addr); +} +EXPORT_SYMBOL(iw_cm_getpeername); + +int iw_cm_reject(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret; + + + if (cm_id->device == 0 || cm_id->device->iwcm == 0) + return -EINVAL; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + switch (cm_id->state) { + case IW_CM_STATE_CONN_RECV: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = cm_id->device->iwcm->reject(cm_id, private_data, private_data_len); + cm_id->state = IW_CM_STATE_IDLE; + break; + default: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = -EINVAL; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_reject); + +int iw_cm_accept(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret; + + if (cm_id->device == 0 || cm_id->device->iwcm == 0) + return -EINVAL; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + switch (cm_id->state) { + case IW_CM_STATE_CONN_RECV: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = cm_id->device->iwcm->accept(cm_id, private_data, + private_data_len); + if (ret == 0) { + struct iw_cm_event event; + event.event = IW_CM_EVENT_ESTABLISHED; + event.provider_id = cm_id->provider_id; + event.status = 0; + event.local_addr = cm_id->local_addr; + event.remote_addr = cm_id->remote_addr; + event.private_data = 0; + event.private_data_len = 0; + cm_event_handler(cm_id, &event); + } + + break; + default: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = -EINVAL; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_accept); + +int iw_cm_bind_qp(struct iw_cm_id* cm_id, struct ib_qp* qp) +{ + int ret = -EINVAL; + + if (cm_id) { + cm_id->qp = qp; + ret = 0; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_bind_qp); + +int iw_cm_connect(struct iw_cm_id *cm_id, + const void* pdata, u8 pdata_len) +{ + struct iwcm_id_private* cm_id_priv; + int ret = 0; + unsigned long flags; + + if (cm_id->device == 0 || cm_id->device->iwcm == 0) + return -EINVAL; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id->state != IW_CM_STATE_IDLE) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return -EBUSY; + } + cm_id->state = IW_CM_STATE_CONN_SENT; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->connect(cm_id, pdata, pdata_len); + if (ret != 0) + cm_id->state = IW_CM_STATE_IDLE; + + return ret; +} +EXPORT_SYMBOL(iw_cm_connect); + +int iw_cm_disconnect(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret; + + if (cm_id->device == 0 || cm_id->device->iwcm == 0 || cm_id->qp == 0) + return -EINVAL; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + switch (cm_id->state) { + case IW_CM_STATE_ESTABLISHED: + cm_id->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = cm_id->device->iwcm->disconnect(cm_id, 1); + if (ret == 0) { + struct iw_cm_event event; + event.event = IW_CM_EVENT_LLP_DISCONNECT; + event.provider_id = cm_id->provider_id; + event.status = 0; + event.local_addr = cm_id->local_addr; + event.remote_addr = cm_id->remote_addr; + event.private_data = 0; + event.private_data_len = 0; + cm_event_handler(cm_id, &event); + } + + break; + default: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = -EINVAL; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_disconnect); + +static void iwcm_add_one(struct ib_device *device) +{ + struct iwcm_device *iwcm_dev; + struct iwcm_port *port; + unsigned long flags; + u8 i; + + if (device->node_type != IB_NODE_RNIC) + return; + + iwcm_dev = kmalloc(sizeof(*iwcm_dev) + sizeof(*port) * + device->phys_port_cnt, GFP_KERNEL); + if (!iwcm_dev) + return; + + iwcm_dev->device = device; + + for (i = 1; i <= device->phys_port_cnt; i++) { + port = &iwcm_dev->port[i-1]; + port->iwcm_dev = iwcm_dev; + port->port_num = i; + } + + ib_set_client_data(device, &iwcm_client, iwcm_dev); + + write_lock_irqsave(&iwcm.device_lock, flags); + list_add_tail(&iwcm_dev->list, &iwcm.device_list); + write_unlock_irqrestore(&iwcm.device_lock, flags); + return; +} + +static void iwcm_remove_one(struct ib_device *device) +{ + struct iwcm_device *iwcm_dev; + unsigned long flags; + + iwcm_dev = ib_get_client_data(device, &iwcm_client); + if (!iwcm_dev) + return; + + write_lock_irqsave(&iwcm.device_lock, flags); + list_del(&iwcm_dev->list); + write_unlock_irqrestore(&iwcm.device_lock, flags); + + kfree(iwcm_dev); +} + +/* Handles an inbound connect request. The function creates a new + * iw_cm_id to represent the new connection and inherits the client + * callback function and other attributes from the listening parent. + * + * The work item contains a pointer to the listen_cm_id and the event. The + * listen_cm_id contains the client cm_handler, context and device. These are + * copied when the device is cloned. The event contains the new four tuple. + */ +static int cm_conn_req_handler(struct iwcm_work* work) +{ + struct iw_cm_id* cm_id; + struct iwcm_id_private* cm_id_priv; + int rc; + + /* If the status was not successful, ignore request */ + if (work->event.status) { + printk(KERN_ERR "%s:%d Bad status=%d for connection request ... " + "should be filtered by provider\n", + __FUNCTION__, __LINE__, + work->event.status); + return work->event.status; + } + cm_id = iw_create_cm_id(work->cm_id->id.device, work->cm_id->id.cm_handler, + work->cm_id->id.context); + if (IS_ERR(cm_id)) + return PTR_ERR(cm_id); + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + cm_id_priv->id.local_addr = work->event.local_addr; + cm_id_priv->id.remote_addr = work->event.remote_addr; + cm_id_priv->id.provider_id = work->event.provider_id; + cm_id_priv->id.state = IW_CM_STATE_CONN_RECV; + + /* Call the client CM handler */ + rc = cm_id->cm_handler(cm_id, &work->event); + if (rc) { + cm_id->state = IW_CM_STATE_IDLE; + iw_destroy_cm_id(cm_id); + } + kfree(work); + return 0; +} + +/* + * Handles the transition to established state on the passive side. + */ +static int cm_conn_est_handler(struct iwcm_work* work) +{ + struct iwcm_id_private* cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = work->cm_id; + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->id.state != IW_CM_STATE_CONN_RECV) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + printk(KERN_ERR "%s:%d Invalid cm_id state=%d for established event\n", + __FUNCTION__, __LINE__, cm_id_priv->id.state); + ret = -EINVAL; + goto error_out; + } + + if (work->event.status == 0) { + cm_id_priv = work->cm_id; + cm_id_priv->id.local_addr = work->event.local_addr; + cm_id_priv->id.remote_addr = work->event.remote_addr; + cm_id_priv->id.state = IW_CM_STATE_ESTABLISHED; + } else + cm_id_priv->id.state = IW_CM_STATE_IDLE; + + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + /* Call the client CM handler */ + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); + if (ret) { + cm_id_priv->id.state = IW_CM_STATE_IDLE; + iw_destroy_cm_id(&cm_id_priv->id); + } + + error_out: + kfree(work); + return ret; +} + +/* + * Handles the reply to our connect request. There are three + * possibilities: + * - If the cm_id is in the wrong state when the event is + * delivered, the event is ignored. [What should we do when the + * provider does something crazy?] + * - If the remote peer accepts the connection, we update the 4-tuple + * in the cm_id with the remote peer info, move the cm_id to the + * ESTABLISHED state and deliver the event to the client. + * - If the remote peer rejects the connection, or there is some + * connection error, move the cm_id to the IDLE state, and deliver + * the event to the client. + */ +static int cm_conn_rep_handler(struct iwcm_work* work) +{ + struct iwcm_id_private* cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = work->cm_id; + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->id.state != IW_CM_STATE_CONN_SENT) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + printk(KERN_ERR "%s:%d Invalid cm_id state=%d for connect reply event\n", + __FUNCTION__, __LINE__, cm_id_priv->id.state); + ret = -EINVAL; + goto error_out; + } + + if (work->event.status == 0) { + cm_id_priv = work->cm_id; + cm_id_priv->id.local_addr = work->event.local_addr; + cm_id_priv->id.remote_addr = work->event.remote_addr; + cm_id_priv->id.state = IW_CM_STATE_ESTABLISHED; + } else + cm_id_priv->id.state = IW_CM_STATE_IDLE; + + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + /* Call the client CM handler */ + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); + if (ret) { + cm_id_priv->id.state = IW_CM_STATE_IDLE; + iw_destroy_cm_id(&cm_id_priv->id); + } + + error_out: + kfree(work); + return ret; +} + +static int cm_disconnect_handler(struct iwcm_work* work) +{ + struct iwcm_id_private* cm_id_priv; + int ret = 0; + + cm_id_priv = work->cm_id; + + cm_id_priv->id.state = IW_CM_STATE_IDLE; + + /* Call the client CM handler */ + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); + if (ret) + iw_destroy_cm_id(&cm_id_priv->id); + + kfree(work); + return ret; +} + +static void cm_work_handler(void* arg) +{ + struct iwcm_work* work = (struct iwcm_work*)arg; + int rc; + + switch (work->event.event) { + case IW_CM_EVENT_CONNECT_REQUEST: + rc = cm_conn_req_handler(work); + break; + case IW_CM_EVENT_CONNECT_REPLY: + rc = cm_conn_rep_handler(work); + break; + case IW_CM_EVENT_ESTABLISHED: + rc = cm_conn_est_handler(work); + break; + case IW_CM_EVENT_LLP_DISCONNECT: + case IW_CM_EVENT_LLP_TIMEOUT: + case IW_CM_EVENT_LLP_RESET: + case IW_CM_EVENT_CLOSE: + rc = cm_disconnect_handler(work); + break; + } +} + +/* IW CM provider event callback handler. This function is called on + * interrupt context. The function builds a work queue element + * and enqueues it for processing on a work queue thread. This allows + * CM client callback functions to block. + */ +static void cm_event_handler(struct iw_cm_id* cm_id, + struct iw_cm_event* event) +{ + struct iwcm_work *work; + struct iwcm_id_private* cm_id_priv; + + work = kmalloc(sizeof *work, GFP_ATOMIC); + if (!work) + return; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + INIT_WORK(&work->work, cm_work_handler, work); + work->cm_id = cm_id_priv; + work->event = *event; + queue_work(iwcm.wq, &work->work); +} + +static int __init iw_cm_init(void) +{ + memset(&iwcm, 0, sizeof iwcm); + INIT_LIST_HEAD(&iwcm.device_list); + rwlock_init(&iwcm.device_lock); + spin_lock_init(&iwcm.lock); + iwcm.wq = create_workqueue("iw_cm"); + if (!iwcm.wq) + return -ENOMEM; + + return ib_register_client(&iwcm_client); +} + +static void __exit iw_cm_cleanup(void) +{ + ib_unregister_client(&iwcm_client); +} + +module_init(iw_cm_init); +module_exit(iw_cm_cleanup); + Index: core/addr.c =================================================================== --- core/addr.c (revision 4748) +++ core/addr.c (working copy) @@ -65,6 +65,9 @@ case ARPHRD_INFINIBAND: dev_addr->dev_type = IB_NODE_CA; break; + case ARPHRD_ETHER: + dev_addr->dev_type = IB_NODE_RNIC; + break; default: return -EADDRNOTAVAIL; } Index: core/Makefile =================================================================== --- core/Makefile (revision 4748) +++ core/Makefile (working copy) @@ -1,6 +1,6 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -Idrivers/infiniband/ulp/ipoib -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o \ +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o iw_cm.o \ ib_sa.o ib_at.o ib_addr.o rdma_cm.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o ib_uat.o rdma_ucm.o @@ -14,6 +14,8 @@ ib_cm-y := cm.o +iw_cm-y := iwcm.o + rdma_cm-y := cma.o rdma_ucm-y := ucma.o Index: core/cma.c =================================================================== --- core/cma.c (revision 4748) +++ core/cma.c (working copy) @@ -3,6 +3,7 @@ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. * * This Software is licensed under one of the following licenses: * @@ -31,9 +32,14 @@ #include #include #include +#include +#include +#include +#include #include #include #include +#include #include MODULE_AUTHOR("Guy German"); @@ -102,8 +108,12 @@ int timeout_ms; struct ib_sa_query *query; int query_id; - struct ib_cm_id *cm_id; + union { + struct ib_cm_id *ib; + struct iw_cm_id *iw; + } cm_id; + u32 seq_num; u32 qp_num; enum ib_qp_type qp_type; @@ -239,11 +249,40 @@ return ret; } +static int cma_acquire_iw_dev(struct rdma_id_private* id_priv) +{ + struct rdma_dev_addr* dev_addr = &id_priv->id.route.addr.dev_addr; + struct cma_device* cma_dev; + int ret = -ENOENT; + + down(&mutex); + list_for_each_entry(cma_dev, &dev_list, list) { + if (memcmp(dev_addr->src_dev_addr, + &cma_dev->node_guid, + sizeof(cma_dev->node_guid)) == 0) { + + /* If we find the device, then check if this + * is an iWARP device. If it is, then attach + */ + if (cma_dev->device->node_type == IB_NODE_RNIC) { + cma_attach_to_dev(id_priv, cma_dev); + ret = 0; + break; + } + } + } + up(&mutex); + + return ret; +} + static int cma_acquire_dev(struct rdma_id_private *id_priv) { switch (id_priv->id.route.addr.dev_addr.dev_type) { case IB_NODE_CA: return cma_acquire_ib_dev(id_priv); + case IB_NODE_RNIC: + return cma_acquire_iw_dev(id_priv); default: return -ENODEV; } @@ -306,6 +345,16 @@ IB_QP_PKEY_INDEX | IB_QP_PORT); } +static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS); +} + int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, struct ib_qp_init_attr *qp_init_attr) { @@ -325,6 +374,9 @@ case IB_NODE_CA: ret = cma_init_ib_qp(id_priv, qp); break; + case IB_NODE_RNIC: + ret = cma_init_iw_qp(id_priv, qp); + break; default: ret = -ENOSYS; break; @@ -412,7 +464,7 @@ id_priv = container_of(id, struct rdma_id_private, id); switch (id_priv->id.device->node_type) { case IB_NODE_CA: - ret = ib_cm_init_qp_attr(id_priv->cm_id, qp_attr, + ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr, qp_attr_mask); if (qp_attr->qp_state == IB_QPS_RTR) qp_attr->rq_psn = id_priv->seq_num; @@ -567,8 +619,8 @@ { cma_exch(id_priv, CMA_DESTROYING); - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) - ib_destroy_cm_id(id_priv->cm_id); + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) + ib_destroy_cm_id(id_priv->cm_id.ib); list_del(&id_priv->listen_list); if (id_priv->cma_dev) @@ -624,9 +676,20 @@ state = cma_exch(id_priv, CMA_DESTROYING); cma_cancel_operation(id_priv, state); - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) - ib_destroy_cm_id(id_priv->cm_id); + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) { + switch (id->device->node_type) { + case IB_NODE_RNIC: + iw_destroy_cm_id(id_priv->cm_id.iw); + break; + default: + ib_destroy_cm_id(id_priv->cm_id.ib); + break; + } + + id_priv->cm_id.ib = NULL; + } + if (id_priv->cma_dev) { down(&mutex); cma_detach_from_dev(id_priv); @@ -652,15 +715,15 @@ ret = cma_modify_qp_rts(&id_priv->id); if (ret) goto reject; - - ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); + + ret = ib_send_cm_rtu(id_priv->cm_id.ib, NULL, 0); if (ret) goto reject; return 0; reject: cma_modify_qp_err(&id_priv->id); - ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); return ret; } @@ -676,7 +739,7 @@ return 0; reject: cma_modify_qp_err(&id_priv->id); - ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); return ret; } @@ -737,7 +800,7 @@ private_data_len); if (ret) { /* Destroy the CM ID by returning a non-zero value. */ - id_priv->cm_id = NULL; + id_priv->cm_id.ib = NULL; cma_exch(id_priv, CMA_DESTROYING); cma_release_remove(id_priv); rdma_destroy_id(&id_priv->id); @@ -819,7 +882,7 @@ goto out; } - conn_id->cm_id = cm_id; + conn_id->cm_id.ib = cm_id; cm_id->context = conn_id; cm_id->cm_handler = cma_ib_handler; @@ -829,7 +892,7 @@ IB_CM_REQ_PRIVATE_DATA_SIZE - offset); if (ret) { /* Destroy the CM ID by returning a non-zero value. */ - conn_id->cm_id = NULL; + conn_id->cm_id.ib = NULL; cma_exch(conn_id, CMA_DESTROYING); cma_release_remove(conn_id); rdma_destroy_id(&conn_id->id); @@ -874,6 +937,116 @@ } } +static int cma_iw_handler(struct iw_cm_id* iw_id, struct iw_cm_event* event) +{ + struct rdma_id_private *id_priv = iw_id->context; + enum rdma_cm_event_type event_type = 0; + int ret = 0; + + atomic_inc(&id_priv->dev_remove); + + switch (event->event) { + case IW_CM_EVENT_LLP_DISCONNECT: + case IW_CM_EVENT_LLP_RESET: + case IW_CM_EVENT_LLP_TIMEOUT: + case IW_CM_EVENT_CLOSE: + event_type = RDMA_CM_EVENT_DISCONNECTED; + break; + + case IW_CM_EVENT_CONNECT_REPLY: { + if (event->status) + event_type = RDMA_CM_EVENT_REJECTED; + else + event_type = RDMA_CM_EVENT_ESTABLISHED; + break; + } + + case IW_CM_EVENT_ESTABLISHED: + event_type = RDMA_CM_EVENT_ESTABLISHED; + break; + + default: + BUG_ON(1); + break; + + } + + ret = cma_notify_user(id_priv, + event_type, + event->status, + event->private_data, + event->private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + id_priv->cm_id.iw = NULL; + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + rdma_destroy_id(&id_priv->id); + return ret; + } + + cma_release_remove(id_priv); + return ret; +} + +static int iw_conn_req_handler(struct iw_cm_id *cm_id, + struct iw_cm_event *iw_event) +{ + struct rdma_cm_id* new_cm_id; + struct rdma_id_private *listen_id, *conn_id; + struct sockaddr_in* sin; + int ret; + + listen_id = cm_id->context; + atomic_inc(&listen_id->dev_remove); + if (!cma_comp(listen_id, CMA_LISTEN)) { + ret = -ECONNABORTED; + goto out; + } + + /* Create a new RDMA id the new IW CM ID */ + new_cm_id = rdma_create_id(listen_id->id.event_handler, + listen_id->id.context, + RDMA_PS_TCP); + if (!new_cm_id) { + ret = -ENOMEM; + goto out; + } + conn_id = container_of(new_cm_id, struct rdma_id_private, id); + atomic_inc(&conn_id->dev_remove); + conn_id->state = CMA_CONNECT; + + /* New connection inherits device from parent */ + down(&mutex); + cma_attach_to_dev(conn_id, listen_id->cma_dev); + up(&mutex); + + conn_id->cm_id.iw = cm_id; + cm_id->context = conn_id; + cm_id->cm_handler = cma_iw_handler; + + sin = (struct sockaddr_in*)&new_cm_id->route.addr.src_addr; + *sin = iw_event->local_addr; + + sin = (struct sockaddr_in*)&new_cm_id->route.addr.dst_addr; + *sin = iw_event->remote_addr; + + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, + iw_event->private_data, + iw_event->private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + conn_id->cm_id.iw = NULL; + cma_exch(conn_id, CMA_DESTROYING); + cma_release_remove(conn_id); + rdma_destroy_id(&conn_id->id); + } + +out: + cma_release_remove(listen_id); + return ret; +} + static int cma_ib_listen(struct rdma_id_private *id_priv) { struct ib_cm_private_data_compare compare_data; @@ -881,28 +1054,52 @@ __be64 svc_id; int ret; - id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_req_handler, + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, cma_req_handler, id_priv); - if (IS_ERR(id_priv->cm_id)) - return PTR_ERR(id_priv->cm_id); + if (IS_ERR(id_priv->cm_id.ib)) + return PTR_ERR(id_priv->cm_id.ib); addr = &id_priv->id.route.addr.src_addr; svc_id = cma_get_service_id(id_priv->id.ps, addr); if (cma_any_addr(addr)) - ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, NULL); + ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, NULL); else { cma_set_compare_data(addr, &compare_data); - ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, &compare_data); + ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, &compare_data); } if (ret) { - ib_destroy_cm_id(id_priv->cm_id); - id_priv->cm_id = NULL; + ib_destroy_cm_id(id_priv->cm_id.ib); + id_priv->cm_id.ib = NULL; } return ret; } +static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog) +{ + int ret; + struct sockaddr_in* sin; + + id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, + iw_conn_req_handler, + id_priv); + if (IS_ERR(id_priv->cm_id.iw)) + return PTR_ERR(id_priv->cm_id.iw); + + sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr; + id_priv->cm_id.iw->local_addr = *sin; + + ret = iw_cm_listen(id_priv->cm_id.iw, backlog); + + if (ret) { + iw_destroy_cm_id(id_priv->cm_id.iw); + id_priv->cm_id.iw = NULL; + } + + return ret; +} + static int cma_duplicate_listen(struct rdma_id_private *id_priv) { struct rdma_id_private *cur_id_priv; @@ -988,6 +1185,9 @@ case IB_NODE_CA: ret = cma_ib_listen(id_priv); break; + case IB_NODE_RNIC: + ret = cma_iw_listen(id_priv, backlog); + break; default: ret = -ENOSYS; break; @@ -1067,6 +1267,45 @@ return (id_priv->query_id < 0) ? id_priv->query_id : 0; } +static void iw_route_handler(void* data) +{ + struct cma_work *work = data; + struct rdma_id_private *id_priv = work->id; + + kfree(work); + + atomic_inc(&id_priv->dev_remove); + + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ROUTE_RESOLVED)) + goto out; + + if (cma_notify_user(id_priv, RDMA_CM_EVENT_ROUTE_RESOLVED, 0, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } + out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); +} + +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) +{ + struct cma_work *work; + + work = kmalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + work->id = id_priv; + INIT_WORK(&work->work, iw_route_handler, work); + queue_work(rdma_wq, &work->work); + + return 0; +} + int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) { struct rdma_id_private *id_priv; @@ -1081,6 +1320,9 @@ case IB_NODE_CA: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; + case IB_NODE_RNIC: + ret = cma_resolve_iw_route(id_priv, timeout_ms); + break; default: ret = -ENOSYS; break; @@ -1221,12 +1463,36 @@ return ret; } +static void iw_addr_handler(void* data) +{ + struct cma_work *work = data; + struct rdma_id_private *id_priv = work->id; + + kfree(work); + + atomic_inc(&id_priv->dev_remove); + + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) + goto out; + + if (cma_notify_user(id_priv, RDMA_CM_EVENT_ADDR_RESOLVED, 0, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); +} + int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr, int timeout_ms) { struct rdma_id_private *id_priv; enum cma_state expected_state; - int ret; + int ret = 0; id_priv = container_of(id, struct rdma_id_private, id); if (id_priv->cma_dev) { @@ -1341,10 +1607,10 @@ memcpy(private_data + offset, conn_param->private_data, conn_param->private_data_len); - id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, cma_ib_handler, id_priv); - if (IS_ERR(id_priv->cm_id)) { - ret = PTR_ERR(id_priv->cm_id); + if (IS_ERR(id_priv->cm_id.ib)) { + ret = PTR_ERR(id_priv->cm_id.ib); goto out; } @@ -1371,12 +1637,45 @@ req.max_cm_retries = CMA_MAX_CM_RETRIES; req.srq = id_priv->srq ? 1 : 0; - ret = ib_send_cm_req(id_priv->cm_id, &req); + ret = ib_send_cm_req(id_priv->cm_id.ib, &req); out: kfree(private_data); return ret; } +static int cma_connect_iw(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct iw_cm_id* cm_id; + struct sockaddr_in* sin; + int ret; + + if (id_priv->id.qp == NULL) + return -EINVAL; + + cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv); + if (IS_ERR(cm_id)) { + ret = PTR_ERR(cm_id); + goto out; + } + + id_priv->cm_id.iw = cm_id; + + sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr; + cm_id->local_addr = *sin; + + sin = (struct sockaddr_in*)&id_priv->id.route.addr.dst_addr; + cm_id->remote_addr = *sin; + + iw_cm_bind_qp(cm_id, id_priv->id.qp); + + ret = iw_cm_connect(cm_id, conn_param->private_data, + conn_param->private_data_len); + +out: + return ret; +} + int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; @@ -1396,6 +1695,9 @@ case IB_NODE_CA: ret = cma_connect_ib(id_priv, conn_param); break; + case IB_NODE_RNIC: + ret = cma_connect_iw(id_priv, conn_param); + break; default: ret = -ENOSYS; break; @@ -1433,7 +1735,7 @@ rep.rnr_retry_count = conn_param->rnr_retry_count; rep.srq = id_priv->srq ? 1 : 0; - return ib_send_cm_rep(id_priv->cm_id, &rep); + return ib_send_cm_rep(id_priv->cm_id.ib, &rep); } int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) @@ -1458,6 +1760,12 @@ else ret = cma_rep_recv(id_priv); break; + case IB_NODE_RNIC: { + iw_cm_bind_qp(id_priv->cm_id.iw, id_priv->id.qp); + ret = iw_cm_accept(id_priv->cm_id.iw, conn_param->private_data, + conn_param->private_data_len); + break; + } default: ret = -ENOSYS; break; @@ -1486,9 +1794,15 @@ switch (id->device->node_type) { case IB_NODE_CA: - ret = ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + ret = ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, private_data, private_data_len); break; + + case IB_NODE_RNIC: + ret = iw_cm_reject(id_priv->cm_id.iw, + private_data, private_data_len); + break; + default: ret = -ENOSYS; break; @@ -1513,9 +1827,12 @@ switch (id->device->node_type) { case IB_NODE_CA: /* Initiate or respond to a disconnect. */ - if (ib_send_cm_dreq(id_priv->cm_id, NULL, 0)) - ib_send_cm_drep(id_priv->cm_id, NULL, 0); + if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) + ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); break; + case IB_NODE_RNIC: + ret = iw_cm_disconnect(id_priv->cm_id.iw); + break; default: break; } Index: core/mad.c =================================================================== --- core/mad.c (revision 4748) +++ core/mad.c (working copy) @@ -2655,7 +2655,9 @@ { int start, end, i; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == IB_NODE_RNIC) + return; + else if (device->node_type == IB_NODE_SWITCH) { start = 0; end = 0; } else { @@ -2702,7 +2704,9 @@ { int i, num_ports, cur_port; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == IB_NODE_RNIC) + return; + else if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; cur_port = 0; } else { Index: include/rdma/ib_verbs.h =================================================================== --- include/rdma/ib_verbs.h (revision 4748) +++ include/rdma/ib_verbs.h (working copy) @@ -67,7 +67,8 @@ enum ib_node_type { IB_NODE_CA = 1, IB_NODE_SWITCH, - IB_NODE_ROUTER + IB_NODE_ROUTER, + IB_NODE_RNIC }; enum ib_device_cap_flags { @@ -86,6 +87,14 @@ IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), IB_DEVICE_SRQ_RESIZE = (1<<13), IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_IN_ORD_PLCMNT = (1<<15), + IB_DEVICE_ZERO_STAG = (1<<16), + IB_DEVICE_SEND_W_INV = (1<<17), + IB_DEVICE_MW = (1<<18), + IB_DEVICE_FMR = (1<<19), + IB_DEVICE_SRQ = (1<<20), + IB_DEVICE_ARP = (1<<21), + IB_DEVICE_LLP = (1<<22), }; enum ib_atomic_cap { @@ -824,6 +833,8 @@ u32 flags; + struct iw_cm_verbs *iwcm; + int (*query_device)(struct ib_device *device, struct ib_device_attr *device_attr); int (*query_port)(struct ib_device *device, Index: include/rdma/iw_cm.h =================================================================== --- include/rdma/iw_cm.h (revision 0) +++ include/rdma/iw_cm.h (revision 0) @@ -0,0 +1,152 @@ +/* + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#if !defined(IW_CM_H) +#define IW_CM_H + +#include +#include + +struct iw_cm_id; +struct iw_cm_event; + +enum iw_cm_event_type { + IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */ + IW_CM_EVENT_CONNECT_REPLY, /* reply from active connect request */ + IW_CM_EVENT_ESTABLISHED, + IW_CM_EVENT_LLP_DISCONNECT, + IW_CM_EVENT_LLP_RESET, + IW_CM_EVENT_LLP_TIMEOUT, + IW_CM_EVENT_CLOSE +}; + +struct iw_cm_event { + enum iw_cm_event_type event; + int status; + u32 provider_id; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *private_data; + u8 private_data_len; +}; + +typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); + +enum iw_cm_state { + IW_CM_STATE_IDLE, /* unbound, inactive */ + IW_CM_STATE_LISTEN, /* listen waiting for connect */ + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ + IW_CM_STATE_ESTABLISHED, /* established */ +}; + +typedef void (*iw_event_handler)(struct iw_cm_id* cm_id, + struct iw_cm_event* event); +struct iw_cm_id { + iw_cm_handler cm_handler; /* client callback function */ + void *context; /* context to provide to client cb */ + enum iw_cm_state state; + struct ib_device *device; + struct ib_qp *qp; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + u64 provider_id; /* device handle for this conn. */ + iw_event_handler event_handler; /* callback for IW CM Provider events */ +}; + +/** + * iw_create_cm_id - Allocate a communication identifier. + * @device: Device associated with the cm_id. All related communication will + * be associated with the specified device. + * @cm_handler: Callback invoked to notify the user of CM events. + * @context: User specified context associated with the communication + * identifier. + * + * Communication identifiers are used to track connection states, + * addr resolution requests, and listen requests. + */ +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context); + +/* This is provided in the event generated when + * a remote peer accepts our connect request + */ + +struct iw_cm_verbs { + int (*connect)(struct iw_cm_id* cm_id, + const void* private_data, + u8 private_data_len); + + int (*disconnect)(struct iw_cm_id* cm_id, + int abrupt); + + int (*accept)(struct iw_cm_id*, + const void *private_data, + u8 pdata_data_len); + + int (*reject)(struct iw_cm_id* cm_id, + const void* private_data, + u8 private_data_len); + + int (*getpeername)(struct iw_cm_id* cm_id, + struct sockaddr_in* local_addr, + struct sockaddr_in* remote_addr); + + int (*create_listen)(struct iw_cm_id* cm_id, + int backlog); + + int (*destroy_listen)(struct iw_cm_id* cm_id); + +}; + +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context); +void iw_destroy_cm_id(struct iw_cm_id *cm_id); +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog); +int iw_cm_getpeername(struct iw_cm_id *cm_id, + struct sockaddr_in* local_add, + struct sockaddr_in* remote_addr); +int iw_cm_reject(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len); +int iw_cm_accept(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len); +int iw_cm_connect(struct iw_cm_id *cm_id, + const void* pdata, u8 pdata_len); +int iw_cm_disconnect(struct iw_cm_id *cm_id); +int iw_cm_bind_qp(struct iw_cm_id* cm_id, struct ib_qp* qp); + +#endif /* IW_CM_H */ From eitan at mellanox.co.il Fri Jan 6 12:11:34 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 06 Jan 2006 22:11:34 +0200 Subject: [openib-general] SA cache design In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A43670A0955@mercury.infiniconsys.com> References: <5D78D28F88822E4D8702BB9EEF1A43670A0955@mercury.infiniconsys.com> Message-ID: <43BECEF6.50602@mellanox.co.il> Hi Todd, So you agree we will need to design "replica" buildup scalability features into the solution ( to avoid the bring-up load on the SA) ? Why would a caching system not work here? Instead of replicating the data. The caching concept allows for the SA to still be in the loop by invalidating the cache or through cache entries lifetime policy. The reason I think a total replica (distribution of the SA) would eventually be problematic is that as we approach QoS solutions, some need for path record use and retirement is going to show up. What if the SM decides to change SL2VL maps due to new QoS requirement. We will need a more complicated "synchronization" or invalidation technique to push that kind of data into the "replica" SAs. Eitan Rimmer, Todd wrote: >>From: Eitan Zahavi [mailto:eitan at mellanox.co.il] >>Hi Sean, Todd, >> >>Although I like the "replica" idea for its "query" >>performance boost - I suspect it will actually do not scale >>for very large >>networks: Each node has to query for the entire database >>would cause N^2 load on the SA. >>After any change (which do happen with higher probability on >>large networks) the SA will need to send each Report to N targets. >> >>We already have some bad experience with large clusters SA >>query issues, like the one reported by Roland >>"searching for SRP targets using PortInfo capability mask". >> > > Our experience has been the exact opposite. > While there is an initial load on the SA to populate the replica (which we have used various techniques to reduce such as backing off when the SA reports Busy, having a random time offset of start of query, etc). The boost occurs when a new application starts, such as an MPI using the SA/CM to establish connections as per the IBTA spec. A 1000 process MPI job would have each process make 999 queries to the SA at job startup time. This causes a burst of 999,0000 sets of SA queries (most will involve both Node Record and Path record queries so it will really be 2x this amount), BEFORE the MPI job can actually start. > > As Open IB moves forward to implement QOS and other features, MPI will have to use the SA to get its path records. If you study MVAPICH at present, it merely exchanges LIDs between nodes and hardcodes (or via enviornment variables uses the same value for all processes) all the other QOS parameters. In a true QOS and congestion management environment it will instead have to use the CM/SA. > > We have been using this replica technique quite successfully for 2-3 years now. Our MPI has used the SA/CM for connection establishment for just as long. > > As it was pointed out, most fabrics will be quite stable. Hence having a replica and paying the cost of the SA queries once will be much more efficient than paying that cost on every application startup. > > Todd Rimmer > From eitan at mellanox.co.il Fri Jan 6 12:13:03 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 06 Jan 2006 22:13:03 +0200 Subject: [openib-general] SA cache design In-Reply-To: <1136577688.4336.11886.camel@hal.voltaire.com> References: <5D78D28F88822E4D8702BB9EEF1A4367D12C70@mercury.infiniconsys.com> <1136556596.4336.8644.camel@hal.voltaire.com> <43BECC76.60803@mellanox.co.il> <1136577688.4336.11886.camel@hal.voltaire.com> Message-ID: <43BECF4F.6080807@mellanox.co.il> Hal Rosenstock wrote: > On Fri, 2006-01-06 at 15:00, Eitan Zahavi wrote: > >>I agree with Todd: a key is to keep the client unaware of the mux existence. >>So the same client can be run on system without the cache. > > > Define same client ? I would consider it the same SA client directing > requests differently based on how the mux is configured based on a query > to the cache (if it exists) as to its capabilities. SA Client can be embedded in an application - any program that can send mads can be an SA client. > > -- Hal > > >>Hal Rosenstock wrote: >> >>>On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote: >>> >>> >>>>>From: Hal Rosenstock [mailto:halr at voltaire.com] >>>>>On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: >>>>> >>>>> >>>>>>This of course implies the "SA Mux" must analyze more than just >>>>>>the attribute ID to determine if the replica can handle the query. >>>>>>But the memory savings is well worth the extra level of filtering. >>>>> >>>>>If the SA cache does this, it seems it would be pretty simple >>>>>to return >>>>>this info in an attribute to the client so the client would >>>>>know when to >>>>>go to the cache/replica and when to go direct to the SA in the case >>>>>where only certain queries are supported. Wouldn't this be >>>>>advantageous >>>>>when the replica doesn't support all queries ? >>>> >>>>Why put the burden on the application. give the query to the Mux. >>> >>> >>>That's what I'm suggesting. Rather than a binary switch mux, a more >>>granular one which determines how to route the outgoing SA request. >>> >>> >>> >>>> With an optional flag indicating a prefered "routing" (choices of: to SA, >>>>to replica, let Mux decide). Then let it decide. As you suggest it may >>>>be simplest to let the Mux try the replica and on failure fallback >>>>to the SA transparent to the app (sort of the way SDP intercepts >>>>socket ops and falls back to TCP/IP when SDP isn't appropriate). >>> >>> >>>It depends on whether the replica/cache forwards unsupported requests on >>>or responds with not supported back to the client as to how this is >>>handled. Sean was proposing the forward on model and a binary switch at >>>the client. I think this is more granular and can be mux'd only with the >>>knowledge of what a replica/cache supports (not sure about dealing with >>>different replica/caches supporting a different set of queries; need to >>>think more on how the caches are located, etc.). You are mentioning a >>>third model here. >>> >>>-- Hal >>> >>>_______________________________________________ >>>openib-general mailing list >>>openib-general at openib.org >>>http://openib.org/mailman/listinfo/openib-general >>> >>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> From eitan at mellanox.co.il Fri Jan 6 12:15:29 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 06 Jan 2006 22:15:29 +0200 Subject: [openib-general] SA cache design In-Reply-To: <43BECD30.8030501@ichips.intel.com> References: <43BE2FF6.2060506@mellanox.co.il> <43BEBE03.6030309@ichips.intel.com> <43BECB17.8020007@mellanox.co.il> <43BECD30.8030501@ichips.intel.com> Message-ID: <43BECFE1.4000609@mellanox.co.il> Sean Hefty wrote: > Eitan Zahavi wrote: > >>> Can someone familiar with the opensm code tell me how difficult it >>> would be to extract out the code that tracks the subnet data and >>> responds to queries? >> >> >> I guess you mean the code that is answering to PathRecord queries? > > > Yes - that along with answering other queries. > >> It is possible to extract the "SMDB" objects and duplicate that database. >> I am not sure it is such a good idea. What if the SM is not OpenSM? > > > I was thinking in terms of code re-use, and not in terms of which SM was > running. Interfacing to the SM would be through standard queries. The issue is that answering PathRecords queries can have impact on further algorithms the SM takes. It might not be enough to know the topology, SL2VL, LFT, MFT to answer PathRecord attributes... > > - Sean From rpandit at silverstorm.com Fri Jan 6 12:26:07 2006 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Fri, 6 Jan 2006 12:26:07 -0800 Subject: [openib-general] Re: Failure in reset HCA withbackport-svn4507-to-2.6.9 In-Reply-To: <20060106014045.GI23796@esmail.cup.hp.com> References: <96f8e60e0601051440t69782459k113307dd82a1b559@mail.gmail.com> <20060106014045.GI23796@esmail.cup.hp.com> Message-ID: <96f8e60e0601061226i3ad174feka58c66da97f5663e@mail.gmail.com> I'm running with 3.3.3 which is pretty recent. The "Redhat 3965 IB kernel" came up fine on an Intel Shasta box (don't know the exact model# etc). It looks like the problem is system dependent and reproducible on Dell 2650's. Has anybody lately tested on a Dell 2650? Btw, if I comment out mthca_reset() in mthca_main.c, then the drivers load and ports go active on the 2650. I suggest somebody review the reset path in mthca... In the past we have had problems reseting Tavor on some platforms and chose not to reset at driver load time. Ranjit On 1/5/06, Grant Grundler wrote: > On Thu, Jan 05, 2006 at 02:46:42PM -0800, Bob Woodruff wrote: > > Ranjit wrote, > > >fw rev: 3.03.0003rc16b > > > > I think that 4.7 is the latest (at least for the PCI-E cards) > > and 3.2 ( for the PCI-X cards) > > Latest for PCI-X is 3.3.5. > Latest for PCI-e is 4.7.4. > The 3.3.3 is likely to refer to a PCI-X card. > > See the openib wiki page: > https://openib.org/tiki/tiki-index.php?page=MellanoxHcaFirmware > > (also has link to mellanox firmware support site) > > grant > > > and I think the latest FW is required for correct operation with the openIB > > stack. > > Michael is this correct ? > > > > woody > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Fri Jan 6 12:21:11 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jan 2006 15:21:11 -0500 Subject: [openib-general] SA cache design In-Reply-To: <43BECF4F.6080807@mellanox.co.il> References: <5D78D28F88822E4D8702BB9EEF1A4367D12C70@mercury.infiniconsys.com> <1136556596.4336.8644.camel@hal.voltaire.com> <43BECC76.60803@mellanox.co.il> <1136577688.4336.11886.camel@hal.voltaire.com> <43BECF4F.6080807@mellanox.co.il> Message-ID: <1136578870.4336.12079.camel@hal.voltaire.com> On Fri, 2006-01-06 at 15:13, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > On Fri, 2006-01-06 at 15:00, Eitan Zahavi wrote: > > > >>I agree with Todd: a key is to keep the client unaware of the mux existence. > >>So the same client can be run on system without the cache. > > > > > > Define same client ? I would consider it the same SA client directing > > requests differently based on how the mux is configured based on a query > > to the cache (if it exists) as to its capabilities. > SA Client can be embedded in an application - any program that can send mads > can be an SA client. Such (non OpenIB) clients would not take advantage of the cache. That seems like the tradeoff for not the duplicated forwarding of the request. Guess I'm in the minority thinking that this might be worthwhile. -- Hal > > > > > -- Hal > > > > > >>Hal Rosenstock wrote: > >> > >>>On Fri, 2006-01-06 at 09:05, Rimmer, Todd wrote: > >>> > >>> > >>>>>From: Hal Rosenstock [mailto:halr at voltaire.com] > >>>>>On Thu, 2006-01-05 at 18:36, Rimmer, Todd wrote: > >>>>> > >>>>> > >>>>>>This of course implies the "SA Mux" must analyze more than just > >>>>>>the attribute ID to determine if the replica can handle the query. > >>>>>>But the memory savings is well worth the extra level of filtering. > >>>>> > >>>>>If the SA cache does this, it seems it would be pretty simple > >>>>>to return > >>>>>this info in an attribute to the client so the client would > >>>>>know when to > >>>>>go to the cache/replica and when to go direct to the SA in the case > >>>>>where only certain queries are supported. Wouldn't this be > >>>>>advantageous > >>>>>when the replica doesn't support all queries ? > >>>> > >>>>Why put the burden on the application. give the query to the Mux. > >>> > >>> > >>>That's what I'm suggesting. Rather than a binary switch mux, a more > >>>granular one which determines how to route the outgoing SA request. > >>> > >>> > >>> > >>>> With an optional flag indicating a prefered "routing" (choices of: to SA, > >>>>to replica, let Mux decide). Then let it decide. As you suggest it may > >>>>be simplest to let the Mux try the replica and on failure fallback > >>>>to the SA transparent to the app (sort of the way SDP intercepts > >>>>socket ops and falls back to TCP/IP when SDP isn't appropriate). > >>> > >>> > >>>It depends on whether the replica/cache forwards unsupported requests on > >>>or responds with not supported back to the client as to how this is > >>>handled. Sean was proposing the forward on model and a binary switch at > >>>the client. I think this is more granular and can be mux'd only with the > >>>knowledge of what a replica/cache supports (not sure about dealing with > >>>different replica/caches supporting a different set of queries; need to > >>>think more on how the caches are located, etc.). You are mentioning a > >>>third model here. > >>> > >>>-- Hal > >>> > >>>_______________________________________________ > >>>openib-general mailing list > >>>openib-general at openib.org > >>>http://openib.org/mailman/listinfo/openib-general > >>> > >>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > From halr at voltaire.com Fri Jan 6 12:26:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jan 2006 15:26:58 -0500 Subject: [openib-general] SA cache design In-Reply-To: <43BECB17.8020007@mellanox.co.il> References: <43BE2FF6.2060506@mellanox.co.il> <43BEBE03.6030309@ichips.intel.com> <43BECB17.8020007@mellanox.co.il> Message-ID: <1136579012.4336.12113.camel@hal.voltaire.com> On Fri, 2006-01-06 at 14:55, Eitan Zahavi wrote: > I guess you mean the code that is answering to PathRecord queries? > It is possible to extract the "SMDB" objects and duplicate that database. > I am not sure it is such a good idea. What if the SM is not OpenSM? I would view that the database is an SADB with the actual pathrecords as one example rather than the SMDB from which they are calculated. I think Sean is interested in the SA packet query/response code here so avoid recreating this and that the backend would be stripped out. Sean, is that accurate ? -- Hal From mshefty at ichips.intel.com Fri Jan 6 12:38:57 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 06 Jan 2006 12:38:57 -0800 Subject: [openib-general] SA cache design In-Reply-To: <43BB1A0F.2080305@ichips.intel.com> References: <43BB1A0F.2080305@ichips.intel.com> Message-ID: <43BED561.5090407@ichips.intel.com> Sean Hefty wrote: > - The MAD interface will result in additional data copies and userspace > to kernel transitions for clients residing on the local system. > - Clients require a mechanism to locate the sa_cache, or need to make > assumptions about its location. Based on some comments from people, I believe that we can handle the latter problem when the sa_cache/sa_replica/sa_whateveryouwanttocallit registers with the MAD layer. Ib_mad can record an sa_lid and sa_sl as part of a device's port attributes. These would initially be set the same as sm_lid and sm_sl. When a client registers to receive unsolicited SA MADs, the attributes would be updated accordingly. ib_sa and other clients sending MADs to the SA would use these values in place of the SM values. I'm not fond of the idea of pushing an SA switch into the MAD layer, since this makes it more difficult for the actual cache to query the SA directly. Another approach that may work better long term is treating the cache as a redirected SA request. Something along the lines of: http://openib.org/pipermail/openib-general/2005-September/011349.html (but with a restricted implementation for now) might also work. - Sean From mshefty at ichips.intel.com Fri Jan 6 12:40:10 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 06 Jan 2006 12:40:10 -0800 Subject: [openib-general] SA cache design In-Reply-To: <1136579012.4336.12113.camel@hal.voltaire.com> References: <43BE2FF6.2060506@mellanox.co.il> <43BEBE03.6030309@ichips.intel.com> <43BECB17.8020007@mellanox.co.il> <1136579012.4336.12113.camel@hal.voltaire.com> Message-ID: <43BED5AA.5040107@ichips.intel.com> Hal Rosenstock wrote: > I would view that the database is an SADB with the actual pathrecords as > one example rather than the SMDB from which they are calculated. I think > Sean is interested in the SA packet query/response code here so avoid > recreating this and that the backend would be stripped out. Sean, is > that accurate ? Hal is correct. - Sean From eitan at mellanox.co.il Fri Jan 6 12:55:59 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 6 Jan 2006 22:55:59 +0200 Subject: [openib-general] SA cache design Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B461@mtlexch01.mtl.com> Hi Sean I am still confused about the exact requirement. But the reference is: osm/opensm/osm_sa_path_record.c The rest of the queries are handled by osm_sa_*.c (but not the code in _ctrl.c). osm_sa_class_port_info.c osm_sa_response.c osm_sa_node_record.c osm_sa_service_record.c osm_sa_informinfo.c osm_sa_path_record.c osm_sa_slvl_record.c osm_sa_lft_record.c osm_sa_lft_record_ctrl.c osm_sa_sminfo_record.c osm_sa_link_record.c osm_sa_pkey_record.c osm_sa_vlarb_record.c osm_sa_mad_ctrl.c osm_sa_portinfo_record.c osm_sa_mcmember_record.c Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Friday, January 06, 2006 10:40 PM > To: Hal Rosenstock > Cc: Eitan Zahavi; openib > Subject: Re: [openib-general] SA cache design > > Hal Rosenstock wrote: > > I would view that the database is an SADB with the actual pathrecords as > > one example rather than the SMDB from which they are calculated. I think > > Sean is interested in the SA packet query/response code here so avoid > > recreating this and that the backend would be stripped out. Sean, is > > that accurate ? > > Hal is correct. > > - Sean From rdreier at cisco.com Fri Jan 6 12:57:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 12:57:42 -0800 Subject: [openib-general] Re: [PATCH] mthca: max_inline_data tweaks In-Reply-To: <20060101103414.GU4907@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 1 Jan 2006 12:34:14 +0200") References: <20060101103414.GU4907@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Fri Jan 6 13:01:35 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 13:01:35 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix for SQEr-to-RTS transition in modify-qp In-Reply-To: <20060102094203.GA5607@mellanox.co.il> (Jack Morgenstein's message of "Mon, 2 Jan 2006 11:42:03 +0200") References: <20060102094203.GA5607@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Fri Jan 6 13:04:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 13:04:53 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix for RTR-to-RTS transition in modify-qp In-Reply-To: <20060102094340.GB5607@mellanox.co.il> (Jack Morgenstein's message of "Mon, 2 Jan 2006 11:43:40 +0200") References: <20060102094340.GB5607@mellanox.co.il> Message-ID: Thanks, applied. By the way, we really should consolidate this translation table into the core verbs so that it's not duplicated in every driver. I think that both ehca and ipath have this same bug, cut and pasted from mthca. From rdreier at cisco.com Fri Jan 6 13:11:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 13:11:18 -0800 Subject: [openib-general] Re: [PATCH] mthca_mcg: multiple fixes In-Reply-To: <20051226150804.GW4907@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 26 Dec 2005 17:08:04 +0200") References: <20051226150804.GW4907@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Fri Jan 6 13:13:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 13:13:40 -0800 Subject: [openib-general] Re: [PATCH] mthca: fill vendor_err in completion with error In-Reply-To: <20051226120957.GR4907@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 26 Dec 2005 14:09:57 +0200") References: <20051226120957.GR4907@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Fri Jan 6 13:24:22 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 13:24:22 -0800 Subject: [openib-general] Re: [PATCH] mthca: add alternate path support In-Reply-To: <20060105134445.GJ2790@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 5 Jan 2006 15:44:45 +0200") References: <20060105134445.GJ2790@mellanox.co.il> Message-ID: Thanks a lot, applied. I had been meaning to implement this so you saved me some work. From tom at opengridcomputing.com Fri Jan 6 14:07:32 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 06 Jan 2006 16:07:32 -0600 Subject: [openib-general] [PATCH] CMA and iWARP In-Reply-To: <1136578777.14108.6.camel@trinity.austin.ammasso.com> References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> Message-ID: <1136585252.14108.9.camel@trinity.austin.ammasso.com> Sean: I just noticed that the iw_addr_handler function is not called. This code became dead when the address resolution logic was combined between IB and IW. This function can be removed... On Fri, 2006-01-06 at 14:19 -0600, Tom Tucker wrote: > Enclosed is a combined include file and core patch for iWARP support in CMA. This > patch includes changes per your last review. > > > Signed-off-by: Tom Tucker > > Index: core/cm.c > =================================================================== > --- core/cm.c (revision 4748) > +++ core/cm.c (working copy) > @@ -3261,6 +3261,9 @@ > int ret; > u8 i; > > + if (device->node_type == IB_NODE_RNIC) > + return; > + > cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * > device->phys_port_cnt, GFP_KERNEL); > if (!cm_dev) > Index: core/iwcm.c > =================================================================== > --- core/iwcm.c (revision 0) > +++ core/iwcm.c (revision 0) > @@ -0,0 +1,648 @@ > +/* > + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. > + * Copyright (c) 2004 Topspin Corporation. All rights reserved. > + * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. > + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > +#include > + > +#include "cm_msgs.h" > + > +MODULE_AUTHOR("Tom Tucker"); > +MODULE_DESCRIPTION("iWARP CM"); > +MODULE_LICENSE("Dual BSD/GPL"); > + > +static void iwcm_add_one(struct ib_device *device); > +static void iwcm_remove_one(struct ib_device *device); > +struct iwcm_id_private; > + > +static struct ib_client iwcm_client = { > + .name = "iwcm", > + .add = iwcm_add_one, > + .remove = iwcm_remove_one > +}; > + > +static struct { > + spinlock_t lock; > + struct list_head device_list; > + rwlock_t device_lock; > + struct workqueue_struct* wq; > +} iwcm; > + > +struct iwcm_device; > +struct iwcm_port { > + struct iwcm_device *iwcm_dev; > + struct sockaddr_in local_addr; > + u8 port_num; > +}; > + > +struct iwcm_device { > + struct list_head list; > + struct ib_device *device; > + struct iwcm_port port[0]; > +}; > + > +struct iwcm_id_private { > + struct iw_cm_id id; > + > + spinlock_t lock; > + wait_queue_head_t wait; > + atomic_t refcount; > + > + struct rb_node listen_node; > + > + struct list_head work_list; > + atomic_t work_count; > +}; > + > +struct iwcm_work { > + struct work_struct work; > + struct iwcm_id_private* cm_id; > + struct iw_cm_event event; > +}; > + > +/* Called whenever a reference added for a cm_id */ > +static inline void iwcm_addref_id(struct iwcm_id_private *cm_id_priv) > +{ > + atomic_inc(&cm_id_priv->refcount); > +} > + > +/* Called whenever releasing a reference to a cm id */ > +static inline void iwcm_deref_id(struct iwcm_id_private *cm_id_priv) > +{ > + if (atomic_dec_and_test(&cm_id_priv->refcount)) > + wake_up(&cm_id_priv->wait); > +} > + > +static void cm_event_handler(struct iw_cm_id* cm_id, struct iw_cm_event* event); > + > +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, > + iw_cm_handler cm_handler, > + void *context) > +{ > + struct iwcm_id_private *iwcm_id_priv; > + > + iwcm_id_priv = kmalloc(sizeof *iwcm_id_priv, GFP_KERNEL); > + if (!iwcm_id_priv) > + return ERR_PTR(-ENOMEM); > + > + memset(iwcm_id_priv, 0, sizeof *iwcm_id_priv); > + iwcm_id_priv->id.state = IW_CM_STATE_IDLE; > + iwcm_id_priv->id.device = device; > + iwcm_id_priv->id.cm_handler = cm_handler; > + iwcm_id_priv->id.context = context; > + iwcm_id_priv->id.event_handler = cm_event_handler; > + > + spin_lock_init(&iwcm_id_priv->lock); > + init_waitqueue_head(&iwcm_id_priv->wait); > + atomic_set(&iwcm_id_priv->refcount, 1); > + > + return &iwcm_id_priv->id; > + > +} > +EXPORT_SYMBOL(iw_create_cm_id); > + > +void iw_destroy_cm_id(struct iw_cm_id *cm_id) > +{ > + struct iwcm_id_private *iwcm_id_priv; > + unsigned long flags; > + int ret = 0; > + > + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + > + spin_lock_irqsave(&iwcm_id_priv->lock, flags); > + switch (cm_id->state) { > + case IW_CM_STATE_LISTEN: > + cm_id->state = IW_CM_STATE_IDLE; > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + ret = cm_id->device->iwcm->destroy_listen(cm_id); > + break; > + > + case IW_CM_STATE_CONN_RECV: > + case IW_CM_STATE_CONN_SENT: > + case IW_CM_STATE_ESTABLISHED: > + cm_id->state = IW_CM_STATE_IDLE; > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + ret = cm_id->device->iwcm->disconnect(cm_id,1); > + break; > + > + case IW_CM_STATE_IDLE: > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + break; > + > + default: > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + printk(KERN_ERR "%s:%s:%u Illegal state %d for iw_cm_id.\n", > + __FILE__, __FUNCTION__, __LINE__, cm_id->state); > + ; > + } > + > + atomic_dec(&iwcm_id_priv->refcount); > + wait_event(iwcm_id_priv->wait, !atomic_read(&iwcm_id_priv->refcount)); > + > + kfree(iwcm_id_priv); > +} > +EXPORT_SYMBOL(iw_destroy_cm_id); > + > +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog) > +{ > + struct iwcm_id_private *iwcm_id_priv; > + unsigned long flags; > + int ret = 0; > + > + if (cm_id->device == 0 || cm_id->device->iwcm == 0) > + return -EINVAL; > + > + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + spin_lock_irqsave(&iwcm_id_priv->lock, flags); > + if (cm_id->state != IW_CM_STATE_IDLE) { > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + return -EBUSY; > + } > + cm_id->state = IW_CM_STATE_LISTEN; > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + > + ret = cm_id->device->iwcm->create_listen(cm_id, backlog); > + if (ret != 0) > + cm_id->state = IW_CM_STATE_IDLE; > + > + return ret; > +} > +EXPORT_SYMBOL(iw_cm_listen); > + > +int iw_cm_getpeername(struct iw_cm_id *cm_id, > + struct sockaddr_in* local_addr, > + struct sockaddr_in* remote_addr) > +{ > + if (cm_id->device == 0) > + return -EINVAL; > + > + if (cm_id->device->iwcm == 0) > + return -EINVAL; > + > + /* Make sure there's a connection */ > + if (cm_id->state != IW_CM_STATE_ESTABLISHED) > + return -ENOTCONN; > + > + return cm_id->device->iwcm->getpeername(cm_id, local_addr, remote_addr); > +} > +EXPORT_SYMBOL(iw_cm_getpeername); > + > +int iw_cm_reject(struct iw_cm_id *cm_id, > + const void *private_data, > + u8 private_data_len) > +{ > + struct iwcm_id_private *iwcm_id_priv; > + unsigned long flags; > + int ret; > + > + > + if (cm_id->device == 0 || cm_id->device->iwcm == 0) > + return -EINVAL; > + > + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + > + spin_lock_irqsave(&iwcm_id_priv->lock, flags); > + switch (cm_id->state) { > + case IW_CM_STATE_CONN_RECV: > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + ret = cm_id->device->iwcm->reject(cm_id, private_data, private_data_len); > + cm_id->state = IW_CM_STATE_IDLE; > + break; > + default: > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + ret = -EINVAL; > + } > + > + return ret; > +} > +EXPORT_SYMBOL(iw_cm_reject); > + > +int iw_cm_accept(struct iw_cm_id *cm_id, > + const void *private_data, > + u8 private_data_len) > +{ > + struct iwcm_id_private *iwcm_id_priv; > + unsigned long flags; > + int ret; > + > + if (cm_id->device == 0 || cm_id->device->iwcm == 0) > + return -EINVAL; > + > + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + > + spin_lock_irqsave(&iwcm_id_priv->lock, flags); > + switch (cm_id->state) { > + case IW_CM_STATE_CONN_RECV: > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + ret = cm_id->device->iwcm->accept(cm_id, private_data, > + private_data_len); > + if (ret == 0) { > + struct iw_cm_event event; > + event.event = IW_CM_EVENT_ESTABLISHED; > + event.provider_id = cm_id->provider_id; > + event.status = 0; > + event.local_addr = cm_id->local_addr; > + event.remote_addr = cm_id->remote_addr; > + event.private_data = 0; > + event.private_data_len = 0; > + cm_event_handler(cm_id, &event); > + } > + > + break; > + default: > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + ret = -EINVAL; > + } > + > + return ret; > +} > +EXPORT_SYMBOL(iw_cm_accept); > + > +int iw_cm_bind_qp(struct iw_cm_id* cm_id, struct ib_qp* qp) > +{ > + int ret = -EINVAL; > + > + if (cm_id) { > + cm_id->qp = qp; > + ret = 0; > + } > + > + return ret; > +} > +EXPORT_SYMBOL(iw_cm_bind_qp); > + > +int iw_cm_connect(struct iw_cm_id *cm_id, > + const void* pdata, u8 pdata_len) > +{ > + struct iwcm_id_private* cm_id_priv; > + int ret = 0; > + unsigned long flags; > + > + if (cm_id->device == 0 || cm_id->device->iwcm == 0) > + return -EINVAL; > + > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + spin_lock_irqsave(&cm_id_priv->lock, flags); > + if (cm_id->state != IW_CM_STATE_IDLE) { > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + return -EBUSY; > + } > + cm_id->state = IW_CM_STATE_CONN_SENT; > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + > + ret = cm_id->device->iwcm->connect(cm_id, pdata, pdata_len); > + if (ret != 0) > + cm_id->state = IW_CM_STATE_IDLE; > + > + return ret; > +} > +EXPORT_SYMBOL(iw_cm_connect); > + > +int iw_cm_disconnect(struct iw_cm_id *cm_id) > +{ > + struct iwcm_id_private *iwcm_id_priv; > + unsigned long flags; > + int ret; > + > + if (cm_id->device == 0 || cm_id->device->iwcm == 0 || cm_id->qp == 0) > + return -EINVAL; > + > + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + spin_lock_irqsave(&iwcm_id_priv->lock, flags); > + switch (cm_id->state) { > + case IW_CM_STATE_ESTABLISHED: > + cm_id->state = IW_CM_STATE_IDLE; > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + ret = cm_id->device->iwcm->disconnect(cm_id, 1); > + if (ret == 0) { > + struct iw_cm_event event; > + event.event = IW_CM_EVENT_LLP_DISCONNECT; > + event.provider_id = cm_id->provider_id; > + event.status = 0; > + event.local_addr = cm_id->local_addr; > + event.remote_addr = cm_id->remote_addr; > + event.private_data = 0; > + event.private_data_len = 0; > + cm_event_handler(cm_id, &event); > + } > + > + break; > + default: > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + ret = -EINVAL; > + } > + > + return ret; > +} > +EXPORT_SYMBOL(iw_cm_disconnect); > + > +static void iwcm_add_one(struct ib_device *device) > +{ > + struct iwcm_device *iwcm_dev; > + struct iwcm_port *port; > + unsigned long flags; > + u8 i; > + > + if (device->node_type != IB_NODE_RNIC) > + return; > + > + iwcm_dev = kmalloc(sizeof(*iwcm_dev) + sizeof(*port) * > + device->phys_port_cnt, GFP_KERNEL); > + if (!iwcm_dev) > + return; > + > + iwcm_dev->device = device; > + > + for (i = 1; i <= device->phys_port_cnt; i++) { > + port = &iwcm_dev->port[i-1]; > + port->iwcm_dev = iwcm_dev; > + port->port_num = i; > + } > + > + ib_set_client_data(device, &iwcm_client, iwcm_dev); > + > + write_lock_irqsave(&iwcm.device_lock, flags); > + list_add_tail(&iwcm_dev->list, &iwcm.device_list); > + write_unlock_irqrestore(&iwcm.device_lock, flags); > + return; > +} > + > +static void iwcm_remove_one(struct ib_device *device) > +{ > + struct iwcm_device *iwcm_dev; > + unsigned long flags; > + > + iwcm_dev = ib_get_client_data(device, &iwcm_client); > + if (!iwcm_dev) > + return; > + > + write_lock_irqsave(&iwcm.device_lock, flags); > + list_del(&iwcm_dev->list); > + write_unlock_irqrestore(&iwcm.device_lock, flags); > + > + kfree(iwcm_dev); > +} > + > +/* Handles an inbound connect request. The function creates a new > + * iw_cm_id to represent the new connection and inherits the client > + * callback function and other attributes from the listening parent. > + * > + * The work item contains a pointer to the listen_cm_id and the event. The > + * listen_cm_id contains the client cm_handler, context and device. These are > + * copied when the device is cloned. The event contains the new four tuple. > + */ > +static int cm_conn_req_handler(struct iwcm_work* work) > +{ > + struct iw_cm_id* cm_id; > + struct iwcm_id_private* cm_id_priv; > + int rc; > + > + /* If the status was not successful, ignore request */ > + if (work->event.status) { > + printk(KERN_ERR "%s:%d Bad status=%d for connection request ... " > + "should be filtered by provider\n", > + __FUNCTION__, __LINE__, > + work->event.status); > + return work->event.status; > + } > + cm_id = iw_create_cm_id(work->cm_id->id.device, work->cm_id->id.cm_handler, > + work->cm_id->id.context); > + if (IS_ERR(cm_id)) > + return PTR_ERR(cm_id); > + > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + cm_id_priv->id.local_addr = work->event.local_addr; > + cm_id_priv->id.remote_addr = work->event.remote_addr; > + cm_id_priv->id.provider_id = work->event.provider_id; > + cm_id_priv->id.state = IW_CM_STATE_CONN_RECV; > + > + /* Call the client CM handler */ > + rc = cm_id->cm_handler(cm_id, &work->event); > + if (rc) { > + cm_id->state = IW_CM_STATE_IDLE; > + iw_destroy_cm_id(cm_id); > + } > + kfree(work); > + return 0; > +} > + > +/* > + * Handles the transition to established state on the passive side. > + */ > +static int cm_conn_est_handler(struct iwcm_work* work) > +{ > + struct iwcm_id_private* cm_id_priv; > + unsigned long flags; > + int ret = 0; > + > + cm_id_priv = work->cm_id; > + spin_lock_irqsave(&cm_id_priv->lock, flags); > + if (cm_id_priv->id.state != IW_CM_STATE_CONN_RECV) { > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + printk(KERN_ERR "%s:%d Invalid cm_id state=%d for established event\n", > + __FUNCTION__, __LINE__, cm_id_priv->id.state); > + ret = -EINVAL; > + goto error_out; > + } > + > + if (work->event.status == 0) { > + cm_id_priv = work->cm_id; > + cm_id_priv->id.local_addr = work->event.local_addr; > + cm_id_priv->id.remote_addr = work->event.remote_addr; > + cm_id_priv->id.state = IW_CM_STATE_ESTABLISHED; > + } else > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + > + /* Call the client CM handler */ > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); > + if (ret) { > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + iw_destroy_cm_id(&cm_id_priv->id); > + } > + > + error_out: > + kfree(work); > + return ret; > +} > + > +/* > + * Handles the reply to our connect request. There are three > + * possibilities: > + * - If the cm_id is in the wrong state when the event is > + * delivered, the event is ignored. [What should we do when the > + * provider does something crazy?] > + * - If the remote peer accepts the connection, we update the 4-tuple > + * in the cm_id with the remote peer info, move the cm_id to the > + * ESTABLISHED state and deliver the event to the client. > + * - If the remote peer rejects the connection, or there is some > + * connection error, move the cm_id to the IDLE state, and deliver > + * the event to the client. > + */ > +static int cm_conn_rep_handler(struct iwcm_work* work) > +{ > + struct iwcm_id_private* cm_id_priv; > + unsigned long flags; > + int ret = 0; > + > + cm_id_priv = work->cm_id; > + spin_lock_irqsave(&cm_id_priv->lock, flags); > + if (cm_id_priv->id.state != IW_CM_STATE_CONN_SENT) { > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + printk(KERN_ERR "%s:%d Invalid cm_id state=%d for connect reply event\n", > + __FUNCTION__, __LINE__, cm_id_priv->id.state); > + ret = -EINVAL; > + goto error_out; > + } > + > + if (work->event.status == 0) { > + cm_id_priv = work->cm_id; > + cm_id_priv->id.local_addr = work->event.local_addr; > + cm_id_priv->id.remote_addr = work->event.remote_addr; > + cm_id_priv->id.state = IW_CM_STATE_ESTABLISHED; > + } else > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + > + /* Call the client CM handler */ > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); > + if (ret) { > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + iw_destroy_cm_id(&cm_id_priv->id); > + } > + > + error_out: > + kfree(work); > + return ret; > +} > + > +static int cm_disconnect_handler(struct iwcm_work* work) > +{ > + struct iwcm_id_private* cm_id_priv; > + int ret = 0; > + > + cm_id_priv = work->cm_id; > + > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + > + /* Call the client CM handler */ > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); > + if (ret) > + iw_destroy_cm_id(&cm_id_priv->id); > + > + kfree(work); > + return ret; > +} > + > +static void cm_work_handler(void* arg) > +{ > + struct iwcm_work* work = (struct iwcm_work*)arg; > + int rc; > + > + switch (work->event.event) { > + case IW_CM_EVENT_CONNECT_REQUEST: > + rc = cm_conn_req_handler(work); > + break; > + case IW_CM_EVENT_CONNECT_REPLY: > + rc = cm_conn_rep_handler(work); > + break; > + case IW_CM_EVENT_ESTABLISHED: > + rc = cm_conn_est_handler(work); > + break; > + case IW_CM_EVENT_LLP_DISCONNECT: > + case IW_CM_EVENT_LLP_TIMEOUT: > + case IW_CM_EVENT_LLP_RESET: > + case IW_CM_EVENT_CLOSE: > + rc = cm_disconnect_handler(work); > + break; > + } > +} > + > +/* IW CM provider event callback handler. This function is called on > + * interrupt context. The function builds a work queue element > + * and enqueues it for processing on a work queue thread. This allows > + * CM client callback functions to block. > + */ > +static void cm_event_handler(struct iw_cm_id* cm_id, > + struct iw_cm_event* event) > +{ > + struct iwcm_work *work; > + struct iwcm_id_private* cm_id_priv; > + > + work = kmalloc(sizeof *work, GFP_ATOMIC); > + if (!work) > + return; > + > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + INIT_WORK(&work->work, cm_work_handler, work); > + work->cm_id = cm_id_priv; > + work->event = *event; > + queue_work(iwcm.wq, &work->work); > +} > + > +static int __init iw_cm_init(void) > +{ > + memset(&iwcm, 0, sizeof iwcm); > + INIT_LIST_HEAD(&iwcm.device_list); > + rwlock_init(&iwcm.device_lock); > + spin_lock_init(&iwcm.lock); > + iwcm.wq = create_workqueue("iw_cm"); > + if (!iwcm.wq) > + return -ENOMEM; > + > + return ib_register_client(&iwcm_client); > +} > + > +static void __exit iw_cm_cleanup(void) > +{ > + ib_unregister_client(&iwcm_client); > +} > + > +module_init(iw_cm_init); > +module_exit(iw_cm_cleanup); > + > Index: core/addr.c > =================================================================== > --- core/addr.c (revision 4748) > +++ core/addr.c (working copy) > @@ -65,6 +65,9 @@ > case ARPHRD_INFINIBAND: > dev_addr->dev_type = IB_NODE_CA; > break; > + case ARPHRD_ETHER: > + dev_addr->dev_type = IB_NODE_RNIC; > + break; > default: > return -EADDRNOTAVAIL; > } > Index: core/Makefile > =================================================================== > --- core/Makefile (revision 4748) > +++ core/Makefile (working copy) > @@ -1,6 +1,6 @@ > EXTRA_CFLAGS += -Idrivers/infiniband/include -Idrivers/infiniband/ulp/ipoib > > -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o \ > +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o iw_cm.o \ > ib_sa.o ib_at.o ib_addr.o rdma_cm.o > obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o > obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o ib_uat.o rdma_ucm.o > @@ -14,6 +14,8 @@ > > ib_cm-y := cm.o > > +iw_cm-y := iwcm.o > + > rdma_cm-y := cma.o > > rdma_ucm-y := ucma.o > Index: core/cma.c > =================================================================== > --- core/cma.c (revision 4748) > +++ core/cma.c (working copy) > @@ -3,6 +3,7 @@ > * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. > * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. > * Copyright (c) 2005 Intel Corporation. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > * > * This Software is licensed under one of the following licenses: > * > @@ -31,9 +32,14 @@ > #include > #include > #include > +#include > +#include > +#include > +#include > #include > #include > #include > +#include > #include > > MODULE_AUTHOR("Guy German"); > @@ -102,8 +108,12 @@ > int timeout_ms; > struct ib_sa_query *query; > int query_id; > - struct ib_cm_id *cm_id; > > + union { > + struct ib_cm_id *ib; > + struct iw_cm_id *iw; > + } cm_id; > + > u32 seq_num; > u32 qp_num; > enum ib_qp_type qp_type; > @@ -239,11 +249,40 @@ > return ret; > } > > +static int cma_acquire_iw_dev(struct rdma_id_private* id_priv) > +{ > + struct rdma_dev_addr* dev_addr = &id_priv->id.route.addr.dev_addr; > + struct cma_device* cma_dev; > + int ret = -ENOENT; > + > + down(&mutex); > + list_for_each_entry(cma_dev, &dev_list, list) { > + if (memcmp(dev_addr->src_dev_addr, > + &cma_dev->node_guid, > + sizeof(cma_dev->node_guid)) == 0) { > + > + /* If we find the device, then check if this > + * is an iWARP device. If it is, then attach > + */ > + if (cma_dev->device->node_type == IB_NODE_RNIC) { > + cma_attach_to_dev(id_priv, cma_dev); > + ret = 0; > + break; > + } > + } > + } > + up(&mutex); > + > + return ret; > +} > + > static int cma_acquire_dev(struct rdma_id_private *id_priv) > { > switch (id_priv->id.route.addr.dev_addr.dev_type) { > case IB_NODE_CA: > return cma_acquire_ib_dev(id_priv); > + case IB_NODE_RNIC: > + return cma_acquire_iw_dev(id_priv); > default: > return -ENODEV; > } > @@ -306,6 +345,16 @@ > IB_QP_PKEY_INDEX | IB_QP_PORT); > } > > +static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) > +{ > + struct ib_qp_attr qp_attr; > + > + qp_attr.qp_state = IB_QPS_INIT; > + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; > + > + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS); > +} > + > int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, > struct ib_qp_init_attr *qp_init_attr) > { > @@ -325,6 +374,9 @@ > case IB_NODE_CA: > ret = cma_init_ib_qp(id_priv, qp); > break; > + case IB_NODE_RNIC: > + ret = cma_init_iw_qp(id_priv, qp); > + break; > default: > ret = -ENOSYS; > break; > @@ -412,7 +464,7 @@ > id_priv = container_of(id, struct rdma_id_private, id); > switch (id_priv->id.device->node_type) { > case IB_NODE_CA: > - ret = ib_cm_init_qp_attr(id_priv->cm_id, qp_attr, > + ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr, > qp_attr_mask); > if (qp_attr->qp_state == IB_QPS_RTR) > qp_attr->rq_psn = id_priv->seq_num; > @@ -567,8 +619,8 @@ > { > cma_exch(id_priv, CMA_DESTROYING); > > - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) > - ib_destroy_cm_id(id_priv->cm_id); > + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) > + ib_destroy_cm_id(id_priv->cm_id.ib); > > list_del(&id_priv->listen_list); > if (id_priv->cma_dev) > @@ -624,9 +676,20 @@ > state = cma_exch(id_priv, CMA_DESTROYING); > cma_cancel_operation(id_priv, state); > > - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) > - ib_destroy_cm_id(id_priv->cm_id); > + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) { > > + switch (id->device->node_type) { > + case IB_NODE_RNIC: > + iw_destroy_cm_id(id_priv->cm_id.iw); > + break; > + default: > + ib_destroy_cm_id(id_priv->cm_id.ib); > + break; > + } > + > + id_priv->cm_id.ib = NULL; > + } > + > if (id_priv->cma_dev) { > down(&mutex); > cma_detach_from_dev(id_priv); > @@ -652,15 +715,15 @@ > ret = cma_modify_qp_rts(&id_priv->id); > if (ret) > goto reject; > - > - ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); > + > + ret = ib_send_cm_rtu(id_priv->cm_id.ib, NULL, 0); > if (ret) > goto reject; > > return 0; > reject: > cma_modify_qp_err(&id_priv->id); > - ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, > + ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, > NULL, 0, NULL, 0); > return ret; > } > @@ -676,7 +739,7 @@ > return 0; > reject: > cma_modify_qp_err(&id_priv->id); > - ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, > + ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, > NULL, 0, NULL, 0); > return ret; > } > @@ -737,7 +800,7 @@ > private_data_len); > if (ret) { > /* Destroy the CM ID by returning a non-zero value. */ > - id_priv->cm_id = NULL; > + id_priv->cm_id.ib = NULL; > cma_exch(id_priv, CMA_DESTROYING); > cma_release_remove(id_priv); > rdma_destroy_id(&id_priv->id); > @@ -819,7 +882,7 @@ > goto out; > } > > - conn_id->cm_id = cm_id; > + conn_id->cm_id.ib = cm_id; > cm_id->context = conn_id; > cm_id->cm_handler = cma_ib_handler; > > @@ -829,7 +892,7 @@ > IB_CM_REQ_PRIVATE_DATA_SIZE - offset); > if (ret) { > /* Destroy the CM ID by returning a non-zero value. */ > - conn_id->cm_id = NULL; > + conn_id->cm_id.ib = NULL; > cma_exch(conn_id, CMA_DESTROYING); > cma_release_remove(conn_id); > rdma_destroy_id(&conn_id->id); > @@ -874,6 +937,116 @@ > } > } > > +static int cma_iw_handler(struct iw_cm_id* iw_id, struct iw_cm_event* event) > +{ > + struct rdma_id_private *id_priv = iw_id->context; > + enum rdma_cm_event_type event_type = 0; > + int ret = 0; > + > + atomic_inc(&id_priv->dev_remove); > + > + switch (event->event) { > + case IW_CM_EVENT_LLP_DISCONNECT: > + case IW_CM_EVENT_LLP_RESET: > + case IW_CM_EVENT_LLP_TIMEOUT: > + case IW_CM_EVENT_CLOSE: > + event_type = RDMA_CM_EVENT_DISCONNECTED; > + break; > + > + case IW_CM_EVENT_CONNECT_REPLY: { > + if (event->status) > + event_type = RDMA_CM_EVENT_REJECTED; > + else > + event_type = RDMA_CM_EVENT_ESTABLISHED; > + break; > + } > + > + case IW_CM_EVENT_ESTABLISHED: > + event_type = RDMA_CM_EVENT_ESTABLISHED; > + break; > + > + default: > + BUG_ON(1); > + break; > + > + } > + > + ret = cma_notify_user(id_priv, > + event_type, > + event->status, > + event->private_data, > + event->private_data_len); > + if (ret) { > + /* Destroy the CM ID by returning a non-zero value. */ > + id_priv->cm_id.iw = NULL; > + cma_exch(id_priv, CMA_DESTROYING); > + cma_release_remove(id_priv); > + rdma_destroy_id(&id_priv->id); > + return ret; > + } > + > + cma_release_remove(id_priv); > + return ret; > +} > + > +static int iw_conn_req_handler(struct iw_cm_id *cm_id, > + struct iw_cm_event *iw_event) > +{ > + struct rdma_cm_id* new_cm_id; > + struct rdma_id_private *listen_id, *conn_id; > + struct sockaddr_in* sin; > + int ret; > + > + listen_id = cm_id->context; > + atomic_inc(&listen_id->dev_remove); > + if (!cma_comp(listen_id, CMA_LISTEN)) { > + ret = -ECONNABORTED; > + goto out; > + } > + > + /* Create a new RDMA id the new IW CM ID */ > + new_cm_id = rdma_create_id(listen_id->id.event_handler, > + listen_id->id.context, > + RDMA_PS_TCP); > + if (!new_cm_id) { > + ret = -ENOMEM; > + goto out; > + } > + conn_id = container_of(new_cm_id, struct rdma_id_private, id); > + atomic_inc(&conn_id->dev_remove); > + conn_id->state = CMA_CONNECT; > + > + /* New connection inherits device from parent */ > + down(&mutex); > + cma_attach_to_dev(conn_id, listen_id->cma_dev); > + up(&mutex); > + > + conn_id->cm_id.iw = cm_id; > + cm_id->context = conn_id; > + cm_id->cm_handler = cma_iw_handler; > + > + sin = (struct sockaddr_in*)&new_cm_id->route.addr.src_addr; > + *sin = iw_event->local_addr; > + > + sin = (struct sockaddr_in*)&new_cm_id->route.addr.dst_addr; > + *sin = iw_event->remote_addr; > + > + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, > + iw_event->private_data, > + iw_event->private_data_len); > + if (ret) { > + /* Destroy the CM ID by returning a non-zero value. */ > + conn_id->cm_id.iw = NULL; > + cma_exch(conn_id, CMA_DESTROYING); > + cma_release_remove(conn_id); > + rdma_destroy_id(&conn_id->id); > + } > + > +out: > + cma_release_remove(listen_id); > + return ret; > +} > + > static int cma_ib_listen(struct rdma_id_private *id_priv) > { > struct ib_cm_private_data_compare compare_data; > @@ -881,28 +1054,52 @@ > __be64 svc_id; > int ret; > > - id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_req_handler, > + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, cma_req_handler, > id_priv); > - if (IS_ERR(id_priv->cm_id)) > - return PTR_ERR(id_priv->cm_id); > + if (IS_ERR(id_priv->cm_id.ib)) > + return PTR_ERR(id_priv->cm_id.ib); > > addr = &id_priv->id.route.addr.src_addr; > svc_id = cma_get_service_id(id_priv->id.ps, addr); > if (cma_any_addr(addr)) > - ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, NULL); > + ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, NULL); > else { > cma_set_compare_data(addr, &compare_data); > - ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, &compare_data); > + ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, &compare_data); > } > > if (ret) { > - ib_destroy_cm_id(id_priv->cm_id); > - id_priv->cm_id = NULL; > + ib_destroy_cm_id(id_priv->cm_id.ib); > + id_priv->cm_id.ib = NULL; > } > > return ret; > } > > +static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog) > +{ > + int ret; > + struct sockaddr_in* sin; > + > + id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, > + iw_conn_req_handler, > + id_priv); > + if (IS_ERR(id_priv->cm_id.iw)) > + return PTR_ERR(id_priv->cm_id.iw); > + > + sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr; > + id_priv->cm_id.iw->local_addr = *sin; > + > + ret = iw_cm_listen(id_priv->cm_id.iw, backlog); > + > + if (ret) { > + iw_destroy_cm_id(id_priv->cm_id.iw); > + id_priv->cm_id.iw = NULL; > + } > + > + return ret; > +} > + > static int cma_duplicate_listen(struct rdma_id_private *id_priv) > { > struct rdma_id_private *cur_id_priv; > @@ -988,6 +1185,9 @@ > case IB_NODE_CA: > ret = cma_ib_listen(id_priv); > break; > + case IB_NODE_RNIC: > + ret = cma_iw_listen(id_priv, backlog); > + break; > default: > ret = -ENOSYS; > break; > @@ -1067,6 +1267,45 @@ > return (id_priv->query_id < 0) ? id_priv->query_id : 0; > } > > +static void iw_route_handler(void* data) > +{ > + struct cma_work *work = data; > + struct rdma_id_private *id_priv = work->id; > + > + kfree(work); > + > + atomic_inc(&id_priv->dev_remove); > + > + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ROUTE_RESOLVED)) > + goto out; > + > + if (cma_notify_user(id_priv, RDMA_CM_EVENT_ROUTE_RESOLVED, 0, NULL, 0)) { > + cma_exch(id_priv, CMA_DESTROYING); > + cma_release_remove(id_priv); > + cma_deref_id(id_priv); > + rdma_destroy_id(&id_priv->id); > + return; > + } > + out: > + cma_release_remove(id_priv); > + cma_deref_id(id_priv); > +} > + > +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) > +{ > + struct cma_work *work; > + > + work = kmalloc(sizeof *work, GFP_KERNEL); > + if (!work) > + return -ENOMEM; > + > + work->id = id_priv; > + INIT_WORK(&work->work, iw_route_handler, work); > + queue_work(rdma_wq, &work->work); > + > + return 0; > +} > + > int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) > { > struct rdma_id_private *id_priv; > @@ -1081,6 +1320,9 @@ > case IB_NODE_CA: > ret = cma_resolve_ib_route(id_priv, timeout_ms); > break; > + case IB_NODE_RNIC: > + ret = cma_resolve_iw_route(id_priv, timeout_ms); > + break; > default: > ret = -ENOSYS; > break; > @@ -1221,12 +1463,36 @@ > return ret; > } > > +static void iw_addr_handler(void* data) > +{ > + struct cma_work *work = data; > + struct rdma_id_private *id_priv = work->id; > + > + kfree(work); > + > + atomic_inc(&id_priv->dev_remove); > + > + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) > + goto out; > + > + if (cma_notify_user(id_priv, RDMA_CM_EVENT_ADDR_RESOLVED, 0, NULL, 0)) { > + cma_exch(id_priv, CMA_DESTROYING); > + cma_release_remove(id_priv); > + cma_deref_id(id_priv); > + rdma_destroy_id(&id_priv->id); > + return; > + } > +out: > + cma_release_remove(id_priv); > + cma_deref_id(id_priv); > +} > + > int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, > struct sockaddr *dst_addr, int timeout_ms) > { > struct rdma_id_private *id_priv; > enum cma_state expected_state; > - int ret; > + int ret = 0; > > id_priv = container_of(id, struct rdma_id_private, id); > if (id_priv->cma_dev) { > @@ -1341,10 +1607,10 @@ > memcpy(private_data + offset, conn_param->private_data, > conn_param->private_data_len); > > - id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, > + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, cma_ib_handler, > id_priv); > - if (IS_ERR(id_priv->cm_id)) { > - ret = PTR_ERR(id_priv->cm_id); > + if (IS_ERR(id_priv->cm_id.ib)) { > + ret = PTR_ERR(id_priv->cm_id.ib); > goto out; > } > > @@ -1371,12 +1637,45 @@ > req.max_cm_retries = CMA_MAX_CM_RETRIES; > req.srq = id_priv->srq ? 1 : 0; > > - ret = ib_send_cm_req(id_priv->cm_id, &req); > + ret = ib_send_cm_req(id_priv->cm_id.ib, &req); > out: > kfree(private_data); > return ret; > } > > +static int cma_connect_iw(struct rdma_id_private *id_priv, > + struct rdma_conn_param *conn_param) > +{ > + struct iw_cm_id* cm_id; > + struct sockaddr_in* sin; > + int ret; > + > + if (id_priv->id.qp == NULL) > + return -EINVAL; > + > + cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv); > + if (IS_ERR(cm_id)) { > + ret = PTR_ERR(cm_id); > + goto out; > + } > + > + id_priv->cm_id.iw = cm_id; > + > + sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr; > + cm_id->local_addr = *sin; > + > + sin = (struct sockaddr_in*)&id_priv->id.route.addr.dst_addr; > + cm_id->remote_addr = *sin; > + > + iw_cm_bind_qp(cm_id, id_priv->id.qp); > + > + ret = iw_cm_connect(cm_id, conn_param->private_data, > + conn_param->private_data_len); > + > +out: > + return ret; > +} > + > int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) > { > struct rdma_id_private *id_priv; > @@ -1396,6 +1695,9 @@ > case IB_NODE_CA: > ret = cma_connect_ib(id_priv, conn_param); > break; > + case IB_NODE_RNIC: > + ret = cma_connect_iw(id_priv, conn_param); > + break; > default: > ret = -ENOSYS; > break; > @@ -1433,7 +1735,7 @@ > rep.rnr_retry_count = conn_param->rnr_retry_count; > rep.srq = id_priv->srq ? 1 : 0; > > - return ib_send_cm_rep(id_priv->cm_id, &rep); > + return ib_send_cm_rep(id_priv->cm_id.ib, &rep); > } > > int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) > @@ -1458,6 +1760,12 @@ > else > ret = cma_rep_recv(id_priv); > break; > + case IB_NODE_RNIC: { > + iw_cm_bind_qp(id_priv->cm_id.iw, id_priv->id.qp); > + ret = iw_cm_accept(id_priv->cm_id.iw, conn_param->private_data, > + conn_param->private_data_len); > + break; > + } > default: > ret = -ENOSYS; > break; > @@ -1486,9 +1794,15 @@ > > switch (id->device->node_type) { > case IB_NODE_CA: > - ret = ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, > + ret = ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, > NULL, 0, private_data, private_data_len); > break; > + > + case IB_NODE_RNIC: > + ret = iw_cm_reject(id_priv->cm_id.iw, > + private_data, private_data_len); > + break; > + > default: > ret = -ENOSYS; > break; > @@ -1513,9 +1827,12 @@ > switch (id->device->node_type) { > case IB_NODE_CA: > /* Initiate or respond to a disconnect. */ > - if (ib_send_cm_dreq(id_priv->cm_id, NULL, 0)) > - ib_send_cm_drep(id_priv->cm_id, NULL, 0); > + if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) > + ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); > break; > + case IB_NODE_RNIC: > + ret = iw_cm_disconnect(id_priv->cm_id.iw); > + break; > default: > break; > } > Index: core/mad.c > =================================================================== > --- core/mad.c (revision 4748) > +++ core/mad.c (working copy) > @@ -2655,7 +2655,9 @@ > { > int start, end, i; > > - if (device->node_type == IB_NODE_SWITCH) { > + if (device->node_type == IB_NODE_RNIC) > + return; > + else if (device->node_type == IB_NODE_SWITCH) { > start = 0; > end = 0; > } else { > @@ -2702,7 +2704,9 @@ > { > int i, num_ports, cur_port; > > - if (device->node_type == IB_NODE_SWITCH) { > + if (device->node_type == IB_NODE_RNIC) > + return; > + else if (device->node_type == IB_NODE_SWITCH) { > num_ports = 1; > cur_port = 0; > } else { > Index: include/rdma/ib_verbs.h > =================================================================== > --- include/rdma/ib_verbs.h (revision 4748) > +++ include/rdma/ib_verbs.h (working copy) > @@ -67,7 +67,8 @@ > enum ib_node_type { > IB_NODE_CA = 1, > IB_NODE_SWITCH, > - IB_NODE_ROUTER > + IB_NODE_ROUTER, > + IB_NODE_RNIC > }; > > enum ib_device_cap_flags { > @@ -86,6 +87,14 @@ > IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), > IB_DEVICE_SRQ_RESIZE = (1<<13), > IB_DEVICE_N_NOTIFY_CQ = (1<<14), > + IB_DEVICE_IN_ORD_PLCMNT = (1<<15), > + IB_DEVICE_ZERO_STAG = (1<<16), > + IB_DEVICE_SEND_W_INV = (1<<17), > + IB_DEVICE_MW = (1<<18), > + IB_DEVICE_FMR = (1<<19), > + IB_DEVICE_SRQ = (1<<20), > + IB_DEVICE_ARP = (1<<21), > + IB_DEVICE_LLP = (1<<22), > }; > > enum ib_atomic_cap { > @@ -824,6 +833,8 @@ > > u32 flags; > > + struct iw_cm_verbs *iwcm; > + > int (*query_device)(struct ib_device *device, > struct ib_device_attr *device_attr); > int (*query_port)(struct ib_device *device, > Index: include/rdma/iw_cm.h > =================================================================== > --- include/rdma/iw_cm.h (revision 0) > +++ include/rdma/iw_cm.h (revision 0) > @@ -0,0 +1,152 @@ > +/* > + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#if !defined(IW_CM_H) > +#define IW_CM_H > + > +#include > +#include > + > +struct iw_cm_id; > +struct iw_cm_event; > + > +enum iw_cm_event_type { > + IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */ > + IW_CM_EVENT_CONNECT_REPLY, /* reply from active connect request */ > + IW_CM_EVENT_ESTABLISHED, > + IW_CM_EVENT_LLP_DISCONNECT, > + IW_CM_EVENT_LLP_RESET, > + IW_CM_EVENT_LLP_TIMEOUT, > + IW_CM_EVENT_CLOSE > +}; > + > +struct iw_cm_event { > + enum iw_cm_event_type event; > + int status; > + u32 provider_id; > + struct sockaddr_in local_addr; > + struct sockaddr_in remote_addr; > + void *private_data; > + u8 private_data_len; > +}; > + > +typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id, > + struct iw_cm_event *event); > + > +enum iw_cm_state { > + IW_CM_STATE_IDLE, /* unbound, inactive */ > + IW_CM_STATE_LISTEN, /* listen waiting for connect */ > + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ > + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ > + IW_CM_STATE_ESTABLISHED, /* established */ > +}; > + > +typedef void (*iw_event_handler)(struct iw_cm_id* cm_id, > + struct iw_cm_event* event); > +struct iw_cm_id { > + iw_cm_handler cm_handler; /* client callback function */ > + void *context; /* context to provide to client cb */ > + enum iw_cm_state state; > + struct ib_device *device; > + struct ib_qp *qp; > + struct sockaddr_in local_addr; > + struct sockaddr_in remote_addr; > + u64 provider_id; /* device handle for this conn. */ > + iw_event_handler event_handler; /* callback for IW CM Provider events */ > +}; > + > +/** > + * iw_create_cm_id - Allocate a communication identifier. > + * @device: Device associated with the cm_id. All related communication will > + * be associated with the specified device. > + * @cm_handler: Callback invoked to notify the user of CM events. > + * @context: User specified context associated with the communication > + * identifier. > + * > + * Communication identifiers are used to track connection states, > + * addr resolution requests, and listen requests. > + */ > +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, > + iw_cm_handler cm_handler, > + void *context); > + > +/* This is provided in the event generated when > + * a remote peer accepts our connect request > + */ > + > +struct iw_cm_verbs { > + int (*connect)(struct iw_cm_id* cm_id, > + const void* private_data, > + u8 private_data_len); > + > + int (*disconnect)(struct iw_cm_id* cm_id, > + int abrupt); > + > + int (*accept)(struct iw_cm_id*, > + const void *private_data, > + u8 pdata_data_len); > + > + int (*reject)(struct iw_cm_id* cm_id, > + const void* private_data, > + u8 private_data_len); > + > + int (*getpeername)(struct iw_cm_id* cm_id, > + struct sockaddr_in* local_addr, > + struct sockaddr_in* remote_addr); > + > + int (*create_listen)(struct iw_cm_id* cm_id, > + int backlog); > + > + int (*destroy_listen)(struct iw_cm_id* cm_id); > + > +}; > + > +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, > + iw_cm_handler cm_handler, > + void *context); > +void iw_destroy_cm_id(struct iw_cm_id *cm_id); > +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog); > +int iw_cm_getpeername(struct iw_cm_id *cm_id, > + struct sockaddr_in* local_add, > + struct sockaddr_in* remote_addr); > +int iw_cm_reject(struct iw_cm_id *cm_id, > + const void *private_data, > + u8 private_data_len); > +int iw_cm_accept(struct iw_cm_id *cm_id, > + const void *private_data, > + u8 private_data_len); > +int iw_cm_connect(struct iw_cm_id *cm_id, > + const void* pdata, u8 pdata_len); > +int iw_cm_disconnect(struct iw_cm_id *cm_id); > +int iw_cm_bind_qp(struct iw_cm_id* cm_id, struct ib_qp* qp); > + > +#endif /* IW_CM_H */ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From iod00d at hp.com Fri Jan 6 15:16:05 2006 From: iod00d at hp.com (Grant Grundler) Date: Fri, 6 Jan 2006 15:16:05 -0800 Subject: [openib-general] Re: Failure in reset HCA withbackport-svn4507-to-2.6.9 In-Reply-To: <96f8e60e0601061226i3ad174feka58c66da97f5663e@mail.gmail.com> References: <96f8e60e0601051440t69782459k113307dd82a1b559@mail.gmail.com> <20060106014045.GI23796@esmail.cup.hp.com> <96f8e60e0601061226i3ad174feka58c66da97f5663e@mail.gmail.com> Message-ID: <20060106231605.GC29572@esmail.cup.hp.com> On Fri, Jan 06, 2006 at 12:26:07PM -0800, Ranjit Pandit wrote: > I'm running with 3.3.3 which is pretty recent. yeah - I've not had/exposed any problems with 3.3.3 either. > It looks like the problem is system dependent and reproducible on Dell 2650's. > Has anybody lately tested on a Dell 2650? Dell 2650 is advertised to have Serverworks "Grand Champion LE" (GC-LE) chipset. Maybe this quirk is relevant? (I doubt it but it's possible.) See drivers/pci/quirks.c: ... static void __init quirk_svw_msi(struct pci_dev *dev) { pci_msi_quirk = 1; printk(KERN_WARNING "PCI: MSI quirk detected. pci_msi_quirk set.\n"); } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_SERVERWORKS, PCI_DEVICE_ID_SERVERWORKS_GCNB_LE, quirk_svw_msi ); #endif /* CONFIG_X86_IO_APIC */ ... > Btw, if I comment out mthca_reset() in mthca_main.c, then the drivers > load and ports go active on the 2650. > > I suggest somebody review the reset path in mthca... In the past we > have had problems reseting Tavor on some platforms and chose not to > reset at driver load time. I'm all for reseting the card and re-initializing as long as it doesn't perturb the rest of the cluster. I've had initialization problems with tg3 driver in the past. They turned out to be bugs in the driver init path making assumptions about the state of the card as handed off by firmware. Rolling BIOS driver (EFI in this case) was risky. If it's really chipset related, it's likely a timing/ordering problem where mthca isn't enforcing ordering when it should or not waiting long enough for the card to recover. Adding printk's is usually sufficient to figure out if it's a timing problem. grant From rdreier at cisco.com Fri Jan 6 16:15:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 16:15:16 -0800 Subject: [openib-general] *** glibc detected *** corrupted double-linked list error In-Reply-To: (wei huang's message of "Thu, 5 Jan 2006 22:21:36 -0500 (EST)") References: Message-ID: It seems that the free() in mthca_free_db_tab() is detecting some corruption in the internal glibc data structures. However I don't see any obvious bug around here. Is it possible that something in your application is causing corruption in the glibc allocator before this? You could try running the application with the environment variable MALLOC_CHECK_ set to 1 and see what it prints. - R. From rdreier at cisco.com Fri Jan 6 16:20:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 16:20:41 -0800 Subject: [openib-general] Re: [PATCH] uverbs: error handling fixes In-Reply-To: <20051227091153.GC4907@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 27 Dec 2005 11:11:54 +0200") References: <20051227091153.GC4907@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Fri Jan 6 16:25:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 16:25:07 -0800 Subject: [openib-general] Uninitialized structure field in ib_uverbs_create_ah() In-Reply-To: <1135890006.5081.60.camel@brick.internal.keyresearch.com> (Ralph Campbell's message of "Thu, 29 Dec 2005 13:00:06 -0800") References: <1135890006.5081.60.camel@brick.internal.keyresearch.com> Message-ID: Thanks, applied. In the future please include a Signed-off-by line with your patch. - R. From rolandd at cisco.com Fri Jan 6 16:25:42 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 07 Jan 2006 00:25:42 +0000 Subject: [openib-general] [git patch review 1/8] IB/mthca: max_inline_data handling tweaks Message-ID: <1136593542999-4f2f4395a7bd3191@cisco.com> Fix a case where copying max_inline_data from a successful create_qp capabilities output to create_qp input could cause EINVAL error: mthca_set_qp_size must check max_inline_data directly against max_desc_sz; checking qp->sq.max_gs is wrong since max_inline_data depends on the qp type and does not involve max_sg. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 62 +++++++++++++++++++------------- 1 files changed, 36 insertions(+), 26 deletions(-) 5b3bc7a68171138d52b1b62012c37ac888895460 diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index ea45fa4..fd60cf3 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -890,18 +890,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, return err; } -static void mthca_adjust_qp_caps(struct mthca_dev *dev, - struct mthca_pd *pd, - struct mthca_qp *qp) +static int mthca_max_data_size(struct mthca_dev *dev, struct mthca_qp *qp, int desc_sz) { - int max_data_size; - /* * Calculate the maximum size of WQE s/g segments, excluding * the next segment and other non-data segments. */ - max_data_size = min(dev->limits.max_desc_sz, 1 << qp->sq.wqe_shift) - - sizeof (struct mthca_next_seg); + int max_data_size = desc_sz - sizeof (struct mthca_next_seg); switch (qp->transport) { case MLX: @@ -920,11 +915,24 @@ static void mthca_adjust_qp_caps(struct break; } + return max_data_size; +} + +static inline int mthca_max_inline_data(struct mthca_pd *pd, int max_data_size) +{ /* We don't support inline data for kernel QPs (yet). */ - if (!pd->ibpd.uobject) - qp->max_inline_data = 0; - else - qp->max_inline_data = max_data_size - MTHCA_INLINE_HEADER_SIZE; + return pd->ibpd.uobject ? max_data_size - MTHCA_INLINE_HEADER_SIZE : 0; +} + +static void mthca_adjust_qp_caps(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int max_data_size = mthca_max_data_size(dev, qp, + min(dev->limits.max_desc_sz, + 1 << qp->sq.wqe_shift)); + + qp->max_inline_data = mthca_max_inline_data(pd, max_data_size); qp->sq.max_gs = min_t(int, dev->limits.max_sg, max_data_size / sizeof (struct mthca_data_seg)); @@ -1191,13 +1199,23 @@ static int mthca_alloc_qp_common(struct } static int mthca_set_qp_size(struct mthca_dev *dev, struct ib_qp_cap *cap, - struct mthca_qp *qp) + struct mthca_pd *pd, struct mthca_qp *qp) { + int max_data_size = mthca_max_data_size(dev, qp, dev->limits.max_desc_sz); + /* Sanity check QP size before proceeding */ - if (cap->max_send_wr > dev->limits.max_wqes || - cap->max_recv_wr > dev->limits.max_wqes || - cap->max_send_sge > dev->limits.max_sg || - cap->max_recv_sge > dev->limits.max_sg) + if (cap->max_send_wr > dev->limits.max_wqes || + cap->max_recv_wr > dev->limits.max_wqes || + cap->max_send_sge > dev->limits.max_sg || + cap->max_recv_sge > dev->limits.max_sg || + cap->max_inline_data > mthca_max_inline_data(pd, max_data_size)) + return -EINVAL; + + /* + * For MLX transport we need 2 extra S/G entries: + * one for the header and one for the checksum at the end + */ + if (qp->transport == MLX && cap->max_recv_sge + 2 > dev->limits.max_sg) return -EINVAL; if (mthca_is_memfree(dev)) { @@ -1216,14 +1234,6 @@ static int mthca_set_qp_size(struct mthc MTHCA_INLINE_CHUNK_SIZE) / sizeof (struct mthca_data_seg)); - /* - * For MLX transport we need 2 extra S/G entries: - * one for the header and one for the checksum at the end - */ - if ((qp->transport == MLX && qp->sq.max_gs + 2 > dev->limits.max_sg) || - qp->sq.max_gs > dev->limits.max_sg || qp->rq.max_gs > dev->limits.max_sg) - return -EINVAL; - return 0; } @@ -1238,7 +1248,7 @@ int mthca_alloc_qp(struct mthca_dev *dev { int err; - err = mthca_set_qp_size(dev, cap, qp); + err = mthca_set_qp_size(dev, cap, pd, qp); if (err) return err; @@ -1281,7 +1291,7 @@ int mthca_alloc_sqp(struct mthca_dev *de u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; int err; - err = mthca_set_qp_size(dev, cap, &sqp->qp); + err = mthca_set_qp_size(dev, cap, pd, &sqp->qp); if (err) return err; -- 0.99.9n From rolandd at cisco.com Fri Jan 6 16:25:42 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 07 Jan 2006 00:25:42 +0000 Subject: [openib-general] [git patch review 5/8] IB/mthca: Fill in vendor_err field in completion with error In-Reply-To: <1136593542999-1260d5ab9345c4eb@cisco.com> Message-ID: <1136593543000-0e0e6d306b8be206@cisco.com> Fill vendor_err field in completion with error. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_cq.c | 12 +++++++----- 1 files changed, 7 insertions(+), 5 deletions(-) 0f8e8f9607d77ffc1f9820446dfcf781e96fdfd4 diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index fcef8dc..96f1a86 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -128,12 +128,12 @@ struct mthca_err_cqe { __be32 my_qpn; u32 reserved1[3]; u8 syndrome; - u8 reserved2; + u8 vendor_err; __be16 db_cnt; - u32 reserved3; + u32 reserved2; __be32 wqe; u8 opcode; - u8 reserved4[2]; + u8 reserved3[2]; u8 owner; }; @@ -342,8 +342,8 @@ static int handle_error_cqe(struct mthca } /* - * For completions in error, only work request ID, status (and - * freed resource count for RD) have to be set. + * For completions in error, only work request ID, status, vendor error + * (and freed resource count for RD) have to be set. */ switch (cqe->syndrome) { case SYNDROME_LOCAL_LENGTH_ERR: @@ -405,6 +405,8 @@ static int handle_error_cqe(struct mthca break; } + entry->vendor_err = cqe->vendor_err; + /* * Mem-free HCAs always generate one CQE per WQE, even in the * error case, so we don't have to check the doorbell count, etc. -- 0.99.9n From rolandd at cisco.com Fri Jan 6 16:25:42 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 07 Jan 2006 00:25:42 +0000 Subject: [openib-general] [git patch review 2/8] IB/mthca: fix for SQEr-to-RTS transition in modify QP In-Reply-To: <1136593542999-4f2f4395a7bd3191@cisco.com> Message-ID: <1136593542999-61fb4d9a5e85dd1e@cisco.com> Fixes to SQEr->RTS transition in modify_qp: 1. The flag IB_QP_ACCESS_FLAGS is optional for UC qps 2. The SQEr state is not supported for RC qps Signed-off-by: Jack Morgenstein Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) 0364ffc3e8c441d4185e3eb41ecc61dbb09614e4 diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index fd60cf3..623f514 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -476,9 +476,8 @@ static const struct { .opt_param = { [UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [UC] = IB_QP_CUR_STATE, - [RC] = (IB_QP_CUR_STATE | - IB_QP_MIN_RNR_TIMER), + [UC] = (IB_QP_CUR_STATE | + IB_QP_ACCESS_FLAGS), [MLX] = (IB_QP_CUR_STATE | IB_QP_QKEY), } -- 0.99.9n From rolandd at cisco.com Fri Jan 6 16:25:42 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 07 Jan 2006 00:25:42 +0000 Subject: [openib-general] [git patch review 3/8] IB/mthca: fix for RTR-to-RTS transition in modify QP In-Reply-To: <1136593542999-61fb4d9a5e85dd1e@cisco.com> Message-ID: <1136593542999-f3246e38ef6bb0f5@cisco.com> PKEY_INDEX is not a legal parameter in the RTR->RTS transition. Signed-off-by: Jack Morgenstein Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 2 -- 1 files changed, 0 insertions(+), 2 deletions(-) 0d3b525fff40475e58dab9176740d2efc5f37838 diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 623f514..ff2def3 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -383,12 +383,10 @@ static const struct { [UC] = (IB_QP_CUR_STATE | IB_QP_ALT_PATH | IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX | IB_QP_PATH_MIG_STATE), [RC] = (IB_QP_CUR_STATE | IB_QP_ALT_PATH | IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX | IB_QP_MIN_RNR_TIMER | IB_QP_PATH_MIG_STATE), [MLX] = (IB_QP_CUR_STATE | -- 0.99.9n From rolandd at cisco.com Fri Jan 6 16:25:42 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 07 Jan 2006 00:25:42 +0000 Subject: [openib-general] [git patch review 7/8] IB/uverbs: Fix reference counting on error paths In-Reply-To: <1136593543000-c8b76b848fc384d6@cisco.com> Message-ID: <1136593543000-bf2926ca65fa9af8@cisco.com> If an operation fails after incrementing an object's reference count, then it should decrement the reference count on the error path. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/core/uverbs_cmd.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) b4ca1a3f8ca24033d7b7ef595faef97d9f8b2326 diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index a57d021..6985a57 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -489,6 +489,7 @@ err_idr: err_unreg: ib_dereg_mr(mr); + atomic_dec(&pd->usecnt); err_up: up(&ib_uverbs_idr_mutex); @@ -935,6 +936,11 @@ err_idr: err_destroy: ib_destroy_qp(qp); + atomic_dec(&pd->usecnt); + atomic_dec(&attr.send_cq->usecnt); + atomic_dec(&attr.recv_cq->usecnt); + if (attr.srq) + atomic_dec(&attr.srq->usecnt); err_up: up(&ib_uverbs_idr_mutex); @@ -1729,6 +1735,7 @@ err_idr: err_destroy: ib_destroy_srq(srq); + atomic_dec(&pd->usecnt); err_up: up(&ib_uverbs_idr_mutex); -- 0.99.9n From rolandd at cisco.com Fri Jan 6 16:25:42 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 07 Jan 2006 00:25:42 +0000 Subject: [openib-general] [git patch review 6/8] IB/mthca: Add support for automatic path migration (APM) In-Reply-To: <1136593543000-0e0e6d306b8be206@cisco.com> Message-ID: <1136593543000-c8b76b848fc384d6@cisco.com> Add code to modify QP operation to handle setting alternate paths for connected QPs. Signed-off-by: Dotan Barak Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 57 +++++++++++++++++++++----------- 1 files changed, 37 insertions(+), 20 deletions(-) 4de144bf721e46e7ccc8fed45b20a640cc364904 diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index ff2def3..564b6d5 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -549,6 +549,25 @@ static __be32 get_hw_access_flags(struct return cpu_to_be32(hw_access_flags); } +static void mthca_path_set(struct ib_ah_attr *ah, struct mthca_qp_path *path) +{ + path->g_mylmc = ah->src_path_bits & 0x7f; + path->rlid = cpu_to_be16(ah->dlid); + path->static_rate = !!ah->static_rate; + + if (ah->ah_flags & IB_AH_GRH) { + path->g_mylmc |= 1 << 7; + path->mgid_index = ah->grh.sgid_index; + path->hop_limit = ah->grh.hop_limit; + path->sl_tclass_flowlabel = + cpu_to_be32((ah->sl << 28) | + (ah->grh.traffic_class << 20) | + (ah->grh.flow_label)); + memcpy(path->rgid, ah->grh.dgid.raw, 16); + } else + path->sl_tclass_flowlabel = cpu_to_be32(ah->sl << 28); +} + int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) { struct mthca_dev *dev = to_mdev(ibqp->device); @@ -712,28 +731,14 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_RNR_RETRY) { - qp_context->pri_path.rnr_retry = attr->rnr_retry << 5; - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY); + qp_context->alt_path.rnr_retry = qp_context->pri_path.rnr_retry = + attr->rnr_retry << 5; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY | + MTHCA_QP_OPTPAR_ALT_RNR_RETRY); } if (attr_mask & IB_QP_AV) { - qp_context->pri_path.g_mylmc = attr->ah_attr.src_path_bits & 0x7f; - qp_context->pri_path.rlid = cpu_to_be16(attr->ah_attr.dlid); - qp_context->pri_path.static_rate = !!attr->ah_attr.static_rate; - if (attr->ah_attr.ah_flags & IB_AH_GRH) { - qp_context->pri_path.g_mylmc |= 1 << 7; - qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index; - qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit; - qp_context->pri_path.sl_tclass_flowlabel = - cpu_to_be32((attr->ah_attr.sl << 28) | - (attr->ah_attr.grh.traffic_class << 20) | - (attr->ah_attr.grh.flow_label)); - memcpy(qp_context->pri_path.rgid, - attr->ah_attr.grh.dgid.raw, 16); - } else { - qp_context->pri_path.sl_tclass_flowlabel = - cpu_to_be32(attr->ah_attr.sl << 28); - } + mthca_path_set(&attr->ah_attr, &qp_context->pri_path); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); } @@ -742,7 +747,19 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT); } - /* XXX alt_path */ + if (attr_mask & IB_QP_ALT_PATH) { + if (attr->alt_port_num == 0 || attr->alt_port_num > dev->limits.num_ports) { + mthca_dbg(dev, "Alternate port number (%u) is invalid\n", + attr->alt_port_num); + return -EINVAL; + } + + mthca_path_set(&attr->alt_ah_attr, &qp_context->alt_path); + qp_context->alt_path.port_pkey |= cpu_to_be32(attr->alt_pkey_index | + attr->alt_port_num << 24); + qp_context->alt_path.ackto = attr->alt_timeout << 3; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ALT_ADDR_PATH); + } /* leave rdd as 0 */ qp_context->pd = cpu_to_be32(to_mpd(ibqp->pd)->pd_num); -- 0.99.9n From rolandd at cisco.com Fri Jan 6 16:25:43 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 07 Jan 2006 00:25:43 +0000 Subject: [openib-general] [git patch review 8/8] IB/uverbs: set ah_flags when creating address handle In-Reply-To: <1136593543000-bf2926ca65fa9af8@cisco.com> Message-ID: <1136593543000-e3ddf87c14250050@cisco.com> AH attribute's ah_flags need to be set according to the is_global flag passed in from userspace. Signed-off-by: Roland Dreier --- drivers/infiniband/core/uverbs_cmd.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) ea5d4a6ad2bfd1006790666981645cab43d3afbd diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 6985a57..12d6cc0 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -1454,6 +1454,7 @@ ssize_t ib_uverbs_create_ah(struct ib_uv attr.sl = cmd.attr.sl; attr.src_path_bits = cmd.attr.src_path_bits; attr.static_rate = cmd.attr.static_rate; + attr.ah_flags = cmd.attr.is_global ? IB_AH_GRH : 0; attr.port_num = cmd.attr.port_num; attr.grh.flow_label = cmd.attr.grh.flow_label; attr.grh.sgid_index = cmd.attr.grh.sgid_index; -- 0.99.9n From rolandd at cisco.com Fri Jan 6 16:25:42 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 07 Jan 2006 00:25:42 +0000 Subject: [openib-general] [git patch review 4/8] IB/mthca: multiple fixes for multicast group handling In-Reply-To: <1136593542999-f3246e38ef6bb0f5@cisco.com> Message-ID: <1136593542999-1260d5ab9345c4eb@cisco.com> Multicast group management fixes: . Fix leak of mailbox memory in error handling on multicast group operations. . Free AMGM indices at detach and in attach error handling. . Fix amount to shift for aligning next_gid_index in mailbox: it starts at bit 6, not bit 5. . Allocate AMGM index after end of MGM table, in the range num_mgms to multicast table size - 1. Add some BUG_ON checks to catch cases where the index falls in the MGM hash area. . Initialize the list of QPs in a newly-allocated group from AMGM to 0 This is necessary since when a group is moved from AMGM to MGM (in the case where the MGM entry has been emptied of QPs), the AMGM entry is not reset to 0 (and we don't want an extra command to do that). Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_mcg.c | 54 ++++++++++++++++++++----------- 1 files changed, 35 insertions(+), 19 deletions(-) 5ceb74557c71465cf8f6fda050aac00e53f9ad3d diff --git a/drivers/infiniband/hw/mthca/mthca_mcg.c b/drivers/infiniband/hw/mthca/mthca_mcg.c index 2fc449d..77bc6c7 100644 --- a/drivers/infiniband/hw/mthca/mthca_mcg.c +++ b/drivers/infiniband/hw/mthca/mthca_mcg.c @@ -111,7 +111,8 @@ static int find_mgm(struct mthca_dev *de goto out; if (status) { mthca_err(dev, "READ_MGM returned status %02x\n", status); - return -EINVAL; + err = -EINVAL; + goto out; } if (!memcmp(mgm->gid, zero_gid, 16)) { @@ -126,7 +127,7 @@ static int find_mgm(struct mthca_dev *de goto out; *prev = *index; - *index = be32_to_cpu(mgm->next_gid_index) >> 5; + *index = be32_to_cpu(mgm->next_gid_index) >> 6; } while (*index); *index = -1; @@ -153,8 +154,10 @@ int mthca_multicast_attach(struct ib_qp return PTR_ERR(mailbox); mgm = mailbox->buf; - if (down_interruptible(&dev->mcg_table.sem)) - return -EINTR; + if (down_interruptible(&dev->mcg_table.sem)) { + err = -EINTR; + goto err_sem; + } err = find_mgm(dev, gid->raw, mailbox, &hash, &prev, &index); if (err) @@ -181,9 +184,8 @@ int mthca_multicast_attach(struct ib_qp err = -EINVAL; goto out; } - + memset(mgm, 0, sizeof *mgm); memcpy(mgm->gid, gid->raw, 16); - mgm->next_gid_index = 0; } for (i = 0; i < MTHCA_QP_PER_MGM; ++i) @@ -209,6 +211,7 @@ int mthca_multicast_attach(struct ib_qp if (status) { mthca_err(dev, "WRITE_MGM returned status %02x\n", status); err = -EINVAL; + goto out; } if (!link) @@ -223,7 +226,7 @@ int mthca_multicast_attach(struct ib_qp goto out; } - mgm->next_gid_index = cpu_to_be32(index << 5); + mgm->next_gid_index = cpu_to_be32(index << 6); err = mthca_WRITE_MGM(dev, prev, mailbox, &status); if (err) @@ -234,7 +237,12 @@ int mthca_multicast_attach(struct ib_qp } out: + if (err && link && index != -1) { + BUG_ON(index < dev->limits.num_mgms); + mthca_free(&dev->mcg_table.alloc, index); + } up(&dev->mcg_table.sem); + err_sem: mthca_free_mailbox(dev, mailbox); return err; } @@ -255,8 +263,10 @@ int mthca_multicast_detach(struct ib_qp return PTR_ERR(mailbox); mgm = mailbox->buf; - if (down_interruptible(&dev->mcg_table.sem)) - return -EINTR; + if (down_interruptible(&dev->mcg_table.sem)) { + err = -EINTR; + goto err_sem; + } err = find_mgm(dev, gid->raw, mailbox, &hash, &prev, &index); if (err) @@ -305,13 +315,11 @@ int mthca_multicast_detach(struct ib_qp if (i != 1) goto out; - goto out; - if (prev == -1) { /* Remove entry from MGM */ - if (be32_to_cpu(mgm->next_gid_index) >> 5) { - err = mthca_READ_MGM(dev, - be32_to_cpu(mgm->next_gid_index) >> 5, + int amgm_index_to_free = be32_to_cpu(mgm->next_gid_index) >> 6; + if (amgm_index_to_free) { + err = mthca_READ_MGM(dev, amgm_index_to_free, mailbox, &status); if (err) goto out; @@ -332,9 +340,13 @@ int mthca_multicast_detach(struct ib_qp err = -EINVAL; goto out; } + if (amgm_index_to_free) { + BUG_ON(amgm_index_to_free < dev->limits.num_mgms); + mthca_free(&dev->mcg_table.alloc, amgm_index_to_free); + } } else { /* Remove entry from AMGM */ - index = be32_to_cpu(mgm->next_gid_index) >> 5; + int curr_next_index = be32_to_cpu(mgm->next_gid_index) >> 6; err = mthca_READ_MGM(dev, prev, mailbox, &status); if (err) goto out; @@ -344,7 +356,7 @@ int mthca_multicast_detach(struct ib_qp goto out; } - mgm->next_gid_index = cpu_to_be32(index << 5); + mgm->next_gid_index = cpu_to_be32(curr_next_index << 6); err = mthca_WRITE_MGM(dev, prev, mailbox, &status); if (err) @@ -354,10 +366,13 @@ int mthca_multicast_detach(struct ib_qp err = -EINVAL; goto out; } + BUG_ON(index < dev->limits.num_mgms); + mthca_free(&dev->mcg_table.alloc, index); } out: up(&dev->mcg_table.sem); + err_sem: mthca_free_mailbox(dev, mailbox); return err; } @@ -365,11 +380,12 @@ int mthca_multicast_detach(struct ib_qp int __devinit mthca_init_mcg_table(struct mthca_dev *dev) { int err; + int table_size = dev->limits.num_mgms + dev->limits.num_amgms; err = mthca_alloc_init(&dev->mcg_table.alloc, - dev->limits.num_amgms, - dev->limits.num_amgms - 1, - 0); + table_size, + table_size - 1, + dev->limits.num_mgms); if (err) return err; -- 0.99.9n From huanwei at cse.ohio-state.edu Fri Jan 6 16:31:51 2006 From: huanwei at cse.ohio-state.edu (wei huang) Date: Fri, 6 Jan 2006 19:31:51 -0500 (EST) Subject: [openib-general] *** glibc detected *** corrupted double-linked list error In-Reply-To: Message-ID: Yeah ... We are also suspecting this. But I have no clue yet. That's why I ask you what might be the cause for this error. Anyway, we will try to apply your suggestion and see what happens. Thanks! Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Fri, 6 Jan 2006, Roland Dreier wrote: > It seems that the free() in mthca_free_db_tab() is detecting some > corruption in the internal glibc data structures. However I don't see > any obvious bug around here. Is it possible that something in your > application is causing corruption in the glibc allocator before this? > > You could try running the application with the environment variable > MALLOC_CHECK_ set to 1 and see what it prints. > > - R. > From rdreier at cisco.com Fri Jan 6 16:42:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 16:42:40 -0800 Subject: [openib-general] Re: [PATCH] core/ib_uverbs: fix error flow in ib_uverbs_create_cq In-Reply-To: <20051219131736.GA8822@mellanox.co.il> (Jack Morgenstein's message of "Mon, 19 Dec 2005 15:17:36 +0200") References: <20051219131736.GA8822@mellanox.co.il> Message-ID: Good catch. I think we also should fail the create CQ operation if userspace asks for a completion channel and we don't find it, so I committed the patch below. - R. --- infiniband/core/uverbs_cmd.c (revision 4798) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -594,13 +594,18 @@ ssize_t ib_uverbs_create_cq(struct ib_uv if (cmd.comp_vector >= file->device->num_comp_vectors) return -EINVAL; - if (cmd.comp_channel >= 0) - ev_file = ib_uverbs_lookup_comp_file(cmd.comp_channel); - uobj = kmalloc(sizeof *uobj, GFP_KERNEL); if (!uobj) return -ENOMEM; + if (cmd.comp_channel >= 0) { + ev_file = ib_uverbs_lookup_comp_file(cmd.comp_channel); + if (!ev_file) { + ret = -EINVAL; + goto err; + } + } + uobj->uobject.user_handle = cmd.user_handle; uobj->uobject.context = file->ucontext; uobj->uverbs_file = file; @@ -664,6 +669,8 @@ err_up: ib_destroy_cq(cq); err: + if (ev_file) + ib_uverbs_release_ucq(file, ev_file, uobj); kfree(uobj); return ret; } From rdreier at cisco.com Fri Jan 6 16:46:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 16:46:07 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fill vendor_err in completion with error In-Reply-To: <20051226121018.GS4907@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 26 Dec 2005 14:10:18 +0200") References: <20051226121018.GS4907@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Fri Jan 6 16:52:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 16:52:15 -0800 Subject: [openib-general] Re: [PATCH] libmthca: race condition fix in In-Reply-To: <20060105130359.GH2790@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 5 Jan 2006 15:03:59 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3010C7429@mtlexch01.mtl.com> <20060105130359.GH2790@mellanox.co.il> Message-ID: Thanks, excellent catch. Applied. From rdreier at cisco.com Fri Jan 6 16:55:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 16:55:45 -0800 Subject: [openib-general] minor bug in pingpong programs In-Reply-To: <1135808092.5081.7.camel@brick.internal.keyresearch.com> (Ralph Campbell's message of "Wed, 28 Dec 2005 14:14:52 -0800") References: <1135808092.5081.7.camel@brick.internal.keyresearch.com> Message-ID: Thanks, applied. From rdreier at cisco.com Fri Jan 6 17:02:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Jan 2006 17:02:34 -0800 Subject: [openib-general] Re: srq_pingpong with many QPs and events may never ends In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3FA2E7B@mtlexch01.mtl.com> (Dotan Barak's message of "Wed, 28 Dec 2005 16:52:33 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FA2E7B@mtlexch01.mtl.com> Message-ID: Thanks, I applied a slightly different version of this. I preferred not to rename wc -> wc_arr, and I added an error if there are not enough receives for all the QPs. Please let me know if I made any mistake here. Index: libibverbs/ChangeLog =================================================================== --- libibverbs/ChangeLog (revision 4802) +++ libibverbs/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2006-01-06 Roland Dreier + + * examples/srq_pingpong.c (main): Fix SRQ example to avoid + problems with many QPs and events. Based on a patch from Dotan + Barak (who also found the problem). + 2006-01-06 Ralph Campbell * examples/rc_pingpong.c (main), examples/srq_pingpong.c (main), Index: libibverbs/examples/srq_pingpong.c =================================================================== --- libibverbs/examples/srq_pingpong.c (revision 4802) +++ libibverbs/examples/srq_pingpong.c (working copy) @@ -511,6 +511,7 @@ int main(int argc, char *argv[]) { struct ibv_device **dev_list; struct ibv_device *ib_dev; + struct ibv_wc *wc; struct pingpong_context *ctx; struct pingpong_dest my_dest[MAX_QP]; struct pingpong_dest *rem_dest; @@ -526,6 +527,7 @@ int main(int argc, char *argv[]) int use_event = 0; int routs; int rcnt, scnt; + int num_wc; int i; srand48(getpid() * time(NULL)); @@ -603,6 +605,16 @@ int main(int argc, char *argv[]) return 1; } + if (num_qp > rx_depth) { + fprintf(stderr, "rx_depth %d is too small for %d QPs -- " + "must have at least one receive per QP.\n", + rx_depth, num_qp); + return 1; + } + + num_wc = num_qp + rx_depth; + wc = alloca(num_wc * sizeof *wc); + page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_device_list(NULL); @@ -714,11 +726,10 @@ int main(int argc, char *argv[]) } { - struct ibv_wc wc[2]; int ne, qp_ind; do { - ne = ibv_poll_cq(ctx->cq, 2, wc); + ne = ibv_poll_cq(ctx->cq, num_wc, wc); if (ne < 0) { fprintf(stderr, "poll CQ failed %d\n", ne); return 1; @@ -745,7 +756,7 @@ int main(int argc, char *argv[]) break; case PINGPONG_RECV_WRID: - if (--routs <= 1) { + if (--routs <= num_qp) { routs += pp_post_recv(ctx, ctx->rx_depth - routs); if (routs < ctx->rx_depth) { fprintf(stderr, From info at uyod.com Fri Jan 6 16:34:56 2006 From: info at uyod.com (info at uyod.com) Date: 7 Jan 2006 09:34:56 +0900 Subject: [openib-general] $B3Z$7$$$R$H;~$9$4$7$^$;$s!)(B Message-ID: <20060107003456.10275.qmail@mail.uyod.com> $B8a8e$N$A$g$C$H$7$?;~4V$KOB$d$+$K2qOC$G$-$kAj$5$l$kJ}$O(B refuse at koi-road.com $B$^$G%a!<%k$r$*Aw$j2<$5$$!#(B From rolandd at cisco.com Sat Jan 7 09:07:08 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 07 Jan 2006 17:07:08 +0000 Subject: [openib-general] [git patch review 2/2] IB: Set GIDs correctly in ib_create_ah_from_wc() In-Reply-To: <1136653628658-9dad0e46bb1d8cba@cisco.com> Message-ID: <1136653628658-5a13a5a36ef5dae2@cisco.com> ib_create_ah_from_wc() doesn't create the correct return address (AH) when there is a GRH present (source & dest GIDs need to be swapped). Signed-off-by: Ralph Campbell Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier --- drivers/infiniband/core/verbs.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) 4f8448dfe8d3804fadad90c9b77494238b4a4eae diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 4c15e11..c857361 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -107,9 +107,9 @@ struct ib_ah *ib_create_ah_from_wc(struc if (wc->wc_flags & IB_WC_GRH) { ah_attr.ah_flags = IB_AH_GRH; - ah_attr.grh.dgid = grh->dgid; + ah_attr.grh.dgid = grh->sgid; - ret = ib_find_cached_gid(pd->device, &grh->sgid, &port_num, + ret = ib_find_cached_gid(pd->device, &grh->dgid, &port_num, &gid_index); if (ret) return ERR_PTR(ret); -- 0.99.9n From rolandd at cisco.com Sat Jan 7 09:07:08 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 07 Jan 2006 17:07:08 +0000 Subject: [openib-general] [git patch review 1/2] IB/uverbs: Release event file reference on ib_uverbs_create_cq() error Message-ID: <1136653628658-9dad0e46bb1d8cba@cisco.com> ib_uverbs_create_cq() should release the completion channel event file if an error occurs after it looks it up. Also, if userspace asks for a completion channel and we don't find it, an error should be returned instead of silently creating a CQ without a completion channel. Signed-off-by: Jack Morgenstein Signed-off-by: Roland Dreier --- drivers/infiniband/core/uverbs_cmd.c | 13 ++++++++++--- 1 files changed, 10 insertions(+), 3 deletions(-) ac4e7b35579de55db50d602a472858867808a9c3 diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 12d6cc0..a02c5a0 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -594,13 +594,18 @@ ssize_t ib_uverbs_create_cq(struct ib_uv if (cmd.comp_vector >= file->device->num_comp_vectors) return -EINVAL; - if (cmd.comp_channel >= 0) - ev_file = ib_uverbs_lookup_comp_file(cmd.comp_channel); - uobj = kmalloc(sizeof *uobj, GFP_KERNEL); if (!uobj) return -ENOMEM; + if (cmd.comp_channel >= 0) { + ev_file = ib_uverbs_lookup_comp_file(cmd.comp_channel); + if (!ev_file) { + ret = -EINVAL; + goto err; + } + } + uobj->uobject.user_handle = cmd.user_handle; uobj->uobject.context = file->ucontext; uobj->uverbs_file = file; @@ -664,6 +669,8 @@ err_up: ib_destroy_cq(cq); err: + if (ev_file) + ib_uverbs_release_ucq(file, ev_file, uobj); kfree(uobj); return ret; } -- 0.99.9n From jackm at mellanox.co.il Sat Jan 7 23:02:48 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 8 Jan 2006 09:02:48 +0200 Subject: [openib-general] RE: [PATCH] core/ib_uverbs: fix error flow in ib_uverbs_create_cq Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3010C7637@mtlexch01.mtl.com> Agreed, good catch! Jack -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Saturday, January 07, 2006 2:43 AM To: Jack Morgenstein Cc: rolandd at cisco.com; openib-general at openib.org Subject: Re: [PATCH] core/ib_uverbs: fix error flow in ib_uverbs_create_cq Good catch. I think we also should fail the create CQ operation if userspace asks for a completion channel and we don't find it, so I committed the patch below. - R. --- infiniband/core/uverbs_cmd.c (revision 4798) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -594,13 +594,18 @@ ssize_t ib_uverbs_create_cq(struct ib_uv if (cmd.comp_vector >= file->device->num_comp_vectors) return -EINVAL; - if (cmd.comp_channel >= 0) - ev_file = ib_uverbs_lookup_comp_file(cmd.comp_channel); - uobj = kmalloc(sizeof *uobj, GFP_KERNEL); if (!uobj) return -ENOMEM; + if (cmd.comp_channel >= 0) { + ev_file = ib_uverbs_lookup_comp_file(cmd.comp_channel); + if (!ev_file) { + ret = -EINVAL; + goto err; + } + } + uobj->uobject.user_handle = cmd.user_handle; uobj->uobject.context = file->ucontext; uobj->uverbs_file = file; @@ -664,6 +669,8 @@ err_up: ib_destroy_cq(cq); err: + if (ev_file) + ib_uverbs_release_ucq(file, ev_file, uobj); kfree(uobj); return ret; } From panda at cse.ohio-state.edu Sun Jan 8 06:10:04 2006 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sun, 8 Jan 2006 09:10:04 -0500 (EST) Subject: [openib-general] We are ready with a Gen2 version of MVAPICH2 Message-ID: <200601081410.k08EA53K002400@xi.cse.ohio-state.edu> Hi Roland, Hal and Woody, This is to let you know that we are ready with a Gen2 version of MVAPICH2. We plan to upload a stripped down version of this new release to the OpenIB SVN at the following location: https://openib.org/svn/gen2/trunk/src/userspace/mpi/ We will create a mvapich2-gen2 directory under the above path and upload the files. Please let us know if the above path is correct and we will proceed. Thanks, DK From rdreier at cisco.com Sun Jan 8 14:25:25 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 08 Jan 2006 14:25:25 -0800 Subject: [openib-general] [git pull] InfiniBand updates Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus The pull will get the following changes: Dotan Barak: IB/mthca: Add support for automatic path migration (APM) Jack Morgenstein: IB/mthca: fix QP size limits for mem-free HCAs IB/umad: fix memory leaks IB/mthca: fix memory user DB table leak IB/mthca: check RDMA limits IB/mthca: correct log2 calculation IB/mthca: don't change driver's copy of attributes if modify QP fails IB/mthca: Fix SRQ cleanup during QP destroy IB/mthca: Fix IB_QP_ACCESS_FLAGS handling. IB/mthca: Fix corner cases in max_rd_atomic value handling in modify QP IB/mthca: fix WQE size calculation in create-srq IB/mthca: check return value in mthca_dev_lim call IB/mthca: check port validity in modify_qp IB/mthca: max_inline_data handling tweaks IB/mthca: fix for SQEr-to-RTS transition in modify QP IB/mthca: fix for RTR-to-RTS transition in modify QP IB/mthca: multiple fixes for multicast group handling IB/uverbs: Fix reference counting on error paths IB/uverbs: Release event file reference on ib_uverbs_create_cq() error Michael S. Tsirkin: IB/mthca: Fix thinko in mthca_table_find() IB/mthca: create_eq with size not a power of 2 IB/mthca: Fill in vendor_err field in completion with error Ralph Campbell: IB/uverbs: set ah_flags when creating address handle IB: Set GIDs correctly in ib_create_ah_from_wc() Sean Hefty: IB/cm: correct reported reject code IB/cm: avoid reusing local ID drivers/infiniband/core/cm.c | 16 +- drivers/infiniband/core/user_mad.c | 4 drivers/infiniband/core/uverbs_cmd.c | 21 ++ drivers/infiniband/core/verbs.c | 4 drivers/infiniband/hw/mthca/mthca_cmd.c | 12 + drivers/infiniband/hw/mthca/mthca_cq.c | 23 ++ drivers/infiniband/hw/mthca/mthca_eq.c | 4 drivers/infiniband/hw/mthca/mthca_main.c | 4 drivers/infiniband/hw/mthca/mthca_mcg.c | 54 ++++-- drivers/infiniband/hw/mthca/mthca_memfree.c | 4 drivers/infiniband/hw/mthca/mthca_qp.c | 265 +++++++++++++++------------ drivers/infiniband/hw/mthca/mthca_srq.c | 2 12 files changed, 250 insertions(+), 163 deletions(-) From rdreier at cisco.com Sun Jan 8 14:37:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 08 Jan 2006 14:37:20 -0800 Subject: [openib-general] Re: We are ready with a Gen2 version of MVAPICH2 In-Reply-To: <200601081410.k08EA53K002400@xi.cse.ohio-state.edu> (Dhabaleswar Panda's message of "Sun, 8 Jan 2006 09:10:04 -0500 (EST)") References: <200601081410.k08EA53K002400@xi.cse.ohio-state.edu> Message-ID: Dhabaleswar> Hi Roland, Hal and Woody, This is to let you know Dhabaleswar> that we are ready with a Gen2 version of MVAPICH2. Dhabaleswar> We plan to upload a stripped down version of this new Dhabaleswar> release to the OpenIB SVN at the following location: Dhabaleswar> https://openib.org/svn/gen2/trunk/src/userspace/mpi/ Dhabaleswar> We will create a mvapich2-gen2 directory under the Dhabaleswar> above path and upload the files. This is exactly where your existing source tree is checked in. I assume you are planning on simply checking in the changes since your last update. If so, that should be fine. However, let me reiterate what I said the last time this came up. The openib subversion repository should not be used simply as a distribution channel; if you are not using the subversion repository as your real source code control system, then I don't think it's appropriate to dump code drops in every few months. - R. From panda at cse.ohio-state.edu Sun Jan 8 16:25:59 2006 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sun, 8 Jan 2006 19:25:59 -0500 (EST) Subject: [openib-general] Re: We are ready with a Gen2 version of MVAPICH2 In-Reply-To: from "Roland Dreier" at Jan 08, 2006 02:37:20 PM Message-ID: <200601090025.k090Px91013968@xi.cse.ohio-state.edu> Hi Roland, Thanks for your prompt reply. > Dhabaleswar> Hi Roland, Hal and Woody, This is to let you know > Dhabaleswar> that we are ready with a Gen2 version of MVAPICH2. > > Dhabaleswar> We plan to upload a stripped down version of this new > Dhabaleswar> release to the OpenIB SVN at the following location: > > Dhabaleswar> https://openib.org/svn/gen2/trunk/src/userspace/mpi/ > > Dhabaleswar> We will create a mvapich2-gen2 directory under the > Dhabaleswar> above path and upload the files. > > This is exactly where your existing source tree is checked in. I > assume you are planning on simply checking in the changes since your > last update. If so, that should be fine. I believe there is a minor confusion here. We have been continuously updating the mvapich-gen2 version at the existing location. There is no problem here. What I am talking about here is the new mvapich2-gen2 (MPI-2) version we have developed. The existing one at the SVN is the mvapich-gen2 (MPI-1) version. Thus, we will have two directories there (one for MPI-1 version and the other one for MPI-2 version). Let me know if this clarifies the situation. > However, let me reiterate what I said the last time this came up. The > openib subversion repository should not be used simply as a > distribution channel; if you are not using the subversion repository > as your real source code control system, then I don't think it's > appropriate to dump code drops in every few months. Thanks for reiterating the comments. I completely agree with it. As indicated above, for the existing mvapich-gen2 (MPI-1) version, we have been using the SVN repository as the real source code control system (not as the distribution channel) and we plan to continue this trend. We plan to do the same with the new mvapich2-gen2 (MPI-2) version with the SVN repository too. Let me know if you have any additional questions. Otherwise, we will proceed with the check-in of the new MPI-2 version at the above location under a new directory. Thanks, DK > - R. > From rdreier at cisco.com Sun Jan 8 16:49:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 08 Jan 2006 16:49:10 -0800 Subject: [openib-general] Re: We are ready with a Gen2 version of MVAPICH2 In-Reply-To: <200601090025.k090Px91013968@xi.cse.ohio-state.edu> (Dhabaleswar Panda's message of "Sun, 8 Jan 2006 19:25:59 -0500 (EST)") References: <200601090025.k090Px91013968@xi.cse.ohio-state.edu> Message-ID: Dhabaleswar> What I am talking about here is the new mvapich2-gen2 Dhabaleswar> (MPI-2) version we have developed. The existing one Dhabaleswar> at the SVN is the mvapich-gen2 (MPI-1) version. Thus, Dhabaleswar> we will have two directories there (one for MPI-1 Dhabaleswar> version and the other one for MPI-2 version). Sorry for the confusion -- I missed the 2nd "2" in the directory name. Yes, that is definitely fine. Dhabaleswar> Thanks for reiterating the comments. I completely Dhabaleswar> agree with it. As indicated above, for the existing Dhabaleswar> mvapich-gen2 (MPI-1) version, we have been using the Dhabaleswar> SVN repository as the real source code control system Dhabaleswar> (not as the distribution channel) and we plan to Dhabaleswar> continue this trend. We plan to do the same with the Dhabaleswar> new mvapich2-gen2 (MPI-2) version with the SVN Dhabaleswar> repository too. Hmm... the full diffstat of all the changes to mpid made since the initial checkin is: $ svn diff -r3157 mpi/mvapich-gen2/mpid/ch_gen2|diffstat -p2 mpid/ch_gen2/dreg.c | 1 mpid/ch_gen2/ibverbs_const.h | 1 mpid/ch_gen2/vbuf.h | 2 - mpid/ch_gen2/viainit.c | 51 +++++++++++++++++++++++++++++++++++++------ mpid/ch_gen2/viaparam.c | 11 ++++++++- mpid/ch_gen2/viaparam.h | 1 6 files changed, 57 insertions(+), 10 deletions(-) Is that really all the MPI-1 development you've done since August '05? (There are also some big changes like deleting most of the extra files and renaming a few documents, but that's all the real changes that I see) - R. From panda at cse.ohio-state.edu Sun Jan 8 18:12:54 2006 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sun, 8 Jan 2006 21:12:54 -0500 (EST) Subject: [openib-general] Re: We are ready with a Gen2 version of MVAPICH2 In-Reply-To: from "Roland Dreier" at Jan 08, 2006 04:49:10 PM Message-ID: <200601090212.k092CtYd015963@xi.cse.ohio-state.edu> Hi Roland, > Dhabaleswar> What I am talking about here is the new mvapich2-gen2 > Dhabaleswar> (MPI-2) version we have developed. The existing one > Dhabaleswar> at the SVN is the mvapich-gen2 (MPI-1) version. Thus, > Dhabaleswar> we will have two directories there (one for MPI-1 > Dhabaleswar> version and the other one for MPI-2 version). > > Sorry for the confusion -- I missed the 2nd "2" in the directory > name. Yes, that is definitely fine. Thanks. > Dhabaleswar> Thanks for reiterating the comments. I completely > Dhabaleswar> agree with it. As indicated above, for the existing > Dhabaleswar> mvapich-gen2 (MPI-1) version, we have been using the > Dhabaleswar> SVN repository as the real source code control system > Dhabaleswar> (not as the distribution channel) and we plan to > Dhabaleswar> continue this trend. We plan to do the same with the > Dhabaleswar> new mvapich2-gen2 (MPI-2) version with the SVN > Dhabaleswar> repository too. > > Hmm... the full diffstat of all the changes to mpid made since the > initial checkin is: > > $ svn diff -r3157 mpi/mvapich-gen2/mpid/ch_gen2|diffstat -p2 > mpid/ch_gen2/dreg.c | 1 > mpid/ch_gen2/ibverbs_const.h | 1 > mpid/ch_gen2/vbuf.h | 2 - > mpid/ch_gen2/viainit.c | 51 +++++++++++++++++++++++++++++++++++++------ > mpid/ch_gen2/viaparam.c | 11 ++++++++- > mpid/ch_gen2/viaparam.h | 1 > 6 files changed, 57 insertions(+), 10 deletions(-) > > Is that really all the MPI-1 development you've done since August '05? > (There are also some big changes like deleting most of the extra files > and renaming a few documents, but that's all the real changes that I see) Yes, these are all the changes done to the `gen2' version of mvapich so far. FYI, currently, there are two different code bases for mvapich: one for `vapi' and one for `gen2' with different version numbers and features. They have evolved like that in the past and currently we are working on unifying these two code bases into a single one so that all features will be available for both vapi and gen2 stacks. This unified code base will be available in early February and will be updated at the OpenIB SVN. The extra files were deleted after initial check-in because OpenIB folks wanted a stripped down version with a smaller code size. On the vapi-side (on the vapi code base), a lot of changes have been happening continuously (including the latest 0.9.6 release made in early December). However, since this code base is different and can not work with the Gen2 stack, these changes have not been reflected on the OpenIB SVN. After these two code bases are unified in early February, all features/changes will be reflected on both vapi and gen2 stacks and will be reflected on the OpenIB SVN. After that you will see frequent changes there. On the mvpich2 side, we have taken a different approach wrt gen2. The new mvapich2-gen2 version (which we are trying to put at the OpenIB SVN) has the unified code base for both vapi and gen2 (i.e., all features available on both stacks). Thus, all future changes (features and bug fixes) will be reflected on this unified version at OpenIB SVN. Hope this clarifies your questions. Let me know if you have any additional questions. Thanks, DK > - R. > From nacc at us.ibm.com Sun Jan 8 20:59:48 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Sun, 8 Jan 2006 20:59:48 -0800 Subject: [openib-general] Userspace testing results (many kernels, many svn trees) Message-ID: <20060109045948.GH2064@us.ibm.com> Hello all, Here are more results, where each section's heading indicates server-client size (e.g. 32-65 is a 32-bit server and a 64-bit client), only related to userspace; that is both machines are running (identical) 64-bit kernels. Each row is headed by the particular kernel which was booted on the two machines (If the kernel is suffixed with svn, then it indicates that the OpenIB kernel components were used. If not, then it is a mainline tree without modification). The userpace (and kernel, where applicable) svn revision is listed following the kernel version. This e-mail is ridiculously long, but there is *a lot* of data here... Enjoy! FYI, all of the errors in the footnotes are from the client-side. If you would like to see what the server-side said, please just ask. netpipe over IB 32-32 avg b/w (Mbps) peak b/w (Mbps) rdma_write 2.6.15-rc7-git4-svn (4662) 1036.73 1839.97 2.6.15-rc7-git5-svn (4663) 1036.18 1839.98 2.6.15-rc7-git6-svn (4670) 1036.08 1839.99 2.6.15-rc7-git6-svn (4692) 1035.65 1839.97 2.6.15-svn (4714) 1036.25 1839.93 2.6.15-svn (4785) 1035.81 1840 2.6.15-svn (4789) 1036.48 1839.97 2.6.15 (4789) 1035.46 1839.98 2.6.15-svn (4803) 1035.57 1840.01 2.6.15 (4803) 1035.67 1839.97 rdma_write with immediate Errors across the board [1] send_recv Errors across the board [2] send_recv with immediate Errors across the board [3] 32-64 avg b/w (Mbps) peak b/w (Mbps) rdma_write Errors across the board [4] rdma_write with immediate Errors across the board [1] send_recv Errors across the board [2] send_recv with immediate Errors across the board [3] 64-32 avg b/w (Mbps) peak b/w (Mbps) rdma_write Errors across the board [4] rdma_write with immediate Errors across the board [1] send_recv Errors across the board [2] send_recv with immediate Errors across the board [3] 64-64 avg b/w (Mbps) peak b/w (Mbps) rdma_write 2.6.15-rc7-git4-svn (4662) 1037.22 1840.01 2.6.15-rc7-git5-svn (4663) Errors [5] 2.6.15-rc7-git6-svn (4670) 1037.58 1839.99 2.6.15-rc7-git6-svn (4692) 1037.28 1839.97 2.6.15-svn (4714) 1036.83 1839.98 2.6.15-svn (4785) 1037.32 1840.01 2.6.15-svn (4789) 1036.99 1840 2.6.15 (4789) 1036.74 1839.97 2.6.15-svn (4803) 1037.82 1839.95 2.6.15 (4803) 1038.05 1839.95 rdma_write with immediate Errors across the board [1] send_recv Errors across the board [2] send_recv with immediate Errors across the board [3] pingpong 32-32 b/w (Mbps) rc 2.6.15-rc7-git4-svn (4662) 962.96 2.6.15-rc7-git5-svn (4663) 959.49 2.6.15-rc7-git6-svn (4670) 956.59 2.6.15-rc7-git6-svn (4692) 961.30 2.6.15-svn (4714) 957.58 2.6.15-svn (4785) 961.83 2.6.15-svn (4789) 962.79 2.6.15-svn (4803) 960.01 2.6.15 (4789) 883.95 2.6.15 (4803) 957.97 srq 2.6.15-rc7-git4-svn (4662) 3318.61 2.6.15-rc7-git5-svn (4663) 3247.41 2.6.15-rc7-git6-svn (4670) 3320.13 2.6.15-rc7-git6-svn (4692) 3294.26 2.6.15-svn (4714) 3267.16 2.6.15-svn (4785) 3293.27 2.6.15-svn (4789) 3347.09 2.6.15 (4789) 3277.46 2.6.15-svn (4803) 3386.00 2.6.15 (4803) 3144.87 uc 2.6.15-rc7-git4-svn (4662) 966.24 2.6.15-rc7-git5-svn (4663) 968.46 2.6.15-rc7-git6-svn (4670) 965.45 2.6.15-rc7-git6-svn (4692) 964.91 2.6.15-svn (4714) 970.53 2.6.15-svn (4785) 962.19 2.6.15-svn (4789) 967.02 2.6.15 (4789) 969.64 2.6.15-svn (4803) 964.62 2.6.15 (4803) 967.62 ud 2.6.15-rc7-git4-svn (4662) 466.30 2.6.15-rc7-git5-svn (4663) 465.32 2.6.15-rc7-git6-svn (4670) 465.64 2.6.15-rc7-git6-svn (4692) 465.42 2.6.15-svn (4714) 463.22 2.6.15-svn (4785) 464.58 2.6.15-svn (4789) 465.56 2.6.15 (4789) 465.97 2.6.15-svn (4803) 462.77 2.6.15 (4803) 462.94 32-64 b/w (Mbps) rc 2.6.15-rc7-git4-svn (4662) 975.59 2.6.15-rc7-git5-svn (4663) 971.97 2.6.15-rc7-git6-svn (4670) 972.89 2.6.15-rc7-git6-svn (4692) 974.41 2.6.15-svn (4714) 968.46 2.6.15-svn (4785) 972.20 2.6.15-svn (4789) 967.26 2.6.15 (4789) 973.15 2.6.15-svn (4803) 968.55 2.6.15 (4803) 968.26 srq 2.6.15-rc7-git4-svn (4662) 3347.26 2.6.15-rc7-git5-svn (4663) 3301.56 2.6.15-rc7-git6-svn (4670) Error [5] 2.6.15-rc7-git6-svn (4692) 3330.42 2.6.15-svn (4714) 3337.54 2.6.15-svn (4785) 334.15 2.6.15-svn (4789) 3378.32 2.6.15 (4789) 3397.06 2.6.15-svn (4803) 3427.79 2.6.15 (4803) 3429.95 uc 2.6.15-rc7-git4-svn (4662) 977.55 2.6.15-rc7-git5-svn (4663) 963.25 2.6.15-rc7-git6-svn (4670) 979.67 2.6.15-rc7-git6-svn (4692) 977.61 2.6.15-svn (4714) 976.04 2.6.15-svn (4785) 974.37 2.6.15-svn (4789) 974.40 2.6.15 (4789) 978.41 2.6.15-svn (4803) 975.75 2.6.15 (4803) 977.76 ud 2.6.15-rc7-git4-svn (4662) 469.33 2.6.15-rc7-git5-svn (4663) 468.64 2.6.15-rc7-git6-svn (4670) 468.90 2.6.15-rc7-git6-svn (4692) 468.07 2.6.15-svn (4714) 468.14 2.6.15-svn (4785) 467.97 2.6.15-svn (4789) 467.30 2.6.15 (4789) 468.58 2.6.15-svn (4803) 462.95 2.6.15 (4803) 466.87 64-32 b/w (Mbps) rc 2.6.15-rc7-git4-svn (4662) 974.95 2.6.15-rc7-git5-svn (4663) 972.01 2.6.15-rc7-git6-svn (4670) 974.17 2.6.15-rc7-git6-svn (4692) 970.75 2.6.15-svn (4714) 970.04 2.6.15-svn (4785) 964.25 2.6.15-svn (4789) 973.72 2.6.15 (4789) 974.80 2.6.15-svn (4803) 969.02 2.6.15 (4803) 968.62 srq 2.6.15-rc7-git4-svn (4662) 3312.58 2.6.15-rc7-git5-svn (4663) 3354.46 2.6.15-rc7-git6-svn (4670) 3344.36 2.6.15-rc7-git6-svn (4692) 3300.56 2.6.15-svn (4714) 3337.71 2.6.15-svn (4785) 3364.79 2.6.15-svn (4789) Error [5] 2.6.15 (4789) 3307.39 2.6.15-svn (4803) 3430.13 2.6.15 (4803) 3415.47 uc 2.6.15-rc7-git4-svn (4662) 973.99 2.6.15-rc7-git5-svn (4663) 975.11 2.6.15-rc7-git6-svn (4670) 981.45 2.6.15-rc7-git6-svn (4692) 978.91 2.6.15-svn (4714) 977.61 2.6.15-svn (4785) 974.28 2.6.15-svn (4789) 976.14 2.6.15 (4789) 975.51 2.6.15-svn (4803) 973.96 2.6.15 (4803) 972.13 ud 2.6.15-rc7-git4-svn (4662) 469.70 2.6.15-rc7-git5-svn (4663) 467.81 2.6.15-rc7-git6-svn (4670) Error [5] 2.6.15-rc7-git6-svn (4692) 467.97 2.6.15-svn (4714) 469.01 2.6.15-svn (4785) 468.09 2.6.15-svn (4789) 468.41 2.6.15 (4789) 468.94 2.6.15-svn (4803) 467.44 2.6.15 (4803) 467.53 64-64 b/w (Mbps) rc 2.6.15-rc7-git4-svn (4662) 980.93 2.6.15-rc7-git5-svn (4663) Error [5] 2.6.15-rc7-git6-svn (4670) 982.37 2.6.15-rc7-git6-svn (4692) 982.76 2.6.15-svn (4714) 968.68 2.6.15-svn (4785) 983.86 2.6.15-svn (4789) 982.61 2.6.15 (4789) 868.61 2.6.15-svn (4803) 982.68 2.6.15 (4803) 981.84 srq 2.6.15-rc7-git4-svn (4662) 3379.36 2.6.15-rc7-git5-svn (4663) Error [5] 2.6.15-rc7-git6-svn (4670) 3303.73 2.6.15-rc7-git6-svn (4692) 3354.80 2.6.15-svn (4714) 3345.55 2.6.15-svn (4785) 3376.58 2.6.15-svn (4789) 3408.89 2.6.15 (4789) 2546.47 2.6.15-svn (4803) 3461.84 2.6.15 (4803) 3450.72 uc 2.6.15-rc7-git4-svn (4662) 987.23 2.6.15-rc7-git5-svn (4663) Error [5] 2.6.15-rc7-git6-svn (4670) 989.21 2.6.15-rc7-git6-svn (4692) 989.24 2.6.15-svn (4714) 985.30 2.6.15-svn (4785) 984.91 2.6.15-svn (4789) 986.27 2.6.15 (4789) 983.80 2.6.15-svn (4803) 985.37 2.6.15 (4803) 990.13 ud 2.6.15-rc7-git4-svn (4662) 472.34 2.6.15-rc7-git5-svn (4663) Error [5] 2.6.15-rc7-git6-svn (4670) 471.73 2.6.15-rc7-git6-svn (4692) 471.38 2.6.15-svn (4714) 471.28 2.6.15-svn (4785) 471.79 2.6.15-svn (4789) 472.37 2.6.15 (4789) 471.68 2.6.15-svn (4803) 471.04 2.6.15 (4803) 470.53 perftest latency [6] 32-32 typical (us) min max rdma_lat 2.6.15-rc7-git4-svn (4662) 3.26954e+09 3.21451e+09 9.55805e+10 2.6.15-rc7-git5-svn (4663) 3.26015e+09 3.20109e+09 4.08654e+10 2.6.15-rc7-git6-svn (4670) 3.26552e+09 3.21586e+09 4.27914e+10 2.6.15-rc7-git6-svn (4692) 3.2682e+09 3.2172e+09 4.30169e+10 2.6.15-svn (4714) 3.26686e+09 3.20512e+09 4.5814e+10 2.6.15 (4785) 3.27491e+09 3.21988e+09 4.00909e+10 2.6.15-svn (4789) 3.26283e+09 3.20378e+09 4.42879e+10 2.6.15 (4789) 3.26686e+09 3.21586e+09 4.35323e+10 2.6.15-svn (4803) 3.27491e+09 3.22659e+09 5.65581e+10 2.6.15 (4803) 3.27625e+09 3.2172e+09 3.89313e+10 read_lat 2.6.15-svn (4803) 7274600857.60 7041062010.88 21923123691.52 2.6.15 (4803) 7282653921.28 7065221201.92 20535312384.00 send_lat 2.6.15-svn (4803) 4954647429.12 4771440230.40 43512045240.21 2.6.15 (4803) 4969411379.20 4799625953.28 46604421693.32 write_lat 2.6.15-svn (4803) 3291018690.56 3235989422.08 44894487838.72 2.6.15 (4803) 3286992158.72 3234647244.80 42607417753.60 32-64 typical (us) min max rdma_lat 2.6.15-rc7-git4-svn (4662) 0.750625 0.7375 10.2784 2.6.15-rc7-git5-svn (4663) 0.753437 0.73875 10.1547 2.6.15-rc7-git6-svn (4670) 0.754687 0.740938 10.4453 2.6.15-rc7-git6-svn (4692) 0.753437 0.74 9.07281 2.6.15-svn (4714) 0.75625 0.744375 9.80438 2.6.15-svn (4785) 0.742812 0.74125 9.77406 2.6.15-svn (4789) 0.756875 0.744687 9.83906 2.6.15 (4789) 0.753125 0.74 9.15125 2.6.15-svn (4803) 0.753437 0.741875 12.2256 2.6.15 (4803) 0.75875 0.745 11.3766 read_lat 2.6.15-svn (4803) 1.63 1.58 8.59 2.6.15 (4803) 1.63 1.59 8.98 send_lat 2.6.15-svn (4803) 1.12 1.09 10.51 2.6.15 (4803) 1.13 1.09 9.44 write_lat 2.6.15-svn (4803) 0.75 0.74 10.24 2.6.15 (4803) 0.76 0.74 11.84 64-32 typical (us) min max rdma_lat 2.6.15-rc7-git4-svn (4662) 3.22659e+09 3.17693e+09 4.30155e+10 2.6.15-rc7-git5-svn (4663) 3.22257e+09 3.16485e+09 4.63898e+10 2.6.15-rc7-git6-svn (4670) 3.21586e+09 3.1662e+09 4.29565e+10 2.6.15-rc7-git6-svn (4692) 3.23733e+09 3.17022e+09 4.34558e+10 2.6.15-svn (4714) 3.22794e+09 3.17425e+09 1.09938e+11 2.6.15-svn (4785) 3.23499e+09 3.17828e+09 4.08023e+10 2.6.15-svn (4789) 3.22928e+09 3.17425e+09 5.84103e+10 2.6.15 (4789) 3.2172e+09 3.17425e+09 5.45797e+10 2.6.15-svn (4803) 3.22391e+09 3.17291e+09 4.00372e+10 2.6.15 (4803) 3.23331e+09 3.17559e+09 1.30541e+12 read_lat 2.6.15-svn (4803) 7296075694.08 7041062010.88 71653476270.08 2.6.15 (4803) 7282653921.28 7046430720.00 20787641712.64 send_lat 2.6.15-svn (4803) 4818416435.20 4674803466.24 46003126271.89 2.6.15 (4803) 4821100789.76 4689567416.32 42077257727.89 write_lat 2.6.15-svn (4803) 3245384663.04 3201092812.80 44909251788.80 2.6.15 (4803) 3244042485.76 3193039749.12 59760443392.00 64-64 typical (us) min max rdma_lat 2.6.15-rc7-git4-svn (4662) Errors [5] 2.6.15-rc7-git5-svn (4663) Errors [5] 2.6.15-rc7-git6-svn (4670) 0.745625 0.731563 10.5003 2.6.15-rc7-git6-svn (4692) 0.745313 0.73375 10.3 2.6.15-svn (4714) 0.747812 0.736875 24.7016 2.6.15-svn (4785) 0.745 0.735313 10.5237 2.6.15-svn (4789) 0.7425 0.73125 9.94781 2.6.15 (4789) 0.745938 0.736563 11.0213 2.6.15-svn (4803) 0.742188 0.730938 9.91563 2.6.15 (4803) 0.7425 0.730625 15.0122 read_lat 2.6.15-svn (4803) 1.63 1.58 8.61 2.6.15 (4803) 1.63 1.58 44.82 send_lat 2.6.15-svn (4803) 1.10 1.07 11.21 2.6.15 (4803) 1.10 1.07 10.89 write_lat 2.6.15-svn (4803) 0.74 0.73 11.00 2.6.15 (4803) 0.75 0.74 10.53 perftest bandwidth 32-32 peak (MBps) avg rdma_bw 2.6.15-rc7-git4-svn (4662) 4.34461e-07 4.34461e-07 2.6.15-rc7-git5-svn (4663) 4.34457e-07 4.34457e-07 2.6.15-rc7-git6-svn (4670) 4.34455e-07 4.34454e-07 2.6.15-rc7-git6-svn (4692) 4.3451e-07 4.3451e-07 2.6.15-svn (4714) 4.34508e-07 4.34508e-07 2.6.15 (4785) 4.34518e-07 4.34518e-07 2.6.15-svn (4789) 4.34452e-07 4.34452e-07 2.6.15 (4789) 4.34453e-07 4.34453e-07 2.6.15-svn (4803) 4.34503e-07 4.34502e-07 2.6.15 (4803) 4.34446e-07 4.34446e-07 read_bw 2.6.15-svn (4803) 0.00 0.00 2.6.15 (4803) 0.00 0.00 send_bw 2.6.15-svn (4803) 0.00 0.00 2.6.15 (4803) 0.00 0.00 write_bw 2.6.15-svn (4803) 0.00 0.00 2.6.15 (4803) 0.00 0.00 32-64 peak (MBps) avg rdma_bw 2.6.15-rc7-git4-svn (4662) 1866.19 1866.17 2.6.15-rc7-git5-svn (4663) 1866.16 1866.16 2.6.15-rc7-git6-svn (4670) 1865.88 1865.87 2.6.15-rc7-git6-svn (4692) 1865.98 1865.97 2.6.15-svn (4714) 1865.88 1865.86 2.6.15-svn (4785) 1866.16 1866.14 2.6.15-svn (4789) 1865.95 1865.95 2.6.15 (4789) 1865.95 1865.92 2.6.15-svn (4803) 1865.92 1865.89 2.6.15 (4803) 1865.92 1865.9 read_bw 2.6.15-svn (4803) 1840.20 1840.19 2.6.15 (4803) 1840.74 1840.71 send_bw 2.6.15-svn (4803) 1840.16 1840.16 2.6.15 (4803) 1840.37 1840.36 write_bw 2.6.15-svn (4803) 1841.55 1841.52 2.6.15 (4803) 1841.32 1841.31 64-32 peak (MBps) avg rdma_bw 2.6.15-rc7-git4-svn (4662) 4.3449e-07 4.34489e-07 2.6.15-rc7-git5-svn (4663) 4.34456e-07 4.34456e-07 2.6.15-rc7-git6-svn (4670) 4.34458e-07 4.34458e-07 2.6.15-rc7-git6-svn (4692) 4.34507e-07 4.34507e-07 2.6.15-svn (4714) 4.34463e-07 4.34463e-07 2.6.15-svn (4785) 4.34454e-07 4.34454e-07 2.6.15-svn (4789) 4.34455e-07 4.34455e-07 2.6.15 (4789) 4.34443e-07 4.34443e-07 2.6.15-svn (4803) 4.34458e-07 4.34458e-07 2.6.15 (4803) 4.34501e-07 4.34501e-07 read_bw 2.6.15-svn (4803) 0.00 0.00 2.6.15 (4803) 0.00 0.00 send_bw 2.6.15-svn (4803) 0.00 0.00 2.6.15 (4803) 0.00 0.00 write_bw 2.6.15-svn (4803) 0.00 0.00 2.6.15 (4803) 0.00 0.00 64-64 peak (MBps) avg rdma_bw 2.6.15-rc7-git4-svn (4662) Errors [5] 2.6.15-rc7-git5-svn (4663) Errors [5] 2.6.15-rc7-git6-svn (4670) 1865.95 1865.93 2.6.15-rc7-git6-svn (4692) 1865.95 1865.94 2.6.15-svn (4714) 1865.92 1865.91 2.6.15-svn (4785) 1865.92 1865.9 2.6.15-svn (4789) 1865.92 1865.89 2.6.15 (4789) 1866.16 1866.15 2.6.15-svn (4803) 1866.19 1866.16 2.6.15 (4803) 1865.95 1865.93 read_bw 2.6.15-svn (4803) 1840.50 1840.48 2.6.15 (4803) 1840.47 1840.46 send_bw 2.6.15-svn (4803) 1840.40 1840.38 2.6.15 (4803) 1840.20 1840.20 write_bw 2.6.15-svn (4803) 1841.32 1841.31 2.6.15 (4803) 1841.35 1841.32 Thanks, Nish [1] Preposting asynchronous receives (required for Infiniband) Error, local polling may only be used with RDMA Write. Try using vapi polling or event completion Using RDMA Write communications with immediate data [2] Preposting asynchronous receives (required for Infiniband) Error, local polling may only be used with RDMA Write. Try using vapi polling or event completion Using Send/Receive communications [3] Preposting asynchronous receives (required for Infiniband) Error, local polling may only be used with RDMA Write. Try using vapi polling or event completion Using Send/Receive communications with immediate data [4] Preposting asynchronous receives (required for Infiniband) NetPIPE: error writing or reading synchronization string: Connection reset by peer Using RDMA Write communications This seems to be an actual error, not necessarily a programmatic one like [1]-[3]. It seems netPIPE over IB verbs is unhappy with mixing the size of the server client (does ok when they are the same, though). [5] These errors are due to our test grid, and thus are not indicative of errors in the code. [6] Just like with the results I posted earlier, all the perftest results are seriously wrong for 32-bit clients (with both 32-bit and 64-bit servers). I am not sure who else to notify beyond the general list (is there a corresponding MAINTAINERS files like in the kernel proper for the OpenIB code?) From mst at mellanox.co.il Sun Jan 8 22:33:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 9 Jan 2006 08:33:01 +0200 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060109045948.GH2064@us.ibm.com> References: <20060109045948.GH2064@us.ibm.com> Message-ID: <20060109063300.GA19748@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Just like with the results I posted earlier, all the perftest results > are seriously wrong for 32-bit clients (with both 32-bit and 64-bit > servers). I am not sure who else to notify beyond the general list (is > there a corresponding MAINTAINERS files like in the kernel proper for > the OpenIB code?) That would be me - sorry about the delay, I'll take a look at this. Thanks a lot, Nishanth! This work is very much appreciated. -- MST From nacc at us.ibm.com Sun Jan 8 22:50:12 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Sun, 8 Jan 2006 22:50:12 -0800 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060109063300.GA19748@mellanox.co.il> References: <20060109045948.GH2064@us.ibm.com> <20060109063300.GA19748@mellanox.co.il> Message-ID: <20060109065012.GJ2064@us.ibm.com> On 09.01.2006 [08:33:01 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Just like with the results I posted earlier, all the perftest results > > are seriously wrong for 32-bit clients (with both 32-bit and 64-bit > > servers). I am not sure who else to notify beyond the general list (is > > there a corresponding MAINTAINERS files like in the kernel proper for > > the OpenIB code?) > > That would be me - sorry about the delay, I'll take a look at this. > Thanks a lot, Nishanth! > This work is very much appreciated. No worries, hope the problem is not too hard to fix :) Thanks, Nish From yael at mellanox.co.il Mon Jan 9 04:29:12 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 09 Jan 2006 14:29:12 +0200 Subject: [openib-general] Re[PATCH] Opensm - fix for osm_sa_portinfo_record.c Message-ID: <5zwth9fuuf.fsf@mtl066.yok.mtl.com> Hi Hal, During some tests we've notices that not all compmask fields are properly checked and compared in the portInfo record query. Attached is a patch with the missing checks, and addition of some set/get relevant functions added to the ib_types.h as well. Thanks, Yael Signed-off-by: Yael Kalka Index: include/iba/ib_types.h =================================================================== --- include/iba/ib_types.h (revision 4809) +++ include/iba/ib_types.h (working copy) @@ -3960,6 +3960,33 @@ ib_port_info_get_vl_cap( * * SEE ALSO *********/ +/****f* IBA Base: Types/ib_port_info_get_init_type +* NAME +* ib_port_info_get_init_type +* +* DESCRIPTION +* Gets the VL Capability of a port. +* +* SYNOPSIS +*/ +static inline uint8_t +ib_port_info_get_init_type( + IN const ib_port_info_t* const p_pi) +{ + return(p_pi->vl_cap & 0x0F); +} +/* +* PARAMETERS +* p_pi +* [in] Pointer to a PortInfo attribute. +* +* RETURN VALUES +* InitType field +* +* NOTES +* +* SEE ALSO +*********/ /****f* IBA Base: Types/ib_port_info_get_op_vls * NAME * ib_port_info_get_op_vls @@ -4457,7 +4484,6 @@ ib_path_get_ipd( * SEE ALSO *********/ - /****f* IBA Base: Types/ib_port_info_get_mtu_cap * NAME * ib_port_info_get_mtu_cap @@ -4546,6 +4572,65 @@ ib_port_info_set_neighbor_mtu( * SEE ALSO *********/ +/****f* IBA Base: Types/ib_port_info_get_master_smsl +* NAME +* ib_port_info_get_master_smsl +* +* DESCRIPTION +* Returns the encoded value for the Master SMSL at this port. +* +* SYNOPSIS +*/ +static inline uint8_t +ib_port_info_get_master_smsl( + IN const ib_port_info_t* const p_pi ) +{ + return( (uint8_t)(p_pi->mtu_smsl & 0x0F) ); +} +/* +* PARAMETERS +* p_pi +* [in] Pointer to a PortInfo attribute. +* +* RETURN VALUES +* Returns the encoded value for the Master SMSL at this port. +* +* NOTES +* +* SEE ALSO +*********/ +/****f* IBA Base: Types/ib_port_info_set_master_smsl +* NAME +* ib_port_info_set_master_smsl +* +* DESCRIPTION +* Sets the Master SMSL value in the PortInfo attribute. +* +* SYNOPSIS +*/ +static inline void +ib_port_info_set_master_smsl( + IN ib_port_info_t* const p_pi, + IN const uint8_t smsl ) +{ + p_pi->mtu_smsl = (uint8_t)((p_pi->mtu_smsl & 0xF0) | smsl ); +} +/* +* PARAMETERS +* p_pi +* [in] Pointer to a PortInfo attribute. +* +* mtu +* [in] Encoded Master SMSL value to set +* +* RETURN VALUES +* None. +* +* NOTES +* +* SEE ALSO +*********/ + /****f* IBA Base: Types/ib_port_info_set_timeout * NAME * ib_port_info_set_timeout @@ -4981,6 +5066,60 @@ ib_port_info_set_mpb( * * NOTES * +* SEE ALSO +*********/ +/****f* IBA Base: Types/ib_port_info_get_local_phy_err_thd +* NAME +* ib_port_info_get_local_phy_err_thd +* +* DESCRIPTION +* Returns the Phy Link Threshold +* +* SYNOPSIS +*/ +static inline uint8_t +ib_port_info_get_local_phy_err_thd( + IN const ib_port_info_t* const p_pi ) +{ + return (uint8_t)( (p_pi->error_threshold & 0xF0) >> 4); +} +/* +* PARAMETERS +* p_pi +* [in] Pointer to a PortInfo attribute. +* +* RETURN VALUES +* Returns the Phy Link error threshold assigned to this port. +* +* NOTES +* +* SEE ALSO +*********/ +/****f* IBA Base: Types/ib_port_info_get_overrun_err_thd +* NAME +* ib_port_info_get_local_overrun_err_thd +* +* DESCRIPTION +* Returns the Credits Overrun Errors Threshold +* +* SYNOPSIS +*/ +static inline uint8_t +ib_port_info_get_overrun_err_thd( + IN const ib_port_info_t* const p_pi ) +{ + return (uint8_t)(p_pi->error_threshold & 0x0F); +} +/* +* PARAMETERS +* p_pi +* [in] Pointer to a PortInfo attribute. +* +* RETURN VALUES +* Returns the Credits Overrun errors threshold assigned to this port. +* +* NOTES +* * SEE ALSO *********/ Index: opensm/osm_sa_portinfo_record.c =================================================================== --- opensm/osm_sa_portinfo_record.c (revision 4809) +++ opensm/osm_sa_portinfo_record.c (working copy) @@ -347,6 +347,12 @@ __osm_sa_pir_check_physp( ib_port_info_get_link_down_def_state( p_pi ) ) goto Exit; } + if ( comp_mask & IB_PIR_COMPMASK_MKEYPROTBITS ) + { + if( ib_port_info_get_mpb( p_comp_pi ) != + ib_port_info_get_mpb( p_pi ) ) + goto Exit; + } if( comp_mask & IB_PIR_COMPMASK_LMC ) { if( ib_port_info_get_lmc( p_comp_pi ) != @@ -371,6 +377,24 @@ __osm_sa_pir_check_physp( ib_port_info_get_neighbor_mtu( p_pi ) ) goto Exit; } + if( comp_mask & IB_PIR_COMPMASK_MASTERSMSL ) + { + if( ib_port_info_get_master_smsl( p_comp_pi ) != + ib_port_info_get_master_smsl( p_pi ) ) + goto Exit; + } + if( comp_mask & IB_PIR_COMPMASK_VLCAP ) + { + if( ib_port_info_get_vl_cap( p_comp_pi ) != + ib_port_info_get_vl_cap( p_pi ) ) + goto Exit; + } + if( comp_mask & IB_PIR_COMPMASK_INITTYPE ) + { + if( ib_port_info_get_init_type( p_comp_pi ) != + ib_port_info_get_init_type( p_pi ) ) + goto Exit; + } if( comp_mask & IB_PIR_COMPMASK_VLHIGHLIMIT ) { if( p_comp_pi->vl_high_limit != p_pi->vl_high_limit ) @@ -392,6 +416,12 @@ __osm_sa_pir_check_physp( ib_port_info_get_mtu_cap( p_pi ) ) goto Exit; } + if( comp_mask & IB_PIR_COMPMASK_VLSTALLCNT ) + { + if( ib_port_info_get_vl_stall_count( p_comp_pi ) != + ib_port_info_get_vl_stall_count( p_pi ) ) + goto Exit; + } if (comp_mask & IB_PIR_COMPMASK_HOQLIFE ) { if ((p_comp_pi->vl_stall_life & 0x1F) != (p_pi->vl_stall_life & 0x1F) ) @@ -452,6 +482,18 @@ __osm_sa_pir_check_physp( if ((p_comp_pi->resp_time_value & 0x1F) != (p_pi->resp_time_value &0x1F) ) goto Exit; } + if (comp_mask & IB_PIR_COMPMASK_LOCALPHYERR ) + { + if( ib_port_info_get_local_phy_err_thd( p_comp_pi ) != + ib_port_info_get_local_phy_err_thd( p_pi ) ) + goto Exit; + } + if (comp_mask & IB_PIR_COMPMASK_OVERRUNERR) + { + if( ib_port_info_get_overrun_err_thd( p_comp_pi ) != + ib_port_info_get_overrun_err_thd( p_pi ) ) + goto Exit; + } __osm_sa_pir_create( p_rcv, p_physp, p_ctxt ); From eitan at mellanox.co.il Mon Jan 9 06:23:26 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 09 Jan 2006 16:23:26 +0200 Subject: [openib-general] [PATCH] osm: cl_timer.c segfault on exit Message-ID: <86y81p8opt.fsf@mtl066.yok.mtl.com> Hi Hal We observed a race on closing of complib on some machines. Tracing down the bug we have found the code on destruction of the global complib timer - to be the couse of it. The fix goes back to use standard pthread_join and avoid the while loop... Thanks Eitan Signed-off-by: Eitan Zahavi Index: complib/cl_timer.c =================================================================== --- complib/cl_timer.c (revision 4706) +++ complib/cl_timer.c (working copy) @@ -112,30 +112,24 @@ void __cl_timer_prov_destroy( void ) { - cl_timer_prov_t *tmp_gp_timer_prov = gp_timer_prov; + pthread_t tid; if( !gp_timer_prov ) return; - /* keep sending cond events untill it exited */ - while (tmp_gp_timer_prov->thread != 0) - { - /* signal the thread to exit. */ - pthread_mutex_lock( &tmp_gp_timer_prov->mutex ); - tmp_gp_timer_prov->exit = TRUE; - - pthread_cond_broadcast( &tmp_gp_timer_prov->cond ); - - /* Broadcast might be prefered on pthread_cond_signal */ - pthread_mutex_unlock( &tmp_gp_timer_prov->mutex ); - } + tid = gp_timer_prov->thread; + pthread_mutex_lock( &gp_timer_prov->mutex ); + gp_timer_prov->exit = TRUE; + pthread_cond_broadcast( &gp_timer_prov->cond ); + pthread_mutex_unlock( &gp_timer_prov->mutex ); + pthread_join( tid , NULL ); /* Destroy the mutex and condition variable. */ - pthread_mutex_destroy( &tmp_gp_timer_prov->mutex ); - pthread_cond_destroy( &tmp_gp_timer_prov->cond ); + pthread_mutex_destroy( &gp_timer_prov->mutex ); + pthread_cond_destroy( &gp_timer_prov->cond ); /* Free the memory and reset the global pointer. */ - cl_free( tmp_gp_timer_prov ); + cl_free( gp_timer_prov ); gp_timer_prov = NULL; } From mst at mellanox.co.il Mon Jan 9 06:54:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 9 Jan 2006 16:54:33 +0200 Subject: [openib-general] [PATCH] mthca: cosmetic change in mthca_qp Message-ID: <20060109145433.GL16938@mellanox.co.il> Roland, how does the following look? --- Cosmetic change in mthca_qp. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- openib/drivers/infiniband/hw/mthca/mthca_qp.c (revision 4505) +++ openib/drivers/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -754,13 +754,10 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) { - if (attr->max_rd_atomic) { - qp_context->params1 |= - cpu_to_be32(MTHCA_QP_BIT_SRE | - MTHCA_QP_BIT_SAE); - qp_context->params1 |= - cpu_to_be32(fls(attr->max_rd_atomic - 1) << 21); - } + if (attr->max_rd_atomic) + qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SRE | + MTHCA_QP_BIT_SAE | + (fls(attr->max_rd_atomic - 1) << 21)); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_SRA_MAX); } -- MST From mst at mellanox.co.il Mon Jan 9 06:56:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 9 Jan 2006 16:56:43 +0200 Subject: [openib-general] [PATCH repost] ipoib: cosmetics Message-ID: <20060109145643.GM16938@mellanox.co.il> Roland, I think we discussed this patch once, I think you agreed it makes the code more readable. --- Cosmetic change in ipoib: make alignment explicit. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-12-16 02:15:55.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib.h 2005-12-16 02:39:42.000000000 +0200 @@ -219,8 +219,8 @@ struct ipoib_neigh { static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) { - return (struct ipoib_neigh **) (neigh->ha + 24 - - (offsetof(struct neighbour, ha) & 4)); + return (void*)neigh + ALIGN(offsetof(struct neighbour, ha) + + INFINIBAND_ALEN, sizeof(void *)); } extern struct workqueue_struct *ipoib_workqueue; -- MST From mst at mellanox.co.il Mon Jan 9 07:05:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 9 Jan 2006 17:05:47 +0200 Subject: [openib-general] [PATCH] mthca: eq doorbell coalescing + prevent even queque overrun Message-ID: <20060109150547.GN16938@mellanox.co.il> I am seeing EQ overruns in SDP stress tests: if CQ completion handler arms a CQ, this could generate more EQEs, so that EQ will never get empty and consumer index will never get updated. This is similiar to what we have with command interface: /* * cmd_event() may add more commands. * The card will think the queue has overflowed if * we don't tell it we've been processing events. */ However, for completion events, we *dont* want to update the consumer index on each event. So, perform EQ doorbell coalescing: allocate an EQ with some spare EQEs, and update once we run out of them. The value 0x80 was selected to avoid any performance impact. --- Fix EQ overrun for completion events: increase EQ size by 0x80, and tell the card that we have been processing events, at least each 0x80 EQEs. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14.3/drivers/infiniband/hw/mthca/mthca_eq.c =================================================================== --- linux-2.6.14.3.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2006-01-05 18:11:04.000000000 +0200 +++ linux-2.6.14.3/drivers/infiniband/hw/mthca/mthca_eq.c 2006-01-08 09:55:28.000000000 +0200 @@ -45,6 +45,7 @@ enum { MTHCA_NUM_ASYNC_EQE = 0x80, MTHCA_NUM_CMD_EQE = 0x80, + MTHCA_NUM_SPARE_EQE = 0x80, MTHCA_EQ_ENTRY_SIZE = 0x20 }; @@ -277,11 +278,10 @@ static int mthca_eq_int(struct mthca_dev { struct mthca_eqe *eqe; int disarm_cqn; - int eqes_found = 0; + int eqes_found = 0; + int set_ci = 0; while ((eqe = next_eqe_sw(eq))) { - int set_ci = 0; - /* * Make sure we read EQ entry contents after we've * checked the ownership bit. @@ -345,12 +345,6 @@ static int mthca_eq_int(struct mthca_dev be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, be64_to_cpu(eqe->event.cmd.out_param)); - /* - * cmd_event() may add more commands. - * The card will think the queue has overflowed if - * we don't tell it we've been processing events. - */ - set_ci = 1; break; case MTHCA_EVENT_TYPE_PORT_CHANGE: @@ -385,8 +379,14 @@ static int mthca_eq_int(struct mthca_dev set_eqe_hw(eqe); ++eq->cons_index; eqes_found = 1; + ++set_ci; - if (unlikely(set_ci)) { + /* + * The card will think the queue has overflowed if + * we don't tell it we've been processing events, + * now and then. + */ + if (unlikely(set_ci >= MTHCA_NUM_SPARE_EQE)) { /* * Conditional on hca_type is OK here because * this is a rare case, not the fast path. @@ -862,19 +862,19 @@ int __devinit mthca_init_eq_table(struct intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? 128 : dev->eq_table.inta_pin; - err = mthca_create_eq(dev, dev->limits.num_cqs, + err = mthca_create_eq(dev, dev->limits.num_cqs + MTHCA_NUM_SPARE_EQE, (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, &dev->eq_table.eq[MTHCA_EQ_COMP]); if (err) goto err_out_unmap; - err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE, + err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE + MTHCA_NUM_SPARE_EQE, (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, &dev->eq_table.eq[MTHCA_EQ_ASYNC]); if (err) goto err_out_comp; - err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE, + err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE + MTHCA_NUM_SPARE_EQE, (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr, &dev->eq_table.eq[MTHCA_EQ_CMD]); if (err) -- MST From halr at voltaire.com Mon Jan 9 06:55:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jan 2006 09:55:34 -0500 Subject: [openib-general] Re: Re[PATCH] Opensm - clean exit on ^C In-Reply-To: <5zzmmag99r.fsf@mtl066.yok.mtl.com> References: <5zzmmag99r.fsf@mtl066.yok.mtl.com> Message-ID: <1136818534.4339.211.camel@hal.voltaire.com> Hi Yael, On Thu, 2006-01-05 at 07:16, Yael Kalka wrote: > Hi Hal, > > I've noticed that sometimes when killing OpenSM using ^C not all > threads are killed. > The reason for that is that there are threads that mask the > signalling, and when removing the ^C handling from OpenSM, these > threads still mask the signalling and stay alive as a result. > The following patch fixes this. Is there one other piece to this ? Doesn't osm_opensm.c need to be modified to handle SIGINT for OSM_VENDOR_INTF_OPENIB ? Thanks. void osm_reg_sig_handler( IN osm_opensm_t * const p_osm ) { __p_osm_to_signal = p_osm; #ifndef OSM_VENDOR_INTF_OPENIB cl_reg_sig_hdl( SIGINT, __sig_handler ); #endif -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: include/complib/cl_signal_osd.h > =================================================================== > --- include/complib/cl_signal_osd.h (revision 4760) > +++ include/complib/cl_signal_osd.h (working copy) > @@ -148,12 +148,14 @@ cl_sig_mask_sigint(void) > #ifdef __WIN__ > /* we do not mask kill */ > #else > +#ifndef OSM_VENDOR_INTF_OPENIB > sigset_t sigs; > > sigemptyset(&sigs); > sigaddset(&sigs, SIGINT); > pthread_sigmask(SIG_BLOCK, &sigs, NULL); > - #endif > +#endif /* OSM_VENDOR_INTF_OPENIB */ > +#endif /* __WIN__ */ > } > /* > *********/ > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 4760) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -244,10 +244,6 @@ umad_receiver(void *p_ptr) > > OSM_LOG_ENTER( p_ur->p_log, umad_receiver ); > > - sigemptyset(&sigs); > - sigaddset(&sigs, SIGINT); > - pthread_sigmask(SIG_BLOCK, &sigs, NULL); > - > for (;;) { > if (!umad && > !(umad = umad_alloc(1, umad_size() + MAD_BLOCK_SIZE))) { > From mst at mellanox.co.il Mon Jan 9 07:08:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 9 Jan 2006 17:08:48 +0200 Subject: [openib-general] [PATCH] ipoib: flush task race Message-ID: <20060109150848.GO16938@mellanox.co.il> Fix a race in IPoIB: flush task has started runnning and the user does ifconfig ib0 down. This results in race conditions, e.g. ipoib_mcast_stop_thread might be waiting for the same completion in parallel on two CPUs. Solve this by moving flush_task to ipoib_wq. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-09 12:06:31.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-09 13:47:21.000000000 +0200 @@ -252,7 +252,7 @@ void ipoib_ib_dev_cleanup(struct net_dev int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); -int ipoib_ib_dev_down(struct net_device *dev); +int ipoib_ib_dev_down(struct net_device *dev, int flush); int ipoib_ib_dev_stop(struct net_device *dev); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); Index: openib/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-01-02 11:56:50.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-01-09 13:47:21.000000000 +0200 @@ -434,7 +434,7 @@ int ipoib_ib_dev_up(struct net_device *d return ipoib_mcast_start_thread(dev); } -int ipoib_ib_dev_down(struct net_device *dev) +int ipoib_ib_dev_down(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -449,10 +449,11 @@ int ipoib_ib_dev_down(struct net_device set_bit(IPOIB_PKEY_STOP, &priv->flags); cancel_delayed_work(&priv->pkey_task); up(&pkey_sem); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); } - ipoib_mcast_stop_thread(dev, 1); + ipoib_mcast_stop_thread(dev, flush); /* * Flush the multicast groups first so we stop any multicast joins. The @@ -599,7 +600,7 @@ void ipoib_ib_dev_flush(void *_dev) ipoib_dbg(priv, "flushing\n"); - ipoib_ib_dev_down(dev); + ipoib_ib_dev_down(dev, 0); /* * The device could have been brought down between the start and when Index: openib/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-08 11:41:20.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-09 13:47:22.000000000 +0200 @@ -133,7 +133,7 @@ static int ipoib_stop(struct net_device netif_stop_queue(dev); - ipoib_ib_dev_down(dev); + ipoib_ib_dev_down(dev, 1); ipoib_ib_dev_stop(dev); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { Index: openib/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2005-12-22 16:52:33.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2006-01-09 13:47:22.000000000 +0200 @@ -255,6 +255,6 @@ void ipoib_event(struct ib_event_handler record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE) { ipoib_dbg(priv, "Port active event\n"); - schedule_work(&priv->flush_task); + queue_work(ipoib_workqueue, &priv->flush_task); } } -- MST From mst at mellanox.co.il Mon Jan 9 07:09:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 9 Jan 2006 17:09:57 +0200 Subject: [openib-general] [PATCH] ipoib: protect mc_list access Message-ID: <20060109150957.GP16938@mellanox.co.il> mc_list accesses must be protected by xmit_lock. Found by Eli Cohen. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 4812) +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -824,6 +824,7 @@ void ipoib_mcast_restart_task(void *dev_ ipoib_mcast_stop_thread(dev, 0); + spin_lock_bh(&dev->xmit_lock); spin_lock_irqsave(&priv->lock, flags); /* @@ -896,7 +897,9 @@ void ipoib_mcast_restart_task(void *dev_ list_add_tail(&mcast->list, &remove_list); } } + spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_bh(&dev->xmit_lock); /* We have to cancel outside of the spinlock */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { -- MST From mst at mellanox.co.il Mon Jan 9 07:17:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 9 Jan 2006 17:17:11 +0200 Subject: [openib-general] [PATCH] ipoib: stop_thread/join_complete race condition fix Message-ID: <20060109151711.GQ16938@mellanox.co.il> IPoIB is open to the following race: ipoib_mcast_join_complete sets mcast->query to NULL, ipoib_mcast_stop_thread tests query, and sees that it is NULL. We then destroy the mcast group. ipoib_mcast_join_complete then calls complete on a non-existing group. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 4743) +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -413,9 +413,11 @@ static void ipoib_mcast_join_complete(in if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + down(&mcast_mutex); + + spin_lock_irq(&priv->lock); mcast->query = NULL; - down(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { if (status == -ETIMEDOUT) queue_work(ipoib_workqueue, &priv->mcast_task); @@ -424,6 +426,7 @@ static void ipoib_mcast_join_complete(in mcast->backoff * HZ); } else complete(&mcast->done); + spin_unlock_irq(&priv->lock); up(&mcast_mutex); return; @@ -600,21 +603,27 @@ int ipoib_mcast_stop_thread(struct net_d if (flush) flush_workqueue(ipoib_workqueue); + spin_lock_irq(&priv->lock); if (priv->broadcast && priv->broadcast->query) { ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); priv->broadcast->query = NULL; + spin_unlock_irq(&priv->lock); ipoib_dbg_mcast(priv, "waiting for bcast\n"); wait_for_completion(&priv->broadcast->done); - } + } else + spin_unlock_irq(&priv->lock); list_for_each_entry(mcast, &priv->multicast_list, list) { + spin_lock_irq(&priv->lock); if (mcast->query) { ib_sa_cancel_query(mcast->query_id, mcast->query); mcast->query = NULL; + spin_unlock_irq(&priv->lock); ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); wait_for_completion(&mcast->done); - } + } else + spin_unlock_irq(&priv->lock); } return 0; -- MST From halr at voltaire.com Mon Jan 9 07:38:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jan 2006 10:38:47 -0500 Subject: [openib-general] Re: Re[PATCH] Opensm - fix for osm_sa_portinfo_record.c In-Reply-To: <5zwth9fuuf.fsf@mtl066.yok.mtl.com> References: <5zwth9fuuf.fsf@mtl066.yok.mtl.com> Message-ID: <1136821126.4339.651.camel@hal.voltaire.com> Hi Yael, On Mon, 2006-01-09 at 07:29, Yael Kalka wrote: > Hi Hal, > > During some tests we've notices that not all compmask fields are > properly checked and compared in the portInfo record query. > Attached is a patch with the missing checks, and addition of some > set/get relevant functions added to the ib_types.h as well. Just a couple of minor (nit) comments below. -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: include/iba/ib_types.h > =================================================================== > --- include/iba/ib_types.h (revision 4809) > +++ include/iba/ib_types.h (working copy) > @@ -3960,6 +3960,33 @@ ib_port_info_get_vl_cap( > * > * SEE ALSO > *********/ > +/****f* IBA Base: Types/ib_port_info_get_init_type > +* NAME > +* ib_port_info_get_init_type > +* > +* DESCRIPTION > +* Gets the VL Capability of a port. ^^^^^^^^^^^^^ init type > +* > +* SYNOPSIS > +*/ > +static inline uint8_t > +ib_port_info_get_init_type( > + IN const ib_port_info_t* const p_pi) > +{ > + return(p_pi->vl_cap & 0x0F); Should this be: return (uint8_t) (p_pi->vl_cap & 0x0F); > +} > +/* > +* PARAMETERS > +* p_pi > +* [in] Pointer to a PortInfo attribute. > +* > +* RETURN VALUES > +* InitType field > +* > +* NOTES > +* > +* SEE ALSO > +*********/ [snip...] -- Hal From halr at voltaire.com Mon Jan 9 07:57:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jan 2006 10:57:58 -0500 Subject: [openib-general] Re: [PATCH] osm: cl_timer.c segfault on exit In-Reply-To: <86y81p8opt.fsf@mtl066.yok.mtl.com> References: <86y81p8opt.fsf@mtl066.yok.mtl.com> Message-ID: <1136822277.4339.847.camel@hal.voltaire.com> On Mon, 2006-01-09 at 09:23, Eitan Zahavi wrote: > Hi Hal > > We observed a race on closing of complib on some machines. > Tracing down the bug we have found the code on destruction > of the global complib timer - to be the couse of it. > > The fix goes back to use standard pthread_join and avoid the > while loop... Thanks. Applied. -- Hal > Thanks > > Eitan From mst at mellanox.co.il Mon Jan 9 08:17:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 9 Jan 2006 18:17:13 +0200 Subject: [openib-general] [PATCH] ipoib: count dropped multicast patckets Message-ID: <20060109161713.GS16938@mellanox.co.il> Count dropped multicast packets. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 4839) +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -122,8 +122,12 @@ static void ipoib_mcast_free(struct ipoi if (mcast->ah) ipoib_put_ah(mcast->ah); - while (!skb_queue_empty(&mcast->pkt_queue)) + while (!skb_queue_empty(&mcast->pkt_queue)) { + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->stats.tx_dropped; + spin_unlock_irqrestore(&priv->tx_lock, flags); dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } kfree(mcast); } @@ -309,8 +313,12 @@ ipoib_mcast_sendonly_join_complete(int s IPOIB_GID_ARG(mcast->mcmember.mgid), status); /* Flush out any queued packets */ - while (!skb_queue_empty(&mcast->pkt_queue)) + while (!skb_queue_empty(&mcast->pkt_queue)) { + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->stats.tx_dropped; + spin_unlock_irqrestore(&priv->tx_lock, flags); dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } /* Clear the busy flag so we try again */ clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); @@ -686,6 +694,7 @@ void ipoib_mcast_send(struct net_device if (!mcast) { ipoib_warn(priv, "unable to allocate memory for " "multicast structure\n"); + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); goto out; } @@ -699,8 +708,10 @@ void ipoib_mcast_send(struct net_device if (!mcast->ah) { if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) skb_queue_tail(&mcast->pkt_queue, skb); - else + else { + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); + } if (mcast->query) ipoib_dbg_mcast(priv, "no address vector, " -- MST From mst at mellanox.co.il Mon Jan 9 08:44:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 9 Jan 2006 18:44:00 +0200 Subject: [openib-general] Re: [PATCH updated] mthca: fix page shift calculation In-Reply-To: References: Message-ID: <20060109164400.GV16938@mellanox.co.il> Quoting Roland Dreier : > Yes, you're right again. It seems like we can get rid of some special > casing and just do things like this: Yes, this makes total sense. -- MST From jlentini at netapp.com Mon Jan 9 08:56:44 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 9 Jan 2006 11:56:44 -0500 (EST) Subject: [openib-general] Re: [PATCH] uDAPL openib_cma disconnect processing fix In-Reply-To: References: Message-ID: arlin> Here is a patch to fix up the disconnect event processing and a arlin> change to dtest to validate. Tested with dtest and dapltest. Thanks Arlin. Commited in revision 4852. From jlentini at netapp.com Mon Jan 9 09:07:10 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 9 Jan 2006 12:07:10 -0500 (EST) Subject: [openib-general] dat_evd_wait & dat_cno_wait not exercised in dtest.c In-Reply-To: References: Message-ID: > Initializing "polling" to "0" would fix the problem. Do you agree with this Arlin? Index: test/dtest/dtest.c =================================================================== --- test/dtest/dtest.c (revision 4852) +++ test/dtest/dtest.c (working copy) @@ -138,7 +138,7 @@ static int connected=0; static int burst=10; static int server=1; static int verbose=0; -static int polling=1; +static int polling=0; static int poll_count=0; static int rdma_wr_poll_count=0; static int rdma_rd_poll_count[MAX_RDMA_RD]={0}; From mst at mellanox.co.il Mon Jan 9 09:28:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 9 Jan 2006 19:28:07 +0200 Subject: [openib-general] get_lock.h on ia64 support In-Reply-To: <20051231071250.GE32607@esmail.cup.hp.com> References: <20051231071250.GE32607@esmail.cup.hp.com> Message-ID: <20060109172807.GC16938@mellanox.co.il> Hi! I was trying to build perftests on SLES9 SP1 (RC5). I was getting compilation errors: apparently on this distribution asm/timex.h includes all kind of kernel headers and so is not fit to be used to build userspace apps. I plan to check in the following patch to get us rid of this dependency on asm/timex.h for PPC64 and IA64. Note that get_cycles kernel function that we use just happens to be incline, it could move out of line at any point, so it might be a good idea to do this, regardless. --- Multiple distributions have asm/timex.h that cant be used in userpace. Lets just insert the inline assembly implementation as is. Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/perftest/get_clock.h =================================================================== --- openib/src/userspace/perftest/get_clock.h (revision 4692) +++ openib/src/userspace/perftest/get_clock.h (working copy) @@ -47,8 +47,9 @@ static inline cycles_t get_cycles() val = (val << 32) | low; return val; } -#elif defined(__PPC__) +#elif defined(__PPC__) || || defined(__PPC64__) /* Note: only PPC CPUs which have mftb instruction are supported. */ +/* PPC64 has mftb */ typedef unsigned long long cycles_t; static inline cycles_t get_cycles() { @@ -57,10 +58,17 @@ static inline cycles_t get_cycles() asm volatile ("mftb %0" : "=r" (ret) : ); return ret; } -#elif defined(__ia64__) || defined(__PPC64__) +#elif defined(__ia64__) /* Itanium2 and up has ar.itc (Itanium1 has errata) */ -/* PPC64 has mftb */ -#include +typedef unsigned long cycles_t; +static inline cycles_t get_cycles() +{ + cycles_t ret; + + asm volatile ("mov %0=ar.itc" : "=r" (ret) ::); + return ret; +} + #else #warning get_cycles not implemented for this architecture: attempt asm/timex.h #include -- MST From robert.j.woodruff at intel.com Mon Jan 9 09:57:49 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 9 Jan 2006 09:57:49 -0800 Subject: [openib-general] RE: We are ready with a Gen2 version of MVAPICH2 In-Reply-To: <200601081410.k08EA53K002400@xi.cse.ohio-state.edu> Message-ID: DK wrote, >We plan to upload a stripped down version of this new release to the >OpenIB SVN at the following location: >https://openib.org/svn/gen2/trunk/src/userspace/mpi/ Sounds good to me. woody From mst at mellanox.co.il Mon Jan 9 10:07:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 9 Jan 2006 20:07:12 +0200 Subject: [openib-general] Re: [PATCH] ipoib: count dropped multicast patckets In-Reply-To: <20060109161713.GS16938@mellanox.co.il> References: <20060109161713.GS16938@mellanox.co.il> Message-ID: <20060109180712.GE16938@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: [PATCH] ipoib: count dropped multicast patckets > > Count dropped multicast packets. > > Signed-off-by: Michael S. Tsirkin And here's a patch that actually works :). Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 4839) +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -122,8 +122,12 @@ static void ipoib_mcast_free(struct ipoi if (mcast->ah) ipoib_put_ah(mcast->ah); - while (!skb_queue_empty(&mcast->pkt_queue)) + while (!skb_queue_empty(&mcast->pkt_queue)) { + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->stats.tx_dropped; + spin_unlock_irqrestore(&priv->tx_lock, flags); dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } kfree(mcast); } @@ -299,6 +303,7 @@ ipoib_mcast_sendonly_join_complete(int s { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; + unsigned long flags; if (!status) ipoib_mcast_join_finish(mcast, mcmember); @@ -309,8 +314,12 @@ ipoib_mcast_sendonly_join_complete(int s IPOIB_GID_ARG(mcast->mcmember.mgid), status); /* Flush out any queued packets */ - while (!skb_queue_empty(&mcast->pkt_queue)) + while (!skb_queue_empty(&mcast->pkt_queue)) { + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->stats.tx_dropped; + spin_unlock_irqrestore(&priv->tx_lock, flags); dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } /* Clear the busy flag so we try again */ clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); @@ -686,6 +695,7 @@ void ipoib_mcast_send(struct net_device if (!mcast) { ipoib_warn(priv, "unable to allocate memory for " "multicast structure\n"); + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); goto out; } @@ -699,8 +709,10 @@ void ipoib_mcast_send(struct net_device if (!mcast->ah) { if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) skb_queue_tail(&mcast->pkt_queue, skb); - else + else { + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); + } if (mcast->query) ipoib_dbg_mcast(priv, "no address vector, " -- MST From mshefty at ichips.intel.com Mon Jan 9 10:11:13 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 09 Jan 2006 10:11:13 -0800 Subject: [openib-general] Re: merge rdma_cm and ib_addr upstream In-Reply-To: References: <43BEB4CF.1020103@ichips.intel.com> Message-ID: <43C2A741.3070700@ichips.intel.com> Roland Dreier wrote: > We should generate a patch (or series of patches depending on how big > it ends up being) against Linus's latest tree and post it to > linux-kernel and openib-general for review. It would be fine if you > post or, or I can do it if you don't feel like it. I installed git, pulled the latest tree, and am working on a set of patches for this. I think that a patch series could be broken up as follows: Address translation service. Marshalling parameters between userspace and the kernel. CM comparing private data. Kernel CMA. Userspace CMA kernel agent. Hopefully I'm not missing anything else. - Sean From iod00d at hp.com Mon Jan 9 10:18:14 2006 From: iod00d at hp.com (Grant Grundler) Date: Mon, 9 Jan 2006 10:18:14 -0800 Subject: [openib-general] Re: get_lock.h on ia64 support In-Reply-To: <20060109172807.GC16938@mellanox.co.il> References: <20051231071250.GE32607@esmail.cup.hp.com> <20060109172807.GC16938@mellanox.co.il> Message-ID: <20060109181814.GA8258@esmail.cup.hp.com> On Mon, Jan 09, 2006 at 07:28:07PM +0200, Michael S. Tsirkin wrote: > Hi! > I was trying to build perftests on SLES9 SP1 (RC5). Michael, Why use SP1? Isn't SP2 the latest version? > I was getting compilation errors: apparently on this distribution > asm/timex.h includes all kind of kernel headers and > so is not fit to be used to build userspace apps. SLES9 doesn't support openib drivers - I'm not surprised there are compilation problems. Suse expects to ship OpenIB support in SLES10. Are you building perf tests to run with mellanox drivers/libs ? > I plan to check in the following patch to get us rid of this dependency > on asm/timex.h for PPC64 and IA64. > Note that get_cycles kernel function that we use just happens > to be incline, it could move out of line at any point, so > it might be a good idea to do this, regardless. Yeah, I've been real comfortable including kernel asm files. But it was the most portable way of getting the cycle counts. If you feel strongly about it, I don't mind getting rid of this dependency. thanks, grant From mst at mellanox.co.il Mon Jan 9 10:34:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 9 Jan 2006 20:34:35 +0200 Subject: [openib-general] Re: get_lock.h on ia64 support In-Reply-To: <20060109181814.GA8258@esmail.cup.hp.com> References: <20060109181814.GA8258@esmail.cup.hp.com> Message-ID: <20060109183435.GF16938@mellanox.co.il> Quoting Grant Grundler : > > Hi! > > I was trying to build perftests on SLES9 SP1 (RC5). > > Michael, > Why use SP1? Isn't SP2 the latest version? Donnu :) Is this working on SP2? > > I was getting compilation errors: apparently on this distribution > > asm/timex.h includes all kind of kernel headers and > > so is not fit to be used to build userspace apps. > > SLES9 doesn't support openib drivers - I'm not surprised there > are compilation problems. Naturally, I've replaced the kernel with 2.6.15. I haven't run into problems, yet. > Suse expects to ship OpenIB support in SLES10. > Are you building perf tests to run with mellanox drivers/libs ? no, with openib. > Yeah, I've been real comfortable including kernel asm files. > But it was the most portable way of getting the cycle counts. > If you feel strongly about it, I don't mind getting rid of > this dependency. Yeah. FWIW I still plan to keep #else #warning get_cycles not implemented for this architecture: attempt asm/timex.h #include #endif -- MST From bill.boas at gmail.com Mon Jan 9 10:43:33 2006 From: bill.boas at gmail.com (Bill Boas) Date: Mon, 9 Jan 2006 10:43:33 -0800 Subject: [openib-general] Sonoma Workshop draft agenda, PLEASE REGISTER ASAP Message-ID: <19a929370601091043y3f3ed5f4q4771a130e5f383e6@mail.gmail.com> Dear Developers and Promoters, Please check the draft strawman agenda at http ://openib.org/tiki/tiki-list_file_gallery.php?galleryId=12 Please let us all know the changes you suggest we make as soon as possible. My apologies to "Wall Street", I think I should have called the keynote "HSIR Requirements" but maybe HSIR would like to schedule sessions into this agenda also. There are meeting rooms available and other tracks can be added if that is agreeable to the community. PLEASE have your webmasters post the invitation http://www.openib.org/conference.html on your events page on your corporate web site and also send out the invitation to your sales and marketing teams to send to their customers and partners to generate attendance. Also REGISTER YOURSELVES and other company members attending. Finally temporary contact info for me bill.boas at gmail.com, cell 510-375-8840. Questions and comments please!!!!! Bill. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Jan 9 11:12:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jan 2006 14:12:23 -0500 Subject: [openib-general] osm_pkey_mgr question Message-ID: <1136833942.4339.2955.camel@hal.voltaire.com> Hi Ofer, With the new default PKey manager support, I am seeing the following error messages in the osm log: Jan 09 10:31:04 894576 [B771FC40] -> osm_pkey_mgr_process: ERR 0502: Invalid physical port for node 0x005442b100004900 port 0 Jan 09 10:31:04 894740 [B771FC40] -> osm_pkey_mgr_process: ERR 0502: Invalid physical port for node 0x005442b100004900 port 2 On port 0, switch port 0 is not a physical port. I seem to get these for all switch port 0s. Is that right ? Also, any idea why I would get this on some of the switch external (physical) ports like port 2 ? Is this an error which should be put in the log ? How can they be discerned from the ones which appear not to be real errors ? Thanks. -- Hal From iod00d at hp.com Mon Jan 9 12:23:44 2006 From: iod00d at hp.com (Grant Grundler) Date: Mon, 9 Jan 2006 12:23:44 -0800 Subject: [openib-general] Re: get_lock.h on ia64 support In-Reply-To: <20060109183435.GF16938@mellanox.co.il> References: <20060109181814.GA8258@esmail.cup.hp.com> <20060109183435.GF16938@mellanox.co.il> Message-ID: <20060109202344.GB8870@esmail.cup.hp.com> On Mon, Jan 09, 2006 at 08:34:35PM +0200, Michael S. Tsirkin wrote: ... > Donnu :) > Is this working on SP2? I'm pretty sure SP2 is the current SLES9 release. Likely it will have compile issues in a similar/same way. I haven't tried. I've not had time yet to integrated perftests into my regular (montly about) netperf runs. > > SLES9 doesn't support openib drivers - I'm not surprised there > > are compilation problems. > > Naturally, I've replaced the kernel with 2.6.15. > I haven't run into problems, yet. OIC. I'm doing the same thing with Debian. In the past, I've just redirected /usr/include/asm to point at the "raw" linux kernel header files. I haven't needed to do that in quite a while though. > > Yeah, I've been real comfortable including kernel asm files. sorry - should read "I've NOT been...". > > But it was the most portable way of getting the cycle counts. > > If you feel strongly about it, I don't mind getting rid of > > this dependency. > > Yeah. FWIW I still plan to keep > > #else > #warning get_cycles not implemented for this architecture: attempt asm/timex.h > #include > #endif Ok. That'll work as well as it works now then. And this is open source. Folks can still add new arches if asm/timex.h doesn't work for them. thanks, grant From ralphc at pathscale.com Mon Jan 9 12:38:47 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 09 Jan 2006 12:38:47 -0800 Subject: [openib-general] [PATCH] Add get Pkey table to smpquery Message-ID: <1136839127.4520.6.camel@brick.internal.keyresearch.com> Here is a patch to add the "pkeys" option to smpquery to display the Pkey table. Signed-off-by: Ralph Campbell Index: diags/src/smpquery.c =================================================================== --- diags/src/smpquery.c (revision 4808) +++ diags/src/smpquery.c (working copy) @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -61,13 +62,14 @@ op_fn_t *fn; } match_rec_t; -static op_fn_t node_desc, node_info, port_info, switch_info; +static op_fn_t node_desc, node_info, port_info, switch_info, pkey_table; static match_rec_t match_tbl[] = { { "nodeinfo", node_info }, { "nodedesc", node_desc }, { "portinfo", port_info }, { "switchinfo", switch_info }, + { "pkeys", pkey_table }, {0} }; @@ -157,6 +159,35 @@ return 0; } +static char * +pkey_table(ib_portid_t *dest, char **argv, int argc) +{ + char data[IB_SMP_DATA_SIZE]; + uint32_t i, j, n; + uint16_t *p; + + /* Get the partition capacity */ + if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0)) + return "node_info query failed"; + n = _get_field(data, 0, ib_mad_f + IB_NODE_PARTITION_CAP_F); + + for (i = 0; i < (n + 31) / 32; i++) { + if (!smp_query(data, dest, IB_ATTR_PKEY_TBL, i, 0)) + return "pkey table query failed"; + p = (uint16_t *) data; + for (j = 0; j < 32; j += 8, p += 8) { + printf("%4u: 0x%04x 0x%04x 0x%04x 0x%04x 0x%04x 0x%04x 0x%04x 0x%04x \n", + (i * 32) + j, + ntohs(p[0]), ntohs(p[1]), + ntohs(p[2]), ntohs(p[3]), + ntohs(p[4]), ntohs(p[5]), + ntohs(p[6]), ntohs(p[7])); + } + } + + return 0; +} + op_fn_t * match_op(char *name) { -- Ralph Campbell From eitan at mellanox.co.il Mon Jan 9 13:08:22 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 9 Jan 2006 23:08:22 +0200 Subject: [openib-general] OpenSM Directions - PKey manager Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B496@mtlexch01.mtl.com> Hi All, Hal and I worked on a functional definition for future OpenSM Partitions Manager. Please review and provide feedback. Thanks Hal and Eitan OpenSM Partition Management --------------------------- Roadmap: Phase 1 - provide partition management at the EndPort (HCA, Router and Switch Port 0) level with no routing affects. Phase 2 - routing engine should take partitions into account. Phase 1 functionality: Supported Policy: 1. EndPort partition groups are to be defined by listing the PortGUIDs as full and partial members. 2. Each partition group might be assigned an explicit P_Key (only the 15 LSB bits are valid) or the SM will assign it randomly. 3. A flag should control the generation of IPoIB broadcast group for that partition. Extra optional MGIDs can be provided to be setup (on top of the IPoIB broadcast group). 4. A global flag "Disconnect Unconfigured EndPorts": If TRUE prevents EndPorts that are not explicitly defined as part of any partition (thus "unconfigured") to communicate with any other EndPort. Otherwise, it will let these EndPorts send packets to all other EndPorts. Functionality: 1. The policy should be updated: - during SM bring-up - after kill -HUP - through SNMP (once it is supported) 2. Partition tables will be updated on full sweep (new port/trap etc). As a first step, the policy feasibility should be verified. Feasibility could be limited by the EndPorts supports for number of partitions, etc. Unrealizable policy should be reported and extra rules ignored after providing error messages. 3. Each EndPort will be assigned P_Keys as follows: a. Default partition group partial membership as defined by rule #4 below. (only the SM port will get 0xffff). b. P_Keys for all partition groups it is part of as defined in the policy. c. P_Key update will preserve index for the existing P_Keys on the port. If port has limited resources that will require reuse of, on index a message will be provided and some of the settings will be ommitted. P_Key indexes will not change under any circumstances. 4. Each Switch Leaf Port (a switch port that is connected to an EndPort) should be configured according to the same rules that apply to the EndPort connected to that switch port. This actually enables un-authorized port isolation (with future usage of M_Key and ProtectBits). 5. Policy entries matching a non EndPort will be flagged as erroneous in the log file and ignored. 6. At the end of the P_Key setting phase, a check for successful setting should be made. Errors should be clearly logged and cause a new sweep. 7. Each partition that is marked to support IPoIB should define a broadcast MGRP. If the partition does not support IPoIB, it should define a dummay MGRP with parameters blocking IPoIB drivers from registering to it. Phase 2 functionality: The partition policy should be considered during the routing such that links are associated with particular partition or a set of partitions. Policy should be enhanced to provide hints for how to do that (correlates to QoS too). The exact algorithm is TBD. <> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: OpenSM_PKey_Mgr_1_6.txt URL: From halr at voltaire.com Mon Jan 9 13:07:45 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jan 2006 16:07:45 -0500 Subject: [openib-general] Re: OpenSM Directions - PKey manager In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B496@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B496@mtlexch01.mtl.com> Message-ID: <1136840865.4339.3543.camel@hal.voltaire.com> On Mon, 2006-01-09 at 16:08, Eitan Zahavi wrote: > Hi All, > > Hal and I worked on a functional definition for future OpenSM > Partitions Manager. > > Please review and provide feedback. This is also checked into the OpenSM tree as: https://openib.org/svn/gen2/trunk/src/userspace/management/osm/doc/OpenSM_PKey_Mgr.txt -- Hal > Thanks > > Hal and Eitan From halr at voltaire.com Mon Jan 9 13:33:22 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jan 2006 16:33:22 -0500 Subject: [openib-general] [PATCH] Add get Pkey table to smpquery In-Reply-To: <1136839127.4520.6.camel@brick.internal.keyresearch.com> References: <1136839127.4520.6.camel@brick.internal.keyresearch.com> Message-ID: <1136842401.4339.3667.camel@hal.voltaire.com> Hi Ralph, On Mon, 2006-01-09 at 15:38, Ralph Campbell wrote: > Here is a patch to add the "pkeys" option to smpquery > to display the Pkey table. Thanks! Applied with some fixups (see below) to this patch which appear to be caused by your mailer. Can you try to get this sorted out ? Also, I will supply a subsequent patch to handle port number in these requests. -- Hal > Signed-off-by: Ralph Campbell > > Index: diags/src/smpquery.c > =================================================================== > --- diags/src/smpquery.c (revision 4808) > +++ diags/src/smpquery.c (working copy) > @@ -42,6 +42,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -61,13 +62,14 @@ > op_fn_t *fn; > } match_rec_t; > > -static op_fn_t node_desc, node_info, port_info, switch_info; > +static op_fn_t node_desc, node_info, port_info, switch_info, > pkey_table; Your mailer appears to be wrapping at 80 columns :-( > static match_rec_t match_tbl[] = { > { "nodeinfo", node_info }, > { "nodedesc", node_desc }, > { "portinfo", port_info }, > { "switchinfo", switch_info }, > + { "pkeys", pkey_table }, > {0} > }; > > @@ -157,6 +159,35 @@ > return 0; > } > > +static char * > +pkey_table(ib_portid_t *dest, char **argv, int argc) > +{ > + char data[IB_SMP_DATA_SIZE]; > + uint32_t i, j, n; > + uint16_t *p; > + > + /* Get the partition capacity */ > + if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0)) > + return "node_info query failed"; > + n = _get_field(data, 0, ib_mad_f + IB_NODE_PARTITION_CAP_F); > + > + for (i = 0; i < (n + 31) / 32; i++) { > + if (!smp_query(data, dest, IB_ATTR_PKEY_TBL, i, 0)) > + return "pkey table query failed"; > + p = (uint16_t *) data; > + for (j = 0; j < 32; j += 8, p += 8) { > + printf("%4u: 0x%04x 0x%04x 0x%04x 0x%04x 0x%04x 0x%04x 0x%04x 0x%04x > \n", Same here... > + (i * 32) + j, > + ntohs(p[0]), ntohs(p[1]), > + ntohs(p[2]), ntohs(p[3]), > + ntohs(p[4]), ntohs(p[5]), > + ntohs(p[6]), ntohs(p[7])); > + } > + } > + > + return 0; > +} > + > op_fn_t * > match_op(char *name) > { From ralphc at pathscale.com Mon Jan 9 13:46:52 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 09 Jan 2006 13:46:52 -0800 Subject: [openib-general] [PATCH] bad free() in libibumad Message-ID: <1136843212.4520.17.camel@brick.internal.keyresearch.com> Here is a patch to fix the following backtrace: (gdb) bt #0 0x000000355642f280 in raise () from /lib64/libc.so.6 #1 0x0000003556430750 in abort () from /lib64/libc.so.6 #2 0x0000003556464a7f in __libc_message () from /lib64/libc.so.6 #3 0x000000355646a71e in _int_free () from /lib64/libc.so.6 #4 0x000000355646ac4e in free () from /lib64/libc.so.6 #5 0x0000003d98f01568 in get_port (ca_name=Variable "ca_name" is not available.) at src/umad.c:191 #6 0x0000003d98f025f5 in umad_get_port ( ca_name=0x7fffff9d3e70 "/sys/class/infiniband/mthca0/ports", portnum=1, port=0x7fffff9d44f0) at src/umad.c:617 #7 0x0000003d9910249a in osm_vendor_get_all_port_attr (p_vend=0x658220, p_attr_array=0x7fffff9d4580, p_num_ports=0x7fffff9d6a08) at osm_vendor_ibumad.c:624 #8 0x00000000004044e9 in get_port_guid (p_osm=Variable "p_osm" is not available. ) at main.c:274 Signed-off-by: Ralph Campbell Index: libibumad/src/umad.c =================================================================== --- libibumad/src/umad.c (revision 4808) +++ libibumad/src/umad.c (working copy) @@ -188,7 +188,6 @@ return 0; clean: - free(port); return -EIO; } -- Ralph Campbell From rdreier at cisco.com Mon Jan 9 13:51:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jan 2006 13:51:12 -0800 Subject: [openib-general] Re: [PATCH updated] mthca: fix page shift calculation In-Reply-To: <20060104124829.GY2790@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 4 Jan 2006 14:48:29 +0200") References: <20060104124829.GY2790@mellanox.co.il> Message-ID: OK, I applied my final version. From rdreier at cisco.com Mon Jan 9 13:52:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jan 2006 13:52:29 -0800 Subject: [openib-general] Re: [PATCH] mthca: cosmetic change in mthca_qp In-Reply-To: <20060109145433.GL16938@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 9 Jan 2006 16:54:33 +0200") References: <20060109145433.GL16938@mellanox.co.il> Message-ID: > + (fls(attr->max_rd_atomic - 1) << 21)); I'm not very strict about 80 columns, and the modify QP function is especially bad, but this line is a little too long even for me. - R. From rdreier at cisco.com Mon Jan 9 14:07:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jan 2006 14:07:48 -0800 Subject: [openib-general] Re: [PATCH] mthca: eq doorbell coalescing + prevent even queque overrun In-Reply-To: <20060109150547.GN16938@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 9 Jan 2006 17:05:47 +0200") References: <20060109150547.GN16938@mellanox.co.il> Message-ID: Thanks, applied. From halr at voltaire.com Mon Jan 9 14:00:27 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jan 2006 17:00:27 -0500 Subject: [openib-general] [PATCH] OpenSM: Increase default MAXSMPs from 1 to 4 Message-ID: <1136844026.4339.3788.camel@hal.voltaire.com> OpenSM: Increase default MAXSMPs from 1 to 4 Signed-off-by: Eitan Zahavi Signed-off-by: Hal Rosenstock Index: include/opensm/osm_base.h =================================================================== --- include/opensm/osm_base.h (revision 4859) +++ include/opensm/osm_base.h (working copy) @@ -359,7 +359,7 @@ * * SYNOPSIS */ -#define OSM_DEFAULT_SMP_MAX_ON_WIRE 1 +#define OSM_DEFAULT_SMP_MAX_ON_WIRE 4 /***********/ /****d* OpenSM: Base/OSM_SM_DEFAULT_QP0_RCV_SIZE From rdreier at cisco.com Mon Jan 9 14:48:26 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jan 2006 14:48:26 -0800 Subject: [openib-general] Re: [PATCH] ipoib: mcast allocation error handling In-Reply-To: <20051218120725.GE4241@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 18 Dec 2005 14:07:25 +0200") References: <20051218120725.GE4241@mellanox.co.il> Message-ID: Thanks, applied From rdreier at cisco.com Mon Jan 9 14:51:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jan 2006 14:51:20 -0800 Subject: [openib-general] Re: [PATCH] address handle refrences in ipoib_multicast.c In-Reply-To: <20051220162932.GH2366@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 20 Dec 2005 18:29:32 +0200") References: <20051220162932.GH2366@mellanox.co.il> Message-ID: Thanks, applied. From mshefty at ichips.intel.com Mon Jan 9 15:13:22 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 09 Jan 2006 15:13:22 -0800 Subject: [openib-general] Re: merge rdma_cm and ib_addr upstream In-Reply-To: <43C2A741.3070700@ichips.intel.com> References: <43BEB4CF.1020103@ichips.intel.com> <43C2A741.3070700@ichips.intel.com> Message-ID: <43C2EE12.10207@ichips.intel.com> Sean Hefty wrote: > I installed git, pulled the latest tree, and am working on a set of > patches for this. I think that a patch series could be broken up as > follows: > > Address translation service. > Marshalling parameters between userspace and the kernel. > CM comparing private data. > Kernel CMA. > Userspace CMA kernel agent. > > Hopefully I'm not missing anything else. I'm missing one other piece: adding node_guid to ib_device. Roland, could you generate a patch (or push the changes upstream) for the node_guid that will work with the mthca version that is upstream? - Sean From halr at voltaire.com Mon Jan 9 15:11:05 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jan 2006 18:11:05 -0500 Subject: [openib-general] Re: merge rdma_cm and ib_addr upstream In-Reply-To: <43C2EE12.10207@ichips.intel.com> References: <43BEB4CF.1020103@ichips.intel.com> <43C2A741.3070700@ichips.intel.com> <43C2EE12.10207@ichips.intel.com> Message-ID: <1136848265.4339.4079.camel@hal.voltaire.com> On Mon, 2006-01-09 at 18:13, Sean Hefty wrote: > Sean Hefty wrote: > > I installed git, pulled the latest tree, and am working on a set of > > patches for this. I think that a patch series could be broken up as > > follows: > > > > Address translation service. > > Marshalling parameters between userspace and the kernel. > > CM comparing private data. > > Kernel CMA. > > Userspace CMA kernel agent. > > > > Hopefully I'm not missing anything else. > > I'm missing one other piece: adding node_guid to ib_device. > > Roland, could you generate a patch (or push the changes upstream) for the > node_guid that will work with the mthca version that is upstream? Also, isn't the fib_frontend.c patch needed too ? -- Hal From mshefty at ichips.intel.com Mon Jan 9 15:23:19 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 09 Jan 2006 15:23:19 -0800 Subject: [openib-general] Re: merge rdma_cm and ib_addr upstream In-Reply-To: <1136848265.4339.4079.camel@hal.voltaire.com> References: <43BEB4CF.1020103@ichips.intel.com> <43C2A741.3070700@ichips.intel.com> <43C2EE12.10207@ichips.intel.com> <1136848265.4339.4079.camel@hal.voltaire.com> Message-ID: <43C2F067.9000406@ichips.intel.com> Hal Rosenstock wrote: > Also, isn't the fib_frontend.c patch needed too ? It is, but I included that with my patch that adds ib_addr to the infiniband tree. - Sean From rdreier at cisco.com Mon Jan 9 15:40:21 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jan 2006 15:40:21 -0800 Subject: [openib-general] Re: merge rdma_cm and ib_addr upstream In-Reply-To: <43C2EE12.10207@ichips.intel.com> (Sean Hefty's message of "Mon, 09 Jan 2006 15:13:22 -0800") References: <43BEB4CF.1020103@ichips.intel.com> <43C2A741.3070700@ichips.intel.com> <43C2EE12.10207@ichips.intel.com> Message-ID: Sean> Roland, could you generate a patch (or push the changes Sean> upstream) for the node_guid that will work with the mthca Sean> version that is upstream? Yes, I'll push that stuff upstream shortly. I'd like to completely kill the node_guid field in struct ib_device_attr at the same time. Do you have a patch handy that does that? I can generate it myself pretty easily but I'd rather be lazy. In svn, ehca still needs to be fixed up, but I sent email to the ehca team asking them to initialize the node_guid field at initialization time. Once that happens we can kill the field in struct ib_device_attr in svn too. - R. From halr at voltaire.com Mon Jan 9 15:40:30 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jan 2006 18:40:30 -0500 Subject: [openib-general] [PATCH] bad free() in libibumad In-Reply-To: <1136843212.4520.17.camel@brick.internal.keyresearch.com> References: <1136843212.4520.17.camel@brick.internal.keyresearch.com> Message-ID: <1136850030.4339.4197.camel@hal.voltaire.com> Hi Ralph, On Mon, 2006-01-09 at 16:46, Ralph Campbell wrote: > Here is a patch to fix the following backtrace: > > (gdb) bt > #0 0x000000355642f280 in raise () from /lib64/libc.so.6 > #1 0x0000003556430750 in abort () from /lib64/libc.so.6 > #2 0x0000003556464a7f in __libc_message () from /lib64/libc.so.6 > #3 0x000000355646a71e in _int_free () from /lib64/libc.so.6 > #4 0x000000355646ac4e in free () from /lib64/libc.so.6 > #5 0x0000003d98f01568 in get_port (ca_name=Variable "ca_name" is not > available.) at src/umad.c:191 > #6 0x0000003d98f025f5 in umad_get_port ( > ca_name=0x7fffff9d3e70 "/sys/class/infiniband/mthca0/ports", > portnum=1, > port=0x7fffff9d44f0) at src/umad.c:617 > #7 0x0000003d9910249a in osm_vendor_get_all_port_attr (p_vend=0x658220, > p_attr_array=0x7fffff9d4580, p_num_ports=0x7fffff9d6a08) > at osm_vendor_ibumad.c:624 > #8 0x00000000004044e9 in get_port_guid (p_osm=Variable "p_osm" is not > available. > ) at main.c:274 > > > Signed-off-by: Ralph Campbell > > Index: libibumad/src/umad.c > =================================================================== > --- libibumad/src/umad.c (revision 4808) > +++ libibumad/src/umad.c (working copy) > @@ -188,7 +188,6 @@ > return 0; > > clean: > - free(port); > return -EIO; > } I think there is more to it than this but thanks for pointing this out. -- Hal From mshefty at ichips.intel.com Mon Jan 9 15:58:02 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 09 Jan 2006 15:58:02 -0800 Subject: [openib-general] Re: merge rdma_cm and ib_addr upstream In-Reply-To: References: <43BEB4CF.1020103@ichips.intel.com> <43C2A741.3070700@ichips.intel.com> <43C2EE12.10207@ichips.intel.com> Message-ID: <43C2F88A.7000607@ichips.intel.com> Roland Dreier wrote: > Sean> Roland, could you generate a patch (or push the changes > Sean> upstream) for the node_guid that will work with the mthca > Sean> version that is upstream? > > Yes, I'll push that stuff upstream shortly. I'd like to completely > kill the node_guid field in struct ib_device_attr at the same time. > Do you have a patch handy that does that? I can generate it myself > pretty easily but I'd rather be lazy. I'll generate a patch to do that against Linus' git tree and forward to you. - Sean From danb at voltaire.com Mon Jan 9 17:30:39 2006 From: danb at voltaire.com (Dan Bar Dov) Date: Tue, 10 Jan 2006 03:30:39 +0200 Subject: [openib-general] RE: [Ips] iSER API's Message-ID: Eddy hi, OpenIB maintains a WIKI at https://openib.org/tiki/tiki-index.php However I'm not sure it contains any ib_verbs documentation. I'm CCing the openib mailing list, maybe someone there knows where you can find documentation. Dan ________________________________ From: Eddy Quicksall [mailto:eddy_quicksall_iVivity_iSCSI at Comcast.net] Sent: Monday, January 09, 2006 10:55 PM To: Dan Bar Dov Subject: Re: [Ips] iSER API's I assume ib_verbs is for Infiniband. Am I correct? Where can I get the ib_verbs API documentation? Eddy ----- Original Message ----- From: Dan Bar Dov To: John Hufferd ; Eddy Quicksall ; ips at ietf.org Sent: Sunday, January 08, 2006 8:50 PM Subject: RE: [Ips] iSER API's Indeed on Linux the direction is towards a generic RDMA interface. CMA provides a generic CM abstraction, and the ib_verbs API is planned to extend over iWARP one way or another. The DM was deemed unnecessary, the API between iSCSI & SCSI and the underlying iSER enforced redesign of the iSER API to conform with iSCSI and SCSI rather then implement yet another layer between iSER and iSCSI/SCSI (namely DM). Dan ________________________________ From: ips-bounces at ietf.org [mailto:ips-bounces at ietf.org] On Behalf Of John Hufferd Sent: Friday, January 06, 2006 11:18 PM To: Eddy Quicksall; ips at ietf.org Subject: RE: [Ips] iSER API's Eddy, What APIs are you asking about? The SCSI CLASS Driver to the Device Driver (Mini Port Driver) should have the same interfaces for iSER as it is available for iSCSI. If you mean between the Device Driver (Mini Port Driver) and the RNIC, that will probably be the RNIC vendors interface if they have implemented all or part of the iSER or Data Mover in the RNIC itself, or the RNIC vendor's interfaces to their version of the verbs (at least until the OS implements its own RDMA interfaces that implement the RDMA verbs). I believe that, over time, most OSs will have generic RDMA interfaces, which can be use by all certified RNIC hardware, and any application (user space or kernel space); in that case the iSER module will probably interface to that OS's RDMA interfaces. . . . John L Hufferd Sr. Executive Director of Technology Brocade Communications Systems, Inc jhufferd at brocade.com Office Phone: (408) 333-5244; eFAX: (408) 904-4688 Alt Office Phone: (408) 997-6136; Cell: (408) 627-9606 ________________________________ From: ips-bounces at ietf.org [mailto:ips-bounces at ietf.org] On Behalf Of Eddy Quicksall Sent: Friday, January 06, 2006 11:42 AM To: ips at ietf.org Subject: [Ips] iSER API's Is there any work afoot to design some standard iSER API's? Eddy -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Mon Jan 9 23:34:36 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 10 Jan 2006 09:34:36 +0200 Subject: [openib-general] Re: merge rdma_cm and ib_addr upstream In-Reply-To: <43C2F88A.7000607@ichips.intel.com> References: <43BEB4CF.1020103@ichips.intel.com> <43C2A741.3070700@ichips.intel.com> <43C2EE12.10207@ichips.intel.com> <43C2F88A.7000607@ichips.intel.com> Message-ID: <43C3638C.70802@voltaire.com> Sean Hefty wrote: > Roland Dreier wrote: >> Sean> Roland, could you generate a patch (or push the changes >> Sean> upstream) for the node_guid that will work with the mthca >> Sean> version that is upstream? >> >> Yes, I'll push that stuff upstream shortly. I'd like to completely >> kill the node_guid field in struct ib_device_attr at the same time. >> Do you have a patch handy that does that? I can generate it myself >> pretty easily but I'd rather be lazy. Sean, Just to make sure, would the __be64 node_guid field of struct ib_device have the exact semantics of the __be64 node_guid field of struct ib_device_attr ? iser uses it from the attr struct and we can sure move to use it from the device struct. Or. From ogerlitz at voltaire.com Mon Jan 9 23:55:19 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 10 Jan 2006 09:55:19 +0200 Subject: [openib-general] [PATCH] mthca: eq doorbell coalescing + prevent even queque overrun In-Reply-To: <20060109150547.GN16938@mellanox.co.il> References: <20060109150547.GN16938@mellanox.co.il> Message-ID: <43C36867.2000405@voltaire.com> Michael, Roland Michael S. Tsirkin wrote: > I am seeing EQ overruns in SDP stress tests: if CQ completion > handler arms a CQ, this could generate more EQEs, so that > EQ will never get empty and consumer index will never get updated. There's something re CQ arming which i'd like to bring up. I see that the mad, ipoib and srp CQ handlers work as follows: first - arm the CQ, second - poll the CQ in a loop till it is emtpy. What is the reasoning behind this approach? does it means that completions occurring while the handler is running cause interrupts which could be saved? is there any problem with first empty-ing the CQ and only then arming it? The latter approach is taken by iser code. As far as i understand it can not cause the ib consumer to miss interrupts, am i wrong? Or. From mst at mellanox.co.il Tue Jan 10 00:24:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jan 2006 10:24:32 +0200 Subject: [openib-general] Re: [PATCH] ipoib: count dropped multicast patckets In-Reply-To: <20060109180712.GE16938@mellanox.co.il> References: <20060109180712.GE16938@mellanox.co.il> Message-ID: <20060110082432.GA3186@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: [PATCH] ipoib: count dropped multicast patckets > > Quoting r. Michael S. Tsirkin : > > Subject: [PATCH] ipoib: count dropped multicast patckets > > > > Count dropped multicast packets. > > > > Signed-off-by: Michael S. Tsirkin > > And here's a patch that actually works :). And here's a verson that really actually works :). Count dropped multicast packets. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-08 22:42:58.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-10 10:17:41.000000000 +0200 @@ -122,8 +122,12 @@ static void ipoib_mcast_free(struct ipoi if (mcast->ah) ipoib_put_ah(mcast->ah); - while (!skb_queue_empty(&mcast->pkt_queue)) + while (!skb_queue_empty(&mcast->pkt_queue)) { + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->stats.tx_dropped; + spin_unlock_irqrestore(&priv->tx_lock, flags); dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } kfree(mcast); } @@ -299,6 +303,8 @@ ipoib_mcast_sendonly_join_complete(int s { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; if (!status) ipoib_mcast_join_finish(mcast, mcmember); @@ -309,8 +315,12 @@ ipoib_mcast_sendonly_join_complete(int s IPOIB_GID_ARG(mcast->mcmember.mgid), status); /* Flush out any queued packets */ - while (!skb_queue_empty(&mcast->pkt_queue)) + while (!skb_queue_empty(&mcast->pkt_queue)) { + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->stats.tx_dropped; + spin_unlock_irqrestore(&priv->tx_lock, flags); dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } /* Clear the busy flag so we try again */ clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); @@ -686,6 +696,7 @@ void ipoib_mcast_send(struct net_device if (!mcast) { ipoib_warn(priv, "unable to allocate memory for " "multicast structure\n"); + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); goto out; } @@ -699,8 +710,10 @@ void ipoib_mcast_send(struct net_device if (!mcast->ah) { if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) skb_queue_tail(&mcast->pkt_queue, skb); - else + else { + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); + } if (mcast->query) ipoib_dbg_mcast(priv, "no address vector, " -- MST From mst at mellanox.co.il Tue Jan 10 00:32:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jan 2006 10:32:55 +0200 Subject: [openib-general] [PATCH] mthca: eq doorbell coalescing + prevent even queque overrun In-Reply-To: <43C36867.2000405@voltaire.com> References: <43C36867.2000405@voltaire.com> Message-ID: <20060110083255.GG16938@mellanox.co.il> Quoting Or Gerlitz : > > I am seeing EQ overruns in SDP stress tests: if CQ completion > > handler arms a CQ, this could generate more EQEs, so that > > EQ will never get empty and consumer index will never get updated. > > There's something re CQ arming which i'd like to bring up. > > I see that the mad, ipoib and srp CQ handlers work as follows: first - > arm the CQ, second - poll the CQ in a loop till it is emtpy. What is the > reasoning behind this approach? Thats what IB spec says. > does it means that completions occurring > while the handler is running cause interrupts which could be saved? Handlers are running out of the interrupt context, so an interrupt would have to be generated in the window while CQ is being armed. Profiling I've done on ipoib shows that this is quite unlikely. > is there any problem with first empty-ing the CQ and only then arming it? > > The latter approach is taken by iser code. As far as i understand it can > not cause the ib consumer to miss interrupts, am i wrong? On Mellanox hardware you wont miss interrupts in this case if you always poll CQ and then arm it as a result of an interrupt. -- MST From tyree0 at americanpress.com Tue Jan 10 01:48:05 2006 From: tyree0 at americanpress.com (Leroy Mackey) Date: Tue, 10 Jan 2006 03:48:05 -0600 Subject: [openib-general] Re-finance before rates skyrocket Message-ID: <271c492k.9379482@yahoo.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cox.9.gif Type: image/gif Size: 7817 bytes Desc: not available URL: From eitan at mellanox.co.il Tue Jan 10 03:08:44 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 10 Jan 2006 13:08:44 +0200 Subject: [openib-general] [PATCH] osm: Add TODO light sweep through LID Message-ID: <86wth88hmr.fsf@mtl066.yok.mtl.com> Hi Hal One TODO we have found is that in order to guarantee traps can be delievered to the SM - it would have been nice if light sweep will use LID routing and not direct routing. This way the LFTs could be verified or else a heavy sweep conducted. This patch only changes the TODO file so we do not forget... (I also fixed a typo: issuesi -> issues) Eitan Signed-off-by: Eitan Zahavi Index: doc/todo =================================================================== --- doc/todo (revision 4876) +++ doc/todo (working copy) @@ -1,4 +1,4 @@ -# OSM List of todo, open issuesi, and futures: +# OSM List of todo, open issues, and futures: 1 041228 - Handle local events (local lid change, port state change, etc.) 2 041228 - Port fail over to next port upon request @@ -12,6 +12,9 @@ 7 051207 - Add dumping of SA records to supported SA records which do not currently do this (SMInfo, VLArb, SLVL, PKey, LFT) +8 060109 - Use LID routing for light sweep to guarantee trap + delivery path to the SM. + Futures From halr at voltaire.com Tue Jan 10 03:17:22 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jan 2006 06:17:22 -0500 Subject: [openib-general] Re: [PATCH] osm: Add TODO light sweep through LID In-Reply-To: <86wth88hmr.fsf@mtl066.yok.mtl.com> References: <86wth88hmr.fsf@mtl066.yok.mtl.com> Message-ID: <1136891841.4339.6272.camel@hal.voltaire.com> On Tue, 2006-01-10 at 06:08, Eitan Zahavi wrote: > Hi Hal > > One TODO we have found is that in order to guarantee traps can > be delievered to the SM - it would have been nice if light sweep > will use LID routing and not direct routing. This way the LFTs could > be verified or else a heavy sweep conducted. > > This patch only changes the TODO file so we do not forget... Thanks. Applied. > (I also fixed a typo: issuesi -> issues) > > Eitan From halr at voltaire.com Tue Jan 10 03:49:05 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jan 2006 06:49:05 -0500 Subject: [openib-general] [PATCH] bad free() in libibumad In-Reply-To: <1136850030.4339.4197.camel@hal.voltaire.com> References: <1136843212.4520.17.camel@brick.internal.keyresearch.com> <1136850030.4339.4197.camel@hal.voltaire.com> Message-ID: <1136893551.4339.6393.camel@hal.voltaire.com> On Mon, 2006-01-09 at 18:40, Hal Rosenstock wrote: > Hi Ralph, > > On Mon, 2006-01-09 at 16:46, Ralph Campbell wrote: > > Here is a patch to fix the following backtrace: > > > > (gdb) bt > > #0 0x000000355642f280 in raise () from /lib64/libc.so.6 > > #1 0x0000003556430750 in abort () from /lib64/libc.so.6 > > #2 0x0000003556464a7f in __libc_message () from /lib64/libc.so.6 > > #3 0x000000355646a71e in _int_free () from /lib64/libc.so.6 > > #4 0x000000355646ac4e in free () from /lib64/libc.so.6 > > #5 0x0000003d98f01568 in get_port (ca_name=Variable "ca_name" is not > > available.) at src/umad.c:191 > > #6 0x0000003d98f025f5 in umad_get_port ( > > ca_name=0x7fffff9d3e70 "/sys/class/infiniband/mthca0/ports", > > portnum=1, > > port=0x7fffff9d44f0) at src/umad.c:617 > > #7 0x0000003d9910249a in osm_vendor_get_all_port_attr (p_vend=0x658220, > > p_attr_array=0x7fffff9d4580, p_num_ports=0x7fffff9d6a08) > > at osm_vendor_ibumad.c:624 > > #8 0x00000000004044e9 in get_port_guid (p_osm=Variable "p_osm" is not > > available. > > ) at main.c:274 > > > > > > Signed-off-by: Ralph Campbell > > > > Index: libibumad/src/umad.c > > =================================================================== > > --- libibumad/src/umad.c (revision 4808) > > +++ libibumad/src/umad.c (working copy) > > @@ -188,7 +188,6 @@ > > return 0; > > > > clean: > > - free(port); > > return -EIO; > > } > > I think there is more to it than this but thanks for pointing this out. Thanks. I applied your "better" version of this patch. -- Hal From ogerlitz at voltaire.com Tue Jan 10 04:31:01 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 10 Jan 2006 14:31:01 +0200 Subject: [openib-general] [PATCH] mthca: eq doorbell coalescing + prevent even queque overrun In-Reply-To: <20060110083255.GG16938@mellanox.co.il> References: <43C36867.2000405@voltaire.com> <20060110083255.GG16938@mellanox.co.il> Message-ID: <43C3A905.7070404@voltaire.com> Michael >>> I am seeing EQ overruns in SDP stress tests: if CQ completion Michael >>> handler arms a CQ, this could generate more EQEs, Or >> I see that the mad, ipoib and srp CQ handlers work as follows: first - Or >> arm the CQ, second - poll the CQ in a loop till it is emtpy. Or >> does it means that completions occurring while the handler is running cause Or >> interrupts which could be saved? Michael > Handlers are running out of the interrupt context, so Michael > an interrupt would have to be generated in the window while Michael > CQ is being armed. Profiling I've done on ipoib shows that Michael > this is quite unlikely. Sorry, i dont follow. If you arm the CQ when there is one WC in it and then start polling, and during your polling a "second" WC is generated by the HCA, wouldn't an interrupt related to the 2nd completion be generated - why the case you are mentioning is the arming window? From mst at mellanox.co.il Tue Jan 10 04:39:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jan 2006 14:39:18 +0200 Subject: [openib-general] [PATCH] mthca: eq doorbell coalescing + prevent even queque overrun In-Reply-To: <43C3A905.7070404@voltaire.com> References: <43C3A905.7070404@voltaire.com> Message-ID: <20060110123917.GM16938@mellanox.co.il> Quoting Or Gerlitz : > Sorry, i dont follow. If you arm the CQ when there is one WC in it and > then start polling, and during your polling a "second" WC is generated > by the HCA, wouldn't an interrupt related to the 2nd completion be > generated - why the case you are mentioning is the arming window? I'm polling in the interrupt handler, so interrupts from the same IRQ are disabled. No? What am I missing? -- MST From ogerlitz at voltaire.com Tue Jan 10 04:41:09 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 10 Jan 2006 14:41:09 +0200 Subject: [openib-general] [PATCH] mthca: eq doorbell coalescing + prevent even queque overrun In-Reply-To: <20060110123917.GM16938@mellanox.co.il> References: <43C3A905.7070404@voltaire.com> <20060110123917.GM16938@mellanox.co.il> Message-ID: <43C3AB65.5070008@voltaire.com> Michael S. Tsirkin wrote: >> Sorry, i dont follow. If you arm the CQ when there is one WC in it and >> then start polling, and during your polling a "second" WC is generated >> by the HCA, wouldn't an interrupt related to the 2nd completion be >> generated - why the case you are mentioning is the arming window? > I'm polling in the interrupt handler, so interrupts from the same IRQ are > disabled. No? What am I missing? OK, i might be somehow newbee around this (working in hard irq context) land. Does disabling the HCA IRQ means no interrupt would be generated later when the handler is done? i was think it would be just deffered. iSER hard irq CQ handler just does a context jump to soft irq handler (tasklet) so the rule you mention does not apply to it. From mst at mellanox.co.il Tue Jan 10 04:56:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jan 2006 14:56:21 +0200 Subject: [openib-general] [PATCH] mthca: eq doorbell coalescing + prevent even queque overrun In-Reply-To: <43C3AB65.5070008@voltaire.com> References: <43C3AB65.5070008@voltaire.com> Message-ID: <20060110125621.GN16938@mellanox.co.il> Quoting Or Gerlitz : > Subject: Re: [openib-general] [PATCH] mthca: eq doorbell coalescing + prevent even queque overrun > > Michael S. Tsirkin wrote: > >> Sorry, i dont follow. If you arm the CQ when there is one WC in it and > >> then start polling, and during your polling a "second" WC is generated > >> by the HCA, wouldn't an interrupt related to the 2nd completion be > >> generated - why the case you are mentioning is the arming window? > > > I'm polling in the interrupt handler, so interrupts from the same IRQ are > > disabled. No? What am I missing? > > OK, i might be somehow newbee around this (working in hard irq context) > land. Does disabling the HCA IRQ means no interrupt would be generated > later when the handler is done? i was think it would be just deffered. AFAIK, you get another interrupt only if hardware continues asserting the interrupt. > iSER hard irq CQ handler just does a context jump to soft irq handler > (tasklet) so the rule you mention does not apply to it. > -- MST From mst at mellanox.co.il Tue Jan 10 05:23:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jan 2006 15:23:11 +0200 Subject: [openib-general] [PATCH applied] perftest: fix device lookup by name Message-ID: <20060110132311.GP16938@mellanox.co.il> The following is already applied on trunk. --- Fix device lookup by name. Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/perftest/rdma_bw.c =================================================================== --- openib.orig/src/userspace/perftest/rdma_bw.c 2006-01-10 15:02:10.000000000 +0200 +++ openib/src/userspace/perftest/rdma_bw.c 2006-01-10 15:15:49.000000000 +0200 @@ -594,7 +594,7 @@ int main(int argc, char *argv[]) return 1; } } else { - for (ib_dev = *dev_list; ib_dev; ++dev_list) + for (; (ib_dev = *dev_list); ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { Index: openib/src/userspace/perftest/read_bw.c =================================================================== --- openib.orig/src/userspace/perftest/read_bw.c 2006-01-10 15:02:10.000000000 +0200 +++ openib/src/userspace/perftest/read_bw.c 2006-01-10 15:17:02.000000000 +0200 @@ -688,7 +688,7 @@ int main(int argc, char *argv[]) return 1; } } else { - for (ib_dev = *dev_list; ib_dev; ++dev_list) + for (; (ib_dev = *dev_list); ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { Index: openib/src/userspace/perftest/read_lat.c =================================================================== --- openib.orig/src/userspace/perftest/read_lat.c 2006-01-10 15:02:10.000000000 +0200 +++ openib/src/userspace/perftest/read_lat.c 2006-01-10 15:16:16.000000000 +0200 @@ -124,7 +124,7 @@ static struct ibv_device *pp_find_dev(co if (!ib_dev) fprintf(stderr, "No IB devices found\n"); } else { - for (ib_dev = *dev_list; ib_dev; ++dev_list) + for (; (ib_dev = *dev_list); ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) Index: openib/src/userspace/perftest/send_bw.c =================================================================== --- openib.orig/src/userspace/perftest/send_bw.c 2006-01-10 15:02:10.000000000 +0200 +++ openib/src/userspace/perftest/send_bw.c 2006-01-10 15:16:21.000000000 +0200 @@ -965,7 +965,7 @@ int main(int argc, char *argv[]) return 1; } } else { - for (ib_dev = *dev_list; ib_dev; ++dev_list) + for (; (ib_dev = *dev_list); ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { Index: openib/src/userspace/perftest/send_lat.c =================================================================== --- openib.orig/src/userspace/perftest/send_lat.c 2006-01-10 15:02:10.000000000 +0200 +++ openib/src/userspace/perftest/send_lat.c 2006-01-10 15:16:30.000000000 +0200 @@ -132,7 +132,7 @@ static struct ibv_device *pp_find_dev(co if (!ib_dev) fprintf(stderr, "No IB devices found\n"); } else { - for (ib_dev = *dev_list; ib_dev; ++dev_list) + for (; (ib_dev = *dev_list); ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) Index: openib/src/userspace/perftest/write_bw.c =================================================================== --- openib.orig/src/userspace/perftest/write_bw.c 2006-01-10 15:02:10.000000000 +0200 +++ openib/src/userspace/perftest/write_bw.c 2006-01-10 15:17:02.000000000 +0200 @@ -740,7 +740,7 @@ int main(int argc, char *argv[]) return 1; } } else { - for (ib_dev = *dev_list; ib_dev; ++dev_list) + for (; (ib_dev = *dev_list); ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { Index: openib/src/userspace/perftest/write_lat.c =================================================================== --- openib.orig/src/userspace/perftest/write_lat.c 2006-01-10 15:02:10.000000000 +0200 +++ openib/src/userspace/perftest/write_lat.c 2006-01-10 15:17:02.000000000 +0200 @@ -122,7 +122,7 @@ static struct ibv_device *pp_find_dev(co if (!ib_dev) fprintf(stderr, "No IB devices found\n"); } else { - for (ib_dev = *dev_list; ib_dev; ++dev_list) + for (; (ib_dev = *dev_list); ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) -- MST From panda at cse.ohio-state.edu Tue Jan 10 06:24:13 2006 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Tue, 10 Jan 2006 09:24:13 -0500 (EST) Subject: [openib-general] Re: We are ready with a Gen2 version of MVAPICH2 In-Reply-To: from "Bob Woodruff" at Jan 09, 2006 09:57:49 AM Message-ID: <200601101424.k0AEOD2j024385@xi.cse.ohio-state.edu> Hi Woody, > DK wrote, > >We plan to upload a stripped down version of this new release to the > >OpenIB SVN at the following location: > > >https://openib.org/svn/gen2/trunk/src/userspace/mpi/ > > Sounds good to me. Thanks a lot for your feedback. We are working on putting the new version at the above location. Thanks, DK > woody > > From ianjiang.ict at gmail.com Tue Jan 10 06:31:54 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Tue, 10 Jan 2006 22:31:54 +0800 Subject: [openib-general] Error when loading ib_umad Message-ID: <7b2fa1820601100631q46ea5b30x2fbceedf0016bdbd@mail.gmail.com> I am using the linux kernel 2.6.14 and the latest IB driver in OpenIB. Is this caused by SMP? Any suggestion is appreciated! Here are the error details: dell-162:~ # lsmod | grep ib ib_mthca 120096 0 ib_mad 38820 1 ib_mthca ib_core 51200 2 ib_mthca,ib_mad libata 54032 1 ata_piix dell-162:~ # modprobe ib_umad Killed dell-162:~ # Message from syslogd at dell-162 at Wed Jan 11 06:08:42 2006 ... dell-162 kernel: Oops: 0000 [1] SMP Message from syslogd at dell-162 at Wed Jan 11 06:08:42 2006 ... dell-162 kernel: CR2: 000000000e70010c dell-162:~# tail /var/log/messages Jan 11 05:25:13 dell-162 kernel: Unable to handle kernel paging request at 000000000e70010c RIP: Jan 11 05:25:13 dell-162 kernel: {kref_get+1} Jan 11 05:25:13 dell-162 kernel: PGD 1316a067 PUD 1f81e067 PMD 0 Jan 11 05:25:13 dell-162 kernel: Oops: 0000 [1] SMP Jan 11 05:25:13 dell-162 kernel: CPU 1 Jan 11 05:25:13 dell-162 kernel: Modules linked in: ib_umad evdev joydev sg sr_mod floppy thermal processor fan button battery ac ib_mthca ib_mad ib_core ehci_hcd uhci_hcd ipv6 i2c_i801 i2c_core hw_random e1000 usbcore dm_mod ext3 jbd ata_piix libata Jan 11 05:25:13 dell-162 kernel: Pid: 12113, comm: modprobe Not tainted 2.6.14 #1 Jan 11 05:25:13 dell-162 kernel: RIP: 0010:[] {kref_get+1} Jan 11 05:25:13 dell-162 kernel: RSP: 0000:ffff81001ad53b68 EFLAGS: 00010206 Jan 11 05:25:13 dell-162 kernel: RAX: ffff810010503aa0 RBX: 000000000e7000f0 RCX: 0000000000000000 Jan 11 05:25:13 dell-162 kernel: RDX: ffff810010503aa0 RSI: ffffffff8039d0db RDI: 000000000e70010c Jan 11 05:25:13 dell-162 kernel: RBP: ffffffff8039d0d4 R08: ffff8100017f1d50 R09: 0000000000000000 Jan 11 05:25:13 dell-162 kernel: R10: ffff810013f69300 R11: 0000000000000048 R12: ffff81000bcbc318 Jan 11 05:25:13 dell-162 kernel: R13: 00000000fffffff4 R14: ffff810013f69300 R15: 000000000e7000f0 Jan 11 05:25:13 dell-162 kernel: FS: 00002aaaaade36e0(0000) GS:ffffffff804eb880(0000) knlGS:0000000000000000 Jan 11 05:25:13 dell-162 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jan 11 05:25:13 dell-162 kernel: CR2: 000000000e70010c CR3: 000000001c6e6000 CR4: 00000000000006e0 Jan 11 05:25:13 dell-162 kernel: Process modprobe (pid: 12113, threadinfo ffff81001ad52000, task ffff81001aaf5040) Jan 11 05:25:13 dell-162 kernel: Stack: 000000000e7000f0 ffffffff80214f62 ffff810010503ac0 ffffffff801c5a11 Jan 11 05:25:13 dell-162 kernel: ffff81001c49c190 ffff81001c49c180 ffff81001c49c180 ffff8100193f1200 Jan 11 05:25:13 dell-162 kernel: 0000000000000000 ffff8100193f1200 Jan 11 05:25:13 dell-162 kernel: Call Trace:{kobject_get+18} {sysfs_create_link+193} Jan 11 05:25:13 dell-162 kernel: {class_device_add+436} {class_device_create+276} Jan 11 05:25:13 dell-162 kernel: {d_instantiate+136} {dput+33} Jan 11 05:25:13 dell-162 kernel: {create_dir+405} {kobj_map+102} Jan 11 05:25:13 dell-162 kernel: {exact_lock+0} {exact_match+0} Jan 11 05:25:13 dell-162 kernel: {:ib_umad:ib_umad_add_one+432} {:ib_core:ib_register_client+124} Jan 11 05:25:13 dell-162 kernel: {:ib_umad:ib_umad_init+144} {sys_init_module+6553} Jan 11 05:25:13 dell-162 kernel: {:ib_umad:ib_umad_init+0} {__up_write+49} Jan 11 05:25:13 dell-162 kernel: {system_call+126} Jan 11 05:25:13 dell-162 kernel: Jan 11 05:25:13 dell-162 kernel: Code: 8b 07 48 89 fb 85 c0 75 26 b9 20 00 00 00 48 c7 c2 bb c9 39 Jan 11 05:25:13 dell-162 kernel: RIP {kref_get+1} RSP Jan 11 05:25:13 dell-162 kernel: CR2: 000000000e70010c Jan 11 05:34:19 dell-162 sshd[12947]: Accepted keyboard-interactive/pam for root from ::ffff:192.168.1.61 port 3821 ssh2 -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From oferg at mellanox.co.il Tue Jan 10 06:51:47 2006 From: oferg at mellanox.co.il (Ofer Gigi) Date: Tue, 10 Jan 2006 16:51:47 +0200 Subject: [openib-general] [PATCH] osm: support for trivial PKey manager Message-ID: <07mzi4yw3g.fsf@swlab25.yok.mtl.com> Hi Hal, Removing redundant error in the log. If the physical port is not valid, nothing is needed to be the done here. Thanks Ofer G. Signed-off-by: Ofer Gigi Index: osm_pkey_mgr.c =================================================================== --- osm_pkey_mgr.c (revision 4867) +++ osm_pkey_mgr.c (working copy) @@ -171,7 +171,7 @@ __osm_pkey_mgr_process_physical_port( { osm_log( p_mgr->p_log, OSM_LOG_ERROR, "__osm_pkey_mgr_process_physical_port: ERR 0501: " - "No empty block entry was found to insert IB_DEFAULT_PKEY for node " + "No empty entry was found to insert IB_DEFAULT_PKEY for node " "0x%016" PRIx64 " and port %u\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); } @@ -275,15 +275,6 @@ osm_pkey_mgr_process( result = OSM_SIGNAL_DONE_PENDING; } } - else - { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "osm_pkey_mgr_process: ERR 0502: " - "Invalid physical port for node 0x%016" PRIx64 - " port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - port_num ); - } } } From rdreier at cisco.com Tue Jan 10 07:07:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jan 2006 07:07:44 -0800 Subject: [openib-general] Error when loading ib_umad In-Reply-To: <7b2fa1820601100631q46ea5b30x2fbceedf0016bdbd@mail.gmail.com> (Ian Jiang's message of "Tue, 10 Jan 2006 22:31:54 +0800") References: <7b2fa1820601100631q46ea5b30x2fbceedf0016bdbd@mail.gmail.com> Message-ID: Are you ignoring compile warnings about class_device_create()? Since 2.6.15 is out, the OpenIB svn does not support 2.6.14 any more, so you may run into problems like this. You will probably need to restore the compatibility hack removed in r4784 by adding something like #include #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,15) #define class_device_create(cls, parent, devt, device, fmt, arg...) \ class_device_create(cls, devt, device, fmt, ## arg) #endif to . - R. From rdreier at cisco.com Tue Jan 10 07:20:21 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jan 2006 07:20:21 -0800 Subject: [openib-general] Re: merge rdma_cm and ib_addr upstream In-Reply-To: <43C3638C.70802@voltaire.com> (Or Gerlitz's message of "Tue, 10 Jan 2006 09:34:36 +0200") References: <43BEB4CF.1020103@ichips.intel.com> <43C2A741.3070700@ichips.intel.com> <43C2EE12.10207@ichips.intel.com> <43C2F88A.7000607@ichips.intel.com> <43C3638C.70802@voltaire.com> Message-ID: Or> Just to make sure, would the __be64 node_guid field of struct Or> ib_device have the exact semantics of the __be64 node_guid Or> field of struct ib_device_attr ? iser uses it from the attr Or> struct and we can sure move to use it from the device struct. Yes, that's right. Something like the patch below (compile tested only) is what is required. - R. --- Move iSER from getting node_guid via ib_query_device() to using the node_guid field in struct ib_device, since ib_query_device() will stop returning the node_guid soon. Signed-off-by: Roland Dreier --- infiniband/ulp/iser/iser_verbs.c (revision 4866) +++ infiniband/ulp/iser/iser_verbs.c (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -73,8 +73,6 @@ int iser_create_adaptor_ib_res(struct is struct ib_device *device = p_iser_adaptor->device; struct ib_fmr_pool_param params; - ib_query_device(device, &(p_iser_adaptor->device_attr)); - strcpy(p_iser_adaptor->name, device->name); iser_dbg("setting device name %s as adatptor name\n", device->name); @@ -234,23 +232,16 @@ int iser_free_qp_and_id(struct iser_conn struct iser_adaptor *iser_adaptor_find_by_device(struct rdma_cm_id *cma_id) { - struct ib_device_attr *p_device_attr = NULL; struct list_head *p_list; struct iser_adaptor *p_adaptor = NULL; - p_device_attr = kmalloc(sizeof *p_device_attr, GFP_KERNEL); - if(p_device_attr == NULL) - goto end; - - ib_query_device(cma_id->device, p_device_attr); - spin_lock(&ig.adaptor_list_lock); p_list = ig.adaptor_list.next; while (p_list != &ig.adaptor_list) { p_adaptor = list_entry(p_list, struct iser_adaptor, ig_list); /* find if there's a match using the device GUID */ - if (p_adaptor->device_attr.node_guid == p_device_attr->node_guid) + if (p_adaptor->device->node_guid == cma_id->device->node_guid) break; } @@ -268,7 +259,6 @@ struct iser_adaptor *iser_adaptor_find_b } end: spin_unlock(&ig.adaptor_list_lock); - kfree(p_device_attr); return p_adaptor; } --- infiniband/ulp/iser/iser.h (revision 4866) +++ infiniband/ulp/iser/iser.h (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -105,7 +105,6 @@ struct iser_adaptor { struct list_head ig_list; /* entry in ig adaptors list */ struct ib_device *device; - struct ib_device_attr device_attr; struct ib_pd *pd; struct ib_cq *cq; From halr at voltaire.com Tue Jan 10 07:36:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jan 2006 10:36:29 -0500 Subject: [openib-general] Re: [PATCH] osm: support for trivial PKey manager In-Reply-To: <07mzi4yw3g.fsf@swlab25.yok.mtl.com> References: <07mzi4yw3g.fsf@swlab25.yok.mtl.com> Message-ID: <1136907195.4322.28.camel@hal.voltaire.com> On Tue, 2006-01-10 at 09:51, Ofer Gigi wrote: > Hi Hal, > > Removing redundant error in the log. > If the physical port is not valid, nothing is needed to be the done here. Thanks. Applied. From ardavis at ichips.intel.com Tue Jan 10 09:20:26 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 10 Jan 2006 09:20:26 -0800 Subject: [openib-general] RE: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's In-Reply-To: References: Message-ID: <43C3ECDA.8000405@ichips.intel.com> Kanevsky, Arkady wrote: >comments inline. > >>> >>As mentioned on the con-call, there are two separate items to >>consider while looking at the proposal. The first is the >>ability to extend DAT for specific provider value-add and the >>second is to validate the need for general atomic and >>immediate data functionality in the basic set of API's for >>all providers. I included atomics and immediate data as >>examples since it is specific to one provider (IB), it >>includes operations that require new ops, events, and event >>data types, and it also provides a working model to validate >>the extension model from request to completion events. I >>would like to concentrate on getting consensus on the >>extension proposal first if possible. Just try to think of >>the actual operations as some opaque dat_ext_foobar_op(). >> >> > >The thing that bothers me is that we already have several APIs >that are transport specific. While some are possible to implement >on other transports the others, like Socket CM, can not. >So I view both of your specific extensions as transport specific >amd hence prefer to add them as normal APIs not extensions. > > That would work for me. >The secondary goal is that Provider can add extensions without requiring >to change to DAT. These fall into 3 categories. >1. New memory types including privilages and protection attributes. >We can add "extension" entry to these structures. We need to check >if this is sufficient. Think of shared memory for example. >I am assuming no changes to PZ. >2. New DTOs. The main issue is not DTOs but their completions and >async errors. This is why Immediate data is better handled by >incorporating into >DAT spec while atomic can be handled by extensions. That is completion >will return >"extention" and Consumer will do the secondary switch on the extension >type. >Extension should not impact backwards compatibility. >We had not looked at errors. But assuming a simple model that async >errors >break connection and we can return "extension error" with extensions >defining >new reason. Again details need to be polished. >3. new connection types or CM models... New connections seems to have >little impact >on existing API assuming that EP type can be extended. The new >connection can even >restrict which DTO they can handle. CM model is more problematic. > > Nice summary. Yes, we need be thorough when flushing out all the requirements for extensions in general. I am not sure how much I can share at this point regarding any "other extensions" but if we think in general terms we should cover all the necessary requirements. Do you want to update the proposal based on your statements above? I would be happy to work it into a real patch for feasibility and to provide feedback based on future extentability. >Arlin, it would be nice to consider some of your other extensions that >are not >transport specific to see how it will fit before we make the final >decision. >This should give us idea how extensible DAT "extension" model is. > > > > >>> >>>In general, extension route was intended for RNIC|HCA providers to >>>expose HW capabilities beyond IBTA, iWARP and VIA standards. The >>>standard RDMA functionality is best handle via spec addition. >>>DAT 2.0 does it for FMR, remote and local memory >>> >>> >>invalidation as well >> >> >>>as others. >>> >>> >>True, but the extension route is not fully defined, >>documented, nor implemented. This is what I would like to >>work on getting completed in time for 2.0 if possible. >> >>BTW: The existing implementation actually uses >>dapl_provider->extension to store the hca_ptr but the >>specification states that it is reserved for the providers >>private use (8.2.1 in DAPL1.2 spec). This is why I had to >>defined another extension_func in the patch. >> >> >> >>> >>>I had posted a complete list of changes/addition to DAT 2.0 about a >>>month ago. >>>But we had not discussed yet version change from 1.3 to 2.0 nor how >>>much backwards compatibility spec will provide. >>> >>>2. What is IMMED_EVENT? is it just immediate data without >>> >>> >>any payload one? >> >> >>>I suggest chnaging the name so it will not use "EVENT". >>> >>> >>Just call it >> >> >>>NO_PAYLOAD. >>>Do you want to support 2 different way to delivery immediate data? >>>One in event and one in data payload? >>>Why? I would think that just an event way will do. >>> >>> >>This was modeled after the immediate data discussions on the >>DAT reflector based on iWARP requirements. >> >>http://groups.yahoo.com/group/dat-discussions/message/3285 >> >> >> > >I recall it now. >I want to consider a few usage cases. >1. Existing app running on the Provider with extensions. >Want to make sure we do not require any App changes beyond recompile >due to extensions. > > agree >2. App wants to be modified to use Immediate data. How big impact it >has on existing code. For example buffer size allocation and completion >handling > > It really depends on transport capabilities. Our current thinking has two delivery mechanisms for the two transports (event and payload) which is not optimal. If we can come to a consensus on delivering one way (events) and simply add a new DTO post option it would reduce the complexity considerably. >for immediate data over existing connection. >2a. Can application take advantage if it knows that Provider will return >immediate data in event? > > Yes, and it requires no additional buffer management on the consumers if providers support this model. >2b. Immediate data inline only? > > That requires more buffer management on the applications part. >3. Ditto for atomic operations over existing connection. > > > >>> >>>3. I suggest beefing up DAT_DTO_COMPLETION_EVENT_DATA and >>> >>> >>DAT_DTOS to >> >> >>>convey which operation completed and return Immediate data >>> >>> >>if complete >> >> >>>operation had immediate data. >>>Since we already modified these 2 struct as part of DAT 2.0 change >>>lets add your proposal to the change. >>>This will allow Consumers to use single approach to deal with >>>completions, extension to the current one but not a >>> >>> >>structural one. No >> >> >>>need for DAT_EXTENSION_DATA, DAT_EXT_EVENT_TYPE, DAT_EXT_OP nor the >>>whole mechanism for extended ops. >>> >>> >>You still need extension types for the "other" value-add >>operations/evnts that will not be accepted as standard and >>are vendor specific. >> >>I would like to defer the rest of the questions for now since >>they touch on actual operations and not the extension >>mechanism. Although, I do need to think about how to extend >>memory registration privledges. Any suggestions? >> >> > >Going with your generic extension design >we add "extension" entry to relevant data structures. >And then outside DAT define the structure for its values which can be >extensible. This imply that adding extension by Provider >will force apps to be recompiled. I hope this is enough. >I am assuming that apps use values not position the structures. > > Yes, a new build against the extented definitions would suffice, along with some validation at load and open times. Can you and James send out an extension patch based on some of your ideas if they differ from the original patch. Or, like I said before, if you update the proposal I would be happy to work on a new patch. thanks, -arlin From ralphc at pathscale.com Tue Jan 10 10:31:18 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 10 Jan 2006 10:31:18 -0800 Subject: [openib-general] [PATCH] opensm fails to find HCA if port is down. Message-ID: <1136917878.4520.53.camel@brick.internal.keyresearch.com> If opensm is started with no arguments, the default algorithm for finding a port to bind to will skip ports which are present but the link is DOWN. If there is only one port in the system, no port is selected and opensm tries the default HCA name "mthca0" which, if not present, confuses opensm and it exits. The following patch changes the port selection so that the first active port is selected, and if none, the first non-disabled port. Signed-off-by: Ralph Campbell Index: umad.c =================================================================== --- umad.c (revision 4900) +++ umad.c (working copy) @@ -207,9 +207,9 @@ } /* - * if *port > 0 checks ca[port] state. Otherwise set *port to + * if *port > 0, check ca[port] state. Otherwise set *port to * the first port that is active, and if such is not found, to - * the first port that is (physically) up. Otherwise return -1; + * the first port that is not disabled. Otherwise return -1; */ static int resolve_ca_port(char *ca_name, int *port) @@ -228,14 +228,14 @@ return 1; } - if (*port > 0) { /* user wants user gets */ + if (*port > 0) { /* check only the port the user wants */ if (*port > ca.numports) return -1; if (!ca.ports[*port]) return -1; if (ca.ports[*port]->state == 4) return 1; - if (ca.ports[*port]->phys_state == 5) + if (ca.ports[*port]->phys_state != 3) return 0; return -1; } @@ -244,7 +244,7 @@ DEBUG("checking port %d", i); if (!ca.ports[i]) continue; - if (up < 0 && ca.ports[i]->phys_state == 5) + if (up < 0 && ca.ports[i]->phys_state != 3) up = *port = i; if (ca.ports[i]->state == 4) { active = *port = i; @@ -278,10 +278,11 @@ return ca_name; } - /* find first existing HCA with Active port */ + /* Get the list of CA names. */ if ((n = umad_get_cas_names((void *)names, UMAD_CA_NAME_LEN)) < 0) return 0; + /* Find the first existing CA with an active port. */ for (caidx = 0; caidx < n; caidx++) { TRACE("checking ca '%s'", names[caidx]); -- Ralph Campbell From mst at mellanox.co.il Tue Jan 10 10:54:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jan 2006 20:54:03 +0200 Subject: [openib-general] [PATCH] ipoib: flush workqueue after clearing ADMIN_UP Message-ID: <20060110185403.GA16696@mellanox.co.il> Flush workqueue after clearing IPOIB_FLAG_ADMIN_UP, to prevent a job running from the workqueue from bringing the device back up. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: gen2/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- gen2.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-10 10:33:32.101242000 +0200 +++ gen2/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-10 10:43:02.254713000 +0200 @@ -134,6 +134,8 @@ static int ipoib_stop(struct net_device netif_stop_queue(dev); + flush_workqueue(ipoib_workqueue); + ipoib_ib_dev_down(dev, 1); ipoib_ib_dev_stop(dev); -- MST From mst at mellanox.co.il Tue Jan 10 10:54:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jan 2006 20:54:32 +0200 Subject: [openib-general] [PATCH] ipoib: mcast->ah race Message-ID: <20060110185432.GA16702@mellanox.co.il> ipoib_mcast_send tests mcast->ah twice. If this value is changed between these two points, we leak an skb. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 4872) +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -206,6 +206,7 @@ static int ipoib_mcast_join_finish(struc { struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; int ret; mcast->mcmember = *mcmember; @@ -262,6 +263,7 @@ static int ipoib_mcast_join_finish(struc av.static_rate, priv->local_rate, ib_sa_rate_enum_to_int(mcast->mcmember.rate)); + spin_lock_irqsave(&priv->lock, flags); mcast->ah = ipoib_create_ah(dev, priv->pd, &av); if (!mcast->ah) { ipoib_warn(priv, "ib_address_create failed\n"); @@ -273,6 +275,7 @@ static int ipoib_mcast_join_finish(struc be16_to_cpu(mcast->mcmember.mlid), mcast->mcmember.sl); } + spin_unlock_irqrestore(&priv->lock, flags); } /* actually send any queued packets */ -- MST From mst at mellanox.co.il Tue Jan 10 10:55:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jan 2006 20:55:15 +0200 Subject: [openib-general] [PATCH] ipoib: tx ring overrun Message-ID: <20060110185515.GA16708@mellanox.co.il> Dont try to post more send work requests if the TX ring is full. Setting netif_stop_queue is insufficient: linux can still land a tx packet on us. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-01-10 15:38:36.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-01-10 18:49:33.000000000 +0200 @@ -344,6 +344,13 @@ void ipoib_send(struct net_device *dev, * means we have to make sure everything is properly recorded and * our state is consistent before we call post_send(). */ + if (unlikely(priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE)) { + ipoib_dbg(priv, "TX ring full, dropping packet\n"); + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } + tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; tx_req->skb = skb; addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, -- MST From mshefty at ichips.intel.com Tue Jan 10 10:55:36 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 10 Jan 2006 10:55:36 -0800 Subject: [openib-general] SA cache design In-Reply-To: <43BB1A0F.2080305@ichips.intel.com> References: <43BB1A0F.2080305@ichips.intel.com> Message-ID: <43C40328.7060201@ichips.intel.com> Sean Hefty wrote: > To keep the design as flexible as possible, my plan is to implement the > cache in userspace. The interface to the cache would be via MADs. > Clients would send their queries to the sa_cache instead of the SA > itself. The format of the MADs would be essentially identical to those > used to query the SA itself. Response MADs would contain any requested > information. If the cache could not satisfy a request, the sa_cache > would query the SA, update its cache, then return a reply. What I think I really want is a distributed relational database management system with an SQL interface and triggers that maintains the SA data... (select * from path_rec where sgid=x and dgid=y and pkey=z) But without making any assumptions about the SA, a local cache could still use an RDMS to store and retrieve the data records. Would requiring an RDMS on each system be acceptable? If not, then writing a small, dumb pseudo-database as part of the sa_cache could provide a lot of flexibility. - Sean From iod00d at hp.com Tue Jan 10 11:04:54 2006 From: iod00d at hp.com (Grant Grundler) Date: Tue, 10 Jan 2006 11:04:54 -0800 Subject: [openib-general] [PATCH] ipoib: tx ring overrun In-Reply-To: <20060110185515.GA16708@mellanox.co.il> References: <20060110185515.GA16708@mellanox.co.il> Message-ID: <20060110190454.GC13156@esmail.cup.hp.com> On Tue, Jan 10, 2006 at 08:55:15PM +0200, Michael S. Tsirkin wrote: > Dont try to post more send work requests if the TX ring is full. > Setting netif_stop_queue is insufficient: linux can still land > a tx packet on us. > > Signed-off-by: Michael S. Tsirkin > > Index: openib/drivers/infiniband/ulp/ipoib/ipoib_ib.c > =================================================================== > --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-01-10 15:38:36.000000000 +0200 > +++ openib/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2006-01-10 18:49:33.000000000 +0200 > @@ -344,6 +344,13 @@ void ipoib_send(struct net_device *dev, > * means we have to make sure everything is properly recorded and > * our state is consistent before we call post_send(). > */ > + if (unlikely(priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE)) { > + ipoib_dbg(priv, "TX ring full, dropping packet\n"); > + ++priv->stats.tx_errors; Could this be tx_dropped? I'm looking at ifconfig output and assuming tx_dropped is used: grundler at gsyprf3:~$ /sbin/ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.0.0.51 Bcast:10.0.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:1020109972 errors:0 dropped:0 overruns:0 frame:0 TX packets:1549932074 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:160480185298 (149.4 GiB) TX bytes:1854376582766 (1.6 TiB) grant > + dev_kfree_skb_any(skb); > + return; > + } > + > tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; > tx_req->skb = skb; > addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, > > -- > MST > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Tue Jan 10 11:15:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jan 2006 21:15:02 +0200 Subject: [openib-general] Re: SA cache design In-Reply-To: <43C40328.7060201@ichips.intel.com> References: <43C40328.7060201@ichips.intel.com> Message-ID: <20060110191501.GB16913@mellanox.co.il> Quoting Sean Hefty : > > To keep the design as flexible as possible, my plan is to implement the > > cache in userspace. The interface to the cache would be via MADs. > > Clients would send their queries to the sa_cache instead of the SA > > itself. The format of the MADs would be essentially identical to those > > used to query the SA itself. Response MADs would contain any requested > > information. If the cache could not satisfy a request, the sa_cache > > would query the SA, update its cache, then return a reply. Bouncing queries required for e.g. IPoIB or SDP to userspace seems like a deadlock scenario in case userspace is located e.g. on an NFS share. -- MST From mshefty at ichips.intel.com Tue Jan 10 11:14:44 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 10 Jan 2006 11:14:44 -0800 Subject: [openib-general] Re: SA cache design In-Reply-To: <20060110191501.GB16913@mellanox.co.il> References: <43C40328.7060201@ichips.intel.com> <20060110191501.GB16913@mellanox.co.il> Message-ID: <43C407A4.5090008@ichips.intel.com> Michael S. Tsirkin wrote: >>>To keep the design as flexible as possible, my plan is to implement the >>>cache in userspace. The interface to the cache would be via MADs. >>>Clients would send their queries to the sa_cache instead of the SA >>>itself. The format of the MADs would be essentially identical to those >>>used to query the SA itself. Response MADs would contain any requested >>>information. If the cache could not satisfy a request, the sa_cache >>>would query the SA, update its cache, then return a reply. > > > Bouncing queries required for e.g. IPoIB or SDP > to userspace seems like a deadlock scenario in case userspace > is located e.g. on an NFS share. The SA itself is implemented in userspace. I don't see any issue here. - Sean From mst at mellanox.co.il Tue Jan 10 11:28:40 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jan 2006 21:28:40 +0200 Subject: [openib-general] Re: SA cache design In-Reply-To: <43C407A4.5090008@ichips.intel.com> References: <43C407A4.5090008@ichips.intel.com> Message-ID: <20060110192840.GC16913@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: SA cache design > > Michael S. Tsirkin wrote: > >>>To keep the design as flexible as possible, my plan is to implement the > >>>cache in userspace. The interface to the cache would be via MADs. > >>>Clients would send their queries to the sa_cache instead of the SA > >>>itself. The format of the MADs would be essentially identical to those > >>>used to query the SA itself. Response MADs would contain any requested > >>>information. If the cache could not satisfy a request, the sa_cache > >>>would query the SA, update its cache, then return a reply. > > > > > > Bouncing queries required for e.g. IPoIB or SDP > > to userspace seems like a deadlock scenario in case userspace > > is located e.g. on an NFS share. > > The SA itself is implemented in userspace. I don't see any issue here. > > - Sean Sure, but its not running on the local (possibly diskless) node. -- MST From mst at mellanox.co.il Tue Jan 10 11:29:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jan 2006 21:29:57 +0200 Subject: [openib-general] [PATCH] ipoib: tx ring overrun In-Reply-To: <20060110190454.GC13156@esmail.cup.hp.com> References: <20060110190454.GC13156@esmail.cup.hp.com> Message-ID: <20060110192957.GD16913@mellanox.co.il> Quoting Grant Grundler : > On Tue, Jan 10, 2006 at 08:55:15PM +0200, Michael S. Tsirkin wrote: > > Dont try to post more send work requests if the TX ring is full. > > Setting netif_stop_queue is insufficient: linux can still land > > a tx packet on us. ... > > + if (unlikely(priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE)) { > > + ipoib_dbg(priv, "TX ring full, dropping packet\n"); > > + ++priv->stats.tx_errors; > > Could this be tx_dropped? I dont know. Roland? -- MST From rdreier at cisco.com Tue Jan 10 11:30:43 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jan 2006 11:30:43 -0800 Subject: [openib-general] Re: [PATCH] ipoib: tx ring overrun In-Reply-To: <20060110185515.GA16708@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 10 Jan 2006 20:55:15 +0200") References: <20060110185515.GA16708@mellanox.co.il> Message-ID: Michael> Dont try to post more send work requests if the TX ring Michael> is full. Setting netif_stop_queue is insufficient: linux Michael> can still land a tx packet on us. I'm confused -- is the code in ipoib_start_xmit() /* * Check if our queue is stopped. Since we have the LLTX bit * set, we can't rely on netif_stop_queue() preventing our * xmit function from being called with a full queue. */ if (unlikely(netif_queue_stopped(dev))) { spin_unlock_irqrestore(&priv->tx_lock, flags); return NETDEV_TX_BUSY; } not enough to prevent us from trying to queue a TX packet after stopping the queue? BTW, I've lost track of the pending IPoIB patches a little bit. I have a lot of patches queued for review, and I'm not sure which have been replaced by new versions, which are critical, etc. Could you send me a list of which patches still need to be applied, and which ones fix problems you hit in testing (vs cosmetic changes, memory leaks and so on). - R. From rolandd at cisco.com Tue Jan 10 11:31:23 2006 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 10 Jan 2006 19:31:23 +0000 Subject: [openib-general] [git patch review 1/7] IB/mthca: fix page shift calculation in mthca_reg_phys_mr() Message-ID: <1136921483290-436a68c58a7111c6@cisco.com> For all pages except possibly the last one, the byte beyond the buffer end must be page aligned. Therefore, when computing the page shift to use, OR the end addresses of the buffers as well as the start addresses into the mask we check. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_provider.c | 18 +++++++----------- 1 files changed, 7 insertions(+), 11 deletions(-) 6627fa662e86c400284b64c13661fdf6bff05983 diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 4cc7e28..30b67c2 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -783,24 +783,20 @@ static struct ib_mr *mthca_reg_phys_mr(s if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) return ERR_PTR(-EINVAL); - if (num_phys_buf > 1 && - ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) - return ERR_PTR(-EINVAL); - mask = 0; total_size = 0; for (i = 0; i < num_phys_buf; ++i) { - if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) - return ERR_PTR(-EINVAL); - if (i != 0 && i != num_phys_buf - 1 && - (buffer_list[i].size & ~PAGE_MASK)) - return ERR_PTR(-EINVAL); + if (i != 0) + mask |= buffer_list[i].addr; + if (i != num_phys_buf - 1) + mask |= buffer_list[i].addr + buffer_list[i].size; total_size += buffer_list[i].size; - if (i > 0) - mask |= buffer_list[i].addr; } + if (mask & ~PAGE_MASK) + return ERR_PTR(-EINVAL); + /* Find largest page shift we can use to cover buffers */ for (shift = PAGE_SHIFT; shift < 31; ++shift) if (num_phys_buf > 1) { -- 1.0.7 From rolandd at cisco.com Tue Jan 10 11:31:23 2006 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 10 Jan 2006 19:31:23 +0000 Subject: [openib-general] [git patch review 2/7] IB/mthca: prevent event queue overrun In-Reply-To: <1136921483290-436a68c58a7111c6@cisco.com> Message-ID: <1136921483290-79d0774f48f3f587@cisco.com> I am seeing EQ overruns in SDP stress tests: if the CQ completion handler arms a CQ, this could generate more EQEs, so that EQ will never get empty and consumer index will never get updated. This is similiar to what we have with command interface: /* * cmd_event() may add more commands. * The card will think the queue has overflowed if * we don't tell it we've been processing events. */ However, for completion events, we *don't* want to update the consumer index on each event. So, perform EQ doorbell coalescing: allocate EQs with some spare EQEs, and update once we run out of them. The value 0x80 was selected to avoid any performance impact. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_eq.c | 28 +++++++++++++++------------- 1 files changed, 15 insertions(+), 13 deletions(-) 92898522e3ee1a0ba54140aad1974d9e868f74ae diff --git a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c index e8a948f..2eabb27 100644 --- a/drivers/infiniband/hw/mthca/mthca_eq.c +++ b/drivers/infiniband/hw/mthca/mthca_eq.c @@ -45,6 +45,7 @@ enum { MTHCA_NUM_ASYNC_EQE = 0x80, MTHCA_NUM_CMD_EQE = 0x80, + MTHCA_NUM_SPARE_EQE = 0x80, MTHCA_EQ_ENTRY_SIZE = 0x20 }; @@ -277,11 +278,10 @@ static int mthca_eq_int(struct mthca_dev { struct mthca_eqe *eqe; int disarm_cqn; - int eqes_found = 0; + int eqes_found = 0; + int set_ci = 0; while ((eqe = next_eqe_sw(eq))) { - int set_ci = 0; - /* * Make sure we read EQ entry contents after we've * checked the ownership bit. @@ -345,12 +345,6 @@ static int mthca_eq_int(struct mthca_dev be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, be64_to_cpu(eqe->event.cmd.out_param)); - /* - * cmd_event() may add more commands. - * The card will think the queue has overflowed if - * we don't tell it we've been processing events. - */ - set_ci = 1; break; case MTHCA_EVENT_TYPE_PORT_CHANGE: @@ -385,8 +379,16 @@ static int mthca_eq_int(struct mthca_dev set_eqe_hw(eqe); ++eq->cons_index; eqes_found = 1; + ++set_ci; - if (unlikely(set_ci)) { + /* + * The HCA will think the queue has overflowed if we + * don't tell it we've been processing events. We + * create our EQs with MTHCA_NUM_SPARE_EQE extra + * entries, so we must update our consumer index at + * least that often. + */ + if (unlikely(set_ci >= MTHCA_NUM_SPARE_EQE)) { /* * Conditional on hca_type is OK here because * this is a rare case, not the fast path. @@ -862,19 +864,19 @@ int __devinit mthca_init_eq_table(struct intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? 128 : dev->eq_table.inta_pin; - err = mthca_create_eq(dev, dev->limits.num_cqs, + err = mthca_create_eq(dev, dev->limits.num_cqs + MTHCA_NUM_SPARE_EQE, (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, &dev->eq_table.eq[MTHCA_EQ_COMP]); if (err) goto err_out_unmap; - err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE, + err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE + MTHCA_NUM_SPARE_EQE, (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, &dev->eq_table.eq[MTHCA_EQ_ASYNC]); if (err) goto err_out_comp; - err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE, + err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE + MTHCA_NUM_SPARE_EQE, (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr, &dev->eq_table.eq[MTHCA_EQ_CMD]); if (err) -- 1.0.7 From rolandd at cisco.com Tue Jan 10 11:31:23 2006 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 10 Jan 2006 19:31:23 +0000 Subject: [openib-general] [git patch review 4/7] IB/mthca: Factor common MAD initialization code In-Reply-To: <1136921483290-3d1a8ae2f0b61cbf@cisco.com> Message-ID: <1136921483290-850b093ba9fe8fda@cisco.com> Factor out common code for initializing MAD packets, which is shared by many query routines in mthca_provider.c, into init_query_mad(). add/remove: 1/0 grow/shrink: 0/4 up/down: 16/-44 (-28) function old new delta init_query_mad - 16 +16 mthca_query_port 521 518 -3 mthca_query_pkey 301 294 -7 mthca_query_device 648 641 -7 mthca_query_gid 453 426 -27 Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_provider.c | 52 +++++++++++--------------- 1 files changed, 22 insertions(+), 30 deletions(-) 87635b71b544563f29050a9cecaa12b5d2a3e34a diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 0ae27fa..4887577 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -45,6 +45,14 @@ #include "mthca_user.h" #include "mthca_memfree.h" +static void init_query_mad(struct ib_smp *mad) +{ + mad->base_version = 1; + mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + mad->class_version = 1; + mad->method = IB_MGMT_METHOD_GET; +} + static int mthca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) { @@ -64,11 +72,8 @@ static int mthca_query_device(struct ib_ props->fw_ver = mdev->fw_ver; - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; err = mthca_MAD_IFC(mdev, 1, 1, 1, NULL, NULL, in_mad, out_mad, @@ -134,12 +139,9 @@ static int mthca_query_port(struct ib_de memset(props, 0, sizeof *props); - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; - in_mad->attr_mod = cpu_to_be32(port); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad, @@ -223,12 +225,9 @@ static int mthca_query_pkey(struct ib_de if (!in_mad || !out_mad) goto out; - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_PKEY_TABLE; - in_mad->attr_mod = cpu_to_be32(index / 32); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PKEY_TABLE; + in_mad->attr_mod = cpu_to_be32(index / 32); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad, @@ -261,12 +260,9 @@ static int mthca_query_gid(struct ib_dev if (!in_mad || !out_mad) goto out; - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; - in_mad->attr_mod = cpu_to_be32(port); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad, @@ -280,13 +276,9 @@ static int mthca_query_gid(struct ib_dev memcpy(gid->raw, out_mad->data + 8, 8); - memset(in_mad, 0, sizeof *in_mad); - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_GUID_INFO; - in_mad->attr_mod = cpu_to_be32(index / 8); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_GUID_INFO; + in_mad->attr_mod = cpu_to_be32(index / 8); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad, -- 1.0.7 From rolandd at cisco.com Tue Jan 10 11:31:23 2006 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 10 Jan 2006 19:31:23 +0000 Subject: [openib-general] [git patch review 6/7] IPoIB: Fix error path in ipoib_mcast_dev_flush() In-Reply-To: <1136921483291-1d87adb85e116682@cisco.com> Message-ID: <1136921483291-3de733b4b68e8e4a@cisco.com> Don't leak memory on allocation failure for broadcast mcast group. Also, print a warning to match handling for other mcast groups. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 +++++--- 1 files changed, 5 insertions(+), 3 deletions(-) 70b4c8cdc168bb5d18e23fd205c4ede1b756a8b2 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index ed0c2ea..6c6db75 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -780,9 +780,11 @@ void ipoib_mcast_dev_flush(struct net_de &priv->multicast_tree); list_add_tail(&priv->broadcast->list, &remove_list); - } - - priv->broadcast = nmcast; + priv->broadcast = nmcast; + } else + ipoib_warn(priv, "could not reallocate broadcast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(priv->broadcast->mcmember.mgid)); } spin_unlock_irqrestore(&priv->lock, flags); -- 1.0.7 From rolandd at cisco.com Tue Jan 10 11:31:23 2006 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 10 Jan 2006 19:31:23 +0000 Subject: [openib-general] [git patch review 7/7] IPoIB: Fix address handle refcounting for multicast groups In-Reply-To: <1136921483291-3de733b4b68e8e4a@cisco.com> Message-ID: <1136921483291-0b87fc4bec2544c5@cisco.com> Multiple ipoib_neigh structures on mcast->neigh_list may point to the same ah. This means that ipoib_mcast_free() can't just make a list of ah structs to free, since this might end up trying to add the same ah to the list more than once. Handle this in ipoib_multicast.c in the same way as it is handled in ipoib_main.c for struct ipoib_path. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 13 +++++++------ 1 files changed, 7 insertions(+), 6 deletions(-) 97460df37ea3335ca11562568932c9f9facfecdb diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 6c6db75..03b2ca6 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -97,8 +97,6 @@ static void ipoib_mcast_free(struct ipoi struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tmp; unsigned long flags; - LIST_HEAD(ah_list); - struct ipoib_ah *ah, *tah; ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group " IPOIB_GID_FMT "\n", @@ -107,8 +105,14 @@ static void ipoib_mcast_free(struct ipoi spin_lock_irqsave(&priv->lock, flags); list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { + /* + * It's safe to call ipoib_put_ah() inside priv->lock + * here, because we know that mcast->ah will always + * hold one more reference, so ipoib_put_ah() will + * never do more than decrement the ref count. + */ if (neigh->ah) - list_add_tail(&neigh->ah->list, &ah_list); + ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; neigh->neighbour->ops->destructor = NULL; kfree(neigh); @@ -116,9 +120,6 @@ static void ipoib_mcast_free(struct ipoi spin_unlock_irqrestore(&priv->lock, flags); - list_for_each_entry_safe(ah, tah, &ah_list, list) - ipoib_put_ah(ah); - if (mcast->ah) ipoib_put_ah(mcast->ah); -- 1.0.7 From rolandd at cisco.com Tue Jan 10 11:31:23 2006 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 10 Jan 2006 19:31:23 +0000 Subject: [openib-general] [git patch review 5/7] IB: Add node_guid to struct ib_device In-Reply-To: <1136921483290-850b093ba9fe8fda@cisco.com> Message-ID: <1136921483291-1d87adb85e116682@cisco.com> Add a node_guid field to struct ib_device. It is the responsibility of the low-level driver to initialize this field before registering a device with the midlayer. Convert everyone to looking at this field instead of calling ib_query_device() when all they want is the node GUID, and remove the node_guid field from struct ib_device_attr. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier --- drivers/infiniband/core/cm.c | 29 +++---------------- drivers/infiniband/core/sysfs.c | 22 +++----------- drivers/infiniband/core/uverbs_cmd.c | 2 + drivers/infiniband/hw/mthca/mthca_provider.c | 40 +++++++++++++++++++++++++- drivers/infiniband/ulp/srp/ib_srp.c | 23 +++------------ include/rdma/ib_verbs.h | 2 + 6 files changed, 54 insertions(+), 64 deletions(-) cf311cd49a78f1e431787068cc31d29d06a415e6 diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 3a611fe..c06b181 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -3163,22 +3163,6 @@ int ib_cm_init_qp_attr(struct ib_cm_id * } EXPORT_SYMBOL(ib_cm_init_qp_attr); -static __be64 cm_get_ca_guid(struct ib_device *device) -{ - struct ib_device_attr *device_attr; - __be64 guid; - int ret; - - device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); - if (!device_attr) - return 0; - - ret = ib_query_device(device, device_attr); - guid = ret ? 0 : device_attr->node_guid; - kfree(device_attr); - return guid; -} - static void cm_add_one(struct ib_device *device) { struct cm_device *cm_dev; @@ -3200,9 +3184,7 @@ static void cm_add_one(struct ib_device return; cm_dev->device = device; - cm_dev->ca_guid = cm_get_ca_guid(device); - if (!cm_dev->ca_guid) - goto error1; + cm_dev->ca_guid = device->node_guid; set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); for (i = 1; i <= device->phys_port_cnt; i++) { @@ -3217,11 +3199,11 @@ static void cm_add_one(struct ib_device cm_recv_handler, port); if (IS_ERR(port->mad_agent)) - goto error2; + goto error1; ret = ib_modify_port(device, i, 0, &port_modify); if (ret) - goto error3; + goto error2; } ib_set_client_data(device, &cm_client, cm_dev); @@ -3230,9 +3212,9 @@ static void cm_add_one(struct ib_device write_unlock_irqrestore(&cm.device_lock, flags); return; -error3: - ib_unregister_mad_agent(port->mad_agent); error2: + ib_unregister_mad_agent(port->mad_agent); +error1: port_modify.set_port_cap_mask = 0; port_modify.clr_port_cap_mask = IB_PORT_CM_SUP; while (--i) { @@ -3240,7 +3222,6 @@ error2: ib_modify_port(device, port->port_num, 0, &port_modify); ib_unregister_mad_agent(port->mad_agent); } -error1: kfree(cm_dev); } diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index 1f1743c..5982d68 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -445,13 +445,7 @@ static int ib_device_uevent(struct class return -ENOMEM; /* - * It might be nice to pass the node GUID with the event, but - * right now the only way to get it is to query the device - * provider, and this can crash during device removal because - * we are will be running after driver removal has started. - * We could add a node_guid field to struct ib_device, or we - * could just let userspace read the node GUID from sysfs when - * devices are added. + * It would be nice to pass the node GUID with the event... */ envp[i] = NULL; @@ -623,21 +617,15 @@ static ssize_t show_sys_image_guid(struc static ssize_t show_node_guid(struct class_device *cdev, char *buf) { struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); - struct ib_device_attr attr; - ssize_t ret; if (!ibdev_is_alive(dev)) return -ENODEV; - ret = ib_query_device(dev, &attr); - if (ret) - return ret; - return sprintf(buf, "%04x:%04x:%04x:%04x\n", - be16_to_cpu(((__be16 *) &attr.node_guid)[0]), - be16_to_cpu(((__be16 *) &attr.node_guid)[1]), - be16_to_cpu(((__be16 *) &attr.node_guid)[2]), - be16_to_cpu(((__be16 *) &attr.node_guid)[3])); + be16_to_cpu(((__be16 *) &dev->node_guid)[0]), + be16_to_cpu(((__be16 *) &dev->node_guid)[1]), + be16_to_cpu(((__be16 *) &dev->node_guid)[2]), + be16_to_cpu(((__be16 *) &dev->node_guid)[3])); } static CLASS_DEVICE_ATTR(node_type, S_IRUGO, show_node_type, NULL); diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index a02c5a0..554c205 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -157,7 +157,7 @@ ssize_t ib_uverbs_query_device(struct ib memset(&resp, 0, sizeof resp); resp.fw_ver = attr.fw_ver; - resp.node_guid = attr.node_guid; + resp.node_guid = file->device->ib_dev->node_guid; resp.sys_image_guid = attr.sys_image_guid; resp.max_mr_size = attr.max_mr_size; resp.page_size_cap = attr.page_size_cap; diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 4887577..db35690 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -33,7 +33,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: mthca_provider.c 1397 2004-12-28 05:09:00Z roland $ + * $Id: mthca_provider.c 4859 2006-01-09 21:55:10Z roland $ */ #include @@ -91,7 +91,6 @@ static int mthca_query_device(struct ib_ props->vendor_part_id = be16_to_cpup((__be16 *) (out_mad->data + 30)); props->hw_ver = be32_to_cpup((__be32 *) (out_mad->data + 32)); memcpy(&props->sys_image_guid, out_mad->data + 4, 8); - memcpy(&props->node_guid, out_mad->data + 12, 8); props->max_mr_size = ~0ull; props->page_size_cap = mdev->limits.page_size_cap; @@ -1054,11 +1053,48 @@ static struct class_device_attribute *mt &class_device_attr_board_id }; +static int mthca_init_node_data(struct mthca_dev *dev) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; + + err = mthca_MAD_IFC(dev, 1, 1, + 1, NULL, NULL, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(&dev->ib_dev.node_guid, out_mad->data + 12, 8); + +out: + kfree(in_mad); + kfree(out_mad); + return err; +} + int mthca_register_device(struct mthca_dev *dev) { int ret; int i; + ret = mthca_init_node_data(dev); + if (ret) + return ret; + strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); dev->ib_dev.owner = THIS_MODULE; diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index dd488d3..31207e6 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1516,8 +1516,7 @@ static ssize_t show_port(struct class_de static CLASS_DEVICE_ATTR(port, S_IRUGO, show_port, NULL); -static struct srp_host *srp_add_port(struct ib_device *device, - __be64 node_guid, u8 port) +static struct srp_host *srp_add_port(struct ib_device *device, u8 port) { struct srp_host *host; @@ -1532,7 +1531,7 @@ static struct srp_host *srp_add_port(str host->port = port; host->initiator_port_id[7] = port; - memcpy(host->initiator_port_id + 8, &node_guid, 8); + memcpy(host->initiator_port_id + 8, &device->node_guid, 8); host->pd = ib_alloc_pd(device); if (IS_ERR(host->pd)) @@ -1580,22 +1579,11 @@ static void srp_add_one(struct ib_device { struct list_head *dev_list; struct srp_host *host; - struct ib_device_attr *dev_attr; int s, e, p; - dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL); - if (!dev_attr) - return; - - if (ib_query_device(device, dev_attr)) { - printk(KERN_WARNING PFX "Couldn't query node GUID for %s.\n", - device->name); - goto out; - } - dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) - goto out; + return; INIT_LIST_HEAD(dev_list); @@ -1608,15 +1596,12 @@ static void srp_add_one(struct ib_device } for (p = s; p <= e; ++p) { - host = srp_add_port(device, dev_attr->node_guid, p); + host = srp_add_port(device, p); if (host) list_add_tail(&host->list, dev_list); } ib_set_client_data(device, &srp_client, dev_list); - -out: - kfree(dev_attr); } static void srp_remove_one(struct ib_device *device) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index a7f4c35..22fc886 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -88,7 +88,6 @@ enum ib_atomic_cap { struct ib_device_attr { u64 fw_ver; - __be64 node_guid; __be64 sys_image_guid; u64 max_mr_size; u64 page_size_cap; @@ -951,6 +950,7 @@ struct ib_device { u64 uverbs_cmd_mask; int uverbs_abi_ver; + __be64 node_guid; u8 node_type; u8 phys_port_cnt; }; -- 1.0.7 From rolandd at cisco.com Tue Jan 10 11:31:23 2006 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 10 Jan 2006 19:31:23 +0000 Subject: [openib-general] [git patch review 3/7] IB/mthca: kzalloc conversions In-Reply-To: <1136921483290-79d0774f48f3f587@cisco.com> Message-ID: <1136921483290-3d1a8ae2f0b61cbf@cisco.com> Convert kmalloc()/memset(,0,) pairs to kzalloc(). Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_provider.c | 12 ++++-------- 1 files changed, 4 insertions(+), 8 deletions(-) 105e50a5e8f184af31daffce4d7bfd7771fe213f diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 30b67c2..0ae27fa 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -55,7 +55,7 @@ static int mthca_query_device(struct ib_ u8 status; - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) goto out; @@ -64,7 +64,6 @@ static int mthca_query_device(struct ib_ props->fw_ver = mdev->fw_ver; - memset(in_mad, 0, sizeof *in_mad); in_mad->base_version = 1; in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; in_mad->class_version = 1; @@ -128,14 +127,13 @@ static int mthca_query_port(struct ib_de int err = -ENOMEM; u8 status; - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) goto out; memset(props, 0, sizeof *props); - memset(in_mad, 0, sizeof *in_mad); in_mad->base_version = 1; in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; in_mad->class_version = 1; @@ -220,12 +218,11 @@ static int mthca_query_pkey(struct ib_de int err = -ENOMEM; u8 status; - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) goto out; - memset(in_mad, 0, sizeof *in_mad); in_mad->base_version = 1; in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; in_mad->class_version = 1; @@ -259,12 +256,11 @@ static int mthca_query_gid(struct ib_dev int err = -ENOMEM; u8 status; - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) goto out; - memset(in_mad, 0, sizeof *in_mad); in_mad->base_version = 1; in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; in_mad->class_version = 1; -- 1.0.7 From mst at mellanox.co.il Tue Jan 10 11:44:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jan 2006 21:44:15 +0200 Subject: [openib-general] Re: [PATCH] ipoib: tx ring overrun In-Reply-To: References: Message-ID: <20060110194415.GE16913@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib: tx ring overrun > > Michael> Dont try to post more send work requests if the TX ring > Michael> is full. Setting netif_stop_queue is insufficient: linux > Michael> can still land a tx packet on us. > > I'm confused -- is the code in ipoib_start_xmit() > > /* > * Check if our queue is stopped. Since we have the LLTX bit > * set, we can't rely on netif_stop_queue() preventing our > * xmit function from being called with a full queue. > */ > if (unlikely(netif_queue_stopped(dev))) { > spin_unlock_irqrestore(&priv->tx_lock, flags); > return NETDEV_TX_BUSY; > } > > not enough to prevent us from trying to queue a TX packet after > stopping the queue? good point. I'll look again. > BTW, I've lost track of the pending IPoIB patches a little bit. I > have a lot of patches queued for review, and I'm not sure which have > been replaced by new versions, which are critical, etc. Could you > send me a list of which patches still need to be applied, and which > ones fix problems you hit in testing (vs cosmetic changes, memory > leaks and so on). > > - R. > > Look here https://openib.org/svn/trunk/contrib/mellanox/patches these are updated outstanding patches -- MST From iod00d at hp.com Tue Jan 10 11:42:49 2006 From: iod00d at hp.com (Grant Grundler) Date: Tue, 10 Jan 2006 11:42:49 -0800 Subject: [openib-general] SA cache design In-Reply-To: <43C40328.7060201@ichips.intel.com> References: <43BB1A0F.2080305@ichips.intel.com> <43C40328.7060201@ichips.intel.com> Message-ID: <20060110194249.GD13156@esmail.cup.hp.com> On Tue, Jan 10, 2006 at 10:55:36AM -0800, Sean Hefty wrote: > What I think I really want is a distributed relational database management > system with an SQL interface and triggers that maintains the SA data... > (select * from path_rec where sgid=x and dgid=y and pkey=z) > > But without making any assumptions about the SA, a local cache could still > use an RDMS to store and retrieve the data records. Would requiring an > RDMS on each system be acceptable? We already have several databases for different things: makedb (primarily for NSS) updatedb (fast lookup of local files) mandb (man pages) rpmdb (yes, even on debian boxes) sasldbconverter2 (for SASL - linux securty/login stuff) *db4.3* (Berkeley v4.3 Database - used by apt-get/dpkg, Apache, python, libns-db, postfix, etc) In fact, looks like a debian "testing" box would be disfunctional without Berkeley Database. Would that work? sleepycat.org gives more examples of opensource use: OpenLDAP, Kerberos, Subversion, Sendmail, Postfix, SquidGuard, NetaTalk, Movable Type, SpamAssassin, Mail Avenger, Bogofilter hth, grant > If not, then writing a small, dumb > pseudo-database as part of the sa_cache could provide a lot of flexibility. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue Jan 10 11:42:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jan 2006 11:42:53 -0800 Subject: [openib-general] Re: [PATCH] ipoib: tx ring overrun In-Reply-To: <20060110194415.GE16913@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 10 Jan 2006 21:44:15 +0200") References: <20060110194415.GE16913@mellanox.co.il> Message-ID: Michael> Look here Michael> https://openib.org/svn/trunk/contrib/mellanox/patches Michael> these are updated outstanding patches OK, cool. Is there any ordering info available about dependencies between the patches? - R. From ralphc at pathscale.com Tue Jan 10 12:33:16 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 10 Jan 2006 12:33:16 -0800 Subject: [openib-general] probable reference count bug in core/mad.c Message-ID: <1136925196.4520.64.camel@brick.internal.keyresearch.com> I have been looking at the code for core/mad.c and in timeout_sends(), the mad_send_wr is removed from the list of pending sends and then retry_send() is called. In retry_send(), if the MAD is resent, mad_send_wr->refcount is incremented and the WR is put pack on the list of pending sends. This seems wrong to me. Either there should be no increment, or there should be a decrement when the WR is removed from the list. Also, I think there may be a dependency on whether mad_send_wr->timeout is zero or not. Someone who knows this code better may want to check this out. BTW, I also don't particularly like mad_send_wr->retries being an int instead of unsigned int and the statement in retry_send(): if (!mad_send_wr->retries--) which could end up resending the MAD 2^32-1 times if requeued. -- Ralph Campbell From mshefty at ichips.intel.com Tue Jan 10 12:40:54 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 10 Jan 2006 12:40:54 -0800 Subject: [openib-general] probable reference count bug in core/mad.c In-Reply-To: <1136925196.4520.64.camel@brick.internal.keyresearch.com> References: <1136925196.4520.64.camel@brick.internal.keyresearch.com> Message-ID: <43C41BD6.6040706@ichips.intel.com> Ralph Campbell wrote: > I have been looking at the code for core/mad.c and in timeout_sends(), > the mad_send_wr is removed from the list of pending sends and > then retry_send() is called. In retry_send(), if the MAD is resent, > mad_send_wr->refcount is incremented and the WR is put pack on > the list of pending sends. > > This seems wrong to me. Either there should be no increment, or > there should be a decrement when the WR is removed from the list. > Also, I think there may be a dependency on whether > mad_send_wr->timeout is zero or not. The increment is done because the MAD has been reposted to the QP and will be referenced by a CQ entry. The decrement happens once the completion occurs. This should be correct. - Sean From halr at voltaire.com Tue Jan 10 14:12:37 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jan 2006 17:12:37 -0500 Subject: [openib-general] [PATCH] opensm fails to find HCA if port is down. In-Reply-To: <1136917878.4520.53.camel@brick.internal.keyresearch.com> References: <1136917878.4520.53.camel@brick.internal.keyresearch.com> Message-ID: <1136931156.4322.79.camel@hal.voltaire.com> Hi Ralph, On Tue, 2006-01-10 at 13:31, Ralph Campbell wrote: > If opensm is started with no arguments, the default algorithm > for finding a port to bind to will skip ports which are present > but the link is DOWN. If there is only one port in the system, > no port is selected and opensm tries the default HCA name "mthca0" > which, if not present, confuses opensm and it exits. > > The following patch changes the port selection so that the first > active port is selected, and if none, the first non-disabled port. This is close and headed in the right direction but has one property I'm not too fond of: when there are no active ports, it does not prefer a port whose physical state is link up over one in (say) polling (anything other than link up) so the subnet may not come up when it could in that case. -- Hal > Signed-off-by: Ralph Campbell > > Index: umad.c > =================================================================== > --- umad.c (revision 4900) > +++ umad.c (working copy) > @@ -207,9 +207,9 @@ > } > > /* > - * if *port > 0 checks ca[port] state. Otherwise set *port to > + * if *port > 0, check ca[port] state. Otherwise set *port to > * the first port that is active, and if such is not found, to > - * the first port that is (physically) up. Otherwise return -1; > + * the first port that is not disabled. Otherwise return -1; > */ > static int > resolve_ca_port(char *ca_name, int *port) > @@ -228,14 +228,14 @@ > return 1; > } > > - if (*port > 0) { /* user wants user gets */ > + if (*port > 0) { /* check only the port the user wants */ > if (*port > ca.numports) > return -1; > if (!ca.ports[*port]) > return -1; > if (ca.ports[*port]->state == 4) > return 1; > - if (ca.ports[*port]->phys_state == 5) > + if (ca.ports[*port]->phys_state != 3) > return 0; > return -1; > } > @@ -244,7 +244,7 @@ > DEBUG("checking port %d", i); > if (!ca.ports[i]) > continue; > - if (up < 0 && ca.ports[i]->phys_state == 5) > + if (up < 0 && ca.ports[i]->phys_state != 3) > up = *port = i; > if (ca.ports[i]->state == 4) { > active = *port = i; > @@ -278,10 +278,11 @@ > return ca_name; > } > > - /* find first existing HCA with Active port */ > + /* Get the list of CA names. */ > if ((n = umad_get_cas_names((void *)names, UMAD_CA_NAME_LEN)) < 0) > return 0; > > + /* Find the first existing CA with an active port. */ > for (caidx = 0; caidx < n; caidx++) { > TRACE("checking ca '%s'", names[caidx]); > From mst at mellanox.co.il Tue Jan 10 14:32:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Jan 2006 00:32:49 +0200 Subject: [openib-general] Re: [PATCH] ipoib: tx ring overrun In-Reply-To: References: Message-ID: <20060110223248.GH16913@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib: tx ring overrun > > Michael> Look here > > Michael> https://openib.org/svn/trunk/contrib/mellanox/patches > > Michael> these are updated outstanding patches > > OK, cool. Is there any ordering info available about dependencies > between the patches? ipoib_flush_wq_2.patch needs to be applied on top of ipoib_flush_wq_1.patch Others are independent of each other. -- MST From ralphc at pathscale.com Tue Jan 10 14:47:39 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 10 Jan 2006 14:47:39 -0800 Subject: [openib-general] [PATCH] opensm fails to find HCA if port is down. In-Reply-To: <1136931156.4322.79.camel@hal.voltaire.com> References: <1136917878.4520.53.camel@brick.internal.keyresearch.com> <1136931156.4322.79.camel@hal.voltaire.com> Message-ID: <1136933259.4520.67.camel@brick.internal.keyresearch.com> I understand. Maybe it should be the first active, if none, then the first UP, and if none, the first !disabled. Mostly I was trying to get something that picked ipath0 port 1 when it was the only port in the system even if the link is down. On Tue, 2006-01-10 at 17:12 -0500, Hal Rosenstock wrote: > Hi Ralph, > > On Tue, 2006-01-10 at 13:31, Ralph Campbell wrote: > > If opensm is started with no arguments, the default algorithm > > for finding a port to bind to will skip ports which are present > > but the link is DOWN. If there is only one port in the system, > > no port is selected and opensm tries the default HCA name "mthca0" > > which, if not present, confuses opensm and it exits. > > > > The following patch changes the port selection so that the first > > active port is selected, and if none, the first non-disabled port. > > This is close and headed in the right direction but has one property I'm > not too fond of: when there are no active ports, it does not prefer a > port whose physical state is link up over one in (say) polling (anything > other than link up) so the subnet may not come up when it could in that > case. > > -- Hal > > > Signed-off-by: Ralph Campbell > > > > Index: umad.c > > =================================================================== > > --- umad.c (revision 4900) > > +++ umad.c (working copy) > > @@ -207,9 +207,9 @@ > > } > > > > /* > > - * if *port > 0 checks ca[port] state. Otherwise set *port to > > + * if *port > 0, check ca[port] state. Otherwise set *port to > > * the first port that is active, and if such is not found, to > > - * the first port that is (physically) up. Otherwise return -1; > > + * the first port that is not disabled. Otherwise return -1; > > */ > > static int > > resolve_ca_port(char *ca_name, int *port) > > @@ -228,14 +228,14 @@ > > return 1; > > } > > > > - if (*port > 0) { /* user wants user gets */ > > + if (*port > 0) { /* check only the port the user wants */ > > if (*port > ca.numports) > > return -1; > > if (!ca.ports[*port]) > > return -1; > > if (ca.ports[*port]->state == 4) > > return 1; > > - if (ca.ports[*port]->phys_state == 5) > > + if (ca.ports[*port]->phys_state != 3) > > return 0; > > return -1; > > } > > @@ -244,7 +244,7 @@ > > DEBUG("checking port %d", i); > > if (!ca.ports[i]) > > continue; > > - if (up < 0 && ca.ports[i]->phys_state == 5) > > + if (up < 0 && ca.ports[i]->phys_state != 3) > > up = *port = i; > > if (ca.ports[i]->state == 4) { > > active = *port = i; > > @@ -278,10 +278,11 @@ > > return ca_name; > > } > > > > - /* find first existing HCA with Active port */ > > + /* Get the list of CA names. */ > > if ((n = umad_get_cas_names((void *)names, UMAD_CA_NAME_LEN)) < 0) > > return 0; > > > > + /* Find the first existing CA with an active port. */ > > for (caidx = 0; caidx < n; caidx++) { > > TRACE("checking ca '%s'", names[caidx]); > > > -- Ralph Campbell From halr at voltaire.com Tue Jan 10 14:51:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jan 2006 17:51:08 -0500 Subject: [openib-general] [PATCH] opensm fails to find HCA if port is down. In-Reply-To: <1136933259.4520.67.camel@brick.internal.keyresearch.com> References: <1136917878.4520.53.camel@brick.internal.keyresearch.com> <1136931156.4322.79.camel@hal.voltaire.com> <1136933259.4520.67.camel@brick.internal.keyresearch.com> Message-ID: <1136933468.4322.14.camel@hal.voltaire.com> Hi Ralph, On Tue, 2006-01-10 at 17:47, Ralph Campbell wrote: > I understand. Maybe it should be the first active, if none, then the > first UP, and if none, the first !disabled. Exactly. I think one more loop (first checking physical state for linkup and then checking for not disabled) will take care of it. I'm working on it now as a patch on your patch which I will post. > Mostly I was trying to get something that picked ipath0 port 1 > when it was the only port in the system even if the link > is down. Understood. -- Hal From mst at mellanox.co.il Tue Jan 10 15:02:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Jan 2006 01:02:54 +0200 Subject: [openib-general] Re: [PATCH] ipoib: tx ring overrun In-Reply-To: <20060110223248.GH16913@mellanox.co.il> References: <20060110223248.GH16913@mellanox.co.il> Message-ID: <20060110230254.GJ16913@mellanox.co.il> > Quoting r. Roland Dreier : > > Subject: Re: [PATCH] ipoib: tx ring overrun > > > > Michael> Look here > > > > Michael> https://openib.org/svn/trunk/contrib/mellanox/patches > > > > Michael> these are updated outstanding patches > > > > OK, cool. Is there any ordering info available about dependencies > > between the patches? > > ipoib_flush_wq_2.patch needs to be applied on top of ipoib_flush_wq_1.patch > > Others are independent of each other. Here's a list of patches with some explanations: Fixes for oopses that we saw in testing: ipoib_up_flag_race.patch ipoib_all_neigh_issues_2.patch ipoib_mcast_send.patch ipoib_mc_list.patch ipoib_flush_wq_1.patch ipoib_flush_wq_2.patch Memory leak that we saw in testing ipoib_multicast_leak.patch Drop counter fix (not sure whether this is from testing or code review) ipoib_multicast_drop_counter.patch Code review: cosmetic ipoib_cosmetics.patch Code review: error handling ipoib_init_qp.patch ipoib_post_receives_err.patch Code review: races ipoib_qprst_protect.patch ipoib_multicast_ah.patch ipoib_multicast_race.patch There's also a patch for libibverbs devinfo_board_id.patch and a patch for core/sysfs.c node_desc_clear.patch -- MST From mshefty at ichips.intel.com Tue Jan 10 15:00:46 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 10 Jan 2006 15:00:46 -0800 Subject: [openib-general] SA cache design In-Reply-To: <20060110194249.GD13156@esmail.cup.hp.com> References: <43BB1A0F.2080305@ichips.intel.com> <43C40328.7060201@ichips.intel.com> <20060110194249.GD13156@esmail.cup.hp.com> Message-ID: <43C43C9E.6020405@ichips.intel.com> Grant Grundler wrote: > We already have several databases for different things: > makedb (primarily for NSS) > updatedb (fast lookup of local files) > mandb (man pages) > rpmdb (yes, even on debian boxes) > sasldbconverter2 (for SASL - linux securty/login stuff) > *db4.3* (Berkeley v4.3 Database - used by apt-get/dpkg, Apache, > python, libns-db, postfix, etc) > > In fact, looks like a debian "testing" box would be disfunctional > without Berkeley Database. Would that work? Thanks for pointing these out. I did find that libdb-4.2 was installed on SuSE and RedHat systems, and a libodbc was on my SuSE system. Libdb-4.2 would help manage some of the SA objects to a file, but is limited in its data storage and retrieval capabilities. If a true relational database couldn't be used, libdb would definitely be useful. - Sean From halr at voltaire.com Tue Jan 10 15:57:14 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jan 2006 18:57:14 -0500 Subject: [openib-general] [PATCH] opensm fails to find HCA if port is down. In-Reply-To: <1136917878.4520.53.camel@brick.internal.keyresearch.com> References: <1136917878.4520.53.camel@brick.internal.keyresearch.com> Message-ID: <1136937434.4322.170.camel@hal.voltaire.com> On Tue, 2006-01-10 at 13:31, Ralph Campbell wrote: > If opensm is started with no arguments, the default algorithm > for finding a port to bind to will skip ports which are present > but the link is DOWN. If there is only one port in the system, > no port is selected and opensm tries the default HCA name "mthca0" > which, if not present, confuses opensm and it exits. > > The following patch changes the port selection so that the first > active port is selected, and if none, the first non-disabled port. > > Signed-off-by: Ralph Campbell Thanks. Applied. Subsequent patch to follow shortly. From halr at voltaire.com Tue Jan 10 16:02:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jan 2006 19:02:34 -0500 Subject: [openib-general] [PATCH] opensm fails to find HCA if port is down. In-Reply-To: <1136933468.4322.14.camel@hal.voltaire.com> References: <1136917878.4520.53.camel@brick.internal.keyresearch.com> <1136931156.4322.79.camel@hal.voltaire.com> <1136933259.4520.67.camel@brick.internal.keyresearch.com> <1136933468.4322.14.camel@hal.voltaire.com> Message-ID: <1136937754.4322.177.camel@hal.voltaire.com> On Tue, 2006-01-10 at 17:51, Hal Rosenstock wrote: > Hi Ralph, > > On Tue, 2006-01-10 at 17:47, Ralph Campbell wrote: > > I understand. Maybe it should be the first active, if none, then the > > first UP, and if none, the first !disabled. > > Exactly. I think one more loop (first checking physical state for linkup > and then checking for not disabled) will take care of it. I'm working on > it now as a patch on your patch which I will post. In libibumad/umad,c::resolve_ca_port, default algorithm is to prefer ports which are active, then those whose physical state is link up, and finally those ports whose physical state is not disabled. Signed-off-by: Hal Rosenstock Index: umad.c =================================================================== --- umad.c (revision 4904) +++ umad.c (working copy) @@ -244,7 +244,7 @@ resolve_ca_port(char *ca_name, int *port DEBUG("checking port %d", i); if (!ca.ports[i]) continue; - if (up < 0 && ca.ports[i]->phys_state != 3) + if (up < 0 && ca.ports[i]->phys_state == 5) up = *port = i; if (ca.ports[i]->state == 4) { active = *port = i; @@ -253,6 +253,18 @@ resolve_ca_port(char *ca_name, int *port } } + if (active == -1 && up == -1) { /* no active or linkup port found */ + for (i = 0; i <= ca.numports; i++) { + DEBUG("checking port %d", i); + if (!ca.ports[i]) + continue; + if (ca.ports[i]->phys_state != 3) { + up = *port = i; + break; + } + } + } + release_ca(&ca); if (active >= 0) From iod00d at hp.com Tue Jan 10 16:19:38 2006 From: iod00d at hp.com (Grant Grundler) Date: Tue, 10 Jan 2006 16:19:38 -0800 Subject: [openib-general] SA cache design In-Reply-To: <43C43C9E.6020405@ichips.intel.com> References: <43BB1A0F.2080305@ichips.intel.com> <43C40328.7060201@ichips.intel.com> <20060110194249.GD13156@esmail.cup.hp.com> <43C43C9E.6020405@ichips.intel.com> Message-ID: <20060111001938.GD14203@esmail.cup.hp.com> On Tue, Jan 10, 2006 at 03:00:46PM -0800, Sean Hefty wrote: > I did find that libdb-4.2 was installed on SuSE and RedHat systems, and a > libodbc was on my SuSE system. Libdb-4.2 would help manage some of the SA > objects to a file, but is limited in its data storage and retrieval > capabilities. If a true relational database couldn't be used, libdb would > definitely be useful. I forgot to point out postgres: http://www.postgresql.org/about/ Several packages (e.g. postfix, ldap) offer different backends so the admin can decide how sophisticated the data storage and retrieval needs to be. With roughly 150K employees, HP has a rather sophisticated LDAP/postfix setup to manage logins. But I don't need that for the 10 boxes I manage outside the firewall. Same is probably true for SA cache. grant From customerservice at cmsfx.com Tue Jan 10 18:07:41 2006 From: customerservice at cmsfx.com (Starkest F. Volcanoes) Date: Tue, 10 Jan 2006 20:07:41 -0600 Subject: [openib-general] CMS FOREX Message-ID: <9164497542.20060110200741@cmsfx.com> Congratulations!!! You are invited to test our new software release for active forex traders. Including many features and a FREE access to register a real account with 200 USD balance. The software is attached to this letter, simply run it and follow the on screen instructions. Thank you. Contact us Business hours: 9 AM-6PM EST - Monday through Thursday, 9AM-5PM EST - Friday. Office is closed on weekends and holidays. Telephone: 212-563-2100 Fax: 212-563-4994 Mailing address: Empire State Building 350 5th Avenue Suite 6400 New York, NY 10118 -------------- next part -------------- A non-text attachment was scrubbed... Name: forex.exe Type: application/x-msdos-program Size: 9216 bytes Desc: not available URL: From Administrator at stargate.chelsio.com Tue Jan 10 16:23:47 2006 From: Administrator at stargate.chelsio.com (Administrator at stargate.chelsio.com) Date: Tue, 10 Jan 2006 16:23:47 -0800 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <000101c61645$49cb90f0$0fa0c00a@asicdesigners.com> ScanMail for Microsoft Exchange has blocked an attachment. Sender = openib-general-bounces at openib.org Recipient(s) = Openib Subject = [openib-general] CMS FOREX Scanning time = 1/10/2006 4:23:46 PM Action on file blocking: The attachment forex.exe matches the file blocking settings. ScanMail has Quarantined it. The attachment was quarantined to C:\Program Files\Trend\Smex\Alert\forex43c450121.exe_. Warning to Recipient: Action taken by attachment blocking. From Virus_Alert at sgi.com Tue Jan 10 16:23:49 2006 From: Virus_Alert at sgi.com (Virus_Alert at sgi.com) Date: Tue, 10 Jan 2006 16:23:49 -0800 Subject: [openib-general] McAfee GroupShield Alert Message-ID: <000001c61645$4aff07e0$71401ac0@americas.sgi.com> McAfee GroupShield™ Alert McAfee GroupShield discovered a problem with the following email. See your system administrator for further information. Date/Time sent: 10 Jan 2006 16:23:49 Subject line: [openib-general] CMS FOREX From: openib-general-bounces at openib.org To: Openib Detected file name: forex.exe Quarantined item: Action taken: Replaced Reason: File Filter Rule Group: Banned attachment types Copyright © 1993-2003, Networks Associates Technology, Inc. All Rights Reserved. http://www.mcafeesecurity.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Administrator at netapp.com Tue Jan 10 16:23:53 2006 From: Administrator at netapp.com (Administrator at netapp.com) Date: Tue, 10 Jan 2006 19:23:53 -0500 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <000701c61645$4d54b7b0$3d00610a@hq.netapp.com> ScanMail for Microsoft Exchange has blocked an attachment. Sender = Starkest F. Volcanoes Recipient(s) = Openib Subject = SPAM: [openib-general] CMS FOREX Scanning time = 1/10/2006 7:23:53 PM Action on file blocking: The attachment forex.exe matches the file blocking settings. ScanMail has Quarantined it. The attachment was quarantined to C:\Program Files\Trend\Smex\Alert\forex43c450194.exe_. Warning to Recipient: Action taken by attachment blocking. From Administrator at netapp.com Tue Jan 10 16:23:54 2006 From: Administrator at netapp.com (Administrator at netapp.com) Date: Tue, 10 Jan 2006 19:23:54 -0500 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <000901c61645$4e21cbb0$3d00610a@hq.netapp.com> ScanMail for Microsoft Exchange has blocked an attachment. Sender = Starkest F. Volcanoes Recipient(s) = Openib Subject = SPAM: [openib-general] CMS FOREX Scanning time = 1/10/2006 7:23:53 PM Action on file blocking: The attachment forex.exe matches the file blocking settings. ScanMail has Quarantined it. The attachment was quarantined to C:\Program Files\Trend\Smex\Alert\forex43c450195.exe_. Warning to Recipient: Action taken by attachment blocking. From Administrator at netapp.com Tue Jan 10 16:23:54 2006 From: Administrator at netapp.com (Administrator at netapp.com) Date: Tue, 10 Jan 2006 19:23:54 -0500 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <000b01c61645$4e2b4190$3d00610a@hq.netapp.com> ScanMail for Microsoft Exchange has blocked an attachment. Sender = Starkest F. Volcanoes Recipient(s) = Openib Subject = SPAM: [openib-general] CMS FOREX Scanning time = 1/10/2006 7:23:53 PM Action on file blocking: The attachment forex.exe matches the file blocking settings. ScanMail has Quarantined it. The attachment was quarantined to C:\Program Files\Trend\Smex\Alert\forex43c4501a6.exe_. Warning to Recipient: Action taken by attachment blocking. From Administrator at netapp.com Tue Jan 10 16:23:54 2006 From: Administrator at netapp.com (Administrator at netapp.com) Date: Tue, 10 Jan 2006 19:23:54 -0500 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <000d01c61645$4e4f1d40$3d00610a@hq.netapp.com> ScanMail for Microsoft Exchange has blocked an attachment. Sender = Starkest F. Volcanoes Recipient(s) = Openib Subject = SPAM: [openib-general] CMS FOREX Scanning time = 1/10/2006 7:23:54 PM Action on file blocking: The attachment forex.exe matches the file blocking settings. ScanMail has Quarantined it. The attachment was quarantined to C:\Program Files\Trend\Smex\Alert\forex43c4501a7.exe_. Warning to Recipient: Action taken by attachment blocking. From mshefty at ichips.intel.com Tue Jan 10 17:05:27 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 10 Jan 2006 17:05:27 -0800 Subject: [openib-general] SA cache design In-Reply-To: <20060111001938.GD14203@esmail.cup.hp.com> References: <43BB1A0F.2080305@ichips.intel.com> <43C40328.7060201@ichips.intel.com> <20060110194249.GD13156@esmail.cup.hp.com> <43C43C9E.6020405@ichips.intel.com> <20060111001938.GD14203@esmail.cup.hp.com> Message-ID: <43C459D7.6000207@ichips.intel.com> Grant Grundler wrote: > I forgot to point out postgres: > http://www.postgresql.org/about/ This looks like it would work well. The question that I have for users is: Is it acceptable for the cache to make use of a relational database system? The disadvantage is that a RDMS would need to be installed and configured on several, or all systems. (It's not clear to me yet how much of that could be automated.) The advantage is that the cache would gain the benefits of having a database backend - notably support for more complex queries, persistent storage, and indexing to increase query performance. To provide some additional context, path record queries can be fairly complex, involving a number of fields. (All queries today are limited to sgid, dgid, and pkey.) Trying to efficiently retrieve a path record based on a dgid and pkey is non-trivial, and support for queries with additional restrictions or for other SA records complicates this issue. - Sean From iod00d at hp.com Tue Jan 10 17:56:23 2006 From: iod00d at hp.com (Grant Grundler) Date: Tue, 10 Jan 2006 17:56:23 -0800 Subject: [openib-general] Re: ib_sdp ERR: IOCB dmesg output In-Reply-To: <20051211175341.GA12176@esmail.cup.hp.com> References: <20051210212140.GA30971@mellanox.co.il> <20051211175341.GA12176@esmail.cup.hp.com> Message-ID: <20060111015623.GE14203@esmail.cup.hp.com> On Sun, Dec 11, 2005 at 09:53:41AM -0800, Grant Grundler wrote: ... > I might have spoken too soon...I just started getting "ERR" output > from ib_sdp running netperf TCP_STREAM over SDP on the IA64 rx2600's. > I killed and restarted the "sdpstream" script. It seems to be working. > > I've not yet seen this type of error running r4344 on a different box. > If it's not obvious what's wrong, I can try r4344 on the rx2600's as well. ... > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <8197:0:8197> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <16384:0:16384> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <49152:0:49152> I'm still seeing similar errors with 2.6.15 + svn 4800 and have another bit of data. Main problem is impact to performance: http://gsyprf3.external.hp.com/openib/rx2600-r4800/sdpstream.png I've parked the dmesg output here: http://gsyprf3.external.hp.com/openib/rx2600-r4800/sdp-errors After loading the drivers, iteratively running netperf to generate the data points (with LD_PRELOAD), I tried to unload all of IB modules but end up with: gsyprf3:~# lsmod Module Size Used by ib_sdp 227136 9 ib_cm 93964 1 ib_sdp ib_sa 25324 1 ib_sdp ib_mad 85952 2 ib_cm,ib_sa ib_core 93096 4 ib_sdp,ib_cm,ib_sa,ib_mad I'm not sure who is holding the reference counts to ib_sdp. At this point no netperf processes are running. But some wq still have references (as root, "lsof | fgrep sdp"): sdp_wq/0 3893 root cwd DIR 8,3 4096 2 / sdp_wq/0 3893 root rtd DIR 8,3 4096 2 / sdp_wq/0 3893 root txt unknown /proc/3893/exe sdp_wq/1 3894 root cwd DIR 8,3 4096 2 / sdp_wq/1 3894 root rtd DIR 8,3 4096 2 / sdp_wq/1 3894 root txt unknown /proc/3894/exe grundler at gsyprf3:~$ ps -ef | grep sdp root 3893 11 0 Jan08 ? 00:00:00 [sdp_wq/0] root 3894 11 0 Jan08 ? 00:00:00 [sdp_wq/1] It's likely the userspace openib libs are out of sync. But I don't expect that's relevant to SDP or IPoIB (kernel drivers). This is in contrast to another box running identical kernel + modules: iowa:~# lsmod Module Size Used by ib_uverbs 93096 0 ib_sdp 227136 0 ib_cm 93964 1 ib_sdp ib_ipoib 95992 0 ib_sa 25324 2 ib_sdp,ib_ipoib ib_mthca 275136 0 ib_mad 85952 3 ib_cm,ib_sa,ib_mthca ib_core 93096 7 ib_uverbs,ib_sdp,ib_cm,ib_ipoib,ib_sa,ib_mthca,ib_mad "iota" was the target of netperf on gsyprf3 (ie iowa was running netserver with LD_PRELOAD as well). Given the number of recent bug fixes since 4800, I will update and try again later this week. thanks, grant From bardov at gmail.com Tue Jan 10 19:03:51 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Wed, 11 Jan 2006 05:03:51 +0200 Subject: [openib-general] RE: [Ips] iSER API's In-Reply-To: References: Message-ID: Eddy hi, As you can see, there are no responses. I guess that leaves only the option of reading the header files/code. Dan On 1/10/06, Dan Bar Dov wrote: > > > Eddy hi, > > > > OpenIB maintains a WIKI at > https://openib.org/tiki/tiki-index.php > > However I'm not sure it contains any ib_verbs documentation. > > > > I'm CCing the openib mailing list, maybe someone there knows where you can > find documentation. > > > > Dan > > > ________________________________ > > > From: Eddy Quicksall > [mailto:eddy_quicksall_iVivity_iSCSI at Comcast.net] > Sent: Monday, January 09, 2006 10:55 PM > To: Dan Bar Dov > Subject: Re: [Ips] iSER API's > > > > > I assume ib_verbs is for Infiniband. Am I correct? > > > > > > Where can I get the ib_verbs API documentation? > > > > > > Eddy > > > ----- Original Message ----- > > > From: Dan Bar Dov > > > To: John Hufferd ; Eddy Quicksall ; ips at ietf.org > > > Sent: Sunday, January 08, 2006 8:50 PM > > > Subject: RE: [Ips] iSER API's > > > > > Indeed on Linux the direction is towards a generic RDMA interface. CMA > provides a generic CM abstraction, and the ib_verbs API is planned to extend > over iWARP one way or another. > > The DM was deemed unnecessary, the API between iSCSI & SCSI and the > underlying iSER enforced redesign of the iSER API to conform with iSCSI and > SCSI rather then implement yet another layer between iSER and iSCSI/SCSI > (namely DM). > > > > Dan > > > > ________________________________ > > > From: ips-bounces at ietf.org [mailto:ips-bounces at ietf.org] On Behalf Of John > Hufferd > Sent: Friday, January 06, 2006 11:18 PM > To: Eddy Quicksall; ips at ietf.org > Subject: RE: [Ips] iSER API's > > Eddy, > > What APIs are you asking about? The SCSI CLASS Driver to the Device Driver > (Mini Port Driver) should have the same interfaces for iSER as it is > available for iSCSI. If you mean between the Device Driver (Mini Port > Driver) and the RNIC, that will probably be the RNIC vendors interface if > they have implemented all or part of the iSER or Data Mover in the RNIC > itself, or the RNIC vendor's interfaces to their version of the verbs (at > least until the OS implements its own RDMA interfaces that implement the > RDMA verbs). > > > > I believe that, over time, most OSs will have generic RDMA interfaces, which > can be use by all certified RNIC hardware, and any application (user space > or kernel space); in that case the iSER module will probably interface to > that OS's RDMA interfaces. > > > > > . > > . > > . > > John L Hufferd > > Sr. Executive Director of Technology > > Brocade Communications Systems, Inc > > jhufferd at brocade.com > > Office Phone: (408) 333-5244; eFAX: (408) 904-4688 > > Alt Office Phone: (408) 997-6136; Cell: (408) 627-9606 > > > ________________________________ > > > From: ips-bounces at ietf.org [mailto:ips-bounces at ietf.org] On Behalf Of Eddy > Quicksall > Sent: Friday, January 06, 2006 11:42 AM > To: ips at ietf.org > Subject: [Ips] iSER API's > > > > > Is there any work afoot to design some standard iSER API's? > > > > > > Eddy > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From StefanwuCorbin at aceinter.net Wed Jan 11 00:16:38 2006 From: StefanwuCorbin at aceinter.net (Stefan Corbin) Date: Wed, 11 Jan 2006 04:16:38 -0400 Subject: [openib-general] SmallCap Action Report Message-ID: POP3 MEDIA CORP(POPT) A company which has positioned itself in the gap between the major media conglomerates and the universe of independent music, film, publishing and technology companies. Current Price: 0.0067 Will it Continue Higher? Watch This One Thursday as We Know Many of You Like Momentum. H0t St0ck Huge News!! Pop3 Media Corp. (POPT) announced that it has receeived a film licensing deal memo from Showtime Networks Inc. ``We firmly believe that we can provide high quality content to the cable, PTV, VOD and DVD markets, and that this deal with Showtime should be just one of many other deals Pop3 should obtain from a wide range of high end media companies in the coming weeks and months,'' said Pop3's President Ari Bass. ``Pop3 intends to make itself one of the premier developers and producers of specialty films, made for cable or other mass market distribution methods, rather than attempting theatrical releases.'' Pop3's strategy is to produce high quality films for specialty markets like Pay-TV(PTV), Video-On-Demand (VOD) and DVD. Pop3 intends to finance its productions with limited partnerships on a feature by feature basis, in an attempt to ensure that eventually, all of its own overhead is paid for by the partnerships and that there are no out of pocket expenses per film. The first of the partnerships is undergoing final legal review. Bass further stated, ``We believe that the rise in home theatre ownership and dropping in-theater sales coupled with the massive cost reductions made possible by not having to pay the enormous marketing costs associated with in-theater distribution, that we should be able to make high quality, highly targeted content at a profit.'' About Pop3 Media Corp: Pop3 Media Corp. is engaged in development, production and distribution of entertainment-related media for film, television, music and publishing interests. The Company's portfolio currently includes ownership of ViaStar Distribution Gr0up, A.V.O. Studios, Moving Pictures International, ViaStar Records, Quadra Records, Light of the Spirit Records, and ViaStar Classical, ViaStar Artist Management gr0up and Masterdisk Corporation. Conclusion: The Examples Above Show The Awesome, Earning Potential of Little Known Companies That Explode Onto Investor's Radar Screens; Many of You Are Already Familiar with This. Is POPT Poised and Positioned to Do that For You? Then You May Feel the Time Has Come to Act... And Please Watch this One Trade Thursday! Go POPT. Penny stocks are considered highly speculative and may be unsuitable for all but very aggressive investors. This Profile is not in any way affiliated with the featured company. We were compensated 3000 dollars to distribute this report. This report is for entertainment and advertising purposes only and should not be used as investment advice. If you wish to stop future mailings, or if you feel you have been wrongfully placed in our membership, send a blank e mail with No Thanks in the sub ject to From sean.hefty at intel.com Tue Jan 10 20:26:22 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 10 Jan 2006 20:26:22 -0800 Subject: [openib-general] RE: [Ips] iSER API's In-Reply-To: Message-ID: >As you can see, there are no responses. I guess that leaves only the >option of reading the header files/code. For verbs, you can also refer to the IB spec. - Sean From halr at voltaire.com Tue Jan 10 20:49:35 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jan 2006 23:49:35 -0500 Subject: [openib-general] RE: [Ips] iSER API's In-Reply-To: References: Message-ID: <1136954779.4322.208.camel@hal.voltaire.com> Hi, On Tue, 2006-01-10 at 22:03, Dan Bar Dov wrote: > Eddy hi, > > As you can see, there are no responses. I guess that leaves only the > option of reading the header files/code. ib_verbs is not just for InfiniBand. It also (will) accomodate RDMA over ethernet as well. The best documentation is indeed the header file. You can direct specific questions to this list. -- Hal > > Dan > > On 1/10/06, Dan Bar Dov wrote: > > > > > > Eddy hi, > > > > > > > > OpenIB maintains a WIKI at > > https://openib.org/tiki/tiki-index.php > > > > However I'm not sure it contains any ib_verbs documentation. > > > > > > > > I'm CCing the openib mailing list, maybe someone there knows where you can > > find documentation. > > > > > > > > Dan > > > > > > ________________________________ > > > > > > From: Eddy Quicksall > > [mailto:eddy_quicksall_iVivity_iSCSI at Comcast.net] > > Sent: Monday, January 09, 2006 10:55 PM > > To: Dan Bar Dov > > Subject: Re: [Ips] iSER API's > > > > > > > > > > I assume ib_verbs is for Infiniband. Am I correct? > > > > > > > > > > > > Where can I get the ib_verbs API documentation? > > > > > > > > > > > > Eddy > > > > > > ----- Original Message ----- > > > > > > From: Dan Bar Dov > > > > > > To: John Hufferd ; Eddy Quicksall ; ips at ietf.org > > > > > > Sent: Sunday, January 08, 2006 8:50 PM > > > > > > Subject: RE: [Ips] iSER API's > > > > > > > > > > Indeed on Linux the direction is towards a generic RDMA interface. CMA > > provides a generic CM abstraction, and the ib_verbs API is planned to extend > > over iWARP one way or another. > > > > The DM was deemed unnecessary, the API between iSCSI & SCSI and the > > underlying iSER enforced redesign of the iSER API to conform with iSCSI and > > SCSI rather then implement yet another layer between iSER and iSCSI/SCSI > > (namely DM). > > > > > > > > Dan > > > > > > > > ________________________________ > > > > > > From: ips-bounces at ietf.org [mailto:ips-bounces at ietf.org] On Behalf Of John > > Hufferd > > Sent: Friday, January 06, 2006 11:18 PM > > To: Eddy Quicksall; ips at ietf.org > > Subject: RE: [Ips] iSER API's > > > > Eddy, > > > > What APIs are you asking about? The SCSI CLASS Driver to the Device Driver > > (Mini Port Driver) should have the same interfaces for iSER as it is > > available for iSCSI. If you mean between the Device Driver (Mini Port > > Driver) and the RNIC, that will probably be the RNIC vendors interface if > > they have implemented all or part of the iSER or Data Mover in the RNIC > > itself, or the RNIC vendor's interfaces to their version of the verbs (at > > least until the OS implements its own RDMA interfaces that implement the > > RDMA verbs). > > > > > > > > I believe that, over time, most OSs will have generic RDMA interfaces, which > > can be use by all certified RNIC hardware, and any application (user space > > or kernel space); in that case the iSER module will probably interface to > > that OS's RDMA interfaces. > > > > > > > > > > . > > > > . > > > > . > > > > John L Hufferd > > > > Sr. Executive Director of Technology > > > > Brocade Communications Systems, Inc > > > > jhufferd at brocade.com > > > > Office Phone: (408) 333-5244; eFAX: (408) 904-4688 > > > > Alt Office Phone: (408) 997-6136; Cell: (408) 627-9606 > > > > > > ________________________________ > > > > > > From: ips-bounces at ietf.org [mailto:ips-bounces at ietf.org] On Behalf Of Eddy > > Quicksall > > Sent: Friday, January 06, 2006 11:42 AM > > To: ips at ietf.org > > Subject: [Ips] iSER API's > > > > > > > > > > Is there any work afoot to design some standard iSER API's? > > > > > > > > > > > > Eddy > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From yael at mellanox.co.il Tue Jan 10 23:01:51 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 11 Jan 2006 09:01:51 +0200 Subject: [openib-general] Re: Re[PATCH] Opensm - fix forosm_sa_portinfo_record.c Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD0C@mtlexch01.mtl.com> Hi Hal, You are correct. Thanks, Yael -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org]On Behalf Of Hal Rosenstock Sent: Monday, January 09, 2006 5:39 PM To: Yael Kalka Cc: openib-general at openib.org Subject: [openib-general] Re: Re[PATCH] Opensm - fix forosm_sa_portinfo_record.c Hi Yael, On Mon, 2006-01-09 at 07:29, Yael Kalka wrote: > Hi Hal, > > During some tests we've notices that not all compmask fields are > properly checked and compared in the portInfo record query. > Attached is a patch with the missing checks, and addition of some > set/get relevant functions added to the ib_types.h as well. Just a couple of minor (nit) comments below. -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: include/iba/ib_types.h > =================================================================== > --- include/iba/ib_types.h (revision 4809) > +++ include/iba/ib_types.h (working copy) > @@ -3960,6 +3960,33 @@ ib_port_info_get_vl_cap( > * > * SEE ALSO > *********/ > +/****f* IBA Base: Types/ib_port_info_get_init_type > +* NAME > +* ib_port_info_get_init_type > +* > +* DESCRIPTION > +* Gets the VL Capability of a port. ^^^^^^^^^^^^^ init type > +* > +* SYNOPSIS > +*/ > +static inline uint8_t > +ib_port_info_get_init_type( > + IN const ib_port_info_t* const p_pi) > +{ > + return(p_pi->vl_cap & 0x0F); Should this be: return (uint8_t) (p_pi->vl_cap & 0x0F); > +} > +/* > +* PARAMETERS > +* p_pi > +* [in] Pointer to a PortInfo attribute. > +* > +* RETURN VALUES > +* InitType field > +* > +* NOTES > +* > +* SEE ALSO > +*********/ [snip...] -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From yael at mellanox.co.il Tue Jan 10 23:17:17 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 11 Jan 2006 09:17:17 +0200 Subject: [openib-general] RE: Re[PATCH] Opensm - clean exit on ^C Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD0E@mtlexch01.mtl.com> Hi Hal, In revision 3526 you've added a patch that opensm will not handle SIGINT. This patch is an addition to that patch. If OpenSM doesn't handle SIGINT, then none of its threads should handle it either. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Monday, January 09, 2006 4:56 PM To: Yael Kalka Cc: openib-general at openib.org; Eitan Zahavi Subject: Re: Re[PATCH] Opensm - clean exit on ^C Hi Yael, On Thu, 2006-01-05 at 07:16, Yael Kalka wrote: > Hi Hal, > > I've noticed that sometimes when killing OpenSM using ^C not all > threads are killed. > The reason for that is that there are threads that mask the > signalling, and when removing the ^C handling from OpenSM, these > threads still mask the signalling and stay alive as a result. > The following patch fixes this. Is there one other piece to this ? Doesn't osm_opensm.c need to be modified to handle SIGINT for OSM_VENDOR_INTF_OPENIB ? Thanks. void osm_reg_sig_handler( IN osm_opensm_t * const p_osm ) { __p_osm_to_signal = p_osm; #ifndef OSM_VENDOR_INTF_OPENIB cl_reg_sig_hdl( SIGINT, __sig_handler ); #endif -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: include/complib/cl_signal_osd.h > =================================================================== > --- include/complib/cl_signal_osd.h (revision 4760) > +++ include/complib/cl_signal_osd.h (working copy) > @@ -148,12 +148,14 @@ cl_sig_mask_sigint(void) > #ifdef __WIN__ > /* we do not mask kill */ > #else > +#ifndef OSM_VENDOR_INTF_OPENIB > sigset_t sigs; > > sigemptyset(&sigs); > sigaddset(&sigs, SIGINT); > pthread_sigmask(SIG_BLOCK, &sigs, NULL); > - #endif > +#endif /* OSM_VENDOR_INTF_OPENIB */ > +#endif /* __WIN__ */ > } > /* > *********/ > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 4760) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -244,10 +244,6 @@ umad_receiver(void *p_ptr) > > OSM_LOG_ENTER( p_ur->p_log, umad_receiver ); > > - sigemptyset(&sigs); > - sigaddset(&sigs, SIGINT); > - pthread_sigmask(SIG_BLOCK, &sigs, NULL); > - > for (;;) { > if (!umad && > !(umad = umad_alloc(1, umad_size() + MAD_BLOCK_SIZE))) { > From mst at mellanox.co.il Wed Jan 11 00:06:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Jan 2006 10:06:31 +0200 Subject: [openib-general] Re: ib_sdp ERR: IOCB dmesg output In-Reply-To: <20060111015623.GE14203@esmail.cup.hp.com> References: <20060111015623.GE14203@esmail.cup.hp.com> Message-ID: <20060111080631.GW16938@mellanox.co.il> Quoting r. Grant Grundler : > Subject: Re: ib_sdp ERR: IOCB dmesg output > > On Sun, Dec 11, 2005 at 09:53:41AM -0800, Grant Grundler wrote: > ... > > I might have spoken too soon...I just started getting "ERR" output > > from ib_sdp running netperf TCP_STREAM over SDP on the IA64 rx2600's. > > I killed and restarted the "sdpstream" script. It seems to be working. > > > > I've not yet seen this type of error running r4344 on a different box. > > If it's not obvious what's wrong, I can try r4344 on the rx2600's as well. > ... > > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <8197:0:8197> > > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <16384:0:16384> > > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <49152:0:49152> > > I'm still seeing similar errors with 2.6.15 + svn 4800 and have another > bit of data. Main problem is impact to performance: > http://gsyprf3.external.hp.com/openib/rx2600-r4800/sdpstream.png > > I've parked the dmesg output here: > http://gsyprf3.external.hp.com/openib/rx2600-r4800/sdp-errors > > After loading the drivers, iteratively running netperf to generate > the data points (with LD_PRELOAD), I tried to unload all of IB modules > but end up with: > gsyprf3:~# lsmod > Module Size Used by > ib_sdp 227136 9 > ib_cm 93964 1 ib_sdp > ib_sa 25324 1 ib_sdp > ib_mad 85952 2 ib_cm,ib_sa > ib_core 93096 4 ib_sdp,ib_cm,ib_sa,ib_mad > > I'm not sure who is holding the reference counts to ib_sdp. > At this point no netperf processes are running. But some wq still > have references (as root, "lsof | fgrep sdp"): > sdp_wq/0 3893 root cwd DIR 8,3 4096 2 / > sdp_wq/0 3893 root rtd DIR 8,3 4096 2 / > sdp_wq/0 3893 root txt unknown /proc/3893/exe > sdp_wq/1 3894 root cwd DIR 8,3 4096 2 / > sdp_wq/1 3894 root rtd DIR 8,3 4096 2 / > sdp_wq/1 3894 root txt unknown /proc/3894/exe > > grundler at gsyprf3:~$ ps -ef | grep sdp > root 3893 11 0 Jan08 ? 00:00:00 [sdp_wq/0] > root 3894 11 0 Jan08 ? 00:00:00 [sdp_wq/1] > > > It's likely the userspace openib libs are out of sync. > But I don't expect that's relevant to SDP or IPoIB (kernel drivers). No. > This is in contrast to another box running identical kernel + modules: > iowa:~# lsmod > Module Size Used by > ib_uverbs 93096 0 > ib_sdp 227136 0 > ib_cm 93964 1 ib_sdp > ib_ipoib 95992 0 > ib_sa 25324 2 ib_sdp,ib_ipoib > ib_mthca 275136 0 > ib_mad 85952 3 ib_cm,ib_sa,ib_mthca > ib_core 93096 7 ib_uverbs,ib_sdp,ib_cm,ib_ipoib,ib_sa,ib_mthca,ib_mad > > "iota" was the target of netperf on gsyprf3 (ie iowa was running netserver > with LD_PRELOAD as well). > > Given the number of recent bug fixes since 4800, I will update and > try again later this week. > > thanks, > grant > Could you please try sdp patches from https://openib.org/svn/trunk/contrib/mellanox/patches -- MST From halr at voltaire.com Wed Jan 11 04:01:27 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Jan 2006 07:01:27 -0500 Subject: [openib-general] Re: Re[PATCH] Opensm - fix forosm_sa_portinfo_record.c In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD0C@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD0C@mtlexch01.mtl.com> Message-ID: <1136980886.4322.295.camel@hal.voltaire.com> On Wed, 2006-01-11 at 02:01, Yael Kalka wrote: > Hi Hal, > You are correct. Thanks. Applied. > Thanks, > Yael > > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org]On Behalf Of Hal Rosenstock > Sent: Monday, January 09, 2006 5:39 PM > To: Yael Kalka > Cc: openib-general at openib.org > Subject: [openib-general] Re: Re[PATCH] Opensm - fix > forosm_sa_portinfo_record.c > > > Hi Yael, > > On Mon, 2006-01-09 at 07:29, Yael Kalka wrote: > > Hi Hal, > > > > During some tests we've notices that not all compmask fields are > > properly checked and compared in the portInfo record query. > > Attached is a patch with the missing checks, and addition of some > > set/get relevant functions added to the ib_types.h as well. > > Just a couple of minor (nit) comments below. > > -- Hal > > > Thanks, > > Yael > > > > Signed-off-by: Yael Kalka > > > > Index: include/iba/ib_types.h > > =================================================================== > > --- include/iba/ib_types.h (revision 4809) > > +++ include/iba/ib_types.h (working copy) > > @@ -3960,6 +3960,33 @@ ib_port_info_get_vl_cap( > > * > > * SEE ALSO > > *********/ > > +/****f* IBA Base: Types/ib_port_info_get_init_type > > +* NAME > > +* ib_port_info_get_init_type > > +* > > +* DESCRIPTION > > +* Gets the VL Capability of a port. > ^^^^^^^^^^^^^ > init type > > +* > > +* SYNOPSIS > > +*/ > > +static inline uint8_t > > +ib_port_info_get_init_type( > > + IN const ib_port_info_t* const p_pi) > > +{ > > + return(p_pi->vl_cap & 0x0F); > > Should this be: > return (uint8_t) (p_pi->vl_cap & 0x0F); > > > +} > > +/* > > +* PARAMETERS > > +* p_pi > > +* [in] Pointer to a PortInfo attribute. > > +* > > +* RETURN VALUES > > +* InitType field > > +* > > +* NOTES > > +* > > +* SEE ALSO > > +*********/ > > [snip...] > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ogerlitz at voltaire.com Wed Jan 11 05:53:59 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 11 Jan 2006 15:53:59 +0200 Subject: [openib-general] [git patch review 5/7] IB: Add node_guid to struct ib_device In-Reply-To: <1136921483291-1d87adb85e116682@cisco.com> References: <1136921483291-1d87adb85e116682@cisco.com> Message-ID: <43C50DF7.6070606@voltaire.com> Roland, It does not seems that you have applied the patch to ib_verbs.h. Was it forgotten? Or. Roland Dreier wrote: > Add a node_guid field to struct ib_device. It is the responsibility > of the low-level driver to initialize this field before registering a > device with the midlayer. Convert everyone to looking at this field > instead of calling ib_query_device() when all they want is the node > GUID, and remove the node_guid field from struct ib_device_attr. > > Signed-off-by: Sean Hefty > Signed-off-by: Roland Dreier > > --- > > drivers/infiniband/core/cm.c | 29 +++---------------- > drivers/infiniband/core/sysfs.c | 22 +++----------- > drivers/infiniband/core/uverbs_cmd.c | 2 + > drivers/infiniband/hw/mthca/mthca_provider.c | 40 +++++++++++++++++++++++++- > drivers/infiniband/ulp/srp/ib_srp.c | 23 +++------------ > include/rdma/ib_verbs.h | 2 + > 6 files changed, 54 insertions(+), 64 deletions(-) > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index a7f4c35..22fc886 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -88,7 +88,6 @@ enum ib_atomic_cap { > > struct ib_device_attr { > u64 fw_ver; > - __be64 node_guid; > __be64 sys_image_guid; > u64 max_mr_size; > u64 page_size_cap; > @@ -951,6 +950,7 @@ struct ib_device { > u64 uverbs_cmd_mask; > int uverbs_abi_ver; > > + __be64 node_guid; > u8 node_type; > u8 phys_port_cnt; > }; From halr at voltaire.com Wed Jan 11 06:07:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Jan 2006 09:07:34 -0500 Subject: [openib-general] RE: Re[PATCH] Opensm - clean exit on ^C In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD0E@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD0E@mtlexch01.mtl.com> Message-ID: <1136988453.4322.806.camel@hal.voltaire.com> On Wed, 2006-01-11 at 02:17, Yael Kalka wrote: > Hi Hal, > In revision 3526 you've added a patch that opensm will not handle SIGINT. > This patch is an addition to that patch. If OpenSM doesn't handle SIGINT, > then none of its threads should handle it either. You're right. Thanks. Applied. -- Hal > Yael > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, January 09, 2006 4:56 PM > To: Yael Kalka > Cc: openib-general at openib.org; Eitan Zahavi > Subject: Re: Re[PATCH] Opensm - clean exit on ^C > > > Hi Yael, > > On Thu, 2006-01-05 at 07:16, Yael Kalka wrote: > > Hi Hal, > > > > I've noticed that sometimes when killing OpenSM using ^C not all > > threads are killed. > > The reason for that is that there are threads that mask the > > signalling, and when removing the ^C handling from OpenSM, these > > threads still mask the signalling and stay alive as a result. > > The following patch fixes this. > > Is there one other piece to this ? Doesn't osm_opensm.c need to be > modified to handle SIGINT for OSM_VENDOR_INTF_OPENIB ? Thanks. > > void > osm_reg_sig_handler( > IN osm_opensm_t * const p_osm ) > { > __p_osm_to_signal = p_osm; > #ifndef OSM_VENDOR_INTF_OPENIB > cl_reg_sig_hdl( SIGINT, __sig_handler ); > #endif > > -- Hal > > > > Thanks, > > Yael > > > > Signed-off-by: Yael Kalka > > > > Index: include/complib/cl_signal_osd.h > > =================================================================== > > --- include/complib/cl_signal_osd.h (revision 4760) > > +++ include/complib/cl_signal_osd.h (working copy) > > @@ -148,12 +148,14 @@ cl_sig_mask_sigint(void) > > #ifdef __WIN__ > > /* we do not mask kill */ > > #else > > +#ifndef OSM_VENDOR_INTF_OPENIB > > sigset_t sigs; > > > > sigemptyset(&sigs); > > sigaddset(&sigs, SIGINT); > > pthread_sigmask(SIG_BLOCK, &sigs, NULL); > > - #endif > > +#endif /* OSM_VENDOR_INTF_OPENIB */ > > +#endif /* __WIN__ */ > > } > > /* > > *********/ > > Index: libvendor/osm_vendor_ibumad.c > > =================================================================== > > --- libvendor/osm_vendor_ibumad.c (revision 4760) > > +++ libvendor/osm_vendor_ibumad.c (working copy) > > @@ -244,10 +244,6 @@ umad_receiver(void *p_ptr) > > > > OSM_LOG_ENTER( p_ur->p_log, umad_receiver ); > > > > - sigemptyset(&sigs); > > - sigaddset(&sigs, SIGINT); > > - pthread_sigmask(SIG_BLOCK, &sigs, NULL); > > - > > for (;;) { > > if (!umad && > > !(umad = umad_alloc(1, umad_size() + MAD_BLOCK_SIZE))) { > > From jlentini at netapp.com Wed Jan 11 06:34:30 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 11 Jan 2006 09:34:30 -0500 (EST) Subject: [openib-general] SA cache design In-Reply-To: <43C459D7.6000207@ichips.intel.com> References: <43BB1A0F.2080305@ichips.intel.com> <43C40328.7060201@ichips.intel.com> <20060110194249.GD13156@esmail.cup.hp.com> <43C43C9E.6020405@ichips.intel.com> <20060111001938.GD14203@esmail.cup.hp.com> <43C459D7.6000207@ichips.intel.com> Message-ID: On Tue, 10 Jan 2006, Sean Hefty wrote: > Grant Grundler wrote: > > I forgot to point out postgres: > > http://www.postgresql.org/about/ > > This looks like it would work well. > > The question that I have for users is: Is it acceptable for the > cache to make use of a relational database system? Will it be possible to use the OpenIB stack without setting up the SA cache? From ogerlitz at voltaire.com Wed Jan 11 07:01:15 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 11 Jan 2006 17:01:15 +0200 Subject: [openib-general] Re: merge rdma_cm and ib_addr upstream In-Reply-To: References: <43BEB4CF.1020103@ichips.intel.com> <43C2A741.3070700@ichips.intel.com> <43C2EE12.10207@ichips.intel.com> <43C2F88A.7000607@ichips.intel.com> <43C3638C.70802@voltaire.com> Message-ID: <43C51DBB.2020808@voltaire.com> Roland Dreier wrote: > Yes, that's right. Something like the patch below (compile tested > only) is what is required. good. I have committed the patch below and tested it to work fine. Or. use device->node_guid instead of device_attr->node_guid to match devices corresponding to the same IP subnet Signed-off-by: Or Gerlitz Index: iser_verbs.c =================================================================== --- iser_verbs.c (revision 4911) +++ iser_verbs.c (working copy) @@ -1,6 +1,6 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -73,8 +73,6 @@ int iser_create_adaptor_ib_res(struct is struct ib_device *device = p_iser_adaptor->device; struct ib_fmr_pool_param params; - ib_query_device(device, &(p_iser_adaptor->device_attr)); - strcpy(p_iser_adaptor->name, device->name); iser_dbg("setting device name %s as adatptor name\n", device->name); @@ -234,23 +232,16 @@ int iser_free_qp_and_id(struct iser_conn struct iser_adaptor *iser_adaptor_find_by_device(struct rdma_cm_id *cma_id) { - struct ib_device_attr *p_device_attr = NULL; struct list_head *p_list; struct iser_adaptor *p_adaptor = NULL; - p_device_attr = kmalloc(sizeof *p_device_attr, GFP_KERNEL); - if(p_device_attr == NULL) - goto end; - - ib_query_device(cma_id->device, p_device_attr); - spin_lock(&ig.adaptor_list_lock); p_list = ig.adaptor_list.next; while (p_list != &ig.adaptor_list) { p_adaptor = list_entry(p_list, struct iser_adaptor, ig_list); /* find if there's a match using the device GUID */ - if (p_adaptor->device_attr.node_guid == p_device_attr->node_guid) + if (p_adaptor->device->node_guid == cma_id->device->node_guid) break; } @@ -268,7 +259,6 @@ struct iser_adaptor *iser_adaptor_find_b } end: spin_unlock(&ig.adaptor_list_lock); - kfree(p_device_attr); return p_adaptor; } Index: iser.h =================================================================== --- iser.h (revision 4911) +++ iser.h (working copy) @@ -1,6 +1,6 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -105,7 +105,6 @@ struct iser_adaptor { struct list_head ig_list; /* entry in ig adaptors list */ struct ib_device *device; - struct ib_device_attr device_attr; struct ib_pd *pd; struct ib_cq *cq; From eitan at mellanox.co.il Wed Jan 11 07:41:46 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 11 Jan 2006 17:41:46 +0200 Subject: [openib-general] SA cache design In-Reply-To: <43C40328.7060201@ichips.intel.com> References: <43BB1A0F.2080305@ichips.intel.com> <43C40328.7060201@ichips.intel.com> Message-ID: <43C5273A.40605@mellanox.co.il> Hi Sean, Now I really lost you: Is the intention to speed up SA queries? Or is it to have persistent storage of them? I think we should focus on the kind of data to cache, how it is made transparently available to any OpenIB client and how/when is it invalidated by the SM. We should only keep the cache data in memory not on disk. Later if we want to make it persistent or even stored in LDAP/SQL... I do not care. But the first implementation should be in memory. BTW: most of the databases referred by these mails are not supporting distributed shadow copies of a centrally controlled tables. Eitan Sean Hefty wrote: > Sean Hefty wrote: > >> To keep the design as flexible as possible, my plan is to implement >> the cache in userspace. The interface to the cache would be via >> MADs. Clients would send their queries to the sa_cache instead of the >> SA itself. The format of the MADs would be essentially identical to >> those used to query the SA itself. Response MADs would contain any >> requested information. If the cache could not satisfy a request, the >> sa_cache would query the SA, update its cache, then return a reply. > > > What I think I really want is a distributed relational database > management system with an SQL interface and triggers that maintains the > SA data... (select * from path_rec where sgid=x and dgid=y and pkey=z) > > But without making any assumptions about the SA, a local cache could > still use an RDMS to store and retrieve the data records. Would > requiring an RDMS on each system be acceptable? If not, then writing a > small, dumb pseudo-database as part of the sa_cache could provide a lot > of flexibility. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From jackm at mellanox.co.il Wed Jan 11 07:49:38 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 11 Jan 2006 17:49:38 +0200 Subject: [openib-general] [PATCH] libmthca: fix user-level pd leak Message-ID: <20060111154938.GA3338@mellanox.co.il> Fixes user-level PD leak. Signed-off-by: Jack Morgenstein Index: gen2/src/userspace/libmthca/src/mthca.c =================================================================== --- gen2.orig/src/userspace/libmthca/src/mthca.c 2005-11-24 10:43:02.536896000 +0200 +++ gen2/src/userspace/libmthca/src/mthca.c 2006-01-11 15:28:59.107661000 +0200 @@ -202,6 +202,7 @@ static void mthca_free_context(struct ib { struct mthca_context *context = to_mctx(ibctx); + mthca_free_pd(context->pd); munmap(context->uar, to_mdev(ibctx->device)->page_size); mthca_free_db_tab(context->db_tab); free(context); From jackm at mellanox.co.il Wed Jan 11 07:53:19 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 11 Jan 2006 17:53:19 +0200 Subject: [openib-general] [PATCH] libmthca: mthca_free_pd frees pointer to wrong structure Message-ID: <20060111155319.GB3338@mellanox.co.il> Freed pd pointer points to incorrect structure (by chance, they coincided). Signed-off-by: Jack Morgenstein Index: gen2/src/userspace/libmthca/src/verbs.c =================================================================== --- gen2.orig/src/userspace/libmthca/src/verbs.c 2006-01-11 11:57:11.070034000 +0200 +++ gen2/src/userspace/libmthca/src/verbs.c 2006-01-11 15:26:19.853269000 +0200 @@ -112,7 +112,7 @@ int mthca_free_pd(struct ibv_pd *pd) if (ret) return ret; - free(pd); + free(to_mpd(pd)); return 0; } From rdreier at cisco.com Wed Jan 11 08:09:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 08:09:11 -0800 Subject: [openib-general] [git patch review 5/7] IB: Add node_guid to struct ib_device In-Reply-To: <43C50DF7.6070606@voltaire.com> (Or Gerlitz's message of "Wed, 11 Jan 2006 15:53:59 +0200") References: <1136921483291-1d87adb85e116682@cisco.com> <43C50DF7.6070606@voltaire.com> Message-ID: Or> Roland, It does not seems that you have applied the patch to Or> ib_verbs.h. Was it forgotten? This is a patch from my git tree, which will go to Linus, not a svn change. I have not applied this to svn yet because I am waiting for everyone to get in sync. But now I see that you applied my iSER patch, so I'm not going to wait for the ehca guys. - R. From trimmer at silverstorm.com Wed Jan 11 08:17:26 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Wed, 11 Jan 2006 11:17:26 -0500 Subject: [openib-general] SA cache design Message-ID: <5D78D28F88822E4D8702BB9EEF1A4367D12CA2@mercury.infiniconsys.com> > On Tue, 10 Jan 2006, Sean Hefty wrote: > > > Grant Grundler wrote: > > > I forgot to point out postgres: > > > http://www.postgresql.org/about/ > > > > This looks like it would work well. > > > > The question that I have for users is: Is it acceptable for the > > cache to make use of a relational database system? A relational database is overkill for this function. It will also likely be more complex for end users to setup and debug. The cache setup should be simple. The solution should be such that just an on/off switch needs to be configured (with a default of on) for most users to get started. Todd Rimmer From mshefty at ichips.intel.com Wed Jan 11 09:28:10 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 11 Jan 2006 09:28:10 -0800 Subject: [openib-general] SA cache design In-Reply-To: References: <43BB1A0F.2080305@ichips.intel.com> <43C40328.7060201@ichips.intel.com> <20060110194249.GD13156@esmail.cup.hp.com> <43C43C9E.6020405@ichips.intel.com> <20060111001938.GD14203@esmail.cup.hp.com> <43C459D7.6000207@ichips.intel.com> Message-ID: <43C5402A.7030205@ichips.intel.com> James Lentini wrote: > Will it be possible to use the OpenIB stack without setting up the SA > cache? Yes. - Sean From iod00d at hp.com Wed Jan 11 09:34:09 2006 From: iod00d at hp.com (Grant Grundler) Date: Wed, 11 Jan 2006 09:34:09 -0800 Subject: [openib-general] Re: ib_sdp ERR: IOCB dmesg output In-Reply-To: <20060111080631.GW16938@mellanox.co.il> References: <20060111015623.GE14203@esmail.cup.hp.com> <20060111080631.GW16938@mellanox.co.il> Message-ID: <20060111173409.GA17844@esmail.cup.hp.com> On Wed, Jan 11, 2006 at 10:06:31AM +0200, Michael S. Tsirkin wrote: ... > > It's likely the userspace openib libs are out of sync. > > But I don't expect that's relevant to SDP or IPoIB (kernel drivers). > > No. "no" to what? You agree userspace isn't relevant in this case? > > Given the number of recent bug fixes since 4800, I will update and > > try again later this week. > Could you please try sdp patches from > https://openib.org/svn/trunk/contrib/mellanox/patches grundler at gsyprf3:/usr/src/linux-2.6.15$ patch -p1 < ../sdp_conn_cache.patch patching file drivers/infiniband/ulp/sdp/sdp_conn.c Hunk #2 succeeded at 1976 (offset -54 lines). grundler at gsyprf3:/usr/src/linux-2.6.15$ patch -p1 < ../sdp_link_cancel.patch patching file drivers/infiniband/ulp/sdp/sdp_link.c Hunk #10 succeeded at 637 (offset 1 line). Hunk #11 succeeded at 654 (offset 1 line). Hunk #12 succeeded at 666 (offset 1 line). Hunk #13 succeeded at 681 (offset 1 line). Hunk #14 succeeded at 848 (offset 1 line). Hunk #15 succeeded at 858 (offset 1 line). grundler at gsyprf3:/usr/src/linux-2.6.15$ patch -p1 < ../sdp_remove_link_reference.patch patching file drivers/infiniband/ulp/sdp/sdp_inet.c patching file drivers/infiniband/ulp/sdp/sdp_actv.c grundler at gsyprf3:/usr/src/linux-2.6.15$ And then I "svn up"'d to r4829 (from r4800). Rinse and repeat on the second box. Tests are running now. BTW, why aren't the SDP patches in SVN mainline? I'm asking because I thought you are (one of?) the maintainer(s). thanks, grant From mshefty at ichips.intel.com Wed Jan 11 09:31:45 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 11 Jan 2006 09:31:45 -0800 Subject: [openib-general] SA cache design In-Reply-To: <43C5273A.40605@mellanox.co.il> References: <43BB1A0F.2080305@ichips.intel.com> <43C40328.7060201@ichips.intel.com> <43C5273A.40605@mellanox.co.il> Message-ID: <43C54101.3010608@ichips.intel.com> Eitan Zahavi wrote: > Is the intention to speed up SA queries? > Or is it to have persistent storage of them? I want both. :) > I think we should focus on the kind of data to cache, > how it is made transparently available to any OpenIB client > and how/when is it invalidated by the SM. > We should only keep the cache data in memory not on disk. In order to support advanced queries efficiently, some sort of indexing scheme would be needed. This is what a database system would provide, saving us from having to implement that part. The fact that the database could also provide persistent storage and triggers are just additional advantages. > Later if we want to make it persistent or even stored in LDAP/SQL... > I do not care. But the first implementation should be in memory. I think that you're assuming that an initial implementation that is done just in memory would be quicker to complete. I'm not really wanting to write a complete throw-away solution capable of supporting only one or two very simple queries efficiently. > BTW: most of the databases referred by these mails are not supporting > distributed shadow copies of a centrally controlled tables. Personally, I'd be happy with a simple database that provided nothing more than indexing and query support. - Sean From mst at mellanox.co.il Wed Jan 11 09:39:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Jan 2006 19:39:25 +0200 Subject: [openib-general] Re: ib_sdp ERR: IOCB dmesg output In-Reply-To: <20060111173409.GA17844@esmail.cup.hp.com> References: <20060111173409.GA17844@esmail.cup.hp.com> Message-ID: <20060111173925.GA28383@mellanox.co.il> Quoting r. Grant Grundler : > Subject: Re: ib_sdp ERR: IOCB dmesg output > > On Wed, Jan 11, 2006 at 10:06:31AM +0200, Michael S. Tsirkin wrote: > ... > > > It's likely the userspace openib libs are out of sync. > > > But I don't expect that's relevant to SDP or IPoIB (kernel drivers). > > > > No. > > "no" to what? > You agree userspace isn't relevant in this case? Sorry. Yes, I agree. > > > Given the number of recent bug fixes since 4800, I will update and > > > try again later this week. > > > Could you please try sdp patches from > > https://openib.org/svn/trunk/contrib/mellanox/patches > > grundler at gsyprf3:/usr/src/linux-2.6.15$ patch -p1 < ../sdp_conn_cache.patch > patching file drivers/infiniband/ulp/sdp/sdp_conn.c > Hunk #2 succeeded at 1976 (offset -54 lines). > grundler at gsyprf3:/usr/src/linux-2.6.15$ patch -p1 < ../sdp_link_cancel.patch > patching file drivers/infiniband/ulp/sdp/sdp_link.c > Hunk #10 succeeded at 637 (offset 1 line). > Hunk #11 succeeded at 654 (offset 1 line). > Hunk #12 succeeded at 666 (offset 1 line). > Hunk #13 succeeded at 681 (offset 1 line). > Hunk #14 succeeded at 848 (offset 1 line). > Hunk #15 succeeded at 858 (offset 1 line). > grundler at gsyprf3:/usr/src/linux-2.6.15$ patch -p1 < ../sdp_remove_link_reference.patch > patching file drivers/infiniband/ulp/sdp/sdp_inet.c > patching file drivers/infiniband/ulp/sdp/sdp_actv.c > grundler at gsyprf3:/usr/src/linux-2.6.15$ > > And then I "svn up"'d to r4829 (from r4800). > Rinse and repeat on the second box. > Tests are running now. > > BTW, why aren't the SDP patches in SVN mainline? > I'm asking because I thought you are (one of?) the maintainer(s). I'm still testing these. -- MST From trimmer at silverstorm.com Wed Jan 11 09:50:26 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Wed, 11 Jan 2006 12:50:26 -0500 Subject: [openib-general] SA cache design Message-ID: <5D78D28F88822E4D8702BB9EEF1A4367D12CA6@mercury.infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Eitan Zahavi wrote: > > Is the intention to speed up SA queries? > > Or is it to have persistent storage of them? > > I want both. :) I would clarify that the best bang for the effort will be to focus on the queries which the ULPs themselves will use most often. For example, the resolution from a node name or Node Guid to a path record. While a general purpose replica would be nice, it could over complicate the initial design. The goal is not to optimize all the queries an end user might desire, but rather to help avoid the O(N^2) load which thinks like start up of an MPI or SDP application could cause on the SA. > and how/when is it invalidated by the SM. There are a variety of notices already available from the SM which should be used for the triggering or invalidation. Such: GID In/Out of Service Client Reregistration It may also be desirable to have the CM upon a failed connect to a given remote node to trigger the local replica to invalidate and requery for information about remote node. Todd R. From mst at mellanox.co.il Wed Jan 11 09:54:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Jan 2006 19:54:29 +0200 Subject: [openib-general] Re: SA cache design In-Reply-To: References: Message-ID: <20060111175429.GB28406@mellanox.co.il> Quoting Sean Hefty : > I can envision how > the sa_cache could eventually build towards a distributed SA. I've been thinking about distributed SA myself for a while now. This seems like an elegant way to improve the SA scalability. How would distributed copies sync with the central SA? I guess we could simply use sockets on top of IPoIB - the sync it probably a rare occurrence. -- MST From ralphc at pathscale.com Wed Jan 11 10:33:06 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 11 Jan 2006 10:33:06 -0800 Subject: [openib-general] [PATCH] opensm fails to find HCA if port is down. Message-ID: <1137004386.4520.97.camel@brick.internal.keyresearch.com> I tested this and it works for me. > On Tue, 2006-01-10 at 17:51, Hal Rosenstock wrote: > > Hi Ralph, > > > > On Tue, 2006-01-10 at 17:47, Ralph Campbell wrote: > > > I understand. Maybe it should be the first active, if none, then the > > > first UP, and if none, the first !disabled. > > > > Exactly. I think one more loop (first checking physical state for linkup > > and then checking for not disabled) will take care of it. I'm working on > > it now as a patch on your patch which I will post. > > In libibumad/umad,c::resolve_ca_port, default algorithm is to prefer > ports which are active, then those whose physical state is link up, and > finally those ports whose physical state is not disabled. Signed-off-by: Hal Rosenstock Signed-off-by: Ralph Campbell Index: umad.c =================================================================== --- umad.c (revision 4933) +++ umad.c (working copy) @@ -244,7 +244,7 @@ DEBUG("checking port %d", i); if (!ca.ports[i]) continue; - if (up < 0 && ca.ports[i]->phys_state != 3) + if (up < 0 && ca.ports[i]->phys_state == 5) up = *port = i; if (ca.ports[i]->state == 4) { active = *port = i; @@ -253,6 +253,18 @@ } } + if (active == -1 && up == -1) { /* no active or linkup port found */ + for (i = 0; i <= ca.numports; i++) { + DEBUG("checking port %d", i); + if (!ca.ports[i]) + continue; + if (ca.ports[i]->phys_state != 3) { + up = *port = i; + break; + } + } + } + release_ca(&ca); if (active >= 0) -- Ralph Campbell From mshefty at ichips.intel.com Wed Jan 11 10:53:12 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 11 Jan 2006 10:53:12 -0800 Subject: [openib-general] Re: SA cache design In-Reply-To: <20060111175429.GB28406@mellanox.co.il> References: <20060111175429.GB28406@mellanox.co.il> Message-ID: <43C55418.3050509@ichips.intel.com> Michael S. Tsirkin wrote: > I've been thinking about distributed SA myself for a while now. > This seems like an elegant way to improve the SA scalability. > > How would distributed copies sync with the central SA? > I guess we could simply use sockets on top of IPoIB - the sync > it probably a rare occurrence. I think this moves beyond what the cache is trying to accomplish at this point. Personally, if the SA were implemented using a database, I'd let the database worry about synchronizing the data. From halr at voltaire.com Wed Jan 11 10:53:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Jan 2006 13:53:54 -0500 Subject: [openib-general] [PATCH] opensm fails to find HCA if port is down. In-Reply-To: <1137004386.4520.97.camel@brick.internal.keyresearch.com> References: <1137004386.4520.97.camel@brick.internal.keyresearch.com> Message-ID: <1137005454.4322.1658.camel@hal.voltaire.com> On Wed, 2006-01-11 at 13:33, Ralph Campbell wrote: > I tested this and it works for me. Thanks. Applied. -- Hal From rdreier at cisco.com Wed Jan 11 11:00:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 11:00:15 -0800 Subject: [openib-general] [git patch review 5/7] IB: Add node_guid to struct ib_device In-Reply-To: (Roland Dreier's message of "Wed, 11 Jan 2006 08:09:11 -0800") References: <1136921483291-1d87adb85e116682@cisco.com> <43C50DF7.6070606@voltaire.com> Message-ID: I just applied the patch below to svn: --- Add a node_guid field to struct ib_device. It is the responsibility of the low-level driver to initialize this field before registering a device with the midlayer. Convert everyone to looking at this field instead of calling ib_query_device() when all they want is the node GUID, and remove the node_guid field from struct ib_device_attr. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier --- infiniband/ulp/srp/ib_srp.c (revision 4929) +++ infiniband/ulp/srp/ib_srp.c (working copy) @@ -1516,8 +1516,7 @@ static ssize_t show_port(struct class_de static CLASS_DEVICE_ATTR(port, S_IRUGO, show_port, NULL); -static struct srp_host *srp_add_port(struct ib_device *device, - __be64 node_guid, u8 port) +static struct srp_host *srp_add_port(struct ib_device *device, u8 port) { struct srp_host *host; @@ -1532,7 +1531,7 @@ static struct srp_host *srp_add_port(str host->port = port; host->initiator_port_id[7] = port; - memcpy(host->initiator_port_id + 8, &node_guid, 8); + memcpy(host->initiator_port_id + 8, &device->node_guid, 8); host->pd = ib_alloc_pd(device); if (IS_ERR(host->pd)) @@ -1580,22 +1579,11 @@ static void srp_add_one(struct ib_device { struct list_head *dev_list; struct srp_host *host; - struct ib_device_attr *dev_attr; int s, e, p; - dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL); - if (!dev_attr) - return; - - if (ib_query_device(device, dev_attr)) { - printk(KERN_WARNING PFX "Couldn't query node GUID for %s.\n", - device->name); - goto out; - } - dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) - goto out; + return; INIT_LIST_HEAD(dev_list); @@ -1608,15 +1596,12 @@ static void srp_add_one(struct ib_device } for (p = s; p <= e; ++p) { - host = srp_add_port(device, dev_attr->node_guid, p); + host = srp_add_port(device, p); if (host) list_add_tail(&host->list, dev_list); } ib_set_client_data(device, &srp_client, dev_list); - -out: - kfree(dev_attr); } static void srp_remove_one(struct ib_device *device) --- infiniband/include/rdma/ib_verbs.h (revision 4929) +++ infiniband/include/rdma/ib_verbs.h (working copy) @@ -88,7 +88,6 @@ enum ib_atomic_cap { struct ib_device_attr { u64 fw_ver; - __be64 node_guid; __be64 sys_image_guid; u64 max_mr_size; u64 page_size_cap; --- infiniband/core/cm.c (revision 4929) +++ infiniband/core/cm.c (working copy) @@ -3230,22 +3230,6 @@ int ib_cm_init_qp_attr(struct ib_cm_id * } EXPORT_SYMBOL(ib_cm_init_qp_attr); -static __be64 cm_get_ca_guid(struct ib_device *device) -{ - struct ib_device_attr *device_attr; - __be64 guid; - int ret; - - device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); - if (!device_attr) - return 0; - - ret = ib_query_device(device, device_attr); - guid = ret ? 0 : device_attr->node_guid; - kfree(device_attr); - return guid; -} - static void cm_add_one(struct ib_device *device) { struct cm_device *cm_dev; @@ -3267,9 +3251,7 @@ static void cm_add_one(struct ib_device return; cm_dev->device = device; - cm_dev->ca_guid = cm_get_ca_guid(device); - if (!cm_dev->ca_guid) - goto error1; + cm_dev->ca_guid = device->node_guid; set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); for (i = 1; i <= device->phys_port_cnt; i++) { @@ -3284,11 +3266,11 @@ static void cm_add_one(struct ib_device cm_recv_handler, port); if (IS_ERR(port->mad_agent)) - goto error2; + goto error1; ret = ib_modify_port(device, i, 0, &port_modify); if (ret) - goto error3; + goto error2; } ib_set_client_data(device, &cm_client, cm_dev); @@ -3297,9 +3279,9 @@ static void cm_add_one(struct ib_device write_unlock_irqrestore(&cm.device_lock, flags); return; -error3: - ib_unregister_mad_agent(port->mad_agent); error2: + ib_unregister_mad_agent(port->mad_agent); +error1: port_modify.set_port_cap_mask = 0; port_modify.clr_port_cap_mask = IB_PORT_CM_SUP; while (--i) { @@ -3307,7 +3289,6 @@ error2: ib_modify_port(device, port->port_num, 0, &port_modify); ib_unregister_mad_agent(port->mad_agent); } -error1: kfree(cm_dev); } --- infiniband/core/sysfs.c (revision 4929) +++ infiniband/core/sysfs.c (working copy) @@ -449,13 +449,7 @@ static int ib_device_uevent(struct class return -ENOMEM; /* - * It might be nice to pass the node GUID with the event, but - * right now the only way to get it is to query the device - * provider, and this can crash during device removal because - * we are will be running after driver removal has started. - * We could add a node_guid field to struct ib_device, or we - * could just let userspace read the node GUID from sysfs when - * devices are added. + * It would be nice to pass the node GUID with the event... */ envp[i] = NULL; @@ -627,21 +621,15 @@ static ssize_t show_sys_image_guid(struc static ssize_t show_node_guid(struct class_device *cdev, char *buf) { struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); - struct ib_device_attr attr; - ssize_t ret; if (!ibdev_is_alive(dev)) return -ENODEV; - ret = ib_query_device(dev, &attr); - if (ret) - return ret; - return sprintf(buf, "%04x:%04x:%04x:%04x\n", - be16_to_cpu(((__be16 *) &attr.node_guid)[0]), - be16_to_cpu(((__be16 *) &attr.node_guid)[1]), - be16_to_cpu(((__be16 *) &attr.node_guid)[2]), - be16_to_cpu(((__be16 *) &attr.node_guid)[3])); + be16_to_cpu(((__be16 *) &dev->node_guid)[0]), + be16_to_cpu(((__be16 *) &dev->node_guid)[1]), + be16_to_cpu(((__be16 *) &dev->node_guid)[2]), + be16_to_cpu(((__be16 *) &dev->node_guid)[3])); } static CLASS_DEVICE_ATTR(node_type, S_IRUGO, show_node_type, NULL); --- infiniband/core/uverbs_cmd.c (revision 4929) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -157,7 +157,7 @@ ssize_t ib_uverbs_query_device(struct ib memset(&resp, 0, sizeof resp); resp.fw_ver = attr.fw_ver; - resp.node_guid = attr.node_guid; + resp.node_guid = file->device->ib_dev->node_guid; resp.sys_image_guid = attr.sys_image_guid; resp.max_mr_size = attr.max_mr_size; resp.page_size_cap = attr.page_size_cap; --- infiniband/hw/ipath/ipath_verbs.c (revision 4929) +++ infiniband/hw/ipath/ipath_verbs.c (working copy) @@ -5368,7 +5368,6 @@ static int ipath_query_device(struct ib_ props->hw_ver = boardrev << 16 | majrev << 8 | minrev; props->sys_image_guid = dev->sys_image_guid; - props->node_guid = ipath_layer_get_guid(dev->ib_unit); props->max_mr_size = ~0ull; props->max_qp = 0xffff; --- infiniband/hw/mthca/mthca_provider.c (revision 4929) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -91,7 +91,6 @@ static int mthca_query_device(struct ib_ props->vendor_part_id = be16_to_cpup((__be16 *) (out_mad->data + 30)); props->hw_ver = be32_to_cpup((__be32 *) (out_mad->data + 32)); memcpy(&props->sys_image_guid, out_mad->data + 4, 8); - memcpy(&props->node_guid, out_mad->data + 12, 8); props->max_mr_size = ~0ull; props->page_size_cap = mdev->limits.page_size_cap; From mst at mellanox.co.il Wed Jan 11 11:05:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Jan 2006 21:05:56 +0200 Subject: [openib-general] ipoib: outstanding patches In-Reply-To: <20060110230254.GJ16913@mellanox.co.il> References: <20060110230254.GJ16913@mellanox.co.il> Message-ID: <20060111190556.GC28406@mellanox.co.il> I just went over the patches again in detail. Here's the list of patches from https://openib.org/svn/trunk/contrib/mellanox/patches Quoting Michael S. Tsirkin : > Fixes for oopses that we saw in testing: > ipoib_up_flag_race.patch ipoib_up_flag_race.patch is removed. It is replaced by ipoib_flush_wq_1.patch The following are all very small: > ipoib_mc_list.patch > ipoib_flush_wq_1.patch > ipoib_flush_wq_2.patch The following are trickier: > ipoib_mcast_send.patch > ipoib_all_neigh_issues_2.patch The rest of them are quite small: > Memory leak that we saw in testing > ipoib_multicast_leak.patch > > Drop counter fix (not sure whether this is from testing or code review) > ipoib_multicast_drop_counter.patch > > Code review: cosmetic > ipoib_cosmetics.patch > > Code review: error handling > ipoib_init_qp.patch > ipoib_post_receives_err.patch > > Code review: races > ipoib_qprst_protect.patch > ipoib_multicast_ah.patch I have updated ipoib_multicast_ah.patch > ipoib_multicast_race.patch > > > There's also a patch for libibverbs > devinfo_board_id.patch > > and a patch for core/sysfs.c > node_desc_clear.patch ipoib_up_flag_race.patch is replaced by ipoib_flush_wq_1.patch I removed it from svn. ipoib_multicast_ah.patch could sleep under spinlock. I fixed it. -- MST From jice at pantasys.com Wed Jan 11 11:16:31 2006 From: jice at pantasys.com (Jean-Christophe Hugly) Date: Wed, 11 Jan 2006 11:16:31 -0800 Subject: [openib-general] stable/recommended revision ? Message-ID: <1137006991.24043.114.camel@jhugly.pantasys.com> Hi guys, Given that what ships with the linux kernel (at least up to 2.6.13) is stripped to its bare essentials, I end-up having to get my own src tree. I have not seen any reference to a so-called "stable" release, or any other recommendation than checking out from svn. That's fine with me, but considering that it changes by the minute, I was wondering if there was one particular revision of the recent past (or a tag, or a branch), that would be better suited for non-devellopers - yet :-). I bet it is a frequently asked question, but not one that I found answered so far. May be just posting a "safe-bet" revision number on the main page would be good enough. (or is always going to be "HEAD" :-) ?) Thanks for your help. -- Jean-Christophe Hugly PANTA From rdreier at cisco.com Wed Jan 11 11:22:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 11:22:20 -0800 Subject: [openib-general] stable/recommended revision ? In-Reply-To: <1137006991.24043.114.camel@jhugly.pantasys.com> (Jean-Christophe Hugly's message of "Wed, 11 Jan 2006 11:16:31 -0800") References: <1137006991.24043.114.camel@jhugly.pantasys.com> Message-ID: Jean-Christophe> I bet it is a frequently asked question, but not Jean-Christophe> one that I found answered so far. May be just Jean-Christophe> posting a "safe-bet" revision number on the main Jean-Christophe> page would be good enough. (or is always going to Jean-Christophe> be "HEAD" :-) ?) In general, our goal is that the head of the svn tree is the best subversion revision. The version of the kernel drivers shipped in Linux kernels represents a pretty good stable snapshot at the time the kernel was shipped. 2.6.15 is not so stripped down, and 2.6.16 will be even better. Also, I am trying to drive to a real 1.0 release of libibverbs and libmthca packages for userspace. - R. From xma at us.ibm.com Wed Jan 11 11:27:31 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 11 Jan 2006 11:27:31 -0800 Subject: [openib-general] stable/recommended revision ? In-Reply-To: Message-ID: > Also, I am trying to drive to a real 1.0 release of libibverbs and > libmthca packages for userspace. When it is going to be available? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Jan 11 11:27:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 11:27:46 -0800 Subject: [openib-general] stable/recommended revision ? In-Reply-To: (Shirley Ma's message of "Wed, 11 Jan 2006 11:27:31 -0800") References: Message-ID: Shirley> When it is going to be available? When it's ready ;) - R. From xma at us.ibm.com Wed Jan 11 11:34:17 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 11 Jan 2006 11:34:17 -0800 Subject: [openib-general] stable/recommended revision ? In-Reply-To: Message-ID: Whether current SVN libibverbs and libmthca packages work well on linux-2.6.15 or linux-2.6.16 mainline stream IB stack? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Jan 11 11:40:05 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 11:40:05 -0800 Subject: [openib-general] stable/recommended revision ? In-Reply-To: (Shirley Ma's message of "Wed, 11 Jan 2006 11:34:17 -0800") References: Message-ID: Shirley> Whether current SVN libibverbs and libmthca packages work Shirley> well on linux-2.6.15 or linux-2.6.16 mainline stream IB Shirley> stack? As far as I know they should. Certainly my goal is to make libibverbs/libmthca packages completely backwards compatible -- using the newest userspace should work fine even against the oldest kernel. - R. From rdreier at cisco.com Wed Jan 11 11:45:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 11:45:39 -0800 Subject: [openib-general] [PATCH] SDP: add Message-ID: In upstream kernels, it seems that networking headers have been rejiggered so that no longer gets included implicitly into sdp_link.c. Since sdp_link.c uses the ARPHRD_INFINIBAND symbol, we need to include it explicitly. Signed-off-by: Roland Dreier --- infiniband/ulp/sdp/sdp_link.c (revision 4929) +++ infiniband/ulp/sdp/sdp_link.c (working copy) @@ -33,6 +33,8 @@ * $Id$ */ +#include + #include "ipoib.h" #include "sdp_main.h" From robert.j.woodruff at intel.com Wed Jan 11 11:55:31 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 11 Jan 2006 11:55:31 -0800 Subject: [openib-general] stable/recommended revision ? In-Reply-To: <1137006991.24043.114.camel@jhugly.pantasys.com> Message-ID: Jean wrote, >I bet it is a frequently asked question, but not one that I found >answered so far. May be just posting a "safe-bet" revision number on the >main page would be good enough. (or is always going to be "HEAD" :-) ?) >Thanks for your help. Very good question. Right now, for the kernel components, what is released with kernel.org is the stable version. Unfortunately, not all of the ULPs have been pushed upstream yet. There are also some backport patches/RPMS in branches/backport-to-2.6.9 that I test as a group before pushing out to SVN. These are generally stable versions, but they are in no way a "sanctioned" stable release from OpenIB. I think that there is going to be a discussion at the upcoming OpenIB workshop in Sonoma on this subject to see if we can agree on an approach for indicating stable versions. woody From lindahl at pathscale.com Wed Jan 11 12:35:37 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed, 11 Jan 2006 12:35:37 -0800 Subject: [openib-general] SA cache design In-Reply-To: <43C5273A.40605@mellanox.co.il> References: <43BB1A0F.2080305@ichips.intel.com> <43C40328.7060201@ichips.intel.com> <43C5273A.40605@mellanox.co.il> Message-ID: <20060111203537.GE2434@greglaptop.internal.keyresearch.com> Since no one's really answered this yet: Many sysadmins are not going to want to install a relational database to run an SA cache. So I'd stick to Berkeley DB if I were you. -- greg From jice at pantasys.com Wed Jan 11 12:46:28 2006 From: jice at pantasys.com (Jean-Christophe Hugly) Date: Wed, 11 Jan 2006 12:46:28 -0800 Subject: [openib-general] stable/recommended revision ? In-Reply-To: References: <1137006991.24043.114.camel@jhugly.pantasys.com> Message-ID: <1137012388.24043.124.camel@jhugly.pantasys.com> On Wed, 2006-01-11 at 11:22 -0800, Roland Dreier wrote: > Jean-Christophe> I bet it is a frequently asked question, but not > Jean-Christophe> one that I found answered so far. May be just > Jean-Christophe> posting a "safe-bet" revision number on the main > Jean-Christophe> page would be good enough. (or is always going to > Jean-Christophe> be "HEAD" :-) ?) > > In general, our goal is that the head of the svn tree is the best > subversion revision. > Thanks. That's the way I like it too :-) However, such a policy is not without its own set of issues. The one that caused me to ask the question in the first place is this: The folks at openmpi did publish a stable release which compiled splendidly against the libs installed by openib-userspace-svn3640-1.x86_64.rpm (thanks to the contributor, btw). But as soon as I installed openib's latest and greatest, a change of API caused openmpi to fail to compile. :-( Oh well, hopefully I'll get around that by grabbing the bleeding edge from openmpi as well. Or is it reasonable to try and leave the userspace stuff alone and update only the k code ? -- Jean-Christophe Hugly PANTA From mst at mellanox.co.il Wed Jan 11 13:21:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Jan 2006 23:21:32 +0200 Subject: [openib-general] Re: [PATCH] SDP: add In-Reply-To: References: Message-ID: <20060111212132.GA29704@mellanox.co.il> Quoting r. Roland Dreier : > In upstream kernels, it seems that networking headers have been > rejiggered so that no longer gets included implicitly > into sdp_link.c. Since sdp_link.c uses the ARPHRD_INFINIBAND symbol, > we need to include it explicitly. Thanks, applied. -- MST From mshefty at ichips.intel.com Wed Jan 11 13:51:11 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 11 Jan 2006 13:51:11 -0800 Subject: [openib-general] SA cache design In-Reply-To: <20060111203537.GE2434@greglaptop.internal.keyresearch.com> References: <43BB1A0F.2080305@ichips.intel.com> <43C40328.7060201@ichips.intel.com> <43C5273A.40605@mellanox.co.il> <20060111203537.GE2434@greglaptop.internal.keyresearch.com> Message-ID: <43C57DCF.4040402@ichips.intel.com> Greg Lindahl wrote: > Since no one's really answered this yet: > > Many sysadmins are not going to want to install a relational database > to run an SA cache. So I'd stick to Berkeley DB if I were you. Thanks for the response. To be clear, the cache would be an optional component, and likely only needed for larger configurations. From what I can tell PostgreSQL and MySQL both ship with RedHat and SuSE. MySQL claims that it can be built as a small library that can then be integrated with an application. It may be possible to have the application do everything for the user except install the necessary libraries... ? The installation and configuration of a database is what I see as the biggest drawback to going this route. Unfortunately, I need to play with this idea more to see how much of an impact that would be to an actual user. - Sean From rdreier at cisco.com Wed Jan 11 13:59:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 13:59:59 -0800 Subject: [openib-general] Re: ipoib: outstanding patches In-Reply-To: <20060111190556.GC28406@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 11 Jan 2006 21:05:56 +0200") References: <20060110230254.GJ16913@mellanox.co.il> <20060111190556.GC28406@mellanox.co.il> Message-ID: OK, I've started reviewing and applying these. > ipoib_mc_list.patch Applied, except spin_lock_bh() followed by spin_lock_irqsave() looked silly to me, so I changed it to spin_lock_irqsave() followed by spin_lock(). > ipoib_flush_wq_1.patch > ipoib_flush_wq_2.patch Still trying to decide if I like this approach. Right now the ipoib workqueue is only doing multicast stuff, so it's easier for me to see what's going on. These patches lose that so I'm trying to see if there's a better approach. > ipoib_mcast_send.patch Could we reuse the IPOIB_MCAST_RUN bit rather than adding a new bit? It seems that we could kill mcast_mutex and replace uses with priv->lock instead -- I don't see anything that sleeps inside mcast_mutex. > ipoib_all_neigh_issues_2.patch Could we do this without a linear search through a list of neighbours? It seems this might become a scalability issue. > ipoib_multicast_leak.patch Why does this change only handle send-only multicast groups? Where are other multicast groups getting freed now that misses send-only groups? - R. From rdreier at cisco.com Wed Jan 11 14:06:43 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 14:06:43 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix user-level pd leak In-Reply-To: <20060111154938.GA3338@mellanox.co.il> (Jack Morgenstein's message of "Wed, 11 Jan 2006 17:49:38 +0200") References: <20060111154938.GA3338@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Wed Jan 11 14:09:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 14:09:30 -0800 Subject: [openib-general] Re: [PATCH] libmthca: mthca_free_pd frees pointer to wrong structure In-Reply-To: <20060111155319.GB3338@mellanox.co.il> (Jack Morgenstein's message of "Wed, 11 Jan 2006 17:53:19 +0200") References: <20060111155319.GB3338@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Wed Jan 11 14:11:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 14:11:53 -0800 Subject: [openib-general] stable/recommended revision ? In-Reply-To: <1137012388.24043.124.camel@jhugly.pantasys.com> (Jean-Christophe Hugly's message of "Wed, 11 Jan 2006 12:46:28 -0800") References: <1137006991.24043.114.camel@jhugly.pantasys.com> <1137012388.24043.124.camel@jhugly.pantasys.com> Message-ID: Jean-Christophe> Oh well, hopefully I'll get around that by Jean-Christophe> grabbing the bleeding edge from openmpi as Jean-Christophe> well. Or is it reasonable to try and leave the Jean-Christophe> userspace stuff alone and update only the k code Jean-Christophe> ? There was one API change in libibverbs that I wanted to make before freezing the API for a 1.0 release series. You can definitely try to upgrade the kernel only -- the userspace libraries will tell you if the kernel is too new for them to understand. I don't remember when the last kernel/user ABI breakage was, but there shouldn't be any silent problems anyway. - R. From mshefty at ichips.intel.com Wed Jan 11 14:21:17 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 11 Jan 2006 14:21:17 -0800 Subject: [openib-general] SA cache design In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A4367D12CA2@mercury.infiniconsys.com> References: <5D78D28F88822E4D8702BB9EEF1A4367D12CA2@mercury.infiniconsys.com> Message-ID: <43C584DD.70503@ichips.intel.com> Rimmer, Todd wrote: > A relational database is overkill for this function. > It will also likely be more complex for end users to setup and debug. > The cache setup should be simple. The solution should be such that > just an on/off switch needs to be configured (with a default of on) > for most users to get started. My take is a little different. I view the SA as a database that maintains related attributes. By supporting relationships between different attributes, we can provide a more powerful, higher performing, and more user-friendly interface to the user. For example, a single SQL query could return path records given only a node description or service name. Today, we generate multiple SA queries, their responses, and associated RMPP MADs to obtain the same data. I'm not sold on the idea of using a relational database, because of the additional complexity for end-users. However, I believe it can offer significant advantages over what we could code ourselves. - Sean From mst at mellanox.co.il Wed Jan 11 14:23:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 00:23:49 +0200 Subject: [openib-general] Re: ipoib: outstanding patches In-Reply-To: References: Message-ID: <20060111222349.GC27348@mellanox.co.il> Quoting r. Roland Dreier : > Subject: [openib-general] Re: ipoib: outstanding patches > > OK, I've started reviewing and applying these. > > > ipoib_mc_list.patch > > Applied, except spin_lock_bh() followed by spin_lock_irqsave() looked > silly to me, so I changed it to spin_lock_irqsave() followed by spin_lock(). Thats fine too. > > ipoib_flush_wq_1.patch > > ipoib_flush_wq_2.patch > > Still trying to decide if I like this approach. Right now the ipoib > workqueue is only doing multicast stuff, so it's easier for me to see > what's going on. These patches lose that so I'm trying to see if > there's a better approach. > > > ipoib_mcast_send.patch > > Could we reuse the IPOIB_MCAST_RUN bit rather than adding a new bit? > It seems that we could kill mcast_mutex and replace uses with > priv->lock instead -- I don't see anything that sleeps inside mcast_mutex. I'll need to think about this. > > ipoib_all_neigh_issues_2.patch > > Could we do this without a linear search through a list of neighbours? > It seems this might become a scalability issue. We could have a list of distinct ops pointers. Would that be better? > > ipoib_multicast_leak.patch > > Why does this change only handle send-only multicast groups? Where > are other multicast groups getting freed now that misses send-only groups? Linux will clean other groups from mc_list so restart will kill them. -- MST From eitan at mellanox.co.il Wed Jan 11 14:20:51 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 12 Jan 2006 00:20:51 +0200 Subject: [openib-general] Anounce: Advanced Diagnostic Tools Message-ID: <43C584C3.5010104@mellanox.co.il> Hi, With the great help from Danny Zarko and Ariel Libman I was able to upload into https://openib.org/svn/gen2/utils/src/linux-user the first of several integrated IB diagnostic tools: ibdiagnet (diagnose network). The tool depends on ibis, ibdm (available under in the same directory). It's main differences from the diag tools available under the trunk are: 1. Performs a complete diagnostic procedure, including: * discovery, * PM counters check, * duplicate LID/GUID * ALL to ALL connectivity check (based on LFT data extracted from the fabric) * Multicast connectivity and report * Credit loop analysis * and various other fabric statistics 2. If a topology file is provided - all reports are given using system names (rather then LID, GUID or directed paths. ############################################################################################### Here are some stdout examples ------------------------------ 1. BAD LIDS -E- Device(s) with LID = 0x0000 found in the fabric: path="1 1 3 5" H-12/U1 PN=2 path="1 1 3 4" H-11/U1 PN=1 path="1 4" H-3/U1 PN=1 2. DUPLICATED PORT GUIDS -E- Devices with identical PortGUID = 0x0002c90000000006 found in the fabric: path="1 1" GNU1/main/U2 path="1 1 5 6" H-9/U1 PN=1 path="1 1 5 5" H-10/U1 PN=2 3. BAD LINKS -I- Errors have occurred on the following links (for errors details, look in log file /tmp/ibmgtsim.31602/ibdiagnet.log): Cable: GNU1/M/P7(GNU1/main/U4/P4) =---= H-7/P2(H-7/U1/P2) Cable: GNU1/M/P5(GNU1/main/U4/P6) =---= H-5/P2(H-5/U1/P2) 4. TOPOLOGY MATCH -I- Note that "bad" links and the part of the fabric to which they led (in the BFS discovery of the fabric, starting at the local node) are not discovered and therefore will be reported as "missing". Missing System:H-7(Cougar) Should be connected by cable from port: P2(H-7/U1/P2) to:GNU1/M/P7(GNU1/main/U4/P4) Missing System:H-5(Cougar) Should be connected by cable from port: P2(H-5/U1/P2) to:GNU1/M/P5(GNU1/main/U4/P6) 5. MULTICAST ROUTING -I- Scanning all multicast groups for loops and connectivity... -I- Multicast Group:0xC000 has:2 switches and:2 HCAs -E- Extra switch:GNU1/leaf1/U1 in group:0xC000 -E- Extra switch:GNU1/main/U4 in group:0xC000 -I- Multicast Group:0xC001 has:4 switches and:4 HCAs -E- Extra switch:GNU1/leaf1/U1 in group:0xC001 -I- Multicast Group:0xC002 has:5 switches and:5 HCAs -E- 3 multicast group checks failed -I--------------------------------------------------- -I- mgid-mlid-HCAs matching table -I--------------------------------------------------- mgid | mlid | HCAs -------------------------------------------------------------------------------- 0xff12401bffff0000:0x00000000ffffffff | 0xc000 | H-11/U1,H-12/U1 0xff12401bffff0000:0x0000000000000001 | 0xc001 | H-15/U1,H-3/U1,H-2/U1,H-7/U1 0xff12401bffff0000:0x0000000000000002 | 0xc002 | H-10/U1,H-16/U1,H-4/U1,H-6/U1 6. UNICAST ROUTING: -I- Verifying all CA to CA paths ... -E- Unassigned LFT for lid:10 Dead end at:GNU1/main/U1 -E- Fail to find a path from:H-1/U1/1 to:H-12/U1/2 -E- Unassigned LFT for lid:18 Dead end at:GNU1/main/U3 -E- Fail to find a path from:H-1/U1/1 to:H-5/U1/2 [snip] -E- Found 19 missing paths out of:240 paths 7. CREDIT LOOPS -I- Tracing all CA to CA paths for Credit Loops potential ... -E- Potential Credit Loop on Path from:H-1/U1/1 to:H-13/U1/1 Going:Down from:GNU1/main/U1 to:GNU1/main/U3 Going:Up from:GNU1/main/U3 to:GNU1/main/U1 Going:Down from:GNU1/main/U1 to:GNU1/leaf1/U1 NOTE: All the above cases simulated on top of ibmgtsim. Errors injected by simulation flows. ###################################################################################### A full man page: ==================== NAME ibdiagnet SYNOPSYS ibdiagnet [-c ] [-v] [-r] [-t ] [-s ] [-i ] [-p ] [-o ] DESCRIPTION ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices. It then produces the following files in the output directory defined by the -o option (see below): ibdiagnet.lst - List of all the nodes, ports and links in the fabric ibdiagnet.fdbs - A dump of the unicast forwarding tables of the fabric switches ibdiagnet.mcfdbs - A dump of the multicast forwarding tables of the fabric switches In addition to generating the files above, the discovery phase also checks for duplicate node GUIDs in the IB fabric. If such an error is detected, it is displayed on the standard output. After the discovery phase is completed, directed route packets are sent multiple times (according to the -c option) to detect possible problematic paths on which packets may be lost. Such paths are explored, and a report of the suspected bad links is displayed on the standard output. After scanning the fabric, if the -r option is provided, a full report of the fabric qualities is displayed. This report includes: Number of nodes and systems Hop-count information: maximal hop-count, an example path, and a hop-count histogram All CA-to-CA paths traced Note: In case the IB fabric includes only one CA, then CA-to-CA paths are not reported. Furthermore, if a topology file is provided, ibdiagnet uses the names defined in it for the output reports. OPTIONS -c : The minimal number of packets to be sent across each link (default = 10) -v : Instructs the tool to run in verbose mode -r : Provides a report of the fabric qualities -t : Specifies the topology file name -s : Specifies the local system name. Meaningful only if a topology file is specified -i : Specifies the index of the device of the port used to connect to the IB fabric (in case of multiple devices on the local system) -p : Specifies the local device's port number used to connect to the IB fabric -o : Specifies the directory where the output files will be placed (default = /tmp/ez) -h|--help : Prints this help information -V|--version : Prints the version of the tool --vars : Prints the tool's environment variables and their values ERROR CODES 1 - Failed to fully discover the fabric 2 - Failed to parse command line options 3 - Some packet drop observed 4 - Mismatch with provided topology From rdreier at cisco.com Wed Jan 11 16:25:25 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 16:25:25 -0800 Subject: [openib-general] Re: ipoib: outstanding patches In-Reply-To: <20060111222349.GC27348@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 12 Jan 2006 00:23:49 +0200") References: <20060111222349.GC27348@mellanox.co.il> Message-ID: > > > ipoib_all_neigh_issues_2.patch > > Could we do this without a linear search through a list of neighbours? > > It seems this might become a scalability issue. > We could have a list of distinct ops pointers. Would that be better? Somewhat better. Let me think about this too. BTW my goal is to get (at least) all the IPoIB crash fixes and leak fixes merged by the end of the week so that they're all in 2.6.16-rc1. Once that happens, if you or someone else at Mellanox wants to help push critical fixes to the -stable tree for 2.6.15.1, that would be great. - R. From rdreier at cisco.com Wed Jan 11 18:20:22 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 18:20:22 -0800 Subject: [openib-general] Re: ipoib: outstanding patches In-Reply-To: <20060111222349.GC27348@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 12 Jan 2006 00:23:49 +0200") References: <20060111222349.GC27348@mellanox.co.il> Message-ID: > ipoib_all_neigh_issues_2.patch Crazy idea: can we just get away with never clearing ops->destructor? ipoib_neigh_destructor() checks if the neighbour structure has an IPoIB structure attached and does nothing if it doesn't. So does it hurt to leave ops->destructor set to ipoib_neigh_destructor() forever? - R. From rdreier at cisco.com Wed Jan 11 18:22:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 18:22:37 -0800 Subject: [openib-general] Re: ipoib: outstanding patches In-Reply-To: (Roland Dreier's message of "Wed, 11 Jan 2006 18:20:22 -0800") References: <20060111222349.GC27348@mellanox.co.il> Message-ID: Roland> Crazy idea: can we just get away with never clearing Roland> ops->destructor? ipoib_neigh_destructor() checks if the Roland> neighbour structure has an IPoIB structure attached and Roland> does nothing if it doesn't. So does it hurt to leave Roland> ops->destructor set to ipoib_neigh_destructor() forever? ie just do the following... (It seems to me that if this approach has a problem, we're leaking IPoIB neighbour structures already...) --- infiniband/ulp/ipoib/ipoib_main.c (revision 4929) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -247,7 +247,6 @@ static void path_free(struct net_device if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } @@ -530,7 +529,6 @@ static void neigh_add_path(struct sk_buf err: *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; kfree(neigh); ++priv->stats.tx_dropped; --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 4936) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -114,7 +114,6 @@ static void ipoib_mcast_free(struct ipoi if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } From mikeknox at lcse.umn.edu Tue Jan 10 20:32:29 2006 From: mikeknox at lcse.umn.edu (Michael Knox) Date: Tue, 10 Jan 2006 22:32:29 -0600 Subject: [openib-general] Errors when loading ib_umad Message-ID: <43C48A5D.2090309@lcse.umn.edu> I'm trying to load the OpenIB driver under Fedora Core 4 x86_64. I downloaded the 2.6.14 smp kernel and successfully replaced the included driver with the latest from openib.org and build the modules. I also upgraded to the 4.7 firmware. Unfortunately modprobe returns some interesting errors when loading ib_umad: [root at l13 ~]# modprobe ib_umad Killed [root at l13 ~]# Message from syslogd at l13 at Wed Jan 11 22:10:41 2006 ... l13 kernel: Oops: 0000 [1] SMP Message from syslogd at l13 at Wed Jan 11 22:10:41 2006 ... l13 kernel: CR2: 000000000e70014c ----------------- Dmesg reports: Unable to handle kernel paging request at 000000000e70014c RIP: {kref_get+4} PGD 2449db067 PUD 24980a067 PMD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: ib_umad parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc pcmcia yenta_socket rsrc_nonstatic pcmcia_core video button battery ac ipv6 ohci1394 ieee1394 uhci_hcd ehci_hcd nvidia hw_random ib_mthca ib_mad ib_core e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix libata 3w_9xxx sd_mod scsi_mod Pid: 2958, comm: modprobe Tainted: P 2.6.14 #1 RIP: 0010:[] {kref_get+4} RSP: 0000:ffff810241a31d18 EFLAGS: 00010206 RAX: ffff810247ae5560 RBX: 000000000e70014c RCX: 0000000000000007 RDX: ffff810247ae5567 RSI: ffffffff8036370c RDI: 000000000e70014c RBP: ffff81024770aca0 R08: 0000000000000002 R09: 0000000000000000 R10: ffff81024492a738 R11: ffffffff801da5ca R12: 00000000fffffff4 R13: ffffffff80363705 R14: ffff81024492a738 R15: 000000000e700130 FS: 00002aaaaaac03c0(0000) GS:ffffffff80503880(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000e70014c CR3: 0000000245f2d000 CR4: 00000000000006e0 Process modprobe (pid: 2958, threadinfo ffff810241a30000, task ffff8100afc41180) Stack: 000000000e700130 ffffffff801f08f4 ffff810247ae52c0 ffffffff801be06d 0000000000000000 0000000000000000 ffff810242c8d3c0 ffff810244067e00 ffff810244067e00 000000000e700000 Call Trace:{kobject_get+18} {sysfs_create_link+187} {class_device_add+393} {class_device_create+168} {d_instantiate+98} {dput+55} {kobj_map+226} {exact_lock+0} {exact_match+0} {:ib_umad:ib_umad_add_one+498} {:ib_core:ib_register_client+134} {:ib_umad:ib_umad_init+149} {sys_init_module+272} {tracesys+209} Code: 8b 07 85 c0 75 24 b9 20 00 00 00 48 c7 c2 45 2b 36 80 48 c7 RIP {kref_get+4} RSP CR2: 000000000e70014c Could anyone give me some advice on what I should be investigating to resolve these problems? Thanks, Mike Knox From benoit.morin at ieee.org Wed Jan 11 21:02:06 2006 From: benoit.morin at ieee.org (Benoit Morin) Date: Thu, 12 Jan 2006 00:02:06 -0500 Subject: [openib-general] [mstflint] firmware upgrade error Message-ID: <43C5E2CE.8010708@ieee.org> Hi, I tried upgrading the firmware on a MHX-CE128-T card (firmware 3.2.0) using the firmware fw-23108-3_3_5-MHX-CE128-T.bin found on Mellanox's website. I used mstflint found in revision 4464 of the openib svn. I used the -nofs flag following what I read in the "[mstflint] firmware upgrade instructions" thread. Now, after a reboot, I get the following : ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing 0000:02:00.0 ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 28 (level, low) -> IRQ 19 ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=0, sock=0, sladdr=0, SPD source=DIMM ib_mthca 0000:02:00.0: SYS_EN returned status 0x07, aborting. ACPI: PCI interrupt for device 0000:02:00.0 disabled ib_mthca: probe of 0000:02:00.0 failed with error -22 I no longer have access to anything under /sys/class/infiniband. Is the card toast? Where did I go wrong? Thanks all, Benoit Morin From iod00d at hp.com Wed Jan 11 22:09:05 2006 From: iod00d at hp.com (Grant Grundler) Date: Wed, 11 Jan 2006 22:09:05 -0800 Subject: [openib-general] [mstflint] firmware upgrade error In-Reply-To: <43C5E2CE.8010708@ieee.org> References: <43C5E2CE.8010708@ieee.org> Message-ID: <20060112060905.GB29168@esmail.cup.hp.com> On Thu, Jan 12, 2006 at 12:02:06AM -0500, Benoit Morin wrote: > Hi, > > I tried upgrading the firmware on a MHX-CE128-T card (firmware 3.2.0) > using the firmware fw-23108-3_3_5-MHX-CE128-T.bin found on Mellanox's > website. I used mstflint found in revision 4464 of the openib svn. I just went through this excercise with a PCI-e card. > I used the -nofs flag following what I read in the "[mstflint] firmware > upgrade instructions" thread. > > Now, after a reboot, I get the following : > > ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) > ib_mthca: Initializing 0000:02:00.0 > ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 28 (level, low) -> IRQ 19 > ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=0, sock=0, sladdr=0, SPD > source=DIMM > ib_mthca 0000:02:00.0: SYS_EN returned status 0x07, aborting. > ACPI: PCI interrupt for device 0000:02:00.0 disabled > ib_mthca: probe of 0000:02:00.0 failed with error -22 > > I no longer have access to anything under /sys/class/infiniband. That's because the driver didn't sucessfully initialize. > Is the card toast? Where did I go wrong? It sounds like you grabbed the wrong firmware. Maybe one with more or less memory on it. I'm not sure though. I've used tvflash regularly to update firmware on PCI-X cards. Other reports to this mailing list indicate mstflint works fine too. I'll point out that neither tvflash nor mstflint worked for updating PCI-e cards on ia64 as I posted a few weeks ago. Tvflash will fully trash the eeprom. I successfully resurrected the PCI-e card using "mst" and "flint" (mft-0.5.0 from Mellanox firmware support web site) on a debian Intel P4 box. However, your card is likely not toast. Try another firmware. Worst case you get to learn about the "J5" jumper ("EEPROM NOT PRESENT") and use "mst" and "flint" tools like I did. hth, grant From rdreier at cisco.com Wed Jan 11 22:29:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 22:29:28 -0800 Subject: [openib-general] Errors when loading ib_umad In-Reply-To: <43C48A5D.2090309@lcse.umn.edu> (Michael Knox's message of "Tue, 10 Jan 2006 22:32:29 -0600") References: <43C48A5D.2090309@lcse.umn.edu> Message-ID: http://openib.org/pipermail/openib-general/2006-January/015216.html From mst at mellanox.co.il Wed Jan 11 22:43:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 08:43:41 +0200 Subject: [openib-general] Re: ipoib: outstanding patches In-Reply-To: References: Message-ID: <20060112064341.GA5284@mellanox.co.il> Quoting Roland Dreier : > BTW my goal is to get (at least) all the IPoIB crash fixes and leak > fixes merged by the end of the week so that they're all in > 2.6.16-rc1. > > Once that happens, if you or someone else at Mellanox wants to help > push critical fixes to the -stable tree for 2.6.15.1, that would be great. What kind of help is needed? -- MST From rdreier at cisco.com Wed Jan 11 22:48:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 22:48:15 -0800 Subject: [openib-general] Re: ipoib: outstanding patches In-Reply-To: <20060112064341.GA5284@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 12 Jan 2006 08:43:41 +0200") References: <20060112064341.GA5284@mellanox.co.il> Message-ID: >> Once that happens, if you or someone else at Mellanox wants to >> help push critical fixes to the -stable tree for 2.6.15.1, that >> would be great. > What kind of help is needed? Basically just pulling patches out of the 2.6.16-rc tree, making sure they apply and work against 2.6.15, and sending them to stable at kernel.org. - R. From mst at mellanox.co.il Wed Jan 11 22:48:24 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 08:48:24 +0200 Subject: [openib-general] Re: ipoib: outstanding patches In-Reply-To: References: Message-ID: <20060112064824.GB5284@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] Re: ipoib: outstanding patches > > > ipoib_all_neigh_issues_2.patch > > Crazy idea: can we just get away with never clearing ops->destructor? > ipoib_neigh_destructor() checks if the neighbour structure has an > IPoIB structure attached and does nothing if it doesn't. So does it > hurt to leave ops->destructor set to ipoib_neigh_destructor() forever? We cant leave the destructor set after unloading the module. This is because ops structure is not per neighbour and not per device. So while kernel destroys all neighbours it created for our device, the ops structure is shared with neighbours for other devices. -- MST From mst at mellanox.co.il Wed Jan 11 22:51:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 08:51:34 +0200 Subject: [openib-general] Re: Re: ipoib: outstanding patches In-Reply-To: References: Message-ID: <20060112065134.GC5284@mellanox.co.il> Quoting r. Roland Dreier : > > > > ipoib_all_neigh_issues_2.patch > > > > Could we do this without a linear search through a list of neighbours? > > > It seems this might become a scalability issue. > > > We could have a list of distinct ops pointers. Would that be better? > > Somewhat better. Let me think about this too. The right thing is to move the destructor pointer out of the ops structure to the neighbour structure, let the driver set it in the setup functiom. This will also solve the ugly problem that only one driver in the whole kernel can ever use the ops destructor pointer trick. -- MST From rdreier at cisco.com Wed Jan 11 22:53:47 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jan 2006 22:53:47 -0800 Subject: [openib-general] Re: ipoib: outstanding patches In-Reply-To: <20060112064824.GB5284@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 12 Jan 2006 08:48:24 +0200") References: <20060112064824.GB5284@mellanox.co.il> Message-ID: Michael> We cant leave the destructor set after unloading the Michael> module. Michael> This is because ops structure is not per neighbour and Michael> not per device. So while kernel destroys all neighbours Michael> it created for our device, the ops structure is shared Michael> with neighbours for other devices. Ugh. So we're relying on the fact that no one else is sticking something in the ->ha member where we stash the ipoib_neigh. Basically we're getting lucky that this works at all. As you said in your other mail, it seems like a more fundamental reorganization of this neighbour destructor stuff is required. - R. From iod00d at hp.com Wed Jan 11 22:55:23 2006 From: iod00d at hp.com (Grant Grundler) Date: Wed, 11 Jan 2006 22:55:23 -0800 Subject: [openib-general] Re: ib_sdp ERR: IOCB dmesg output In-Reply-To: <20060111080631.GW16938@mellanox.co.il> References: <20060111015623.GE14203@esmail.cup.hp.com> <20060111080631.GW16938@mellanox.co.il> Message-ID: <20060112065523.GD29168@esmail.cup.hp.com> On Wed, Jan 11, 2006 at 10:06:31AM +0200, Michael S. Tsirkin wrote: > Could you please try sdp patches from > https://openib.org/svn/trunk/contrib/mellanox/patches As noted earlier, netperf TCP_RR over SDP ran to completion with no problems. netperf TCP_STREAM over SDP started spewing the same errors despite the patches. :( On the "netserver" side: ib_sdp ERR: IOCB <-1> cancel <0> flag <03c0> size <32768:24591:8177> ib_sdp ERR: IOCB <-1> cancel <0> flag <03c0> size <32768:16384:16384> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <32768:0:32768> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <32768:16384:16384> ib_sdp WARN: Unexpected abort. conn <1483> state <4701> (last line is from hitting ^C to kill the remote netperf test) On the netperf side: ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <8192:0:8192> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <8197:0:8197> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <16384:0:16384> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <49152:0:49152> Any clue what I might look for to help track this down? thanks, grant From mst at mellanox.co.il Wed Jan 11 22:59:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 08:59:04 +0200 Subject: [openib-general] Re: Re: ipoib: outstanding patches In-Reply-To: References: Message-ID: <20060112065903.GE5284@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Re: ipoib: outstanding patches > > Michael> We cant leave the destructor set after unloading the > Michael> module. > > Michael> This is because ops structure is not per neighbour and > Michael> not per device. So while kernel destroys all neighbours > Michael> it created for our device, the ops structure is shared > Michael> with neighbours for other devices. > > Ugh. So we're relying on the fact that no one else is sticking > something in the ->ha member where we stash the ipoib_neigh. Ugh. Right. With the all neighbour list we can do a lookup there so make it safe. > Basically we're getting lucky that this works at all. > > As you said in your other mail, it seems like a more fundamental > reorganization of this neighbour destructor stuff is required. Right. [Sound of busy typing ] -- MST From mst at mellanox.co.il Wed Jan 11 23:00:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 09:00:36 +0200 Subject: [openib-general] Re: [mstflint] firmware upgrade error In-Reply-To: <20060112060905.GB29168@esmail.cup.hp.com> References: <20060112060905.GB29168@esmail.cup.hp.com> Message-ID: <20060112070036.GF5284@mellanox.co.il> Quoting Grant Grundler : > I'll point out that neither tvflash nor mstflint worked for > updating PCI-e cards on ia64 as I posted a few weeks ago. > Tvflash will fully trash the eeprom. I successfully resurrected > the PCI-e card using "mst" and "flint" (mft-0.5.0 from Mellanox > firmware support web site) on a debian Intel P4 box. I missed that one, sorry. I'll take a look at the archives. -- MST From iod00d at hp.com Wed Jan 11 23:14:31 2006 From: iod00d at hp.com (Grant Grundler) Date: Wed, 11 Jan 2006 23:14:31 -0800 Subject: [openib-general] SDP perf drop with 2.6.15 Message-ID: <20060112071431.GE29168@esmail.cup.hp.com> Michael, The last couple of SDP_RR perf data are alarming low. Worse than 10GigE on the same HW. $ fgrep "1 1 60.0" openib-perf-200*/rx260*/sdprr-tx1.out openib-perf-2005/rx2600-r3984/sdprr-tx1.out:16384 87380 1 1 60.00 18483.92 7.14 6.93 7.721 7.503 openib-perf-2005/rx2600-r4371/sdprr-tx1.out:16384 87380 1 1 60.01 18466.79 4.70 4.63 5.088 5.010 openib-perf-2006/rx2600-r4371/sdprr-tx1.out:16384 87380 1 1 60.01 18466.79 4.70 4.63 5.088 5.010 openib-perf-2006/rx2600-r4800/sdprr-tx1.out:16384 87380 1 1 60.00 13898.76 53.74 7.58 77.327 10.904 openib-perf-2006/rx2600-r4929/sdprr-tx1.out:16384 87380 1 1 60.00 13902.33 6.29 8.84 9.043 12.718 r4800 was the first time I'd used 2.6.15. I believe the other numbers are 2.6.14. CPU utilization numbers are just alot higher. Makes me think it's a change in the interrupt code path. Any clues on where I might start with this? thanks, grant From mst at mellanox.co.il Wed Jan 11 23:17:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 09:17:02 +0200 Subject: [openib-general] Re: Re: ipoib: outstanding patches In-Reply-To: <20060112065903.GE5284@mellanox.co.il> References: <20060112065903.GE5284@mellanox.co.il> Message-ID: <20060112071702.GB5850@mellanox.co.il> Quoting r. Michael S. Tsirkin : > > Basically we're getting lucky that this works at all. > > > > As you said in your other mail, it seems like a more fundamental > > reorganization of this neighbour destructor stuff is required. > > Right. [Sound of busy typing ] How does the following look? Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/include/linux/netdevice.h =================================================================== --- linux-2.6.15.orig/include/linux/netdevice.h 2006-01-03 05:21:10.000000000 +0200 +++ linux-2.6.15/include/linux/netdevice.h 2006-01-12 11:55:49.000000000 +0200 @@ -485,6 +485,7 @@ struct net_device int (*hard_header_parse)(struct sk_buff *skb, unsigned char *haddr); int (*neigh_setup)(struct net_device *dev, struct neigh_parms *); + void (*neigh_destructor)(struct neighbour *); #ifdef CONFIG_NETPOLL struct netpoll_info *npinfo; #endif Index: linux-2.6.15/net/core/neighbour.c =================================================================== --- linux-2.6.15.orig/net/core/neighbour.c 2006-01-12 11:58:15.000000000 +0200 +++ linux-2.6.15/net/core/neighbour.c 2006-01-12 11:58:45.000000000 +0200 @@ -586,8 +586,8 @@ void neigh_destroy(struct neighbour *nei kfree(hh); } - if (neigh->ops && neigh->ops->destructor) - (neigh->ops->destructor)(neigh); + if (neigh->dev->neigh_destructor) + (neigh->dev->neigh_destructor)(neigh); skb_queue_purge(&neigh->arp_queue); Index: linux-2.6.15/include/net/neighbour.h =================================================================== --- linux-2.6.15.orig/include/net/neighbour.h 2006-01-03 05:21:10.000000000 +0200 +++ linux-2.6.15/include/net/neighbour.h 2006-01-12 11:54:26.000000000 +0200 @@ -145,7 +145,6 @@ struct neighbour struct neigh_ops { int family; - void (*destructor)(struct neighbour *); void (*solicit)(struct neighbour *, struct sk_buff*); void (*error_report)(struct neighbour *, struct sk_buff*); int (*output)(struct sk_buff*); -- MST From mst at mellanox.co.il Wed Jan 11 23:18:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 09:18:03 +0200 Subject: [openib-general] Re: SDP perf drop with 2.6.15 In-Reply-To: <20060112071431.GE29168@esmail.cup.hp.com> References: <20060112071431.GE29168@esmail.cup.hp.com> Message-ID: <20060112071803.GC5850@mellanox.co.il> Quoting r. Grant Grundler : > Subject: SDP perf drop with 2.6.15 > > Michael, > The last couple of SDP_RR perf data are alarming low. > Worse than 10GigE on the same HW. > > > $ fgrep "1 1 60.0" openib-perf-200*/rx260*/sdprr-tx1.out > openib-perf-2005/rx2600-r3984/sdprr-tx1.out:16384 87380 1 1 60.00 18483.92 7.14 6.93 7.721 7.503 > openib-perf-2005/rx2600-r4371/sdprr-tx1.out:16384 87380 1 1 60.01 18466.79 4.70 4.63 5.088 5.010 > openib-perf-2006/rx2600-r4371/sdprr-tx1.out:16384 87380 1 1 60.01 18466.79 4.70 4.63 5.088 5.010 > openib-perf-2006/rx2600-r4800/sdprr-tx1.out:16384 87380 1 1 60.00 13898.76 53.74 7.58 77.327 10.904 > openib-perf-2006/rx2600-r4929/sdprr-tx1.out:16384 87380 1 1 60.00 13902.33 6.29 8.84 9.043 12.718 > > r4800 was the first time I'd used 2.6.15. > I believe the other numbers are 2.6.14. > > CPU utilization numbers are just alot higher. > Makes me think it's a change in the interrupt code path. > Any clues on where I might start with this? > > thanks, > grant oprofile should show where we are spending the time. -- MST From mst at mellanox.co.il Wed Jan 11 23:42:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 09:42:13 +0200 Subject: [openib-general] Re: Re: ipoib: outstanding patches In-Reply-To: <20060112071702.GB5850@mellanox.co.il> References: <20060112071702.GB5850@mellanox.co.il> Message-ID: <20060112074213.GD5850@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: Re: ipoib: outstanding patches > > Quoting r. Michael S. Tsirkin : > > > Basically we're getting lucky that this works at all. > > > > > > As you said in your other mail, it seems like a more fundamental > > > reorganization of this neighbour destructor stuff is required. > > > > Right. [Sound of busy typing ] > > How does the following look? > > Signed-off-by: Michael S. Tsirkin And then in ipoib we can do this: Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 12:05:17.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 12:23:16.000000000 +0200 @@ -246,9 +246,8 @@ static void path_free(struct net_device */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; - kfree(neigh); + + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -476,7 +475,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -484,8 +483,6 @@ static void neigh_add_path(struct sk_buf } skb_queue_head_init(&neigh->queue); - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; /* * We can only be called from ipoib_start_xmit, so we're @@ -528,11 +525,8 @@ static void neigh_add_path(struct sk_buf return; err: - *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; - kfree(neigh); - + ipoib_neigh_free(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -759,8 +753,7 @@ static void ipoib_neigh_destructor(struc if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); - *to_ipoib_neigh(n) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -769,23 +762,24 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } -static int ipoib_neigh_setup(struct neighbour *neigh) +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) { - /* - * Is this kosher? I can't find anybody in the kernel that - * sets neigh->destructor, so we should be able to set it here - * without trouble. - */ - neigh->ops->destructor = ipoib_neigh_destructor; + struct ipoib_neigh *neigh; - return 0; + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) + return NULL; + + neigh->neighbour = neighbour; + *to_ipoib_neigh(neighbour) = neigh; + + return neigh; } -static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) +void ipoib_neigh_free(struct ipoib_neigh *neigh) { - parms->neigh_setup = ipoib_neigh_setup; - - return 0; + *to_ipoib_neigh(neigh->neighbour) = NULL; + kfree(neigh); } int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) @@ -861,7 +855,7 @@ static void ipoib_setup(struct net_devic dev->tx_timeout = ipoib_timeout; dev->hard_header = ipoib_hard_header; dev->set_multicast_list = ipoib_set_mcast_list; - dev->neigh_setup = ipoib_neigh_setup_dev; + dev->neigh_destructor = ipoib_neigh_destructor; dev->watchdog_timeo = HZ; Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-12 12:05:17.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-12 12:19:36.000000000 +0200 @@ -113,9 +113,7 @@ static void ipoib_mcast_free(struct ipoi */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -721,13 +719,11 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (neigh) { kref_get(&mcast->ah->ref); neigh->ah = mcast->ah; - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; list_add_tail(&neigh->list, &mcast->neigh_list); } } Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-12 12:05:17.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-12 12:25:51.000000000 +0200 @@ -222,6 +222,9 @@ static inline struct ipoib_neigh **to_ip (offsetof(struct neighbour, ha) & 4)); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +void ipoib_neigh_free(struct ipoib_neigh *neigh); + extern struct workqueue_struct *ipoib_workqueue; /* functions */ -- MST From mst at mellanox.co.il Wed Jan 11 23:52:45 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 09:52:45 +0200 Subject: [openib-general] Re: ib_sdp ERR: IOCB dmesg output In-Reply-To: <20060112065523.GD29168@esmail.cup.hp.com> References: <20060112065523.GD29168@esmail.cup.hp.com> Message-ID: <20060112075245.GE5850@mellanox.co.il> Quoting r. Grant Grundler : > Subject: Re: ib_sdp ERR: IOCB dmesg output > > On Wed, Jan 11, 2006 at 10:06:31AM +0200, Michael S. Tsirkin wrote: > > Could you please try sdp patches from > > https://openib.org/svn/trunk/contrib/mellanox/patches > > As noted earlier, netperf TCP_RR over SDP ran to completion > with no problems. netperf TCP_STREAM over SDP started spewing > the same errors despite the patches. :( OK, but can you unload the module now? -- MST From ogerlitz at voltaire.com Thu Jan 12 00:52:16 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 12 Jan 2006 10:52:16 +0200 (IST) Subject: [openib-general] [PATCH] iser: moved iSCSI controls to be sent in zero-copy + kzalloc-tions Message-ID: I just commited the patch below as r4957 moved iSCSI controls to be sent in zero-copy, removed iscsi_iser_conn->send_cache and associated fields & refrences. Moved kmalloc/memset-zero to kzalloc. Signed-off-by: Or Gerlitz Index: iser_memory.h =================================================================== --- iser_memory.h (revision 4956) +++ iser_memory.h (revision 4957) @@ -38,7 +38,7 @@ #include "iser.h" /* regd_buf */ -struct iser_regd_buf *iser_regd_buf_alloc(struct iser_adaptor *p_iser_adaptor); +struct iser_regd_buf *iser_regd_buf_alloc(void); struct iser_regd_buf *iser_regd_mem_alloc(struct iser_adaptor *p_iser_adaptor, kmem_cache_t *cache, @@ -66,11 +66,6 @@ void iser_finalize_rdma_unaligned_sg(str unsigned int iser_data_buf_aligned_len(struct iser_data_buf *p_data, int skip); - -void iser_data_buf_memcpy(unsigned char *p_dst_buf, - struct iser_data_buf *p_src_data, - unsigned long *p_total_copied_sz); - void iser_data_buf_dump(struct iser_data_buf *p_data); /* iser_page_vec */ Index: iscsi_iser.h =================================================================== --- iscsi_iser.h (revision 4956) +++ iscsi_iser.h (revision 4957) @@ -141,7 +141,6 @@ struct iser_dto { enum iser_op_param_default { defaultInitiatorRecvDataSegmentLength = 128, - defaultTargetRecvDataSegmentLength = 8 * 1024 }; struct iser_conn @@ -177,10 +176,6 @@ struct iscsi_iser_conn unsigned int postrecv_bsize; char postrecv_cn[32]; - kmem_cache_t *send_cache; - unsigned int send_bsize; - char send_cn[32]; - atomic_t post_recv_buf_count; atomic_t post_send_buf_count; Index: iser_verbs.c =================================================================== --- iser_verbs.c (revision 4956) +++ iser_verbs.c (revision 4957) @@ -246,10 +246,9 @@ struct iser_adaptor *iser_adaptor_find_b } if (p_adaptor == NULL) { - p_adaptor = kmalloc(sizeof *p_adaptor, GFP_ATOMIC); + p_adaptor = kzalloc(sizeof *p_adaptor, GFP_ATOMIC); if (p_adaptor == NULL) goto end; - memset(p_adaptor, 0, sizeof *p_adaptor); ig.num_adaptors++; /* assign this device to the adaptor */ p_adaptor->device = cma_id->device; Index: iser_dto.c =================================================================== --- iser_dto.c (revision 4956) +++ iser_dto.c (revision 4957) @@ -212,31 +212,3 @@ struct iser_dto *iser_dto_send_create(st return p_send_dto; } -/** - * iser_dto_copy_send_data - Allocates a send-data buffer and copies - * there a PDU's data segment - */ -int iser_dto_copy_send_data(struct iser_dto *p_send_dto, - struct iser_data_buf *p_data) -{ - struct iscsi_iser_conn *p_iser_conn = p_send_dto->p_conn; - struct iser_regd_buf *p_regd_buf; - unsigned long total_data_sz; - - p_regd_buf = iser_regd_mem_alloc(p_iser_conn->ib_conn->p_adaptor, - p_iser_conn->send_cache, - p_iser_conn->send_bsize); - if (p_regd_buf == NULL) { - iser_err("Failed to alloc send buffer\n"); - return -ENOMEM; - } - iser_data_buf_memcpy(p_regd_buf->virt_addr, p_data, &total_data_sz); - - /* DMA_MAP: safe to dma_map now - map and flush the cache */ - iser_reg_single(p_iser_conn->ib_conn->p_adaptor,p_regd_buf, DMA_TO_DEVICE); - - iser_dto_add_regd_buff(p_send_dto, p_regd_buf, - USE_NO_OFFSET, - USE_SIZE(total_data_sz)); - return 0; -} Index: iser_dto.h =================================================================== --- iser_dto.h (revision 4956) +++ iser_dto.h (revision 4957) @@ -61,7 +61,4 @@ struct iser_dto *iser_dto_send_create(st struct iscsi_hdr *p_hdr, unsigned char **p_header); -int iser_dto_copy_send_data(struct iser_dto *p_send_dto, - struct iser_data_buf *p_data); - #endif /* __ISER_DTO_H__ */ Index: iser_conn.c =================================================================== --- iser_conn.c (revision 4956) +++ iser_conn.c (revision 4957) @@ -196,11 +196,9 @@ int iser_conn_bind(struct iscsi_iser_con iscsi_conn->ib_conn = p_iser_conn; /* MERGE_ADDED_CHANGE moved here from ic_establish, before LOGIN sent */ - iser_dbg("postrecv_cache = send_cache = ig.login_cache\n"); + iser_dbg("postrecv_cache = ig.login_cache\n"); iscsi_conn->postrecv_cache = ig.login_cache; iscsi_conn->postrecv_bsize = ISER_LOGIN_PHASE_PDU_DATA_LEN; - iscsi_conn->send_cache = ig.login_cache; - iscsi_conn->send_bsize = ISER_LOGIN_PHASE_PDU_DATA_LEN; sprintf(iscsi_conn->name,"%d.%d.%d.%d", NIPQUAD(iscsi_conn->ib_conn->dst_addr)); @@ -220,17 +218,13 @@ int iser_conn_set_full_featured_mode(str int initial_post_recv_bufs_num = ISER_INITIAL_POST_RECV + 2; p_iser_conn->postrecv_cache = NULL; - p_iser_conn->send_cache = NULL; iser_dbg("Initially post: %d\n", initial_post_recv_bufs_num); sprintf(p_iser_conn->postrecv_cn,"prcv_%d.%d.%d.%d:%d", NIPQUAD(p_iser_conn->ib_conn->dst_addr),p_iser_conn->ib_conn->dst_port); - sprintf(p_iser_conn->send_cn,"snd_%d.%d.%d.%d:%d", - NIPQUAD(p_iser_conn->ib_conn->dst_addr),p_iser_conn->ib_conn->dst_port); - - /* Allocate recv & send-data buffers for the full-featured phase */ + /* Allocate recv buffers for the full-featured phase */ /* FIXME should be a param eg p_iser_conn->initiator_max_recv_dsl; */ p_iser_conn->postrecv_bsize = defaultInitiatorRecvDataSegmentLength; @@ -245,17 +239,7 @@ int iser_conn_set_full_featured_mode(str goto ffeatured_mode_failure; } - /* FIXME should be a param eg p_iser_conn->target_max_recv_dsl; */ - p_iser_conn->send_bsize = (unsigned int)defaultTargetRecvDataSegmentLength; - p_iser_conn->send_cache = - kmem_cache_create(p_iser_conn->send_cn, - p_iser_conn->send_bsize, - 0,SLAB_HWCACHE_ALIGN, NULL, NULL); - if (p_iser_conn->send_cache == NULL) { - iser_err("Failed to allocate send cache\n"); - err = -ENOMEM; - goto ffeatured_mode_failure; - } + /* Check that there is no posted recv or send buffers left - */ /* they must be consumed during the login phase */ if (atomic_read(&p_iser_conn->post_recv_buf_count) != 0) @@ -278,10 +262,6 @@ int iser_conn_set_full_featured_mode(str return 0; ffeatured_mode_failure: - if(p_iser_conn->send_cache) { - kmem_cache_destroy(p_iser_conn->send_cache); - p_iser_conn->send_cache = NULL; - } if(p_iser_conn->postrecv_cache) { kmem_cache_destroy(p_iser_conn->postrecv_cache); p_iser_conn->postrecv_cache = NULL; @@ -398,9 +378,6 @@ void iser_conn_release(struct iser_conn p_iscsi_conn = p_iser_conn->p_iscsi_conn; if(p_iscsi_conn != NULL && p_iscsi_conn->ff_mode_enabled) { - if(kmem_cache_destroy(p_iscsi_conn->send_cache) != 0) - iser_err("send cache %s not empty, leak!\n", - p_iscsi_conn->send_cn); if(kmem_cache_destroy(p_iscsi_conn->postrecv_cache) != 0) iser_err("postrecv cache %s not empty, leak!\n", p_iscsi_conn->postrecv_cn); Index: iser_initiator.c =================================================================== --- iser_initiator.c (revision 4956) +++ iser_initiator.c (revision 4957) @@ -72,7 +72,7 @@ static int iser_reg_rdma_mem(struct iscs priv_flags |= IB_ACCESS_REMOTE_READ; p_iser_task->rdma_regd[cmd_dir] = NULL; - p_regd_buf = iser_regd_buf_alloc(p_iser_adaptor); + p_regd_buf = iser_regd_buf_alloc(); if (p_regd_buf == NULL) return -ENOMEM; @@ -459,7 +459,8 @@ int iser_send_control(struct iscsi_iser_ unsigned long data_seg_len; int err = 0; unsigned char opcode; - struct iser_data_buf data_buf; + struct iser_regd_buf *p_regd_buf; + struct iser_adaptor *p_iser_adaptor; if (atomic_read(&p_iser_conn->ib_conn->state) != ISER_CONN_UP) { iser_err("Failed to send, conn: 0x%p is not up\n", p_iser_conn->ib_conn); @@ -473,13 +474,16 @@ int iser_send_control(struct iscsi_iser_ err = -ENOMEM; goto send_control_error; } + p_iser_adaptor = p_iser_conn->ib_conn->p_adaptor; /* DMA_MAP: safe to dma_map now - map and flush the cache */ - iser_reg_single(p_iser_conn->ib_conn->p_adaptor, - p_send_dto->regd[0], DMA_TO_DEVICE); + iser_reg_single(p_iser_adaptor, p_send_dto->regd[0], DMA_TO_DEVICE); itt = ntohl(p_mtask->hdr.itt); opcode = p_mtask->hdr.opcode & ISCSI_OPCODE_MASK; + + /* no need to copy when there's data b/c the mtask is not reallocated * + * till the response related to this ITT is received */ switch (opcode) { case ISCSI_OP_SCSI_TMFUNC: @@ -490,11 +494,20 @@ int iser_send_control(struct iscsi_iser_ case ISCSI_OP_LOGOUT: data_seg_len = ntoh24(p_mtask->hdr.dlength); if (data_seg_len > 0) { - data_buf.p_buf = p_mtask->data; - data_buf.size = p_mtask->data_count; - data_buf.type = ISER_BUF_TYPE_SINGLE; - /* Allocate data regd buffer and copy the user data */ - iser_dto_copy_send_data(p_send_dto, &data_buf); + p_regd_buf = iser_regd_buf_alloc(); + if (p_regd_buf == NULL) { + iser_err("Failed to alloc regd buffer\n"); + err = -ENOMEM; + goto send_control_error; + } + p_regd_buf->p_adaptor = p_iser_adaptor; + p_regd_buf->virt_addr = p_mtask->data; + p_regd_buf->data_size = p_mtask->data_count; + iser_reg_single(p_iser_adaptor, p_regd_buf, + DMA_TO_DEVICE); + iser_dto_add_regd_buff(p_send_dto, p_regd_buf, + USE_NO_OFFSET, + USE_SIZE(data_seg_len)); } break; @@ -537,7 +550,7 @@ void iser_rcv_dto_completion(struct iser struct iscsi_iser_cmd_task *p_iser_task = NULL; struct iscsi_hdr *p_hdr; char *rx_data; - int rx_data_size,rc; + int rc, rx_data_size = 0; unsigned int itt; unsigned char opcode; int no_more_task_sends = 0; Index: iser_memory.c =================================================================== --- iser_memory.c (revision 4956) +++ iser_memory.c (revision 4957) @@ -59,7 +59,7 @@ iser_page_to_virt(struct page *page) * * returns the registered buffer descriptor */ -struct iser_regd_buf *iser_regd_buf_alloc(struct iser_adaptor *p_iser_adaptor) +struct iser_regd_buf *iser_regd_buf_alloc(void) { struct iser_regd_buf *p_regd_buf; @@ -84,7 +84,7 @@ struct iser_regd_buf *iser_regd_mem_allo struct iser_regd_buf *p_regd_buf; void *data; - p_regd_buf = iser_regd_buf_alloc(p_iser_adaptor); + p_regd_buf = iser_regd_buf_alloc(); if (p_regd_buf != NULL) { data = (void *) kmem_cache_alloc(cache, GFP_KERNEL | __GFP_NOFAIL); @@ -505,44 +505,6 @@ unsigned int iser_data_buf_aligned_len(s return ret_len; } -/** - * iser_data_buf_memcpy - Copies arbitrary data buffer to a - * contiguous memory region - */ -void iser_data_buf_memcpy(unsigned char *p_dst_buf, - struct iser_data_buf *p_src_data, - unsigned long *p_total_copied_sz) -{ - if (p_src_data->type == ISER_BUF_TYPE_SINGLE) { - iser_dbg( - "copy SINGLE virt: 0x%p -> 0x%p, " "sz: %d\n", - p_src_data->p_buf, p_dst_buf, p_src_data->size); - memcpy(p_dst_buf, p_src_data->p_buf, p_src_data->size); - if (p_total_copied_sz != NULL) - *p_total_copied_sz = p_src_data->size; - } else { - unsigned char *chunk_addr = 0; - unsigned int chunk_size = 0; - unsigned long total_sz = 0; - int i; - - for (i = 0; i < p_src_data->size; i++) { - chunk_addr = (unsigned char *) - iser_sg_entry_to_virt(p_src_data->p_buf, i); - chunk_size = - iser_sg_entry_len(p_src_data->p_buf, i); - iser_dbg( - "copy SG[%d]: 0x%p -> 0x%p, sz: %d\n", - i, chunk_addr, p_dst_buf, chunk_size); - memcpy(p_dst_buf, chunk_addr, chunk_size); - p_dst_buf += chunk_size; - total_sz += chunk_size; - } - if (p_total_copied_sz != NULL) - *p_total_copied_sz = total_sz; - } -} - void iser_data_buf_dump(struct iser_data_buf *p_data) { if (p_data->type == ISER_BUF_TYPE_SINGLE) Index: iscsi_iser.c =================================================================== --- iscsi_iser.c (revision 4956) +++ iscsi_iser.c (revision 4957) @@ -1101,11 +1101,10 @@ static iscsi_connh_t iscsi_iser_conn_cre struct iscsi_iser_session *session = iscsi_ptr(sessionh); struct iscsi_iser_conn *conn = NULL; - conn = kmalloc(sizeof *conn, GFP_KERNEL); + conn = kzalloc(sizeof *conn, GFP_KERNEL); if (conn == NULL) { goto conn_alloc_fail; } - memset(conn, 0, sizeof(struct iscsi_iser_conn)); /* Init the connection */ conn->c_stage = ISCSI_CONN_INITIAL_STAGE; From mst at mellanox.co.il Thu Jan 12 01:14:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 11:14:18 +0200 Subject: [openib-general] [PATCH] mthca: uninitialized variable Message-ID: <20060112091418.GB16938@mellanox.co.il> mthca was using sqp->ud_header.grh_present before it was initialized by mthca_read_ah. Further, header->grh_present is set by ib_ud_header_init. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2006-01-11 15:40:20.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_qp.c 2006-01-11 19:35:58.000000000 +0200 @@ -1432,7 +1432,7 @@ static int build_mlx_header(struct mthca u16 pkey; ib_ud_header_init(256, /* assume a MAD */ - sqp->ud_header.grh_present, + mthca_ah_grh_present(to_mah(wr->wr.ud.ah)), &sqp->ud_header); err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header); Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_av.c 2005-12-23 08:02:56.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_av.c 2006-01-11 19:34:48.000000000 +0200 @@ -161,6 +161,11 @@ int mthca_destroy_ah(struct mthca_dev *d return 0; } +int mthca_ah_grh_present(struct mthca_ah *ah) +{ + return !!(ah->av->g_slid & 0x80); +} + int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, struct ib_ud_header *header) { @@ -170,8 +175,7 @@ int mthca_read_ah(struct mthca_dev *dev, header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; header->lrh.destination_lid = ah->av->dlid; header->lrh.source_lid = cpu_to_be16(ah->av->g_slid & 0x7f); - if (ah->av->g_slid & 0x80) { - header->grh_present = 1; + if (mthca_ah_grh_present(ah)) { header->grh.traffic_class = (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; header->grh.flow_label = @@ -182,8 +186,6 @@ int mthca_read_ah(struct mthca_dev *dev, &header->grh.source_gid); memcpy(header->grh.destination_gid.raw, ah->av->dgid, 16); - } else { - header->grh_present = 0; } return 0; -- MST From dotanb at mellanox.co.il Thu Jan 12 01:49:56 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 12 Jan 2006 11:49:56 +0200 Subject: [openib-general] RE: srq_pingpong with many QPs and events may never ends Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3011DD346@mtlexch01.mtl.com> > Thanks, I applied a slightly different version of this. I preferred > not to rename wc -> wc_arr, and I added an error if there are not > enough receives for all the QPs. > > Please let me know if I made any mistake here. > I reviewed this patch and the patch looks good, thanks. Dotan From RayrbVick at aceinter.net Wed Jan 11 23:50:11 2006 From: RayrbVick at aceinter.net (Ray Vick) Date: Thu, 12 Jan 2006 09:50:11 +0200 Subject: [openib-general] Increase your energy levels Message-ID: You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://treatmentgreen.com/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://treatmentgreen.com/ Regards, Dr. Ray Vick From mst at mellanox.co.il Thu Jan 12 01:54:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 11:54:54 +0200 Subject: [openib-general] [PATCH] mthca_cmd small cleanup Message-ID: <20060112095454.GE16938@mellanox.co.il> The following results in a slight code size reduction. Signed-off-by: Ishai Rabinovitz Signed-off-by: Michael S. Tsirkin Index: last_stable/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-01-09 17:02:57.000000000 +0200 +++ last_stable/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-01-11 16:36:39.000000000 +0200 @@ -606,7 +606,7 @@ err = -EINVAL; goto out; } - for (i = 0; i < mthca_icm_size(&iter) / (1 << lg); ++i) { + for (i = 0; i < mthca_icm_size(&iter) >> lg; ++i) { if (virt != -1) { pages[nent * 2] = cpu_to_be64(virt); virt += 1 << lg; -- MST From mst at mellanox.co.il Thu Jan 12 02:56:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 12:56:31 +0200 Subject: [openib-general] [PATCH] mthca: cosmetics: use ALIGN macro Message-ID: <20060112105631.GF16938@mellanox.co.il> mthca: cosmetics - use the ALIGN macro Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-01-12 15:38:37.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-01-12 15:39:08.000000000 +0200 @@ -727,8 +727,8 @@ int mthca_QUERY_FW(struct mthca_dev *dev * system pages needed. */ dev->fw.arbel.fw_pages = - (dev->fw.arbel.fw_pages + (1 << (PAGE_SHIFT - 12)) - 1) >> - (PAGE_SHIFT - 12); + ALIGN(dev->fw.arbel.fw_pages, PAGE_SIZE >> 12) >> + (PAGE_SHIFT - 12); mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n", (unsigned long long) dev->fw.arbel.clr_int_base, @@ -1444,7 +1444,7 @@ int mthca_SET_ICM_SIZE(struct mthca_dev * Arbel page size is always 4 KB; round up number of system * pages needed. */ - *aux_pages = (*aux_pages + (1 << (PAGE_SHIFT - 12)) - 1) >> (PAGE_SHIFT - 12); + *aux_pages = ALIGN(*aux_pages, PAGE_SIZE >> 12) >> (PAGE_SHIFT - 12); return 0; } -- MST From devesh28 at gmail.com Thu Jan 12 05:57:39 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Thu, 12 Jan 2006 19:27:39 +0530 Subject: [openib-general] Functioning of ib_register_mad_agent() Message-ID: <309a667c0601120557h2bcec18fu6aa13d8930ccba4c@mail.gmail.com> Hello all, I have some queries regarding the significance of the function ib_register_mad_agent() A) What this function dose? B) What is the significance of this function in implementing HCA driver? Thanks Devesh. From jackm at mellanox.co.il Thu Jan 12 06:23:59 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 12 Jan 2006 16:23:59 +0200 Subject: [openib-general] [PATCH] mthca: fix mem leaks in mthca_provider error handling Message-ID: <20060112142359.GA6536@mellanox.co.il> Fixes memory leaks in mthca_create_qp and mthca_create_srq error handling. Signed-off-by: Jack Morgenstein Index: gen2/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- gen2.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2006-01-12 15:21:24.935512000 +0200 +++ gen2/drivers/infiniband/hw/mthca/mthca_provider.c 2006-01-12 15:23:06.839657000 +0200 @@ -463,8 +463,10 @@ static struct ib_srq *mthca_create_srq(s if (pd->uobject) { context = to_mucontext(pd->uobject->context); - if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) - return ERR_PTR(-EFAULT); + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) { + err = -EFAULT; + goto err_free; + } err = mthca_map_user_db(to_mdev(pd->device), &context->uar, context->db_tab, ucmd.db_index, @@ -540,8 +542,10 @@ static struct ib_qp *mthca_create_qp(str if (pd->uobject) { context = to_mucontext(pd->uobject->context); - if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) { + kfree(qp); return ERR_PTR(-EFAULT); + } err = mthca_map_user_db(to_mdev(pd->device), &context->uar, context->db_tab, From halr at voltaire.com Thu Jan 12 07:01:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Jan 2006 10:01:10 -0500 Subject: [openib-general] Anounce: Advanced Diagnostic Tools In-Reply-To: <43C584C3.5010104@mellanox.co.il> References: <43C584C3.5010104@mellanox.co.il> Message-ID: <1137077427.4322.4265.camel@hal.voltaire.com> On Wed, 2006-01-11 at 17:20, Eitan Zahavi wrote: > Hi, > > With the great help from Danny Zarko and Ariel Libman I was able to upload into > https://openib.org/svn/gen2/utils/src/linux-user the first of several integrated IB > diagnostic tools: ibdiagnet (diagnose network). > The tool depends on ibis, ibdm (available under in the same directory). > It's main differences from the diag tools available under the trunk are: Just to set the record straight, some but not all of the below are supported in the current OpenIB diag tools. -- Hal > 1. Performs a complete diagnostic procedure, including: > * discovery, > * PM counters check, > * duplicate LID/GUID > * ALL to ALL connectivity check (based on LFT data extracted from the fabric) > * Multicast connectivity and report > * Credit loop analysis > * and various other fabric statistics > 2. If a topology file is provided - all reports are given using system names (rather then > LID, GUID or directed paths. > > ############################################################################################### > Here are some stdout examples > ------------------------------ > 1. BAD LIDS > -E- Device(s) with LID = 0x0000 found in the fabric: > path="1 1 3 5" H-12/U1 PN=2 > path="1 1 3 4" H-11/U1 PN=1 > path="1 4" H-3/U1 PN=1 > > 2. DUPLICATED PORT GUIDS > -E- Devices with identical PortGUID = 0x0002c90000000006 found in the fabric: > path="1 1" GNU1/main/U2 > path="1 1 5 6" H-9/U1 PN=1 > path="1 1 5 5" H-10/U1 PN=2 > > 3. BAD LINKS > -I- Errors have occurred on the following links (for errors details, look in log > file /tmp/ibmgtsim.31602/ibdiagnet.log): > Cable: GNU1/M/P7(GNU1/main/U4/P4) =---= H-7/P2(H-7/U1/P2) > Cable: GNU1/M/P5(GNU1/main/U4/P6) =---= H-5/P2(H-5/U1/P2) > > 4. TOPOLOGY MATCH > -I- Note that "bad" links and the part of the fabric to which they led (in the > BFS discovery of the fabric, starting at the local node) are not discovered > and therefore will be reported as "missing". > > Missing System:H-7(Cougar) > Should be connected by cable from port: P2(H-7/U1/P2) > to:GNU1/M/P7(GNU1/main/U4/P4) > > Missing System:H-5(Cougar) > Should be connected by cable from port: P2(H-5/U1/P2) > to:GNU1/M/P5(GNU1/main/U4/P6) > > 5. MULTICAST ROUTING > -I- Scanning all multicast groups for loops and connectivity... > -I- Multicast Group:0xC000 has:2 switches and:2 HCAs > -E- Extra switch:GNU1/leaf1/U1 in group:0xC000 > -E- Extra switch:GNU1/main/U4 in group:0xC000 > -I- Multicast Group:0xC001 has:4 switches and:4 HCAs > -E- Extra switch:GNU1/leaf1/U1 in group:0xC001 > -I- Multicast Group:0xC002 has:5 switches and:5 HCAs > -E- 3 multicast group checks failed > > -I--------------------------------------------------- > -I- mgid-mlid-HCAs matching table > -I--------------------------------------------------- > mgid | mlid | HCAs > -------------------------------------------------------------------------------- > 0xff12401bffff0000:0x00000000ffffffff | 0xc000 | H-11/U1,H-12/U1 > 0xff12401bffff0000:0x0000000000000001 | 0xc001 | H-15/U1,H-3/U1,H-2/U1,H-7/U1 > 0xff12401bffff0000:0x0000000000000002 | 0xc002 | H-10/U1,H-16/U1,H-4/U1,H-6/U1 > > 6. UNICAST ROUTING: > -I- Verifying all CA to CA paths ... > -E- Unassigned LFT for lid:10 Dead end at:GNU1/main/U1 > -E- Fail to find a path from:H-1/U1/1 to:H-12/U1/2 > -E- Unassigned LFT for lid:18 Dead end at:GNU1/main/U3 > -E- Fail to find a path from:H-1/U1/1 to:H-5/U1/2 > [snip] > -E- Found 19 missing paths out of:240 paths > > 7. CREDIT LOOPS > -I- Tracing all CA to CA paths for Credit Loops potential ... > -E- Potential Credit Loop on Path from:H-1/U1/1 to:H-13/U1/1 > Going:Down from:GNU1/main/U1 to:GNU1/main/U3 > Going:Up from:GNU1/main/U3 to:GNU1/main/U1 > Going:Down from:GNU1/main/U1 to:GNU1/leaf1/U1 > > > NOTE: All the above cases simulated on top of ibmgtsim. > Errors injected by simulation flows. > ###################################################################################### > A full man page: > ==================== > NAME > ibdiagnet > > SYNOPSYS > ibdiagnet [-c ] [-v] [-r] [-t ] [-s ] > [-i ] [-p ] [-o ] > > DESCRIPTION > ibdiagnet scans the fabric using directed route packets and extracts all the > available information regarding its connectivity and devices. > It then produces the following files in the output directory defined by the > -o option (see below): > ibdiagnet.lst - List of all the nodes, ports and links in the fabric > ibdiagnet.fdbs - A dump of the unicast forwarding tables of the fabric > switches > ibdiagnet.mcfdbs - A dump of the multicast forwarding tables of the fabric > switches > In addition to generating the files above, the discovery phase also checks for > duplicate node GUIDs in the IB fabric. If such an error is detected, it is > displayed on the standard output. > After the discovery phase is completed, directed route packets are sent > multiple times (according to the -c option) to detect possible problematic > paths on which packets may be lost. Such paths are explored, and a report of > the suspected bad links is displayed on the standard output. > After scanning the fabric, if the -r option is provided, a full report of the > fabric qualities is displayed. > This report includes: > Number of nodes and systems > Hop-count information: > maximal hop-count, an example path, and a hop-count histogram > All CA-to-CA paths traced > Note: In case the IB fabric includes only one CA, then CA-to-CA paths are not > reported. > Furthermore, if a topology file is provided, ibdiagnet uses the names defined > in it for the output reports. > > OPTIONS > -c : The minimal number of packets to be sent across each link > (default = 10) > -v : Instructs the tool to run in verbose mode > -r : Provides a report of the fabric qualities > -t : Specifies the topology file name > -s : Specifies the local system name. Meaningful only if a topology > file is specified > -i : Specifies the index of the device of the port used to connect > to the IB fabric (in case of multiple devices on the local > system) > -p : Specifies the local device's port number used to connect to > the IB fabric > -o : Specifies the directory where the output files will be placed > (default = /tmp/ez) > > -h|--help : Prints this help information > -V|--version : Prints the version of the tool > --vars : Prints the tool's environment variables and their values > > ERROR CODES > 1 - Failed to fully discover the fabric > 2 - Failed to parse command line options > 3 - Some packet drop observed > 4 - Mismatch with provided topology > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Thu Jan 12 07:13:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 17:13:22 +0200 Subject: [openib-general] patches status update Message-ID: <20060112151322.GK16938@mellanox.co.il> Hello, Roland! I have done the following changes to contrib/mellanox/patches: Added new patches: seen in testing: 4971 4966 mst mthca_mlx_grh.patch from code review: 4971 4962 mst mthca_provider_err_leak.patch cosmetic, from code review: 4971 4959 mst mthca_cosmetic_shift.patch 4971 4965 mst mthca_cosmetic_align.patch Updated the patch: 4971 4964 mst ipoib_all_neigh_issues_2.patch This still reflects the approach of using a global neigh list to check the neighbour and the ops pointer. This does not require kernel patches. I plan to later send a patchset for another approach: moving destructor from neighbour ops to neighbour params. -- MST From mst at mellanox.co.il Thu Jan 12 07:15:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 17:15:02 +0200 Subject: [openib-general] Re: [PATCH] mthca: cosmetics: use ALIGN macro In-Reply-To: <20060112105631.GF16938@mellanox.co.il> References: <20060112105631.GF16938@mellanox.co.il> Message-ID: <20060112151502.GL16938@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: [PATCH] mthca: cosmetics: use ALIGN macro > > mthca: cosmetics - use the ALIGN macro > > Signed-off-by: Michael S. Tsirkin This patch was mangled. Here's an update, I've put it under mthca_cosmetic_align.patch. Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-01-12 16:54:43.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-01-12 16:56:10.000000000 +0200 @@ -727,8 +727,8 @@ int mthca_QUERY_FW(struct mthca_dev *dev * system pages needed. */ dev->fw.arbel.fw_pages = - (dev->fw.arbel.fw_pages + (1 << (PAGE_SHIFT - 12)) - 1) >> - (PAGE_SHIFT - 12); + ALIGN(dev->fw.arbel.fw_pages, PAGE_SIZE >> 12) >> + (PAGE_SHIFT - 12); mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n", (unsigned long long) dev->fw.arbel.clr_int_base, @@ -1445,6 +1445,7 @@ int mthca_SET_ICM_SIZE(struct mthca_dev * pages needed. */ *aux_pages = (*aux_pages + (1 << (PAGE_SHIFT - 12)) - 1) >> (PAGE_SHIFT - 12); + *aux_pages = ALIGN(*aux_pages, PAGE_SIZE >> 12) >> (PAGE_SHIFT - 12); return 0; } -- MST From mst at mellanox.co.il Thu Jan 12 07:16:44 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 17:16:44 +0200 Subject: [openib-general] Re: [PATCH] mthca: uninitialized variable In-Reply-To: <20060112091418.GB16938@mellanox.co.il> References: <20060112091418.GB16938@mellanox.co.il> Message-ID: <20060112151644.GM16938@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: [PATCH] mthca: uninitialized variable > > mthca was using sqp->ud_header.grh_present before it was initialized > by mthca_read_ah. Further, header->grh_present is set by ib_ud_header_init. > > Signed-off-by: Michael S. Tsirkin This patch was mangled, sorry about that. I've put an updated copy here: mthca_mlx_grh.patch Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2006-01-12 16:54:43.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_qp.c 2006-01-12 16:54:43.000000000 +0200 @@ -1429,7 +1429,7 @@ static int build_mlx_header(struct mthca u16 pkey; ib_ud_header_init(256, /* assume a MAD */ - sqp->ud_header.grh_present, + mthca_ah_grh_present(to_mah(wr->wr.ud.ah)), &sqp->ud_header); err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header); Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_av.c 2006-01-12 16:54:31.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_av.c 2006-01-12 16:54:43.000000000 +0200 @@ -161,6 +161,11 @@ int mthca_destroy_ah(struct mthca_dev *d return 0; } +int mthca_ah_grh_present(struct mthca_ah *ah) +{ + return !!(ah->av->g_slid & 0x80); +} + int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, struct ib_ud_header *header) { @@ -170,8 +175,7 @@ int mthca_read_ah(struct mthca_dev *dev, header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; header->lrh.destination_lid = ah->av->dlid; header->lrh.source_lid = cpu_to_be16(ah->av->g_slid & 0x7f); - if (ah->av->g_slid & 0x80) { - header->grh_present = 1; + if (mthca_ah_grh_present(ah)) { header->grh.traffic_class = (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; header->grh.flow_label = @@ -182,8 +186,6 @@ int mthca_read_ah(struct mthca_dev *dev, &header->grh.source_gid); memcpy(header->grh.destination_gid.raw, ah->av->dgid, 16); - } else { - header->grh_present = 0; } return 0; Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2006-01-12 16:52:25.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_dev.h 2006-01-12 17:00:08.000000000 +0200 @@ -520,6 +520,7 @@ int mthca_create_ah(struct mthca_dev *de int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, struct ib_ud_header *header); +int mthca_ah_grh_present(struct mthca_ah *ah); int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); -- MST From mst at mellanox.co.il Thu Jan 12 08:24:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 18:24:27 +0200 Subject: [openib-general] Patch series: neigh destructor cleanup Message-ID: <20060112162427.GN16938@mellanox.co.il> Roland, what follows is a patch series to clean up the destructor infrastructure in kernel - this is an alternative to ipoib_all_neigh_issues_2.patch Works for me, although I only had time to do basic testing on this. -- MST From mst at mellanox.co.il Thu Jan 12 08:24:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 18:24:38 +0200 Subject: [openib-general] [PATCH 1 of 3] move destructor to struct neigh_parms Message-ID: <20060112162438.GO16938@mellanox.co.il> This is an alternative approach to the one presented in ipoib_all_neigh_issues_2.patch. --- Move destructor from neigh_ops (which is shared between devices) to neigh_parms which is not, so that multiple drivers can set it safely. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/net/core/neighbour.c =================================================================== --- linux-2.6.15.orig/net/core/neighbour.c 2006-01-12 11:58:15.000000000 +0200 +++ linux-2.6.15/net/core/neighbour.c 2006-01-12 20:10:00.000000000 +0200 @@ -586,8 +586,8 @@ void neigh_destroy(struct neighbour *nei kfree(hh); } - if (neigh->ops && neigh->ops->destructor) - (neigh->ops->destructor)(neigh); + if (neigh->parms->neigh_destructor) + (neigh->parms->neigh_destructor)(neigh); skb_queue_purge(&neigh->arp_queue); Index: linux-2.6.15/include/net/neighbour.h =================================================================== --- linux-2.6.15.orig/include/net/neighbour.h 2006-01-03 05:21:10.000000000 +0200 +++ linux-2.6.15/include/net/neighbour.h 2006-01-12 20:09:27.000000000 +0200 @@ -68,6 +68,7 @@ struct neigh_parms struct net_device *dev; struct neigh_parms *next; int (*neigh_setup)(struct neighbour *); + void (*neigh_destructor)(struct neighbour *); struct neigh_table *tbl; void *sysctl_table; @@ -145,7 +146,6 @@ struct neighbour struct neigh_ops { int family; - void (*destructor)(struct neighbour *); void (*solicit)(struct neighbour *, struct sk_buff*); void (*error_report)(struct neighbour *, struct sk_buff*); int (*output)(struct sk_buff*); -- MST From mst at mellanox.co.il Thu Jan 12 08:24:45 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 18:24:45 +0200 Subject: [openib-general] [PATCH 2 of 3] ipoib: move destructor to struct neigh_parms Message-ID: <20060112162445.GP16938@mellanox.co.il> This is an alternative approach to the one presented in ipoib_all_neigh_issues_2.patch. --- Move destructor from neigh_ops (which is shared between devices) to neigh_parms which is not, so that multiple drivers can set it safely. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 20:30:52.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 20:31:26.000000000 +0200 @@ -247,7 +247,6 @@ static void path_free(struct net_device if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } @@ -530,7 +529,6 @@ static void neigh_add_path(struct sk_buf err: *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; kfree(neigh); ++priv->stats.tx_dropped; @@ -769,21 +767,9 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } -static int ipoib_neigh_setup(struct neighbour *neigh) -{ - /* - * Is this kosher? I can't find anybody in the kernel that - * sets neigh->destructor, so we should be able to set it here - * without trouble. - */ - neigh->ops->destructor = ipoib_neigh_destructor; - - return 0; -} - static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { - parms->neigh_setup = ipoib_neigh_setup; + parms->neigh_destructor = ipoib_neigh_destructor; return 0; } -- MST From mst at mellanox.co.il Thu Jan 12 08:25:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 18:25:00 +0200 Subject: [openib-general] [PATCH 3 of 3] ipoib: fix error handling Message-ID: <20060112162500.GQ16938@mellanox.co.il> Fix error handling in neigh_add_path. Reduce code duplication by implementing alloc/free functions for ipoib_neigh. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 20:48:06.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 20:48:43.000000000 +0200 @@ -246,8 +246,7 @@ static void path_free(struct net_device */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -475,7 +474,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -483,8 +482,6 @@ static void neigh_add_path(struct sk_buf } skb_queue_head_init(&neigh->queue); - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; /* * We can only be called from ipoib_start_xmit, so we're @@ -497,7 +494,7 @@ static void neigh_add_path(struct sk_buf path = path_rec_create(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); if (!path) - goto err; + goto err_path; __path_add(dev, path); } @@ -527,10 +524,9 @@ static void neigh_add_path(struct sk_buf return; err: - *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - kfree(neigh); - +err_path: + ipoib_neigh_free(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -757,8 +753,7 @@ static void ipoib_neigh_destructor(struc if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); - *to_ipoib_neigh(n) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -766,6 +761,26 @@ static void ipoib_neigh_destructor(struc if (ah) ipoib_put_ah(ah); } + +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +{ + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) + return NULL; + + neigh->neighbour = neighbour; + *to_ipoib_neigh(neighbour) = neigh; + + return neigh; +} + +void ipoib_neigh_free(struct ipoib_neigh *neigh) +{ + *to_ipoib_neigh(neigh->neighbour) = NULL; + kfree(neigh); +} static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-12 20:32:08.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-12 20:48:43.000000000 +0200 @@ -113,8 +113,7 @@ static void ipoib_mcast_free(struct ipoi */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -720,13 +719,11 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (neigh) { kref_get(&mcast->ah->ref); neigh->ah = mcast->ah; - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; list_add_tail(&neigh->list, &mcast->neigh_list); } } Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-12 20:27:47.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-12 20:48:43.000000000 +0200 @@ -222,6 +222,9 @@ static inline struct ipoib_neigh **to_ip (offsetof(struct neighbour, ha) & 4)); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +void ipoib_neigh_free(struct ipoib_neigh *neigh); + extern struct workqueue_struct *ipoib_workqueue; /* functions */ -- MST From brilong at cisco.com Thu Jan 12 08:49:32 2006 From: brilong at cisco.com (Brian Long) Date: Thu, 12 Jan 2006 11:49:32 -0500 Subject: [openib-general] SA cache design In-Reply-To: <43C584DD.70503@ichips.intel.com> References: <5D78D28F88822E4D8702BB9EEF1A4367D12CA2@mercury.infiniconsys.com> <43C584DD.70503@ichips.intel.com> Message-ID: <1137084572.4466.40.camel@brilong-lnx> On Wed, 2006-01-11 at 14:21 -0800, Sean Hefty wrote: > Rimmer, Todd wrote: > > A relational database is overkill for this function. > > It will also likely be more complex for end users to setup and debug. > > The cache setup should be simple. The solution should be such that > > just an on/off switch needs to be configured (with a default of on) > > for most users to get started. > > My take is a little different. I view the SA as a database that maintains > related attributes. > > By supporting relationships between different attributes, we can provide a more > powerful, higher performing, and more user-friendly interface to the user. For > example, a single SQL query could return path records given only a node > description or service name. Today, we generate multiple SA queries, their > responses, and associated RMPP MADs to obtain the same data. > > I'm not sold on the idea of using a relational database, because of the > additional complexity for end-users. However, I believe it can offer > significant advantages over what we could code ourselves. How much overhead is going to be incurred by using a standard RDBMS instead of not caching anything? I'm not completely familiar with the IB configurations that would benefit from the proposed SA cache, but it seems to me, adding a RDBMS to anything as fast as IB would actually slow things down considerably. Can an RDBMS + SA cache actually be faster than no cache at all? /Brian/ -- Brian Long | | | IT Data Center Systems | .|||. .|||. Cisco Linux Developer | ..:|||||||:...:|||||||:.. Phone: (919) 392-7363 | C i s c o S y s t e m s From iod00d at hp.com Thu Jan 12 09:29:38 2006 From: iod00d at hp.com (Grant Grundler) Date: Thu, 12 Jan 2006 09:29:38 -0800 Subject: [openib-general] Re: ib_sdp ERR: IOCB dmesg output In-Reply-To: <20060112075245.GE5850@mellanox.co.il> References: <20060112065523.GD29168@esmail.cup.hp.com> <20060112075245.GE5850@mellanox.co.il> Message-ID: <20060112172938.GA3106@esmail.cup.hp.com> On Thu, Jan 12, 2006 at 09:52:45AM +0200, Michael S. Tsirkin wrote: > > As noted earlier, netperf TCP_RR over SDP ran to completion > > with no problems. netperf TCP_STREAM over SDP started spewing > > the same errors despite the patches. :( > > OK, but can you unload the module now? Yes. That seems to be fixed. Sorry for not mentioning it in the prevous mail. thanks, grant From iod00d at hp.com Thu Jan 12 09:59:22 2006 From: iod00d at hp.com (Grant Grundler) Date: Thu, 12 Jan 2006 09:59:22 -0800 Subject: [openib-general] Re: ib_sdp ERR: IOCB dmesg output In-Reply-To: <20060112075245.GE5850@mellanox.co.il> References: <20060112065523.GD29168@esmail.cup.hp.com> <20060112075245.GE5850@mellanox.co.il> Message-ID: <20060112175922.GF3106@esmail.cup.hp.com> On Thu, Jan 12, 2006 at 09:52:45AM +0200, Michael S. Tsirkin wrote: > > As noted earlier, netperf TCP_RR over SDP ran to completion > > with no problems. netperf TCP_STREAM over SDP started spewing > > the same errors despite the patches. :( > > OK, but can you unload the module now? Sorry - I just realized I checked the "netserver" machine and not the "netperf" ("client"). client side still fails. :( iota:~# reload_ib + IPoIB=30 + ifconfig ib0 down + ifconfig ib1 down + rmmod ib_umad ib_ipoib ib_uverbs ib_sdp ib_cm ib_sa ib_mthca ib_mad ib_core ERROR: Module ib_sdp is in use ERROR: Module ib_cm is in use by ib_sdp ERROR: Module ib_sa is in use by ib_sdp ACPI: PCI interrupt for device 0000:81:00.0 disabled GSI 60 (level, low) -> CPU 1 (0x0100) vector 60 unregistered ERROR: Module ib_mad is in use by ib_cm,ib_sa ERROR: Module ib_core is in use by ib_sdp,ib_cm,ib_sa,ib_mad + modprobe ib_mthca msi_x=1 ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing 0000:81:00.0 GSI 60 (level, low) -> CPU 0 (0x0000) vector 60 ACPI: PCI Interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 60 ib_mthca 0000:81:00.0: HCA FW version 3.3.2 is old (3.3.3 is current). ib_mthca 0000:81:00.0: If you have problems, try updating your HCA FW. + modprobe ib_ipoib + modprobe ib_sdp + modprobe ib_uverbs + modprobe ib_umad + ifconfig ib0 10.0.0.30 netmask 255.255.255.0 broadcast 10.0.0.255 + ifconfig ib1 10.0.1.30 netmask 255.255.255.0 broadcast 10.0.1.255 iota:~# lsmod Module Size Used by ib_umad 33648 0 ib_uverbs 93096 0 ib_ipoib 96128 0 ib_mthca 274728 0 ib_sdp 230480 3 ib_cm 93964 1 ib_sdp ib_sa 25324 2 ib_ipoib,ib_sdp ib_mad 85952 4 ib_umad,ib_mthca,ib_cm,ib_sa ib_core 93096 8 ib_umad,ib_uverbs,ib_ipoib,ib_mthca,ib_sdp,ib_cm,ib_sa,ib_mad tulip 118064 0 e1000 233420 0 tg3 227280 0 e100 83592 0 iota:~# Looks like the error messages and sdp refcnt might be related. (IIRC, 4 error msgs and SDP ref cnt is 3) Since netperf is terminated by a timer signal, it's possible traffic is still outstanding when it exits. Could that be a cause of the "ERR: IOCB <-1> cancel" error messages? thanks, grant From mshefty at ichips.intel.com Thu Jan 12 10:16:23 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 12 Jan 2006 10:16:23 -0800 Subject: [openib-general] SA cache design In-Reply-To: <1137084572.4466.40.camel@brilong-lnx> References: <5D78D28F88822E4D8702BB9EEF1A4367D12CA2@mercury.infiniconsys.com> <43C584DD.70503@ichips.intel.com> <1137084572.4466.40.camel@brilong-lnx> Message-ID: <43C69CF7.309@ichips.intel.com> Brian Long wrote: > How much overhead is going to be incurred by using a standard RDBMS > instead of not caching anything? I'm not completely familiar with the > IB configurations that would benefit from the proposed SA cache, but it > seems to me, adding a RDBMS to anything as fast as IB would actually > slow things down considerably. Can an RDBMS + SA cache actually be > faster than no cache at all? I'm not sure what the speed-up of any cache will be. The SA maintains a database of various related records - node records, path records, service records, etc. and responds to queries. This need doesn't go away. The SA itself is perfect candidate to be implemented using a DBMS. (And if one had been implemented over a DBMS, I'm not even sure that we'd be talking about scalability issues for only a few thousand nodes. Is the perceived lack of scalability of the SA a result of the architecture or the existing implementations?) My belief is that a DBMS will outperform anything that I could write to store and retrieve these records. Consider that a 4000 node cluster will have about 8,000,000 path records. Local caches can reduce this considerably (to about 4000), and if we greatly restrict the type of queries that are supported, then we can manage the retrieval of those records ourselves. I do not want end-users to have to administer a database. However, if the user only needs to install a library, then this approach seems worth pursuing. - Sean From trimmer at silverstorm.com Thu Jan 12 10:46:26 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Thu, 12 Jan 2006 13:46:26 -0500 Subject: [openib-general] SA cache design Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670A0967@mercury.infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > I'm not sure what the speed-up of any cache will be. The SA > maintains a > database of various related records - node records, path > records, service > records, etc. and responds to queries. This need doesn't go > away. The SA > itself is perfect candidate to be implemented using a DBMS. > (And if one had > been implemented over a DBMS, I'm not even sure that we'd be > talking about > scalability issues for only a few thousand nodes. Is the > perceived lack of > scalability of the SA a result of the architecture or the > existing implementations?) The scalability problem occurs during things like MPI job startup. At start up, you will have N processes which each need N-1 path records to establish connections. Those queries require both Node Record and Path Record queries. This means at job startup, the SA must process O(N^2) SA queries. If the lookup algorithm in the SA is O(logM) {M= number of SA records, which is O(N^2)), then the SA will have O(N^2 log(N^2)) operations to perform and O(N^2) packets to send and receive. For a 4000 CPU cluster (1000 nodes with 2 dual core CPUs each), that is over 16 million SA queries at job startup against a 1 million entry SA database. It would take quite a good SA database implementation to handle than in a timely manner. In contrast the replica on each node only needs to handle O(N) entries. And its lookup time could be O(logN). You'll note I spoke of processes, not nodes. In multi-CPU nodes, each process will need similar information. This is one area where a replica can greatly help, why ask the SA the same question multiple times in a row? If only a cache is considered, then the startup is still O(N^2) SA queries its just that we have 1/CPU-per-Node as many queries. Todd Rimmer From brilong at cisco.com Thu Jan 12 10:56:53 2006 From: brilong at cisco.com (Brian Long) Date: Thu, 12 Jan 2006 13:56:53 -0500 Subject: [openib-general] SA cache design In-Reply-To: <43C69CF7.309@ichips.intel.com> References: <5D78D28F88822E4D8702BB9EEF1A4367D12CA2@mercury.infiniconsys.com> <43C584DD.70503@ichips.intel.com> <1137084572.4466.40.camel@brilong-lnx> <43C69CF7.309@ichips.intel.com> Message-ID: <1137092213.4466.59.camel@brilong-lnx> On Thu, 2006-01-12 at 10:16 -0800, Sean Hefty wrote: > Brian Long wrote: > > How much overhead is going to be incurred by using a standard RDBMS > > instead of not caching anything? I'm not completely familiar with the > > IB configurations that would benefit from the proposed SA cache, but it > > seems to me, adding a RDBMS to anything as fast as IB would actually > > slow things down considerably. Can an RDBMS + SA cache actually be > > faster than no cache at all? > > I'm not sure what the speed-up of any cache will be. The SA maintains a > database of various related records - node records, path records, service > records, etc. and responds to queries. This need doesn't go away. The SA > itself is perfect candidate to be implemented using a DBMS. (And if one had > been implemented over a DBMS, I'm not even sure that we'd be talking about > scalability issues for only a few thousand nodes. Is the perceived lack of > scalability of the SA a result of the architecture or the existing implementations?) > > My belief is that a DBMS will outperform anything that I could write to store > and retrieve these records. Consider that a 4000 node cluster will have about > 8,000,000 path records. Local caches can reduce this considerably (to about > 4000), and if we greatly restrict the type of queries that are supported, then > we can manage the retrieval of those records ourselves. > > I do not want end-users to have to administer a database. However, if the user > only needs to install a library, then this approach seems worth pursuing. What about SQLite (http://www.sqlite.org/)? This is used by yum 2.4 in Fedora Core and other distributions. "SQLite is a small C library that implements a self-contained, embeddable, zero-configuration SQL database engine." /Brian/ -- Brian Long | | | IT Data Center Systems | .|||. .|||. Cisco Linux Developer | ..:|||||||:...:|||||||:.. Phone: (919) 392-7363 | C i s c o S y s t e m s From mshefty at ichips.intel.com Thu Jan 12 11:08:17 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 12 Jan 2006 11:08:17 -0800 Subject: [openib-general] SA cache design In-Reply-To: <1137092213.4466.59.camel@brilong-lnx> References: <5D78D28F88822E4D8702BB9EEF1A4367D12CA2@mercury.infiniconsys.com> <43C584DD.70503@ichips.intel.com> <1137084572.4466.40.camel@brilong-lnx> <43C69CF7.309@ichips.intel.com> <1137092213.4466.59.camel@brilong-lnx> Message-ID: <43C6A921.9090903@ichips.intel.com> Brian Long wrote: > What about SQLite (http://www.sqlite.org/)? This is used by yum 2.4 in > Fedora Core and other distributions. > > "SQLite is a small C library that implements a self-contained, > embeddable, zero-configuration SQL database engine." Someone else sent me a link to this same site, and it looks promising. Thanks. - Sean From eitan at mellanox.co.il Thu Jan 12 11:15:57 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 12 Jan 2006 21:15:57 +0200 Subject: [openib-general] SA cache design Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B4F8@mtlexch01.mtl.com> Hi Sean, The issue is the number of queries grow by N^2. Only a very small subset of queries is used: * PathRecord by SRC-GUID,DST-GUID * PortInfo by capability mask Not to say the current implementations are perfect. But RDBMS are optimized for other requirements not a simple single key lookup. Also, PathRecord implementation requires traversing the fabric. One could store the result after enumerating the entire N^2*Nsl*Np-key*... But then lookup is a simple hash lookup. Eitan > > Brian Long wrote: > > How much overhead is going to be incurred by using a standard RDBMS > > instead of not caching anything? I'm not completely familiar with the > > IB configurations that would benefit from the proposed SA cache, but it > > seems to me, adding a RDBMS to anything as fast as IB would actually > > slow things down considerably. Can an RDBMS + SA cache actually be > > faster than no cache at all? > > I'm not sure what the speed-up of any cache will be. The SA maintains a > database of various related records - node records, path records, service > records, etc. and responds to queries. This need doesn't go away. The SA > itself is perfect candidate to be implemented using a DBMS. (And if one had > been implemented over a DBMS, I'm not even sure that we'd be talking about > scalability issues for only a few thousand nodes. Is the perceived lack of > scalability of the SA a result of the architecture or the existing implementations?) > > My belief is that a DBMS will outperform anything that I could write to store > and retrieve these records. Consider that a 4000 node cluster will have about > 8,000,000 path records. Local caches can reduce this considerably (to about > 4000), and if we greatly restrict the type of queries that are supported, then > we can manage the retrieval of those records ourselves. > > I do not want end-users to have to administer a database. However, if the user > only needs to install a library, then this approach seems worth pursuing. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Thu Jan 12 11:30:34 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 12 Jan 2006 11:30:34 -0800 Subject: [openib-general] SA cache design In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B4F8@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B4F8@mtlexch01.mtl.com> Message-ID: <43C6AE5A.1040502@ichips.intel.com> Eitan Zahavi wrote: > The issue is the number of queries grow by N^2. I understand. On a related note, why does every instance of the application need to query for every other instance? To establish all-to-all communication, couldn't instance X only initiate connections to instances > X? (I.e. 1 connects to 2 and 3, 2 connects to 3.) > Only a very small subset of queries is used: > * PathRecord by SRC-GUID,DST-GUID > * PortInfo by capability mask I did look at the code to see what queries were actually being used today. And yes, we can implement for only those cases. I wanted to allow the flexibility to support other queries efficiently. - Sean From mst at mellanox.co.il Thu Jan 12 11:35:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 21:35:00 +0200 Subject: [openib-general] Re: ib_sdp ERR: IOCB dmesg output In-Reply-To: <20060112175922.GF3106@esmail.cup.hp.com> References: <20060112175922.GF3106@esmail.cup.hp.com> Message-ID: <20060112193500.GC9256@mellanox.co.il> Quoting r. Grant Grundler : > Subject: Re: ib_sdp ERR: IOCB dmesg output > > On Thu, Jan 12, 2006 at 09:52:45AM +0200, Michael S. Tsirkin wrote: > > > As noted earlier, netperf TCP_RR over SDP ran to completion > > > with no problems. netperf TCP_STREAM over SDP started spewing > > > the same errors despite the patches. :( > > > > OK, but can you unload the module now? > > Sorry - I just realized I checked the "netserver" machine > and not the "netperf" ("client"). client side still fails. :( > > iota:~# reload_ib > + IPoIB=30 > + ifconfig ib0 down > + ifconfig ib1 down > + rmmod ib_umad ib_ipoib ib_uverbs ib_sdp ib_cm ib_sa ib_mthca ib_mad ib_core > ERROR: Module ib_sdp is in use > ERROR: Module ib_cm is in use by ib_sdp > ERROR: Module ib_sa is in use by ib_sdp > ACPI: PCI interrupt for device 0000:81:00.0 disabled > GSI 60 (level, low) -> CPU 1 (0x0100) vector 60 unregistered > ERROR: Module ib_mad is in use by ib_cm,ib_sa > ERROR: Module ib_core is in use by ib_sdp,ib_cm,ib_sa,ib_mad > + modprobe ib_mthca msi_x=1 > ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) > ib_mthca: Initializing 0000:81:00.0 > GSI 60 (level, low) -> CPU 0 (0x0000) vector 60 > ACPI: PCI Interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 60 > ib_mthca 0000:81:00.0: HCA FW version 3.3.2 is old (3.3.3 is current). > ib_mthca 0000:81:00.0: If you have problems, try updating your HCA FW. > + modprobe ib_ipoib > + modprobe ib_sdp > + modprobe ib_uverbs > + modprobe ib_umad > + ifconfig ib0 10.0.0.30 netmask 255.255.255.0 broadcast 10.0.0.255 > + ifconfig ib1 10.0.1.30 netmask 255.255.255.0 broadcast 10.0.1.255 > iota:~# lsmod > Module Size Used by > ib_umad 33648 0 > ib_uverbs 93096 0 > ib_ipoib 96128 0 > ib_mthca 274728 0 > ib_sdp 230480 3 > ib_cm 93964 1 ib_sdp > ib_sa 25324 2 ib_ipoib,ib_sdp > ib_mad 85952 4 ib_umad,ib_mthca,ib_cm,ib_sa > ib_core 93096 8 ib_umad,ib_uverbs,ib_ipoib,ib_mthca,ib_sdp,ib_cm,ib_sa,ib_mad > tulip 118064 0 > e1000 233420 0 > tg3 227280 0 > e100 83592 0 > iota:~# > > Looks like the error messages and sdp refcnt might be related. > (IIRC, 4 error msgs and SDP ref cnt is 3) > Since netperf is terminated by a timer signal, it's possible > traffic is still outstanding when it exits. Could that be a > cause of the "ERR: IOCB <-1> cancel" error messages? > > thanks, > grant > Yes. By the way, this is with zcopy set, isnt it? Could you try testing with zcopy off? -- MST From eitan at mellanox.co.il Thu Jan 12 11:51:02 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 12 Jan 2006 21:51:02 +0200 Subject: [openib-general] SA cache design Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B4FA@mtlexch01.mtl.com> > > > On a related note, why does every instance of the application need to query for > every other instance? To establish all-to-all communication, couldn't instance > X only initiate connections to instances > X? (I.e. 1 connects to 2 and 3, 2 > connects to 3.) [EZ] MPI opens a connection from each node to every other node. Actually even from every CPU to every other CPU. So this is why we have N^2 connections. > > > Only a very small subset of queries is used: > > * PathRecord by SRC-GUID,DST-GUID > > * PortInfo by capability mask > > I did look at the code to see what queries were actually being used today. And > yes, we can implement for only those cases. I wanted to allow the flexibility > to support other queries efficiently. [EZ] The scalability issues we see today are what I most worry about. > > - Sean From iod00d at hp.com Thu Jan 12 11:50:04 2006 From: iod00d at hp.com (Grant Grundler) Date: Thu, 12 Jan 2006 11:50:04 -0800 Subject: [openib-general] Re: SDP perf drop with 2.6.15 In-Reply-To: <20060112071803.GC5850@mellanox.co.il> References: <20060112071431.GE29168@esmail.cup.hp.com> <20060112071803.GC5850@mellanox.co.il> Message-ID: <20060112195004.GK3106@esmail.cup.hp.com> On Thu, Jan 12, 2006 at 09:18:03AM +0200, Michael S. Tsirkin wrote: > > Makes me think it's a change in the interrupt code path. > > Any clues on where I might start with this? > > oprofile should show where we are spending the time. Executive summary: Silly me. interrupt bindings changed and I didn't check after rolling to 2.6.15 (from 2.6.14). That explains the increase cpu utilization with lower performance. Getting similar performance with -T 0,0 (2.6.15) vs -T 1,1 (2.6.14). I can't explain why q-syscollect *improves* perf by ~11 to 17%. I cc'd linux-ia64 hoping someone might explain it. Details: Since I prefer q-syscollect: grundler at gsyprf3:~/openib-perf-2006/rx2600-r4929$ LD_PRELOAD=/usr/local/lib/libsdp.so q-syscollect /usr/local/bin/netperf -p 12866 -l 60 -H 10.0.0.30 -t TCP_RR -T 1,1 -c -C -- -r 1,1 -s 0 -S 0 libsdp.so: $LIBSDP_CONFIG_FILE not set. Using /usr/local/etc/libsdp.conf libsdp.so: $LIBSDP_CONFIG_FILE not set. Using /usr/local/etc/libsdp.conf bind_to_specific_processor: enter TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.30 (10.0.0.30) port 0 AF_INET Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 16300.11 10.53 8.26 12.925 10.137 Wierd. Performance jumps from 13900 to 16300 (+2400 or +%17). Hrm...something got me to look at /proc/interrupts and I see that mthca is interrupting on CPU0 now: 70: 644084899 0 PCI-MSI-X ib_mthca (comp) 71: 8 0 PCI-MSI-X ib_mthca (async) 72: 27247 0 PCI-MSI-X ib_mthca (cmd) Retest with -T 0,1 : 16384 87380 1 1 60.00 17557.94 6.06 10.88 6.909 12.390 And again -T 0,1 but without q-syscollect: 16384 87380 1 1 60.00 15891.41 6.13 7.61 7.713 9.571 Now with -T 0,0: 16384 87380 1 1 60.00 20719.03 5.93 5.26 5.724 5.076 with -T 0,0 and without q-syscollect: 16384 87380 1 1 60.00 18553.61 5.73 5.36 6.181 5.782 That's +11% on the last set. I'm stumped why q-syscollect would *improve* performance. Maybe someone on linux-ia64 has a theory? .q/ output from the runs below is available on: http://gsyprf3.external.hp.com/~grundler/openib-perf-2006/rx2600-r4929/.q The "insteresting" output files from the last q-syscollect run: netperf-pid5693-cpu0.info#0 q-syscollect-pid5691-cpu0.info#0 unknown-cpu0.info#2 q-view output for last run of netperf/SDP is: Command: /usr/local/bin/netperf -p 12866 -l 60 -H 10.0.0.30 -t TCP_RR -T 0 0 -c -C -- -r 1 1 -s 0 -S 0 Flat profile of CPU_CYCLES in netperf-pid5693-cpu0.hist#0: Each histogram sample counts as 1.00034m seconds % time self cumul calls self/call tot/call name 27.18 1.52 1.52 21.8M 69.9n 69.9n _spin_unlock_irqrestore 16.08 0.90 2.42 1.18M 761n 846n schedule 6.24 0.35 2.77 1.29M 271n 2.56u sdp_inet_recv 4.33 0.24 3.01 2.59M 93.6n 93.6n __kernel_syscall_via_break 3.90 0.22 3.23 1.18M 185n 342n sdp_send_buff_post 3.51 0.20 3.43 1.24M 158n 933n sdp_inet_send 2.70 0.15 3.58 2.48M 60.8n 60.8n kmem_cache_free 2.11 0.12 3.69 4.82M 24.5n 24.5n kmem_cache_alloc 1.93 0.11 3.80 1.20M 90.1n 358n sdp_recv_flush 1.79 0.10 3.90 2.53M 39.5n 39.5n fget 1.75 0.10 4.00 1.22M 80.2n 2.95u sys_recv 1.65 0.09 4.09 1.22M 75.4n 414n sdp_send_data_queue_test 1.59 0.09 4.18 1.25M 71.3n 1.16u sys_send 1.54 0.09 4.27 - - - send_tcp_rr 1.41 0.08 4.35 2.33M 33.9n 33.9n sdp_buff_q_put_tail 1.22 0.07 4.41 21.7M 3.14n 3.14n _spin_lock_irqsave 1.09 0.06 4.48 1.25M 48.7n 48.7n __divsi3 1.06 0.06 4.53 1.23M 48.1n 2.86u sys_recvfrom 1.02 0.06 4.59 2.29M 24.9n 24.9n sba_map_single 1.02 0.06 4.65 2.49M 22.9n 63.0n sockfd_lookup 0.98 0.06 4.70 - - - sdp_exit 0.98 0.06 4.76 1.26M 43.7n 967n sock_sendmsg 0.97 0.05 4.81 2.38M 22.7n 72.3n sdp_buff_pool_get 0.95 0.05 4.87 2.50M 21.2n 21.2n fput 0.91 0.05 4.92 1.23M 41.4n 2.72u sock_recvmsg 0.89 0.05 4.97 1.19M 42.1n 62.9n memcpy_toiovec 0.88 0.05 5.02 2.50M 19.6n 19.6n __copy_user 0.86 0.05 5.06 1.21M 39.8n 868n schedule_timeout 0.82 0.05 5.11 1.29M 35.7n 128n send 0.79 0.04 5.15 1.25M 35.1n 35.1n sched_clock 0.64 0.04 5.19 1.29M 27.8n 123n recv 0.59 0.03 5.22 1.23M 26.9n 26.9n __udivdi3 0.59 0.03 5.26 1.19M 27.8n 99.1n mthca_tavor_post_send .... q-view output for the q-syscollect process: Command: q-syscollect /usr/local/bin/netperf -p 12866 -l 60 -H 10.0.0.30 -t TCP_ RR -T 0,0 -c -C -- -r 1,1 -s 0 -S 0 Flat profile of CPU_CYCLES in q-syscollect-pid5691-cpu0.hist#0: Each histogram sample counts as 1.00034m seconds % time self cumul calls self/call tot/call name 0.00 0.00 0.00 162 0.00 0.00 _spin_lock_irqsave 0.00 0.00 0.00 149 0.00 0.00 _spin_unlock_irqrestore 0.00 0.00 0.00 115 0.00 0.00 lock_timer_base 0.00 0.00 0.00 102 0.00 0.00 schedule 0.00 0.00 0.00 41.0 0.00 0.00 kfree ... Only notable thing here is it calls "schedule" and triggers an additional ~160 interrupts (my guess). unknown-cpu0.info#2 is probably interesting too: Flat profile of CPU_CYCLES in unknown-cpu0.hist#2: Each histogram sample counts as 1.00034m seconds % time self cumul calls self/call tot/call name 91.59 53.37 53.37 931k 57.3u 57.3u default_idle 4.53 2.64 56.00 10.8M 244n 244n _spin_unlock_irqrestore 1.78 1.04 57.04 1.27M 815n 885n schedule 0.75 0.43 57.48 1.38M 315n 2.30u mthca_eq_int 0.49 0.28 57.76 1.38M 206n 2.57u handle_IRQ_event 0.19 0.11 57.87 - - - cpu_idle 0.14 0.08 57.96 1.35M 62.1n 2.41u mthca_tavor_msi_x_interrupt 0.10 0.06 58.02 1.32M 45.4n 2.07u mthca_cq_completion ... thanks, grant From mshefty at ichips.intel.com Thu Jan 12 11:58:28 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 12 Jan 2006 11:58:28 -0800 Subject: [openib-general] SA cache design In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A43670A0967@mercury.infiniconsys.com> References: <5D78D28F88822E4D8702BB9EEF1A43670A0967@mercury.infiniconsys.com> Message-ID: <43C6B4E4.7060002@ichips.intel.com> Rimmer, Todd wrote: > 1 million entry SA database. This is exactly why I think that the SA needs to be backed by a real DBMS. > In contrast the replica on each node only needs to handle O(N) entries. > And its lookup time could be O(logN). This is still O(NlogN) operations, which made me look at indexing schemes to improve performance. The most obvious implementation to me was to store path records in a binary tree sorted by dgid/pkey. But this isn't very flexible. > why ask the SA the same question multiple times in a row? I have no idea why the application did this. Are any of the queries in this case actually the same? - Sean From mst at mellanox.co.il Thu Jan 12 11:58:28 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 21:58:28 +0200 Subject: [openib-general] Re: SDP perf drop with 2.6.15 In-Reply-To: <20060112195004.GK3106@esmail.cup.hp.com> References: <20060112195004.GK3106@esmail.cup.hp.com> Message-ID: <20060112195828.GD9256@mellanox.co.il> Quoting r. Grant Grundler : > Subject: Re: SDP perf drop with 2.6.15 > > On Thu, Jan 12, 2006 at 09:18:03AM +0200, Michael S. Tsirkin wrote: > > > Makes me think it's a change in the interrupt code path. > > > Any clues on where I might start with this? > > > > oprofile should show where we are spending the time. > > Executive summary: > Silly me. interrupt bindings changed and I didn't check after rolling > to 2.6.15 (from 2.6.14). That explains the increase cpu utilization > with lower performance. > Getting similar performance with -T 0,0 (2.6.15) vs -T 1,1 (2.6.14). > OK, so we are left with the ref count now. I'm seeing this myself sometimes, I have some other stuff to attend to next week, and then I plan to look into this. -- MST From mshefty at ichips.intel.com Thu Jan 12 12:02:39 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 12 Jan 2006 12:02:39 -0800 Subject: [openib-general] SA cache design In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B4FA@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B4FA@mtlexch01.mtl.com> Message-ID: <43C6B5DF.1030403@ichips.intel.com> Eitan Zahavi wrote: > [EZ] MPI opens a connection from each node to every other node. Actually > even from every CPU to every other CPU. So this is why we have N^2 > connections. I was confusing myself. I think that there are n(n-1)/2 connections, but that's still O(n^2). - Sean From trimmer at silverstorm.com Thu Jan 12 12:27:05 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Thu, 12 Jan 2006 15:27:05 -0500 Subject: [openib-general] SA cache design Message-ID: <5D78D28F88822E4D8702BB9EEF1A4367D12CC8@mercury.infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > why ask the SA the same question multiple times in a row? > > I have no idea why the application did this. Are any of the > queries in this > case actually the same? Each MPI process is independent. However they all need to get pathrecords for all the other processes/nodes in the system. Hence, each process on a node will make the exact same set of queries. Todd R. From iod00d at hp.com Thu Jan 12 12:36:23 2006 From: iod00d at hp.com (Grant Grundler) Date: Thu, 12 Jan 2006 12:36:23 -0800 Subject: [openib-general] SA cache design In-Reply-To: <43C6B4E4.7060002@ichips.intel.com> References: <5D78D28F88822E4D8702BB9EEF1A43670A0967@mercury.infiniconsys.com> <43C6B4E4.7060002@ichips.intel.com> Message-ID: <20060112203623.GL3106@esmail.cup.hp.com> On Thu, Jan 12, 2006 at 11:58:28AM -0800, Sean Hefty wrote: > This is still O(NlogN) operations, which made me look at indexing schemes > to improve performance. I strongly associate "Indexing schemes" with "judy": http://docs.hp.com/en/B6841-90001/ix01.html The open source project is here: http://judy.sourceforge.net/ > The most obvious implementation to me was to store path records in a binary > tree sorted by dgid/pkey. But this isn't very flexible. "dynamic, associative array" might be overkill too. I'm not sure how many index's it supports but Judy is definitely worth looking at for a "simple" implementation. Perf data I've seen 3-4 years ago indicated that Judy scales nicely from 0 to several million entries. grant From hkupferschmidt at acilabs.com Thu Jan 12 13:02:24 2006 From: hkupferschmidt at acilabs.com (Alec Long) Date: Thu, 12 Jan 2006 22:02:24 +0100 Subject: [openib-general] Hey baby, found this site and wanted you to check it out firstNeed Software? Message-ID: <000001c617e5$e5fab400$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! http://www.fewacto.net/pt/?46&tckafy -------------- next part -------------- An HTML attachment was scrubbed... URL: From jice at pantasys.com Thu Jan 12 13:12:57 2006 From: jice at pantasys.com (Jean-Christophe Hugly) Date: Thu, 12 Jan 2006 13:12:57 -0800 Subject: [openib-general] Error when loading ib_umad In-Reply-To: References: <7b2fa1820601100631q46ea5b30x2fbceedf0016bdbd@mail.gmail.com> Message-ID: <1137100377.24043.134.camel@jhugly.pantasys.com> On Tue, 2006-01-10 at 07:07 -0800, Roland Dreier wrote: > Are you ignoring compile warnings about class_device_create()? Since > 2.6.15 is out, the OpenIB svn does not support 2.6.14 any more, so you > may run into problems like this. > > You will probably need to restore the compatibility hack removed in > r4784 by adding something like > > #include > > #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,15) > #define class_device_create(cls, parent, devt, device, fmt, arg...) \ > class_device_create(cls, devt, device, fmt, ## arg) > #endif > > to . Thanks a bunch. Since I'm just the 3rd guy hitting that, may be a little addition to the backport set of patches is in order. I do not mind generating it, but I am not sure of the correct procedure: the patches are grouped by reason, but they could overlap (in that case it would) should they be grouped by affected file instead ? Or is there a way to specify in which order they should be applied ? Whoever maintains these BP patches might have some recommendation to make. -- Jean-Christophe Hugly PANTA From rdreier at cisco.com Thu Jan 12 13:15:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jan 2006 13:15:18 -0800 Subject: [openib-general] ipoib_multicast_leak.patch In-Reply-To: <20060111222349.GC27348@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 12 Jan 2006 00:23:49 +0200") References: <20060111222349.GC27348@mellanox.co.il> Message-ID: > ipoib_multicast_leak.patch It seems things can be simplified still further. I didn't see a reason to have both ipoib_mcast_dev_flush() and ipoib_mcast_dev_down(), so I rolled everything up into ipoib_mcast_dev_flush(). This passes light testing here (and I checked that mcast allocs are balanced by mcast frees). How does this look to you? - R. --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 4973) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -742,50 +742,23 @@ void ipoib_mcast_dev_flush(struct net_de { struct ipoib_dev_priv *priv = netdev_priv(dev); LIST_HEAD(remove_list); - struct ipoib_mcast *mcast, *tmcast, *nmcast; + struct ipoib_mcast *mcast, *tmcast; unsigned long flags; ipoib_dbg_mcast(priv, "flushing multicast list\n"); spin_lock_irqsave(&priv->lock, flags); - list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { - nmcast = ipoib_mcast_alloc(dev, 0); - if (nmcast) { - nmcast->flags = - mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY); - - nmcast->mcmember.mgid = mcast->mcmember.mgid; - - /* Add the new group in before the to-be-destroyed group */ - list_add_tail(&nmcast->list, &mcast->list); - list_del_init(&mcast->list); - - rb_replace_node(&mcast->rb_node, &nmcast->rb_node, - &priv->multicast_tree); - list_add_tail(&mcast->list, &remove_list); - } else { - ipoib_warn(priv, "could not reallocate multicast group " - IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); - } + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + list_del(&mcast->list); + rb_erase(&mcast->rb_node, &priv->multicast_tree); + list_add_tail(&mcast->list, &remove_list); } if (priv->broadcast) { - nmcast = ipoib_mcast_alloc(dev, 0); - if (nmcast) { - nmcast->mcmember.mgid = priv->broadcast->mcmember.mgid; - - rb_replace_node(&priv->broadcast->rb_node, - &nmcast->rb_node, - &priv->multicast_tree); - - list_add_tail(&priv->broadcast->list, &remove_list); - priv->broadcast = nmcast; - } else - ipoib_warn(priv, "could not reallocate broadcast group " - IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(priv->broadcast->mcmember.mgid)); + rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree); + list_add_tail(&priv->broadcast->list, &remove_list); + priv->broadcast = NULL; } spin_unlock_irqrestore(&priv->lock, flags); @@ -796,24 +769,6 @@ void ipoib_mcast_dev_flush(struct net_de } } -void ipoib_mcast_dev_down(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned long flags; - - /* Delete broadcast since it will be recreated */ - if (priv->broadcast) { - ipoib_dbg_mcast(priv, "deleting broadcast group\n"); - - spin_lock_irqsave(&priv->lock, flags); - rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree); - spin_unlock_irqrestore(&priv->lock, flags); - ipoib_mcast_leave(dev, priv->broadcast); - ipoib_mcast_free(priv->broadcast); - priv->broadcast = NULL; - } -} - void ipoib_mcast_restart_task(void *dev_ptr) { struct net_device *dev = dev_ptr; --- infiniband/ulp/ipoib/ipoib_ib.c (revision 4973) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -453,17 +453,8 @@ int ipoib_ib_dev_down(struct net_device } ipoib_mcast_stop_thread(dev, 1); - - /* - * Flush the multicast groups first so we stop any multicast joins. The - * completion thread may have already died and we may deadlock waiting - * for the completion thread to finish some multicast joins. - */ ipoib_mcast_dev_flush(dev); - /* Delete broadcast and local addresses since they will be recreated */ - ipoib_mcast_dev_down(dev); - ipoib_flush_paths(dev); return 0; @@ -624,9 +615,7 @@ void ipoib_ib_dev_cleanup(struct net_dev ipoib_dbg(priv, "cleaning up ib_dev\n"); ipoib_mcast_stop_thread(dev, 1); - - /* Delete the broadcast address and the local address */ - ipoib_mcast_dev_down(dev); + ipoib_mcast_dev_flush(dev); ipoib_transport_dev_cleanup(dev); } From mst at mellanox.co.il Thu Jan 12 13:32:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 23:32:49 +0200 Subject: [openib-general] Re: ipoib: outstanding patches In-Reply-To: References: Message-ID: <20060112213248.GH9256@mellanox.co.il> Quoting Roland Dreier : > > ipoib_mcast_send.patch > > Could we reuse the IPOIB_MCAST_RUN bit rather than adding a new bit? > It seems that we could kill mcast_mutex and replace uses with > priv->lock instead -- I don't see anything that sleeps inside mcast_mutex. Yes, I now believe that we should be able to do it this way. -- MST From mst at mellanox.co.il Thu Jan 12 13:56:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jan 2006 23:56:31 +0200 Subject: [openib-general] ipoib_multicast_leak.patch In-Reply-To: References: Message-ID: <20060112215631.GJ9256@mellanox.co.il> Quoting r. Roland Dreier : > Subject: [openib-general] ipoib_multicast_leak.patch > > > ipoib_multicast_leak.patch > > It seems things can be simplified still further. I didn't see a > reason to have both ipoib_mcast_dev_flush() and ipoib_mcast_dev_down(), > so I rolled everything up into ipoib_mcast_dev_flush(). This passes > light testing here (and I checked that mcast allocs are balanced by > mcast frees). How does this look to you? > > - R. Looks sane. But what was the idea behind all the complexity in dev_flush? Was there some reason for it? -- MST From mshefty at ichips.intel.com Thu Jan 12 13:58:40 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 12 Jan 2006 13:58:40 -0800 Subject: [openib-general] SA cache design In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A4367D12CC8@mercury.infiniconsys.com> References: <5D78D28F88822E4D8702BB9EEF1A4367D12CC8@mercury.infiniconsys.com> Message-ID: <43C6D110.2040006@ichips.intel.com> Rimmer, Todd wrote: > Each MPI process is independent. However they all need to get pathrecords > for all the other processes/nodes in the system. > Hence, each process on a node will make the exact same set of queries. That should still only be P queries per node, with P = number of processes on a node. Why doesn't a single query (GET_TABLE) suffice for each process? - Sean From mst at mellanox.co.il Thu Jan 12 14:00:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Jan 2006 00:00:55 +0200 Subject: [openib-general] Re: Error when loading ib_umad In-Reply-To: <1137100377.24043.134.camel@jhugly.pantasys.com> References: <1137100377.24043.134.camel@jhugly.pantasys.com> Message-ID: <20060112220054.GK9256@mellanox.co.il> Quoting Jean-Christophe Hugly : > Subject: Re: Error when loading ib_umad > > On Tue, 2006-01-10 at 07:07 -0800, Roland Dreier wrote: > > Are you ignoring compile warnings about class_device_create()? Since > > 2.6.15 is out, the OpenIB svn does not support 2.6.14 any more, so you > > may run into problems like this. > > > > You will probably need to restore the compatibility hack removed in > > r4784 by adding something like > > > > #include > > > > #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,15) > > #define class_device_create(cls, parent, devt, device, fmt, arg...) \ > > class_device_create(cls, devt, device, fmt, ## arg) > > #endif > > > > to . > > Thanks a bunch. Since I'm just the 3rd guy hitting that, may be a little > addition to the backport set of patches is in order. Yes, thats now under https://openib.org/svn/gen2/branches/backport/2.6.14 > I do not mind > generating it, but I am not sure of the correct procedure: the patches > are grouped by reason, but they could overlap (in that case it would) > should they be grouped by affected file instead ? Or is there a way to > specify in which order they should be applied ? Whoever maintains these > BP patches might have some recommendation to make. If you are talking about the patches under https://openib.org/svn/gen2/branches/backport I'm being careful to make them independent. You'll be able to apply in whatever order (possibly with a fuzz). -- MST From rdreier at cisco.com Thu Jan 12 14:08:50 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jan 2006 14:08:50 -0800 Subject: [openib-general] ipoib_multicast_leak.patch In-Reply-To: <20060112215631.GJ9256@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 12 Jan 2006 23:56:31 +0200") References: <20060112215631.GJ9256@mellanox.co.il> Message-ID: Michael> Looks sane. But what was the idea behind all the Michael> complexity in dev_flush? Was there some reason for it? I'm not sure. It must be historical. - R. From mst at mellanox.co.il Thu Jan 12 14:27:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Jan 2006 00:27:04 +0200 Subject: [openib-general] Re: ipoib_multicast_leak.patch In-Reply-To: References: Message-ID: <20060112222703.GL9256@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ipoib_multicast_leak.patch > > Michael> Looks sane. But what was the idea behind all the > Michael> complexity in dev_flush? Was there some reason for it? > > I'm not sure. It must be historical. OK then. BTW, is what's going on with the patches clear now, or still confusing? Basically there's a pile of patches in svn, and the ipoib_all_neigh_issues_2.patch for trunk that works on 2.6.15, for 2.6.16 could be replaced by the patch set that I sent previously. -- MST From trimmer at silverstorm.com Thu Jan 12 14:27:47 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Thu, 12 Jan 2006 17:27:47 -0500 Subject: [openib-general] SA cache design Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670A0969@mercury.infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Rimmer, Todd wrote: > > Each MPI process is independent. However they all need to > get pathrecords > > for all the other processes/nodes in the system. > > Hence, each process on a node will make the exact same set > of queries. > > That should still only be P queries per node, with P = number > of processes on a > node. Why doesn't a single query (GET_TABLE) suffice for > each process? Given a cluster with 1000 nodes, 4 processors per node. A given MPI run may choose to use a subset, for example 500 processes. Each process needs path records for the other 500 processes, but not for the other 3500 cpus in the cluster. While each process could do a GET_TABLE for all path records that would be rather inefficient and would provide 1,000,000 path records in the RMPP response, of which only 500 are of interest. Even if all 4000 processors were being used in a single run, each process only needs 3999 path records (999 or which are unique). In fact a given node will never need more than N or the N^2 path records because the remaining involve paths where this node is not involved. so getting all 1,000,000 path records would be very inefficient. Then multiply this by 4 processes per node making this same set of queries. Then multiply this by multiple partitions, SLs, etc per node and it gets very inefficient to simply get the whole table. Todd Rimmer From rdreier at cisco.com Thu Jan 12 14:28:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jan 2006 14:28:15 -0800 Subject: [openib-general] Re: ipoib_multicast_leak.patch In-Reply-To: <20060112222703.GL9256@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 13 Jan 2006 00:27:04 +0200") References: <20060112222703.GL9256@mellanox.co.il> Message-ID: Michael> BTW, is what's going on with the patches clear now, or Michael> still confusing? Yeah, I think I've got a handle on it for now. - R. From mshefty at ichips.intel.com Thu Jan 12 14:54:23 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 12 Jan 2006 14:54:23 -0800 Subject: [openib-general] SA cache design In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A43670A0969@mercury.infiniconsys.com> References: <5D78D28F88822E4D8702BB9EEF1A43670A0969@mercury.infiniconsys.com> Message-ID: <43C6DE1F.6050302@ichips.intel.com> Rimmer, Todd wrote: > While each process could do a GET_TABLE for all path records that > would be rather inefficient and would provide 1,000,000 path records in > the RMPP response, of which only 500 are of interest. Each process could do a GET_TABLE for only those path records with the SGID set to their local port and NumPath set to 1. This would give them only 1000 or so path records, most of which are of interest. > Even if all 4000 processors were being used in a single run, each > process only needs 3999 path records (999 or which are unique). > In fact a given node will never need more than N or the N^2 path records > because the remaining involve paths where this node is not involved. > so getting all 1,000,000 path records would be very inefficient. Even a local cache wouldn't get every possible path record. The application should be no different. An application that wants to connect to every node on the fabric should only need to issue a single path record query, all of which are of interest. - Sean From rdreier at cisco.com Thu Jan 12 15:28:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jan 2006 15:28:12 -0800 Subject: [openib-general] Re: [PATCH] mthca_cmd small cleanup In-Reply-To: <20060112095454.GE16938@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 12 Jan 2006 11:54:54 +0200") References: <20060112095454.GE16938@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Jan 12 15:35:21 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jan 2006 15:35:21 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix mem leaks in mthca_provider error handling In-Reply-To: <20060112142359.GA6536@mellanox.co.il> (Jack Morgenstein's message of "Thu, 12 Jan 2006 16:23:59 +0200") References: <20060112142359.GA6536@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Jan 12 15:44:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jan 2006 15:44:12 -0800 Subject: [openib-general] Re: [PATCH] mthca: cosmetics: use ALIGN macro In-Reply-To: <20060112151502.GL16938@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 12 Jan 2006 17:15:02 +0200") References: <20060112105631.GF16938@mellanox.co.il> <20060112151502.GL16938@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Jan 12 15:55:50 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jan 2006 15:55:50 -0800 Subject: [openib-general] Re: [PATCH] mthca: uninitialized variable In-Reply-To: <20060112151644.GM16938@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 12 Jan 2006 17:16:44 +0200") References: <20060112091418.GB16938@mellanox.co.il> <20060112151644.GM16938@mellanox.co.il> Message-ID: Thanks, applied. From rolandd at cisco.com Thu Jan 12 16:13:17 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 13 Jan 2006 00:13:17 +0000 Subject: [openib-general] [git patch review 2/6] IPoIB: Fix memory leak of multicast group structures In-Reply-To: <1137111197380-341f286bd5273779@cisco.com> Message-ID: <1137111197380-f482e88c451680c0@cisco.com> The current handling of multicast groups in IPoIB ends up never freeing send-only multicast groups. It turns out the logic was much more complicated than it needed to be; we can fix this bug and completely kill ipoib_mcast_dev_down() at the same time. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 13 ----- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 61 +++--------------------- 2 files changed, 9 insertions(+), 65 deletions(-) 988bd50300ef2e2d5cb8563e2ac99453dd9acd86 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 2388580..f7e8489 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -453,17 +453,8 @@ int ipoib_ib_dev_down(struct net_device } ipoib_mcast_stop_thread(dev, 1); - - /* - * Flush the multicast groups first so we stop any multicast joins. The - * completion thread may have already died and we may deadlock waiting - * for the completion thread to finish some multicast joins. - */ ipoib_mcast_dev_flush(dev); - /* Delete broadcast and local addresses since they will be recreated */ - ipoib_mcast_dev_down(dev); - ipoib_flush_paths(dev); return 0; @@ -624,9 +615,7 @@ void ipoib_ib_dev_cleanup(struct net_dev ipoib_dbg(priv, "cleaning up ib_dev\n"); ipoib_mcast_stop_thread(dev, 1); - - /* Delete the broadcast address and the local address */ - ipoib_mcast_dev_down(dev); + ipoib_mcast_dev_flush(dev); ipoib_transport_dev_cleanup(dev); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index bf1c08c..7403bac 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -742,50 +742,23 @@ void ipoib_mcast_dev_flush(struct net_de { struct ipoib_dev_priv *priv = netdev_priv(dev); LIST_HEAD(remove_list); - struct ipoib_mcast *mcast, *tmcast, *nmcast; + struct ipoib_mcast *mcast, *tmcast; unsigned long flags; ipoib_dbg_mcast(priv, "flushing multicast list\n"); spin_lock_irqsave(&priv->lock, flags); - list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { - nmcast = ipoib_mcast_alloc(dev, 0); - if (nmcast) { - nmcast->flags = - mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY); - - nmcast->mcmember.mgid = mcast->mcmember.mgid; - - /* Add the new group in before the to-be-destroyed group */ - list_add_tail(&nmcast->list, &mcast->list); - list_del_init(&mcast->list); - - rb_replace_node(&mcast->rb_node, &nmcast->rb_node, - &priv->multicast_tree); - list_add_tail(&mcast->list, &remove_list); - } else { - ipoib_warn(priv, "could not reallocate multicast group " - IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); - } + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { + list_del(&mcast->list); + rb_erase(&mcast->rb_node, &priv->multicast_tree); + list_add_tail(&mcast->list, &remove_list); } if (priv->broadcast) { - nmcast = ipoib_mcast_alloc(dev, 0); - if (nmcast) { - nmcast->mcmember.mgid = priv->broadcast->mcmember.mgid; - - rb_replace_node(&priv->broadcast->rb_node, - &nmcast->rb_node, - &priv->multicast_tree); - - list_add_tail(&priv->broadcast->list, &remove_list); - priv->broadcast = nmcast; - } else - ipoib_warn(priv, "could not reallocate broadcast group " - IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(priv->broadcast->mcmember.mgid)); + rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree); + list_add_tail(&priv->broadcast->list, &remove_list); + priv->broadcast = NULL; } spin_unlock_irqrestore(&priv->lock, flags); @@ -796,24 +769,6 @@ void ipoib_mcast_dev_flush(struct net_de } } -void ipoib_mcast_dev_down(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned long flags; - - /* Delete broadcast since it will be recreated */ - if (priv->broadcast) { - ipoib_dbg_mcast(priv, "deleting broadcast group\n"); - - spin_lock_irqsave(&priv->lock, flags); - rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree); - spin_unlock_irqrestore(&priv->lock, flags); - ipoib_mcast_leave(dev, priv->broadcast); - ipoib_mcast_free(priv->broadcast); - priv->broadcast = NULL; - } -} - void ipoib_mcast_restart_task(void *dev_ptr) { struct net_device *dev = dev_ptr; -- 1.0.7 From rolandd at cisco.com Thu Jan 12 16:13:17 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 13 Jan 2006 00:13:17 +0000 Subject: [openib-general] [git patch review 6/6] IB/mthca: Initialize grh_present before using it In-Reply-To: <1137111197380-d647455e061ba3b8@cisco.com> Message-ID: <1137111197380-ba6aed04e3bd8864@cisco.com> build_mlx_header() was using sqp->ud_header.grh_present before it was initialized by mthca_read_ah(). Furthermore, header->grh_present is set by ib_ud_header_init, so there's no need to set it again in mthca_read_ah(). Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_av.c | 10 ++++++---- drivers/infiniband/hw/mthca/mthca_dev.h | 1 + drivers/infiniband/hw/mthca/mthca_qp.c | 2 +- 3 files changed, 8 insertions(+), 5 deletions(-) 9eacee2ac624bfa9740d49355dbe6ee88d0cba0a diff --git a/drivers/infiniband/hw/mthca/mthca_av.c b/drivers/infiniband/hw/mthca/mthca_av.c index 22fdc44..a14eed0 100644 --- a/drivers/infiniband/hw/mthca/mthca_av.c +++ b/drivers/infiniband/hw/mthca/mthca_av.c @@ -163,6 +163,11 @@ int mthca_destroy_ah(struct mthca_dev *d return 0; } +int mthca_ah_grh_present(struct mthca_ah *ah) +{ + return !!(ah->av->g_slid & 0x80); +} + int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, struct ib_ud_header *header) { @@ -172,8 +177,7 @@ int mthca_read_ah(struct mthca_dev *dev, header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; header->lrh.destination_lid = ah->av->dlid; header->lrh.source_lid = cpu_to_be16(ah->av->g_slid & 0x7f); - if (ah->av->g_slid & 0x80) { - header->grh_present = 1; + if (mthca_ah_grh_present(ah)) { header->grh.traffic_class = (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; header->grh.flow_label = @@ -184,8 +188,6 @@ int mthca_read_ah(struct mthca_dev *dev, &header->grh.source_gid); memcpy(header->grh.destination_gid.raw, ah->av->dgid, 16); - } else { - header->grh_present = 0; } return 0; diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index 795b379..a104ab0 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -520,6 +520,7 @@ int mthca_create_ah(struct mthca_dev *de int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, struct ib_ud_header *header); +int mthca_ah_grh_present(struct mthca_ah *ah); int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 564b6d5..fba608e 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1434,7 +1434,7 @@ static int build_mlx_header(struct mthca u16 pkey; ib_ud_header_init(256, /* assume a MAD */ - sqp->ud_header.grh_present, + mthca_ah_grh_present(to_mah(wr->wr.ud.ah)), &sqp->ud_header); err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header); -- 1.0.7 From rolandd at cisco.com Thu Jan 12 16:13:17 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 13 Jan 2006 00:13:17 +0000 Subject: [openib-general] [git patch review 5/6] IB/mthca: Cosmetic: use the ALIGN macro In-Reply-To: <1137111197380-64102fbf42547cb5@cisco.com> Message-ID: <1137111197380-d647455e061ba3b8@cisco.com> Use the ALIGN macro to simplify some rounding code. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_cmd.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) c063a06835d3ccfa6c039d3a3869fcf22249c862 diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index f69e489..be1791b 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -727,8 +727,8 @@ int mthca_QUERY_FW(struct mthca_dev *dev * system pages needed. */ dev->fw.arbel.fw_pages = - (dev->fw.arbel.fw_pages + (1 << (PAGE_SHIFT - 12)) - 1) >> - (PAGE_SHIFT - 12); + ALIGN(dev->fw.arbel.fw_pages, PAGE_SIZE >> 12) >> + (PAGE_SHIFT - 12); mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n", (unsigned long long) dev->fw.arbel.clr_int_base, @@ -1445,6 +1445,7 @@ int mthca_SET_ICM_SIZE(struct mthca_dev * pages needed. */ *aux_pages = (*aux_pages + (1 << (PAGE_SHIFT - 12)) - 1) >> (PAGE_SHIFT - 12); + *aux_pages = ALIGN(*aux_pages, PAGE_SIZE >> 12) >> (PAGE_SHIFT - 12); return 0; } -- 1.0.7 From rolandd at cisco.com Thu Jan 12 16:13:17 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 13 Jan 2006 00:13:17 +0000 Subject: [openib-general] [git patch review 4/6] IB/mthca: Fix memory leaks in error handling In-Reply-To: <1137111197380-7741e9b26c0a0236@cisco.com> Message-ID: <1137111197380-64102fbf42547cb5@cisco.com> Fix memory leaks in mthca_create_qp() and mthca_create_srq() error handling. Signed-off-by: Jack Morgenstein Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_provider.c | 10 +++++++--- 1 files changed, 7 insertions(+), 3 deletions(-) 17e2e819517d75f2f3407e59c5f7f6f0ef305d14 diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index db35690..484a7e6 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -445,8 +445,10 @@ static struct ib_srq *mthca_create_srq(s if (pd->uobject) { context = to_mucontext(pd->uobject->context); - if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) - return ERR_PTR(-EFAULT); + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) { + err = -EFAULT; + goto err_free; + } err = mthca_map_user_db(to_mdev(pd->device), &context->uar, context->db_tab, ucmd.db_index, @@ -522,8 +524,10 @@ static struct ib_qp *mthca_create_qp(str if (pd->uobject) { context = to_mucontext(pd->uobject->context); - if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) { + kfree(qp); return ERR_PTR(-EFAULT); + } err = mthca_map_user_db(to_mdev(pd->device), &context->uar, context->db_tab, -- 1.0.7 From rolandd at cisco.com Thu Jan 12 16:13:17 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 13 Jan 2006 00:13:17 +0000 Subject: [openib-general] [git patch review 3/6] IB/mthca: Fix memory leak of multicast group structures In-Reply-To: <1137111197380-f482e88c451680c0@cisco.com> Message-ID: <1137111197380-7741e9b26c0a0236@cisco.com> Convert "/ (1 << lg)" to ">> lg" for a slight code size reduction. add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-24 (-24) function old new delta mthca_map_cmd 613 589 -24 Signed-off-by: Ishai Rabinovitz Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_cmd.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) 59f174faffd5dfee709fa0ead320cc6daf827e93 diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 22ac72b..f69e489 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -606,7 +606,7 @@ static int mthca_map_cmd(struct mthca_de err = -EINVAL; goto out; } - for (i = 0; i < mthca_icm_size(&iter) / (1 << lg); ++i) { + for (i = 0; i < mthca_icm_size(&iter) >> lg; ++i) { if (virt != -1) { pages[nent * 2] = cpu_to_be64(virt); virt += 1 << lg; -- 1.0.7 From rolandd at cisco.com Thu Jan 12 16:13:17 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 13 Jan 2006 00:13:17 +0000 Subject: [openib-general] [git patch review 1/6] IPoIB: Take dev->xmit_lock around mc_list accesses Message-ID: <1137111197380-341f286bd5273779@cisco.com> dev->mc_list accesses must be protected by dev->xmit_lock. Found by Eli Cohen . Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) 78bfe0b5b67fe126ed98608e42e42fb6ed9aabd4 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 03b2ca6..bf1c08c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -827,7 +827,8 @@ void ipoib_mcast_restart_task(void *dev_ ipoib_mcast_stop_thread(dev, 0); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&dev->xmit_lock, flags); + spin_lock(&priv->lock); /* * Unfortunately, the networking core only gives us a list of all of @@ -899,7 +900,9 @@ void ipoib_mcast_restart_task(void *dev_ list_add_tail(&mcast->list, &remove_list); } } - spin_unlock_irqrestore(&priv->lock, flags); + + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&dev->xmit_lock, flags); /* We have to cancel outside of the spinlock */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { -- 1.0.7 From ralphc at pathscale.com Thu Jan 12 16:22:20 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Thu, 12 Jan 2006 16:22:20 -0800 Subject: [openib-general] Problem with directed route SMPs with beginning or ending LID routed parts Message-ID: <1137111740.4520.208.camel@brick.internal.keyresearch.com> I'm trying to resolve a bug in the directed route handling code. I thought I would alert the general list in case someone had a solution. I'm continuing to find a clean fix but its difficult. The basic problem is that smi_handle_dr_smp_send() and smi_handle_dr_smp_recv() are modifying the directed route part (inc/dec hop_ptr) when the packet is in the LID routed part of the path. Here is an example: Receive SubnGet(NodeInfo) LRH:DLID=0x0009, LRH:SLID=0x000A, hop_ptr=2, hop_cnt=1, DrSLID=0xFFFF, DrDLID=0x0009. It is processed OK through ib_mad_recv_done_handler() ... port_priv->device->process_mad() generates OK response ... agent_send_response() calls ib_create_ah_from_wc() which creates the correct AH (to 0x000A) ... ib_post_send_mad() calls handle_outgoing_dr_smp() which calls smi_handle_dr_smp_send() which INCORRECTLY decrements hop_ptr since this is a reply to 0x0009 not 0xFFFF. The difficulty is that at this point, the AH is opaque so you can't easily tell that the DLID isn't the permissive LID. You can see that DrDLID isn't 0xFFFF but you can't just return 1 in smi_handle_dr_smp_send() because if OpenIB received this same reply (i.e., on node with LID=0x000A), it would still think its at the beginning of the destination LID routed part and not decrement hop_ptr. I think there is a similar problem when sending requests where the initial part of the path is LID routed. -- Ralph Campbell From halr at voltaire.com Thu Jan 12 17:43:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Jan 2006 20:43:18 -0500 Subject: [openib-general] Problem with directed route SMPs with beginning or ending LID routed parts In-Reply-To: <1137111740.4520.208.camel@brick.internal.keyresearch.com> References: <1137111740.4520.208.camel@brick.internal.keyresearch.com> Message-ID: <1137116358.4322.5606.camel@hal.voltaire.com> On Thu, 2006-01-12 at 19:22, Ralph Campbell wrote: > I'm trying to resolve a bug in the directed route handling code. > I thought I would alert the general list in case someone had > a solution. I'm continuing to find a clean fix but its > difficult. > > The basic problem is that smi_handle_dr_smp_send() and > smi_handle_dr_smp_recv() are modifying the directed > route part (inc/dec hop_ptr) when the packet is in the LID > routed part of the path. > > Here is an example: > Receive SubnGet(NodeInfo) LRH:DLID=0x0009, LRH:SLID=0x000A, > hop_ptr=2, hop_cnt=1, DrSLID=0xFFFF, DrDLID=0x0009. > It is processed OK through ib_mad_recv_done_handler() ... > port_priv->device->process_mad() generates OK response ... > agent_send_response() calls ib_create_ah_from_wc() > which creates the correct AH (to 0x000A) ... > ib_post_send_mad() calls handle_outgoing_dr_smp() which > calls smi_handle_dr_smp_send() which INCORRECTLY decrements > hop_ptr since this is a reply to 0x0009 not 0xFFFF. > > The difficulty is that at this point, the AH is opaque so > you can't easily tell that the DLID isn't the permissive LID. > You can see that DrDLID isn't 0xFFFF but you can't just > return 1 in smi_handle_dr_smp_send() because if OpenIB > received this same reply (i.e., on node with LID=0x000A), it > would still think its at the beginning of the destination LID > routed part and not decrement hop_ptr. > > I think there is a similar problem when sending requests > where the initial part of the path is LID routed. Yes, I've been aware of this test case for some time now but haven't had the chance to figure out a good solution either. It currently is a compliance test which is not used by any SM that I'm aware of. Maybe I'll have some cycles early next week to work on this. Thanks for the analysis of it. -- Hal From ralphc at pathscale.com Thu Jan 12 18:44:40 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Thu, 12 Jan 2006 18:44:40 -0800 Subject: [openib-general] [PATCH] Problem with directed route SMPs with beginning or ending LID routed parts Message-ID: <1137120280.4520.214.camel@brick.internal.keyresearch.com> I have only done basic testing (i.e., the link comes up and IPoIB still works) but I think this fixes the problem. Signed-off-by: Ralph Campbell Index: include/rdma/ib_mad.h =================================================================== --- include/rdma/ib_mad.h (revision 4978) +++ include/rdma/ib_mad.h (working copy) @@ -596,7 +596,7 @@ */ struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, u32 remote_qpn, u16 pkey_index, - int rmpp_active, + int rmpp_active, int directed_route, int hdr_len, int data_len, gfp_t gfp_mask); Index: core/user_mad.c =================================================================== --- core/user_mad.c (revision 4978) +++ core/user_mad.c (working copy) @@ -51,6 +51,7 @@ #include #include +#include MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("InfiniBand userspace MAD packet access"); @@ -339,6 +340,8 @@ __be64 *tid; int ret, length, hdr_len, copy_offset; int rmpp_active, has_rmpp_header; + int directed_route; + struct ib_smp *smp; if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) return -EINVAL; @@ -415,9 +418,20 @@ goto err_ah; } + /* + * Directed route handling starts if the initial LID routed part of + * a request or the ending LID routed part of a response is empty. + * See section 14.2.2, Vol 1 IB spec. + */ + smp = (struct ib_smp *) packet->mad.data; + if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + directed_route = (ib_get_smp_direction(smp) ? + smp->dr_dlid : smp->dr_slid) == IB_LID_PERMISSIVE; + else + directed_route = 0; packet->msg = ib_create_send_mad(agent, be32_to_cpu(packet->mad.hdr.qpn), - 0, rmpp_active, + 0, rmpp_active, directed_route, hdr_len, length - hdr_len, GFP_KERNEL); if (IS_ERR(packet->msg)) { Index: core/agent.c =================================================================== --- core/agent.c (revision 4978) +++ core/agent.c (working copy) @@ -103,6 +103,7 @@ struct ib_mad_send_buf *send_buf; struct ib_ah *ah; int ret; + int directed_route; port_priv = ib_get_agent_port(device, port_num); if (!port_priv) { @@ -118,7 +119,20 @@ return ret; } + /* + * Directed route handling starts if the initial LID routed part of + * a request or the ending LID routed part of a response is empty. + * See section 14.2.2, Vol 1 IB spec. + */ + if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + struct ib_smp *smp = (struct ib_smp *) mad; + + directed_route = (ib_get_smp_direction(smp) ? + smp->dr_dlid : smp->dr_slid) == IB_LID_PERMISSIVE; + } else + directed_route = 0; send_buf = ib_create_send_mad(agent, wc->src_qp, wc->pkey_index, 0, + directed_route, IB_MGMT_MAD_HDR, IB_MGMT_MAD_DATA, GFP_KERNEL); if (IS_ERR(send_buf)) { Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 4978) +++ core/sa_query.c (working copy) @@ -574,7 +574,7 @@ return -ENOMEM; query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, + 0, 0, IB_MGMT_SA_HDR, IB_MGMT_SA_DATA, gfp_mask); if (!query->sa_query.mad_buf) { ret = -ENOMEM; @@ -695,7 +695,7 @@ return -ENOMEM; query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, + 0, 0, IB_MGMT_SA_HDR, IB_MGMT_SA_DATA, gfp_mask); if (!query->sa_query.mad_buf) { ret = -ENOMEM; @@ -787,7 +787,7 @@ return -ENOMEM; query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, + 0, 0, IB_MGMT_SA_HDR, IB_MGMT_SA_DATA, gfp_mask); if (!query->sa_query.mad_buf) { ret = -ENOMEM; Index: core/ping.c =================================================================== --- core/ping.c (revision 4978) +++ core/ping.c (working copy) @@ -127,7 +127,7 @@ } msg = ib_create_send_mad(mad_agent, mad_recv_wc->wc->src_qp, - mad_recv_wc->wc->pkey_index, 0, + mad_recv_wc->wc->pkey_index, 0, 0, IB_MGMT_VENDOR_HDR, mad_recv_wc->mad_len - IB_MGMT_VENDOR_HDR, GFP_KERNEL); Index: core/mad_rmpp.c =================================================================== --- core/mad_rmpp.c (revision 4978) +++ core/mad_rmpp.c (working copy) @@ -138,8 +138,9 @@ int ret; msg = ib_create_send_mad(&rmpp_recv->agent->agent, recv_wc->wc->src_qp, - recv_wc->wc->pkey_index, 1, IB_MGMT_RMPP_HDR, - IB_MGMT_RMPP_DATA, GFP_KERNEL); + recv_wc->wc->pkey_index, 1, 0, + IB_MGMT_RMPP_HDR, IB_MGMT_RMPP_DATA, + GFP_KERNEL); if (!msg) return; @@ -163,7 +164,7 @@ return (void *) ah; msg = ib_create_send_mad(agent, recv_wc->wc->src_qp, - recv_wc->wc->pkey_index, 1, + recv_wc->wc->pkey_index, 1, 0, IB_MGMT_RMPP_HDR, IB_MGMT_RMPP_DATA, GFP_KERNEL); if (IS_ERR(msg)) Index: core/mad.c =================================================================== --- core/mad.c (revision 4978) +++ core/mad.c (working copy) @@ -665,7 +665,8 @@ struct ib_wc mad_wc; struct ib_send_wr *send_wr = &mad_send_wr->send_wr; - if (!smi_handle_dr_smp_send(smp, device->node_type, port_num)) { + if (mad_send_wr->directed_route && + !smi_handle_dr_smp_send(smp, device->node_type, port_num)) { ret = -EINVAL; printk(KERN_ERR PFX "Invalid directed route\n"); goto out; @@ -699,8 +700,7 @@ ret = device->process_mad(device, 0, port_num, &mad_wc, NULL, (struct ib_mad *)smp, (struct ib_mad *)&mad_priv->mad); - switch (ret) - { + switch (ret) { case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY: if (response_mad(&mad_priv->mad.mad) && mad_agent_priv->agent.recv_handler) { @@ -773,7 +773,7 @@ struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, u32 remote_qpn, u16 pkey_index, - int rmpp_active, + int rmpp_active, int directed_route, int hdr_len, int data_len, gfp_t gfp_mask) { @@ -818,6 +818,8 @@ mad_send_wr->send_wr.wr.ud.remote_qkey = IB_QP_SET_QKEY; mad_send_wr->send_wr.wr.ud.pkey_index = pkey_index; + mad_send_wr->directed_route = directed_route; + if (rmpp_active) { struct ib_rmpp_mad *rmpp_mad = mad_send_wr->send_buf.mad; rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(hdr_len - Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 4978) +++ core/mad_priv.h (working copy) @@ -124,10 +124,11 @@ struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; __be64 tid; unsigned long timeout; - int retries; - int retry; - int refcount; + unsigned int retries; + unsigned int retry; + unsigned int refcount; enum ib_wc_status status; + int directed_route; /* RMPP control */ int last_ack; Index: core/cm.c =================================================================== --- core/cm.c (revision 4978) +++ core/cm.c (working copy) @@ -177,7 +177,7 @@ m = ib_create_send_mad(mad_agent, cm_id_priv->id.remote_cm_qpn, cm_id_priv->av.pkey_index, - 0, IB_MGMT_MAD_HDR, IB_MGMT_MAD_DATA, + 0, 0, IB_MGMT_MAD_HDR, IB_MGMT_MAD_DATA, GFP_ATOMIC); if (IS_ERR(m)) { ib_destroy_ah(ah); @@ -207,7 +207,7 @@ return PTR_ERR(ah); m = ib_create_send_mad(port->mad_agent, 1, mad_recv_wc->wc->pkey_index, - 0, IB_MGMT_MAD_HDR, IB_MGMT_MAD_DATA, + 0, 0, IB_MGMT_MAD_HDR, IB_MGMT_MAD_DATA, GFP_ATOMIC); if (IS_ERR(m)) { ib_destroy_ah(ah); Index: hw/mthca/mthca_mad.c =================================================================== --- hw/mthca/mthca_mad.c (revision 4978) +++ hw/mthca/mthca_mad.c (working copy) @@ -117,7 +117,8 @@ unsigned long flags; if (agent) { - send_buf = ib_create_send_mad(agent, qpn, 0, 0, IB_MGMT_MAD_HDR, + send_buf = ib_create_send_mad(agent, qpn, 0, 0, 0, + IB_MGMT_MAD_HDR, IB_MGMT_MAD_DATA, GFP_ATOMIC); /* * We rely here on the fact that MLX QPs don't use the -- Ralph Campbell From ralphc at pathscale.com Thu Jan 12 20:35:11 2006 From: ralphc at pathscale.com (ralphc at pathscale.com) Date: Thu, 12 Jan 2006 20:35:11 -0800 (PST) Subject: [openib-general] P.S. my last patch In-Reply-To: <1136840865.4339.3543.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B496@mtlexch01.mtl.com> <1136840865.4339.3543.camel@hal.voltaire.com> Message-ID: <39576.71.131.58.111.1137126911.squirrel@71.131.58.111> Please note that the patch I sent out is more for comment. I would like to test it and think about it some more before committing to it. I may have a simpler solution but I have to think about it first. Ralph Campbell From halr at voltaire.com Fri Jan 13 04:05:50 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jan 2006 07:05:50 -0500 Subject: [openib-general] [PATCH] Problem with directed route SMPs with beginning or ending LID routed parts In-Reply-To: <1137120280.4520.214.camel@brick.internal.keyresearch.com> References: <1137120280.4520.214.camel@brick.internal.keyresearch.com> Message-ID: <1137153884.4322.6630.camel@hal.voltaire.com> Hi Ralph, On Thu, 2006-01-12 at 21:44, Ralph Campbell wrote: > I have only done basic testing (i.e., the link comes up and IPoIB still works) > but I think this fixes the problem. This looks good to me with the one possible improvement below. Has SM been tested too with this ? [snip...] > Index: hw/mthca/mthca_mad.c > =================================================================== > --- hw/mthca/mthca_mad.c (revision 4978) > +++ hw/mthca/mthca_mad.c (working copy) > @@ -117,7 +117,8 @@ > unsigned long flags; > > if (agent) { > - send_buf = ib_create_send_mad(agent, qpn, 0, 0, IB_MGMT_MAD_HDR, > + send_buf = ib_create_send_mad(agent, qpn, 0, 0, 0, > + IB_MGMT_MAD_HDR, > IB_MGMT_MAD_DATA, GFP_ATOMIC); > /* > * We rely here on the fact that MLX QPs don't use the Traps currently are LID routed so this is works. But directed route traps are supported so similar code could be added here. -- Hal From iod00d at hp.com Fri Jan 13 09:46:55 2006 From: iod00d at hp.com (Grant Grundler) Date: Fri, 13 Jan 2006 09:46:55 -0800 Subject: [openib-general] Re: ib_sdp ERR: IOCB dmesg output In-Reply-To: <20060112193500.GC9256@mellanox.co.il> References: <20060112175922.GF3106@esmail.cup.hp.com> <20060112193500.GC9256@mellanox.co.il> Message-ID: <20060113174655.GA7930@esmail.cup.hp.com> On Thu, Jan 12, 2006 at 09:35:00PM +0200, Michael S. Tsirkin wrote: > > Looks like the error messages and sdp refcnt might be related. > > (IIRC, 4 error msgs and SDP ref cnt is 3) > > Since netperf is terminated by a timer signal, it's possible > > traffic is still outstanding when it exits. Could that be a > > cause of the "ERR: IOCB <-1> cancel" error messages? ... > Yes. > By the way, this is with zcopy set, isnt it? I think so. I'm assuming that's the default now. > Could you try testing with zcopy off? certainly. grant From mshefty at ichips.intel.com Fri Jan 13 09:51:51 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 13 Jan 2006 09:51:51 -0800 Subject: [openib-general] [PATCH] Problem with directed route SMPs with beginning or ending LID routed parts In-Reply-To: <1137120280.4520.214.camel@brick.internal.keyresearch.com> References: <1137120280.4520.214.camel@brick.internal.keyresearch.com> Message-ID: <43C7E8B7.1020109@ichips.intel.com> Ralph Campbell wrote: > struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, > u32 remote_qpn, u16 pkey_index, > - int rmpp_active, > + int rmpp_active, int directed_route, > int hdr_len, int data_len, > gfp_t gfp_mask); Is there a way to do this that doesn't involve changing the ib_create_send_mad interface? - Sean From ralphc at pathscale.com Fri Jan 13 10:31:55 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 13 Jan 2006 10:31:55 -0800 Subject: [openib-general] [PATCH] Problem with directed route SMPs with beginning or ending LID routed parts In-Reply-To: <43C7E8B7.1020109@ichips.intel.com> References: <1137120280.4520.214.camel@brick.internal.keyresearch.com> <43C7E8B7.1020109@ichips.intel.com> Message-ID: <1137177115.4520.220.camel@brick.internal.keyresearch.com> Sure. The alternative would be to make ib_create_send_mad() initialize directed_route to zero and add an access function to set directed_route when needed. The access function is needed since directed_route is in the private part of ib_mad_send_wr_private and not accessible from ib_mad_send_buf. On Fri, 2006-01-13 at 09:51 -0800, Sean Hefty wrote: > Ralph Campbell wrote: > > struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, > > u32 remote_qpn, u16 pkey_index, > > - int rmpp_active, > > + int rmpp_active, int directed_route, > > int hdr_len, int data_len, > > gfp_t gfp_mask); > > Is there a way to do this that doesn't involve changing the ib_create_send_mad > interface? > > - Sean -- Ralph Campbell From ralphc at pathscale.com Fri Jan 13 10:32:54 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 13 Jan 2006 10:32:54 -0800 Subject: [openib-general] Another MAD processing question Message-ID: <1137177174.4520.222.camel@brick.internal.keyresearch.com> After looking at the MAD handling code some more, I am puzzled by the following. smi_check_local_dr_smp() is called only from two places in core/mad.c It returns 0 or 1. In smi_check_local_dr_smp(), it checks for a directed route SMP but this function is only called when the SMP is a directed route so this is a NOP. The following patch could be applied. Index: agent.c =================================================================== --- agent.c (revision 4978) +++ agent.c (working copy) @@ -81,9 +81,6 @@ { struct ib_agent_port_private *port_priv; - if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) - return 1; - port_priv = ib_get_agent_port(device, port_num); if (!port_priv) { printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " The call to ib_get_agent_port() shouldn't be possible to fail when smi_check_local_dr_smp() is called from ib_mad_recv_done_handler(). When it is called from handle_outgoing_dr_smp(), the device and port_num come from mad_agent_priv so I assume the call to ib_get_agent_port() shouldn't fail either. In either case, smi_check_local_smp() only uses the mad_agent pointer to check that mad_agent->device->process_mad is not NULL. The device pointer would have to be the same as the one passed to smi_check_local_dr_smp() since that pointer is used later instead of the one checked in smi_check_local_smp(). All of this leads me to think that the following patch should work: Index: smi.h =================================================================== --- smi.h (revision 4978) +++ smi.h (working copy) @@ -49,19 +49,16 @@ extern int smi_handle_dr_smp_send(struct ib_smp *smp, u8 node_type, int port_num); -extern int smi_check_local_dr_smp(struct ib_smp *smp, - struct ib_device *device, - int port_num); /* * Return 1 if the SMP should be handled by the local SMA/SM via process_mad */ -static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent, - struct ib_smp *smp) +static inline int smi_check_local_smp(struct ib_smp *smp, + struct ib_device *device) { /* C14-9:3 -- We're at the end of the DR segment of path */ /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ - return ((mad_agent->device->process_mad && + return ((device->process_mad && !ib_get_smp_direction(smp) && (smp->hop_ptr == smp->hop_cnt + 1))); } Index: agent.c =================================================================== --- agent.c (revision 4978) +++ agent.c (working copy) @@ -75,25 +75,6 @@ return entry; } -int smi_check_local_dr_smp(struct ib_smp *smp, - struct ib_device *device, - int port_num) -{ - struct ib_agent_port_private *port_priv; - - if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) - return 1; - - port_priv = ib_get_agent_port(device, port_num); - if (!port_priv) { - printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " - "not open\n", device->name, port_num); - return 1; - } - - return smi_check_local_smp(port_priv->agent[0], smp); -} - int agent_send_response(struct ib_mad *mad, struct ib_grh *grh, struct ib_wc *wc, struct ib_device *device, int port_num, int qpn) Index: mad.c =================================================================== --- mad.c (revision 4978) +++ mad.c (working copy) @@ -671,8 +671,8 @@ goto out; } /* Check to post send on QP or process locally */ - ret = smi_check_local_dr_smp(smp, device, port_num); - if (!ret || !device->process_mad) + ret = smi_check_local_smp(smp, device); + if (!ret) goto out; local = kmalloc(sizeof *local, GFP_ATOMIC); @@ -1669,9 +1669,7 @@ port_priv->device->node_type, port_priv->port_num)) goto out; - if (!smi_check_local_dr_smp(&recv->mad.smp, - port_priv->device, - port_priv->port_num)) + if (!smi_check_local_smp(&recv->mad.smp, port_priv->device)) goto out; } -- Ralph Campbell From iod00d at hp.com Fri Jan 13 13:50:07 2006 From: iod00d at hp.com (Grant Grundler) Date: Fri, 13 Jan 2006 13:50:07 -0800 Subject: [openib-general] Re: ib_sdp ERR: IOCB dmesg output In-Reply-To: <20060112193500.GC9256@mellanox.co.il> References: <20060112175922.GF3106@esmail.cup.hp.com> <20060112193500.GC9256@mellanox.co.il> Message-ID: <20060113215007.GA8595@esmail.cup.hp.com> On Thu, Jan 12, 2006 at 09:35:00PM +0200, Michael S. Tsirkin wrote: > By the way, this is with zcopy set, isnt it? I confirmed it *was* set. I've disabled it for this round. BTW, it's not the default (I asserted it was in previous reply). > Could you try testing with zcopy off? With SDP_SEND_ZCOPY=n on the netperf side only (gsyprf3), I no longer see the "ib_sdp ERR: IOCB <-1> cancel" from the client. The "netserver" (iota) still has ZCOPY=y and reproduced this error. I've disabled ZCOPY on iota as well and restarted testing. I'll start a new thread if I see the same error again. thanks, grant From ch9ilfshimag at kobej.zzn.com Fri Jan 13 14:50:47 2006 From: ch9ilfshimag at kobej.zzn.com (ch9ilfshimag at kobej.zzn.com) Date: Fri, 13 Jan 2006 14:50:47 -0800 (PST) Subject: [openib-general] =?utf-8?b?woF5wpNvwphewoLDiMKCwrXCgXrCi0PCjHk=?= =?utf-8?b?woLDicKKw4jCklDClsKzwpfCv8KDwoHCgVvCg8KLwpHCl8KQTQ==?= Message-ID: 20060114054725.36631mail@mail.hothot-top7789548_5524_superwebserver09_hothot-top99.cc �E*��*�E�E*��*�E�E*��*�E�E*��*�E*��*�E*��*�E*��*�E*��*�E �@�@�@�@�@�@�Z���������������Ȅ������Ʉ��@�@ �@�@�@�@�@�@�����������������������������@�@�@ �@�@�@���������Å�����Ȅ������Å�������I���@�@ �@�@�@�����������������������������������������@�@�@ �E*��*�E*��*�E�E*��*�E�E*��*�E�E*��*�E*��*�E*��*�E*��*�E :*.���B�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�B��.*: �@�@�@�@�@�@�h����ɏ�����ꂽ��c�h �@http://conquest.dyn.dhs.org/ �@�@�@�@�@�@���܂��������p�������I �@�@�@�@�@�@�o�^�s�v�Ŋ��S�����Ń��[�����M�o���܂��B �@�@�@�@�@�@���܂莞�Ԃ���Ȃ����ł�y���߂܂��B �@�@�@�@�@�@������A��������ƃp�[�g�i�[��T���������ɂ�œK�ł��B �@http://conquest.dyn.dhs.org/ �@�@�@�@�@�@�����������ȑ̌����]����������A �@�@�@�@�@�@�^���ɗ��l��T����Ă�����܂ŁA �@�@�@�@�@�@���L�����������p����Ă��܂��B �@http://conquest.dyn.dhs.org/ ========================================================= �@�@����ȕ��ɃI�X�X�����܂��� �@�@��@�E��Ɉِ����S�R���Ȃ��̂ŁA�o����߂āB �@�@��@���܂łƂ͈Ⴄ���E�̐l�Ƃ̏o����߂āB �@�@��@���z�̑����������ƒT�������B �@�@��@��l�̎��ԁA�₵���𖄂߂鑊�肪�~���� �@http://conquest.dyn.dhs.org/ -��--���S�����ň��S�̏o�--��- �@�@�@�@�������p����� ALL \0!! �@�@�@�S�ẴT�[�r�X�������Ŋy���߂�I �@�@������������������������ �@�@�@���@�o�^�@�@�@�@�O�~�I �@�@�@���@���[�����M�@�O�~�I �@�@�@���@���[����M�@�O�~�I �@�@�@���@�����݁@�@�@�O�~�I �@�@�@���@�f���‰{���@�O�~�I �@�@�@���@���A�h���@�O�~�I �@�@�@���@���d���@�@�O�~�I �@�@�@���@�މ�@�@�@�@�O�~�I �@�@�������炩��ǂ����� �@http://conquest.dyn.dhs.org/ �@�@ �@�^�^����������̓��e���_�_ �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c �@���ߎq����@27�� �@�����R�����̐l�Ȃł��B �@�������ĊԂ�Ȃ�����G�b�`�͂����񂾂Ǝv��ꂪ���ł��� �@�S�R���̑�����Ă���Ȃ���ł��c�B �@�d�����Z�����炵���ƂɋA���ė��Ă���Ă����Q���Ⴄ��ł�� �@�閧����ł��肢���܂��B �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c �@�o�^���Ȃ��Ń��[�����MOK�Fhttp://conquest.dyn.dhs.org/ �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c �@�Ȃ�������@22�� �@���߂܂��āB �@�^��w�a�@�ŊŌ�m���Ă�Ȃ���27�΂ł��B �@�ŋ߁A�v���C�x�[�g���ق�Ƃɂ�鎖�Ȃ��ăq�}(><) �@�N���V��ł���܂��񂩁H �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c �@�o�^���Ȃ��Ń��[�����MOK�Fhttp://conquest.dyn.dhs.org/ �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c �@��������@20�΁@ �@�̒j�֌W�Őh�����������Ă�������ɉ��a�ɂȂ�������āc �@�ł�ŋ߂͏����—ǂ��Ȃ��Ă��āB �@���߂͗F�B���炨�肢���܂��B�����ގ���o����Ƃ����ȁB �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c �@�o�^���Ȃ��Ń��[�����MOK�Fhttp://conquest.dyn.dhs.org/ �@�c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c���c�c�c�c From rdreier at cisco.com Fri Jan 13 14:57:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 13 Jan 2006 14:57:03 -0800 Subject: [openib-general] [git pull] IB changes for 2.6.16-rc1 Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus The pull will get the following changes: Eli Cohen: IPoIB: Fix error path in ipoib_mcast_dev_flush() IPoIB: Fix address handle refcounting for multicast groups IPoIB: Fix memory leak of multicast group structures Ingo Molnar: IB: convert from semaphores to mutexes Ishai Rabinovitz: IB/mthca: Fix memory leak of multicast group structures Jack Morgenstein: IB/mthca: Fix memory leaks in error handling Michael S. Tsirkin: IB/mthca: fix page shift calculation in mthca_reg_phys_mr() IB/mthca: prevent event queue overrun IPoIB: Take dev->xmit_lock around mc_list accesses IB/mthca: Cosmetic: use the ALIGN macro IB/mthca: Initialize grh_present before using it Roland Dreier: IB/mthca: kzalloc conversions IB/mthca: Factor common MAD initialization code Sean Hefty: IB: Add node_guid to struct ib_device drivers/infiniband/core/cm.c | 29 +---- drivers/infiniband/core/device.c | 23 ++-- drivers/infiniband/core/sysfs.c | 22 +-- drivers/infiniband/core/ucm.c | 23 ++-- drivers/infiniband/core/uverbs.h | 5 - drivers/infiniband/core/uverbs_cmd.c | 152 ++++++++++++------------ drivers/infiniband/core/uverbs_main.c | 8 + drivers/infiniband/hw/mthca/mthca_av.c | 10 +- drivers/infiniband/hw/mthca/mthca_cmd.c | 7 + drivers/infiniband/hw/mthca/mthca_dev.h | 1 drivers/infiniband/hw/mthca/mthca_eq.c | 28 ++-- drivers/infiniband/hw/mthca/mthca_provider.c | 132 ++++++++++++--------- drivers/infiniband/hw/mthca/mthca_qp.c | 2 drivers/infiniband/ulp/ipoib/ipoib.h | 6 - drivers/infiniband/ulp/ipoib/ipoib_ib.c | 31 ++--- drivers/infiniband/ulp/ipoib/ipoib_main.c | 12 +- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 105 +++++------------ drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 8 + drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 10 +- drivers/infiniband/ulp/srp/ib_srp.c | 23 +--- include/rdma/ib_verbs.h | 2 21 files changed, 287 insertions(+), 352 deletions(-) From halr at voltaire.com Sat Jan 14 09:38:07 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Jan 2006 12:38:07 -0500 Subject: [openib-general] Another MAD processing question In-Reply-To: <1137177174.4520.222.camel@brick.internal.keyresearch.com> References: <1137177174.4520.222.camel@brick.internal.keyresearch.com> Message-ID: <1137260184.4322.10637.camel@hal.voltaire.com> On Fri, 2006-01-13 at 13:32, Ralph Campbell wrote: > After looking at the MAD handling code some more, I am puzzled by > the following. I'll respond in 2 parts to this. > smi_check_local_dr_smp() is called only from two places in core/mad.c > It returns 0 or 1. In smi_check_local_dr_smp(), it checks for > a directed route SMP but this function is only called when the SMP > is a directed route so this is a NOP. The following patch could be > applied. > > Index: agent.c > =================================================================== > --- agent.c (revision 4978) > +++ agent.c (working copy) > @@ -81,9 +81,6 @@ > { > struct ib_agent_port_private *port_priv; > > - if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > - return 1; > - > port_priv = ib_get_agent_port(device, port_num); > if (!port_priv) { > printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " Yes, that check is redundant. Thanks. Applied. -- Hal > The call to ib_get_agent_port() shouldn't be possible to fail when > smi_check_local_dr_smp() is called from ib_mad_recv_done_handler(). > When it is called from handle_outgoing_dr_smp(), the device and port_num > come from mad_agent_priv so I assume the call to ib_get_agent_port() > shouldn't fail either. In either case, smi_check_local_smp() > only uses the mad_agent pointer to check that mad_agent->device->process_mad > is not NULL. The device pointer would have to be the same as the > one passed to smi_check_local_dr_smp() since that pointer is used later > instead of the one checked in smi_check_local_smp(). > > All of this leads me to think that the following patch should work: > > Index: smi.h > =================================================================== > --- smi.h (revision 4978) > +++ smi.h (working copy) > @@ -49,19 +49,16 @@ > extern int smi_handle_dr_smp_send(struct ib_smp *smp, > u8 node_type, > int port_num); > -extern int smi_check_local_dr_smp(struct ib_smp *smp, > - struct ib_device *device, > - int port_num); > > /* > * Return 1 if the SMP should be handled by the local SMA/SM via process_mad > */ > -static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent, > - struct ib_smp *smp) > +static inline int smi_check_local_smp(struct ib_smp *smp, > + struct ib_device *device) > { > /* C14-9:3 -- We're at the end of the DR segment of path */ > /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ > - return ((mad_agent->device->process_mad && > + return ((device->process_mad && > !ib_get_smp_direction(smp) && > (smp->hop_ptr == smp->hop_cnt + 1))); > } > Index: agent.c > =================================================================== > --- agent.c (revision 4978) > +++ agent.c (working copy) > @@ -75,25 +75,6 @@ > return entry; > } > > -int smi_check_local_dr_smp(struct ib_smp *smp, > - struct ib_device *device, > - int port_num) > -{ > - struct ib_agent_port_private *port_priv; > - > - if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > - return 1; > - > - port_priv = ib_get_agent_port(device, port_num); > - if (!port_priv) { > - printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " > - "not open\n", device->name, port_num); > - return 1; > - } > - > - return smi_check_local_smp(port_priv->agent[0], smp); > -} > - > int agent_send_response(struct ib_mad *mad, struct ib_grh *grh, > struct ib_wc *wc, struct ib_device *device, > int port_num, int qpn) > Index: mad.c > =================================================================== > --- mad.c (revision 4978) > +++ mad.c (working copy) > @@ -671,8 +671,8 @@ > goto out; > } > /* Check to post send on QP or process locally */ > - ret = smi_check_local_dr_smp(smp, device, port_num); > - if (!ret || !device->process_mad) > + ret = smi_check_local_smp(smp, device); > + if (!ret) > goto out; > > local = kmalloc(sizeof *local, GFP_ATOMIC); > @@ -1669,9 +1669,7 @@ > port_priv->device->node_type, > port_priv->port_num)) > goto out; > - if (!smi_check_local_dr_smp(&recv->mad.smp, > - port_priv->device, > - port_priv->port_num)) > + if (!smi_check_local_smp(&recv->mad.smp, port_priv->device)) > goto out; > } > From halr at voltaire.com Sat Jan 14 10:03:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Jan 2006 13:03:46 -0500 Subject: [openib-general] Another MAD processing question In-Reply-To: <1137260184.4322.10637.camel@hal.voltaire.com> References: <1137177174.4520.222.camel@brick.internal.keyresearch.com> <1137260184.4322.10637.camel@hal.voltaire.com> Message-ID: <1137261825.4322.10688.camel@hal.voltaire.com> On Sat, 2006-01-14 at 12:38, Hal Rosenstock wrote: > On Fri, 2006-01-13 at 13:32, Ralph Campbell wrote: > > After looking at the MAD handling code some more, I am puzzled by > > the following. > > I'll respond in 2 parts to this. > > > smi_check_local_dr_smp() is called only from two places in core/mad.c > > It returns 0 or 1. In smi_check_local_dr_smp(), it checks for > > a directed route SMP but this function is only called when the SMP > > is a directed route so this is a NOP. The following patch could be > > applied. > > > > Index: agent.c > > =================================================================== > > --- agent.c (revision 4978) > > +++ agent.c (working copy) > > @@ -81,9 +81,6 @@ > > { > > struct ib_agent_port_private *port_priv; > > > > - if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > > - return 1; > > - > > port_priv = ib_get_agent_port(device, port_num); > > if (!port_priv) { > > printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " > > > Yes, that check is redundant. > > Thanks. Applied. > > -- Hal > > > The call to ib_get_agent_port() shouldn't be possible to fail when > > smi_check_local_dr_smp() is called from ib_mad_recv_done_handler(). > > When it is called from handle_outgoing_dr_smp(), the device and port_num > > come from mad_agent_priv so I assume the call to ib_get_agent_port() > > shouldn't fail either. In either case, smi_check_local_smp() > > only uses the mad_agent pointer to check that mad_agent->device->process_mad > > is not NULL. The device pointer would have to be the same as the > > one passed to smi_check_local_dr_smp() since that pointer is used later > > instead of the one checked in smi_check_local_smp(). > > > > All of this leads me to think that the following patch should work: This is correct as well. Thanks! Applied. -- Hal > > > > Index: smi.h > > =================================================================== > > --- smi.h (revision 4978) > > +++ smi.h (working copy) > > @@ -49,19 +49,16 @@ > > extern int smi_handle_dr_smp_send(struct ib_smp *smp, > > u8 node_type, > > int port_num); > > -extern int smi_check_local_dr_smp(struct ib_smp *smp, > > - struct ib_device *device, > > - int port_num); > > > > /* > > * Return 1 if the SMP should be handled by the local SMA/SM via process_mad > > */ > > -static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent, > > - struct ib_smp *smp) > > +static inline int smi_check_local_smp(struct ib_smp *smp, > > + struct ib_device *device) > > { > > /* C14-9:3 -- We're at the end of the DR segment of path */ > > /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ > > - return ((mad_agent->device->process_mad && > > + return ((device->process_mad && > > !ib_get_smp_direction(smp) && > > (smp->hop_ptr == smp->hop_cnt + 1))); > > } > > Index: agent.c > > =================================================================== > > --- agent.c (revision 4978) > > +++ agent.c (working copy) > > @@ -75,25 +75,6 @@ > > return entry; > > } > > > > -int smi_check_local_dr_smp(struct ib_smp *smp, > > - struct ib_device *device, > > - int port_num) > > -{ > > - struct ib_agent_port_private *port_priv; > > - > > - if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > > - return 1; > > - > > - port_priv = ib_get_agent_port(device, port_num); > > - if (!port_priv) { > > - printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " > > - "not open\n", device->name, port_num); > > - return 1; > > - } > > - > > - return smi_check_local_smp(port_priv->agent[0], smp); > > -} > > - > > int agent_send_response(struct ib_mad *mad, struct ib_grh *grh, > > struct ib_wc *wc, struct ib_device *device, > > int port_num, int qpn) > > Index: mad.c > > =================================================================== > > --- mad.c (revision 4978) > > +++ mad.c (working copy) > > @@ -671,8 +671,8 @@ > > goto out; > > } > > /* Check to post send on QP or process locally */ > > - ret = smi_check_local_dr_smp(smp, device, port_num); > > - if (!ret || !device->process_mad) > > + ret = smi_check_local_smp(smp, device); > > + if (!ret) > > goto out; > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > @@ -1669,9 +1669,7 @@ > > port_priv->device->node_type, > > port_priv->port_num)) > > goto out; > > - if (!smi_check_local_dr_smp(&recv->mad.smp, > > - port_priv->device, > > - port_priv->port_num)) > > + if (!smi_check_local_smp(&recv->mad.smp, port_priv->device)) > > goto out; > > } > > From HelgalmWeeks at rima-tde.net Sat Jan 14 08:08:08 2006 From: HelgalmWeeks at rima-tde.net (Helga Weeks) Date: Sat, 14 Jan 2006 21:08:08 +0500 Subject: [openib-general] Your order# 5051. Message-ID: You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://treatmentgreen.com/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://treatmentgreen.com/ Regards, Dr. Helga Weeks From openib-general at openib.org Sat Jan 14 16:29:00 2006 From: openib-general at openib.org (openib-general at openib.org) Date: Sat, 14 Jan 2006 16:29:00 -0800 (PST) Subject: [openib-general] openib-general@openib.org Message-ID: <20060115002900.29CB02283D6@openib.ca.sandia.gov> ------------------------------------------------------------------------- ADULT MEDIA Video Clips .: Slide Shows .: Screen Shots ADULTS ONLY ------------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: Download-and-Buy.zip Type: application/x-zip-compressed Size: 3759 bytes Desc: Download-and-Buy.zip URL: From WalterhvHobbs at albedo.net Sat Jan 14 19:02:50 2006 From: WalterhvHobbs at albedo.net (Walter Hobbs) Date: Sun, 15 Jan 2006 05:02:50 +0200 Subject: [openib-general] No known side effects Message-ID: You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://treatmentgreen.com/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://treatmentgreen.com/ Regards, Dr. Walter Hobbs From ogerlitz at voltaire.com Sat Jan 14 22:41:30 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 15 Jan 2006 08:41:30 +0200 (IST) Subject: [openib-general] [PATCH] iser: double kmem_cache destory bugfix Message-ID: avoid buggy kmem_cache_destory on repeated connect/disconnect cycles, refine the predicate used to print an error related to the processing of error-ed WC. Signed-off-by: Or Gerlitz Index: iser_verbs.c =================================================================== --- iser_verbs.c (revision 4984) +++ iser_verbs.c (revision 4985) @@ -670,7 +670,8 @@ static void iser_handle_comp_error(enum iser_dbg("Conn. 0x%p is being terminated asynchronously\n", p_iser_conn); } /* Handle completion Error */ - if (iser_dto_completion_error(p_dto)) + ret_val = iser_dto_completion_error(p_dto); + if (ret_val && ret_val != -EAGAIN) iser_err("Failed to handle ERROR DTO completion\n"); } Index: iser_conn.c =================================================================== --- iser_conn.c (revision 4984) +++ iser_conn.c (revision 4985) @@ -381,6 +381,7 @@ void iser_conn_release(struct iser_conn if(kmem_cache_destroy(p_iscsi_conn->postrecv_cache) != 0) iser_err("postrecv cache %s not empty, leak!\n", p_iscsi_conn->postrecv_cn); + p_iscsi_conn->ff_mode_enabled = 0; } /* release socket with conn descriptor */ sock_release(iser_conn_to_sock(p_iser_conn)); From ogerlitz at voltaire.com Sun Jan 15 04:14:07 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 15 Jan 2006 14:14:07 +0200 (IST) Subject: [openib-general] [PATCH] iser: align fmr pool params to scsi midlayer host template Message-ID: was commited to r5002 Correlate the fmr pool params with the template posted to the scsi midlayer and have fmr pool per ib connection. Signed-off-by: Or Gerlitz Index: ulp/iser/iscsi_iser.h =================================================================== --- ulp/iser/iscsi_iser.h (revision 4984) +++ ulp/iser/iscsi_iser.h (working copy) @@ -30,8 +30,9 @@ #define ISCSI_ISER_XMIT_CMDS_MAX 128 /* must be power of 2 */ #define ISCSI_ISER_MGMT_CMDS_MAX 32 /* must be power of 2 */ -#define ISER_SG_TABLESIZE SG_ALL -#define ISER_CMD_PER_LUN 128 + /* support upto 512KB in one RDMA */ +#define ISCSI_ISER_SG_TABLESIZE (0x80000 >> PAGE_SHIFT) +#define ISCSI_ISER_CMD_PER_LUN ISCSI_ISER_XMIT_CMDS_MAX #define ISCSI_ISER_MAX_LUN 256 #define ISCSI_ISER_MAX_CMD_LEN 16 @@ -148,6 +149,7 @@ struct iser_conn atomic_t state; /* rdma connection state */ struct rdma_cm_id *cma_id; struct ib_qp *qp; + struct ib_fmr_pool *fmr_pool; struct iser_adaptor *p_adaptor; /* adaptor context */ struct list_head adaptor_list; /* entry in the adaptor's conns list */ @@ -198,10 +200,8 @@ struct iscsi_iser_conn int id; /* iSCSI CID */ spinlock_t lock; /* MERGE_FIXME: can it be removed */ - int max_xmit_dlength; /* FIXME change it to be target_max_recv_dsl */ - int initiator_max_recv_dsl; - int target_max_recv_dsl; - + int max_recv_dlength; /* == initiator_max_recv_dsl */ + int max_xmit_dlength; /* == target_max_recv_dsl */ unsigned int max_outstand_cmds; /* MERGE_FIXME need2 review */ /* abort */ @@ -273,7 +273,7 @@ struct iscsi_iser_data_task { struct iscsi_data hdr; /* PDU */ struct list_head item; /* data queue item */ }; -#define ISCSI_DTASK_DEFAULT_MAX ISER_SG_TABLESIZE * PAGE_SIZE / 512 +#define ISCSI_DTASK_DEFAULT_MAX ISCSI_ISER_SG_TABLESIZE * PAGE_SIZE / 512 struct iscsi_iser_session { Index: ulp/iser/iser_verbs.c =================================================================== --- ulp/iser/iser_verbs.c (revision 4985) +++ ulp/iser/iser_verbs.c (working copy) @@ -71,7 +71,6 @@ static void iser_qp_event_callback(struc int iser_create_adaptor_ib_res(struct iser_adaptor *p_iser_adaptor) { struct ib_device *device = p_iser_adaptor->device; - struct ib_fmr_pool_param params; strcpy(p_iser_adaptor->name, device->name); iser_dbg("setting device name %s as adatptor name\n", device->name); @@ -100,25 +99,8 @@ int iser_create_adaptor_ib_res(struct is if (IS_ERR(p_iser_adaptor->mr)) goto dma_mr_err; - params.max_pages_per_fmr = ISER_MAX_CMD_SIZE >> PAGE_SHIFT; - params.pool_size = ISER_MAX_OUTSTAND_CMDS * ISER_MAX_CONN; - params.dirty_watermark = 32; - params.cache = 0; - params.flush_function = NULL; - params.access = (IB_ACCESS_LOCAL_WRITE | - IB_ACCESS_REMOTE_WRITE | - IB_ACCESS_REMOTE_READ); - - p_iser_adaptor->fmr_pool = ib_create_fmr_pool(p_iser_adaptor->pd, ¶ms); - if (IS_ERR(p_iser_adaptor->fmr_pool)) { - iser_err("failed to create FMR pool\n"); - goto fmr_pool_err; - } - return 0; -fmr_pool_err: - ib_dereg_mr(p_iser_adaptor->mr); dma_mr_err: tasklet_kill(&p_iser_adaptor->cq_tasklet); cq_arm_err: @@ -138,41 +120,51 @@ pd_err: */ int iser_free_adaptor_ib_res(struct iser_adaptor *p_iser_adaptor) { - /* do we need to deallocate any resource ? */ - if (p_iser_adaptor->fmr_pool == NULL) - return 0; + BUG_ON(p_iser_adaptor->mr == NULL); tasklet_kill(&p_iser_adaptor->cq_tasklet); - (void)ib_destroy_fmr_pool(p_iser_adaptor->fmr_pool); (void)ib_dereg_mr(p_iser_adaptor->mr); (void)ib_destroy_cq(p_iser_adaptor->cq); (void)ib_dealloc_pd(p_iser_adaptor->pd); - p_iser_adaptor->fmr_pool = NULL; - p_iser_adaptor->mr = NULL; - p_iser_adaptor->cq = NULL; - p_iser_adaptor->pd = NULL; + p_iser_adaptor->mr = NULL; + p_iser_adaptor->cq = NULL; + p_iser_adaptor->pd = NULL; return 0; } /** - * iser_create_qp - Creates a Queue-Pair (QP) + * iser_create_ib_conn_res - Creates FMR pool and Queue-Pair (QP) * * returns 0 on success, -1 on failure */ -int iser_create_qp(struct iser_conn *p_iser_conn) +int iser_create_ib_conn_res(struct iser_conn *p_iser_conn) { struct iser_adaptor *p_iser_adaptor; struct ib_qp_init_attr init_attr; int ret; + struct ib_fmr_pool_param params; + + BUG_ON(p_iser_conn->p_adaptor == NULL); - if (p_iser_conn->p_adaptor == NULL) { - iser_err("NULL adaptor in conn, p_conn: 0x%p\n", - p_iser_conn); - return -1; - } p_iser_adaptor = p_iser_conn->p_adaptor; + + params.max_pages_per_fmr = ISCSI_ISER_SG_TABLESIZE; + params.pool_size = ISCSI_ISER_XMIT_CMDS_MAX; + params.dirty_watermark = 32; + params.cache = 0; + params.flush_function = NULL; + params.access = (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE | + IB_ACCESS_REMOTE_READ); + + p_iser_conn->fmr_pool = ib_create_fmr_pool(p_iser_adaptor->pd, ¶ms); + if (IS_ERR(p_iser_conn->fmr_pool)) { + ret = PTR_ERR(p_iser_conn->fmr_pool); + goto fmr_pool_err; + } + memset(&init_attr, 0, sizeof init_attr); init_attr.event_handler = iser_qp_event_callback; @@ -187,35 +179,42 @@ int iser_create_qp(struct iser_conn *p_i init_attr.qp_type = IB_QPT_RC; ret = rdma_create_qp(p_iser_conn->cma_id, p_iser_adaptor->pd, &init_attr); - if (ret) { - iser_err("unable to create qp: %d\n", ret); - return -1; - } + if (ret) + goto qp_err; + p_iser_conn->qp = p_iser_conn->cma_id->qp; - iser_err("setting conn %p qp cma_id %p qp %p\n", - p_iser_conn,p_iser_conn->cma_id, p_iser_conn->cma_id->qp); + iser_err("setting conn %p cma_id %p: fmr_pool %p qp %p\n", + p_iser_conn, p_iser_conn->cma_id, + p_iser_conn->fmr_pool, p_iser_conn->cma_id->qp); + return ret; + +qp_err: + (void)ib_destroy_fmr_pool(p_iser_conn->fmr_pool); +fmr_pool_err: + iser_err("unable to create fmr pool or qp for ib_conn: %d\n", ret); return ret; } /** - * iser_free_qp_and_id - Releases the QP and CMA ID objects + * iser_free_ib_conn_res - Releases the FMR pool, QP and CMA ID objects * * Also starts the conn termination. May be called more than * once, actually initiates releasing QP/ID only for the first call. * * returns 0 on success, -1 on failure */ -int iser_free_qp_and_id(struct iser_conn *p_iser_conn) +int iser_free_ib_conn_res(struct iser_conn *p_iser_conn) { - if (p_iser_conn == NULL) { - iser_err("NULL conn to free\n"); - return -1; - } + BUG_ON(p_iser_conn == NULL); - iser_err("free-ing conn %p conn->qp %p conn->cma_id %p\n", - p_iser_conn,p_iser_conn->qp, p_iser_conn->cma_id); + iser_err("free-ing conn %p cma_id %p fmr pool %p qp %p\n", + p_iser_conn, p_iser_conn->cma_id, + p_iser_conn->fmr_pool, p_iser_conn->qp); /* qp is created only once both addr & route are resolved */ + if (p_iser_conn->fmr_pool != NULL) + ib_destroy_fmr_pool(p_iser_conn->fmr_pool); + if (p_iser_conn->qp != NULL) rdma_destroy_qp(p_iser_conn->cma_id); @@ -224,6 +223,7 @@ int iser_free_qp_and_id(struct iser_conn else iser_bug("we are not supposed to be called twice\n"); + p_iser_conn->fmr_pool = NULL; p_iser_conn->qp = NULL; p_iser_conn->cma_id = NULL; @@ -307,7 +307,7 @@ static void iser_route_handler(struct rd struct rdma_conn_param conn_param; int ret; - ret = iser_create_qp((struct iser_conn *)cma_id->context); + ret = iser_create_ib_conn_res((struct iser_conn *)cma_id->context); if (ret) goto failure; @@ -471,7 +471,7 @@ int iser_disconnect(struct iser_conn *p_ * returns: 0 on success, -1 on failure */ int -iser_reg_phys_mem(struct iser_adaptor *p_iser_adaptor, +iser_reg_phys_mem(struct iser_conn *p_iser_conn, struct iser_page_vec *page_vec, enum ib_access_flags access_flags, struct iser_mem_reg *mem_reg) @@ -484,7 +484,7 @@ iser_reg_phys_mem(struct iser_adaptor *p page_list = page_vec->pages; io_addr = page_list[0]; - mem = ib_fmr_pool_map_phys(p_iser_adaptor->fmr_pool, + mem = ib_fmr_pool_map_phys(p_iser_conn->fmr_pool, page_list, page_vec->length, &io_addr); Index: ulp/iser/iser.h =================================================================== --- ulp/iser/iser.h (revision 4984) +++ ulp/iser/iser.h (working copy) @@ -99,8 +99,6 @@ struct iser_regd_buf { struct iser_adaptor; -#define ISER_MAX_CONN 4 - struct iser_adaptor { struct list_head ig_list; /* entry in ig adaptors list */ @@ -109,7 +107,6 @@ struct iser_adaptor { struct ib_pd *pd; struct ib_cq *cq; struct ib_mr *mr; - struct ib_fmr_pool *fmr_pool; struct tasklet_struct cq_tasklet; Index: ulp/iser/iser_verbs.h =================================================================== --- ulp/iser/iser_verbs.h (revision 4984) +++ ulp/iser/iser_verbs.h (working copy) @@ -48,11 +48,8 @@ #define ISER_MAX_TASK_MGT_REQ 2 #define ISER_MAX_LOGOUT_REQ 1 -#define ISER_MAX_OUTSTAND_CMDS 64 #define ISER_MAX_IMMEDIATE_CMDS 2 -#define ISER_MAX_CMD_SIZE 0x80000 /* 512KB */ - #define ISER_MIN_RECV_DSL (8*1024) /* 8K */ #define ISER_MAX_FIRST_BURST (128*1024) /* 128K */ @@ -64,13 +61,14 @@ /* Maximal bounds on asynchronous PDUs received by iSER Initiator */ #define ISER_MAX_RX_MISC_PDUS (ISER_MAX_NOP_IN + \ ISER_MAX_ASYNC_EVT) + #define ISER_MAX_TX_MISC_PDUS (ISER_MAX_TEXT_REQ + \ ISER_MAX_NOP_OUT + \ ISER_MAX_TASK_MGT_REQ + \ ISER_MAX_LOGOUT_REQ) -#define ISER_MAX_RX_CMD_RESP ISER_MAX_OUTSTAND_CMDS -#define ISER_MAX_TX_CMDS (ISER_MAX_OUTSTAND_CMDS + \ - ISER_MAX_IMMEDIATE_CMDS) + +#define ISER_MAX_RX_CMD_RESP ISCSI_ISER_XMIT_CMDS_MAX + /* iSER Initiator QP settings */ #define ISER_AVG_TASK_RELATED_SEND(first_burst, recv_dsl,imm,max_cmds) \ @@ -90,19 +88,15 @@ ISER_MAX_FIRST_BURST, \ ISER_MIN_RECV_DSL, \ 1, \ - ISER_MAX_OUTSTAND_CMDS) + \ + ISCSI_ISER_XMIT_CMDS_MAX) + \ ISER_MAX_TX_MISC_PDUS + \ ISER_MAX_RX_MISC_PDUS) /* iSER Initiator CQ settings */ -#define ISER_CQ_MAX_RECV_DTOS (ISER_QP_MAX_RECV_DTOS * \ - ISER_MAX_CONN) -#define ISER_CQ_MAX_REQ_DTOS (ISER_QP_MAX_REQ_DTOS * \ - ISER_MAX_CONN) -#define ISER_MAX_CQ_LEN (ISER_CQ_MAX_RECV_DTOS +\ - ISER_CQ_MAX_REQ_DTOS) +#define ISCSI_ISER_MAX_CONN 8 -#define ISER_MAX_TOTAL_QLEN ISER_MAX_QLEN +#define ISER_MAX_CQ_LEN ((ISER_QP_MAX_RECV_DTOS + ISER_QP_MAX_REQ_DTOS) *\ + ISCSI_ISER_MAX_CONN) int iser_create_adaptor_ib_res(struct iser_adaptor *p_iser_adaptor); @@ -117,7 +111,9 @@ int iser_disconnect(struct iser_conn *p_ int iser_free_qp_and_id(struct iser_conn *p_iser_conn); -int iser_reg_phys_mem(struct iser_adaptor *p_iser_adaptor, +int iser_free_ib_conn_res(struct iser_conn *p_iser_conn); + +int iser_reg_phys_mem(struct iser_conn *p_iser_conn, struct iser_page_vec *page_vec, enum ib_access_flags access_flags, struct iser_mem_reg *p_mem_reg); Index: ulp/iser/iser_conn.c =================================================================== --- ulp/iser/iser_conn.c (revision 4985) +++ ulp/iser/iser_conn.c (working copy) @@ -373,7 +373,7 @@ void iser_conn_release(struct iser_conn struct iscsi_iser_conn *p_iscsi_conn; if (atomic_read(&p_iser_conn->state) == ISER_CONN_DOWN) { - iser_free_qp_and_id(p_iser_conn); /* qp/id freed only once */ + iser_free_ib_conn_res(p_iser_conn); /* qp/id freed only once */ iser_adaptor_remove_conn(p_iser_conn); p_iscsi_conn = p_iser_conn->p_iscsi_conn; Index: ulp/iser/iser_initiator.c =================================================================== --- ulp/iser/iser_initiator.c (revision 4984) +++ ulp/iser/iser_initiator.c (working copy) @@ -57,7 +57,7 @@ static void iser_dma_unmap_task_data(str static int iser_reg_rdma_mem(struct iscsi_iser_cmd_task *p_iser_task, enum iser_data_dir cmd_dir) { - struct iser_adaptor *p_iser_adaptor = p_iser_task->conn->ib_conn->p_adaptor; + struct iser_conn *p_iser_conn = p_iser_task->conn->ib_conn; struct iser_page_vec *page_vec = NULL; struct iser_regd_buf *p_regd_buf = NULL; enum ib_access_flags priv_flags = IB_ACCESS_LOCAL_WRITE; @@ -106,7 +106,7 @@ static int iser_reg_rdma_mem(struct iscs return -ENOMEM; } page_vec_len = iser_page_vec_build(p_mem, page_vec, 0, cnt_to_reg); - err = iser_reg_phys_mem(p_iser_adaptor, page_vec, priv_flags, + err = iser_reg_phys_mem(p_iser_conn, page_vec, priv_flags, &p_regd_buf->reg); iser_page_vec_free(page_vec); if (err) { Index: ulp/iser/iscsi_iser.c =================================================================== --- ulp/iser/iscsi_iser.c (revision 4984) +++ ulp/iser/iscsi_iser.c (working copy) @@ -892,10 +892,10 @@ static struct scsi_host_template iscsi_i ISCSI_VERSION_STR, .queuecommand = iscsi_iser_queuecommand, .can_queue = ISCSI_ISER_XMIT_CMDS_MAX - 1, - .sg_tablesize = ISER_SG_TABLESIZE, - .cmd_per_lun = ISER_CMD_PER_LUN, + .sg_tablesize = ISCSI_ISER_SG_TABLESIZE, + .cmd_per_lun = ISCSI_ISER_CMD_PER_LUN, .eh_abort_handler = iscsi_iser_eh_abort, - .eh_host_reset_handler = iscsi_iser_eh_host_reset, + .eh_host_reset_handler = iscsi_iser_eh_host_reset, .use_clustering = DISABLE_CLUSTERING, .proc_name = "iscsi_iser", .this_id = -1, @@ -1318,6 +1318,9 @@ static int iscsi_iser_conn_set_param(isc spin_unlock_bh(&session->lock); switch (param) { + case ISCSI_PARAM_MAX_RECV_DLENGTH: + /* TBD */ + break; case ISCSI_PARAM_MAX_XMIT_DLENGTH: conn->max_xmit_dlength = value; break; From sashak at voltaire.com Sun Jan 15 04:42:52 2006 From: sashak at voltaire.com (Sasha Khapyrsky) Date: Sun, 15 Jan 2006 14:42:52 +0200 Subject: [openib-general] [patch] userspace/management/Makefile small fixes Message-ID: <1137328972.6718.21.camel@sashak.voltaire.com> Hello Hal, I found small outdates in userspace/management/Makefile. Fixes are attached. Sasha. Small fixes in Makefile: - remove nonexisted 'UTIL' dir - fix DIAG dir definition (so 'make all' will make diags as well) Signed-off-by: Sasha Khapyorsky Index: Makefile =================================================================== --- Makefile (revision 4999) +++ Makefile (working copy) @@ -2,10 +2,9 @@ LIBS:=libibcommon libibumad libibmad OSM:=osm OSMLIBS:=complib libvendor -UTIL:=$(wildcard util/*) -DIAG:=$(wildcard diags/*) +DIAG:=diags -SUBDIRS=$(UTIL) $(DIAG) $(OSM) +SUBDIRS=$(DIAG) $(OSM) all: BUILD_TARG=all all: libs_install subdirs @@ -31,12 +30,12 @@ if !(cd $(OSM)/$$i; ./autogen.sh && ./configure && make && make install); then exit 1; fi\ fi\ done - @for i in $(UTIL) $(DIAG) $(OSM)/opensm; do\ + @for i in $(DIAG) $(OSM)/opensm; do\ if [ -x $$i/autogen.sh ]; then\ if !(cd $$i; ./autogen.sh && ./configure); then exit 1; fi\ fi\ done - @for i in $(UTIL) $(DIAG) $(OSM)/opensm; do\ + @for i in $(DIAG) $(OSM)/opensm; do\ if [ -x $$i/autogen.sh ]; then\ if !(cd $$i; make && make install); then exit 1; fi\ fi\ @@ -46,7 +45,7 @@ install: subdirs @echo Install done -clean: SUBDIRS=$(LIBS) $(UTIL) $(DIAG) $(OSM) +clean: SUBDIRS=$(LIBS) $(DIAG) $(OSM) clean: BUILD_TARG=clean clean: subdirs @rm -f build_tag @@ -55,7 +54,7 @@ rmdep: find $(SUBDIRS) -name ".depend" | xargs rm -f -depend: SUBDIRS=$(LIBS) $(UTIL) $(DIAG) $(OSM) +depend: SUBDIRS=$(LIBS) $(DIAG) $(OSM) depend: BUILD_TARG=depend depend: rmdep subdirs @echo Depend done From halr at voltaire.com Sun Jan 15 07:31:12 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Jan 2006 10:31:12 -0500 Subject: [openib-general] IPoIB BUG at shutdown with latest OpenIB svn Message-ID: <1137339072.4336.240.camel@hal.voltaire.com> With latest OpenIB svn on an i386, when shutting down the machine with IPoIB, I got the following on the console: BUG: spinlock lockup on CPU #0, ipoib/6181, cefeca80 The traceback showed: __ipoib_reap_ah+0x24/0xdb ipoib_reap_ah+0xb This was only the last message. The others scrolled off the screen. Not sure if I can reproduce this but thought it should be reported. -- Hal From halr at voltaire.com Sun Jan 15 08:10:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Jan 2006 11:10:43 -0500 Subject: [openib-general] Re: [patch] userspace/management/Makefile small fixes In-Reply-To: <1137328972.6718.21.camel@sashak.voltaire.com> References: <1137328972.6718.21.camel@sashak.voltaire.com> Message-ID: <1137341442.4336.327.camel@hal.voltaire.com> Hi Sssha, On Sun, 2006-01-15 at 07:42, Sasha Khapyrsky wrote: > Hello Hal, > > I found small outdates in userspace/management/Makefile. Fixes are > attached. Thanks. They are a vestige of history and didn't keep the Makefile sync'd up with these changes. Thanks. Applied. Please see below for comment about emailing patch. -- Hal > Sasha. > > > Small fixes in Makefile: > - remove nonexisted 'UTIL' dir > - fix DIAG dir definition (so 'make all' will make diags as well) > > Signed-off-by: Sasha Khapyorsky > > Index: Makefile > =================================================================== > --- Makefile (revision 4999) > +++ Makefile (working copy) > @@ -2,10 +2,9 @@ > LIBS:=libibcommon libibumad libibmad > OSM:=osm > OSMLIBS:=complib libvendor > -UTIL:=$(wildcard util/*) > -DIAG:=$(wildcard diags/*) > +DIAG:=diags > > -SUBDIRS=$(UTIL) $(DIAG) $(OSM) > +SUBDIRS=$(DIAG) $(OSM) > > all: BUILD_TARG=all > all: libs_install subdirs > @@ -31,12 +30,12 @@ > if !(cd $(OSM)/$$i; ./autogen.sh && ./configure && make && make > install); then exit 1; fi\ Need to setup mailer not to line wrap (this should have been one line in the patch and it made it into 2). I fixed this up by hand. > fi\ > done > - @for i in $(UTIL) $(DIAG) $(OSM)/opensm; do\ > + @for i in $(DIAG) $(OSM)/opensm; do\ > if [ -x $$i/autogen.sh ]; then\ > if !(cd $$i; ./autogen.sh && ./configure); then exit 1; fi\ > fi\ > done > - @for i in $(UTIL) $(DIAG) $(OSM)/opensm; do\ > + @for i in $(DIAG) $(OSM)/opensm; do\ > if [ -x $$i/autogen.sh ]; then\ > if !(cd $$i; make && make install); then exit 1; fi\ > fi\ > @@ -46,7 +45,7 @@ > install: subdirs > @echo Install done > > -clean: SUBDIRS=$(LIBS) $(UTIL) $(DIAG) $(OSM) > +clean: SUBDIRS=$(LIBS) $(DIAG) $(OSM) > clean: BUILD_TARG=clean > clean: subdirs > @rm -f build_tag > @@ -55,7 +54,7 @@ > rmdep: > find $(SUBDIRS) -name ".depend" | xargs rm -f > > -depend: SUBDIRS=$(LIBS) $(UTIL) $(DIAG) $(OSM) > +depend: SUBDIRS=$(LIBS) $(DIAG) $(OSM) > depend: BUILD_TARG=depend > depend: rmdep subdirs > @echo Depend done > > From nacc at us.ibm.com Sun Jan 15 08:48:58 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Sun, 15 Jan 2006 08:48:58 -0800 Subject: [openib-general] kernel build failure Message-ID: <20060115164858.GM1129@us.ibm.com> Hi Roland, IPOIB is failing to build with 2.6.15 and svn 4981 and 5009 In file included from drivers/infiniband/core/at.c:55: drivers/infiniband/ulp/ipoib/ipoib.h:52:34: linux/mutex-backport.h: No such file or directory In file included from drivers/infiniband/core/at.c:55: drivers/infiniband/ulp/ipoib/ipoib.h:133: error: field `mcast_mutex' has incomplete type drivers/infiniband/ulp/ipoib/ipoib.h:134: error: field `vlan_mutex' has incomplete type Is this related the semaphore to mutex conversions from Ingo? If so, does this mean that it's only expected to build with 2.6.15-git{6,7,...} or so? (Remember, I'm stuck with 2.6.15 until 2.6.16-rc1 comes out so that I can use ISER without having to patch again). Thanks, Nish From rdreier at cisco.com Sun Jan 15 08:56:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 15 Jan 2006 08:56:48 -0800 Subject: [openib-general] Re: kernel build failure In-Reply-To: <20060115164858.GM1129@us.ibm.com> (Nishanth Aravamudan's message of "Sun, 15 Jan 2006 08:48:58 -0800") References: <20060115164858.GM1129@us.ibm.com> Message-ID: Nishanth> Is this related the semaphore to mutex conversions from Nishanth> Ingo? If so, does this mean that it's only expected to Nishanth> build with 2.6.15-git{6,7,...} or so? (Remember, I'm Nishanth> stuck with 2.6.15 until 2.6.16-rc1 comes out so that I Nishanth> can use ISER without having to patch again). Yes, it's related, but 2.6.15 should still work. I'm surprised that svn doesn't build for you against 2.6.15 -- I created under the include directory in the svn repository, and my tree does build against 2.6.15 for me. There is the same old story that until 2.6.16-rc1 comes out (which should be in the next day or so), I don't know of a way to tell the difference between a real 2.6.15 kernel, and a 2.6.15-gitX kernel. So there may be trouble with svn against a Linus git kernel. - R. From rdreier at cisco.com Sun Jan 15 09:04:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 15 Jan 2006 09:04:29 -0800 Subject: [openib-general] Re: IPoIB BUG at shutdown with latest OpenIB svn In-Reply-To: <1137339072.4336.240.camel@hal.voltaire.com> (Hal Rosenstock's message of "15 Jan 2006 10:31:12 -0500") References: <1137339072.4336.240.camel@hal.voltaire.com> Message-ID: > BUG: spinlock lockup on CPU #0, ipoib/6181, cefeca80 > > The traceback showed: > > __ipoib_reap_ah+0x24/0xdb > ipoib_reap_ah+0xb > This was only the last message. The others scrolled off the screen. Unfortunately that traceback is probably just the symptom of something else crashing while holding a lock. The earlier messages are what we really need for figuring out what went wrong. - R. From halr at voltaire.com Sun Jan 15 09:20:12 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Jan 2006 12:20:12 -0500 Subject: [openib-general] Re: IPoIB BUG at shutdown with latest OpenIB svn In-Reply-To: References: <1137339072.4336.240.camel@hal.voltaire.com> Message-ID: <1137345612.4336.341.camel@hal.voltaire.com> On Sun, 2006-01-15 at 12:04, Roland Dreier wrote: > > BUG: spinlock lockup on CPU #0, ipoib/6181, cefeca80 > > > > The traceback showed: > > > > __ipoib_reap_ah+0x24/0xdb > > ipoib_reap_ah+0xb > > > This was only the last message. The others scrolled off the screen. > > Unfortunately that traceback is probably just the symptom of something > else crashing while holding a lock. Maybe not a crash but perhaps a deadlock ? > The earlier messages are what we > really need for figuring out what went wrong. > > - R. From halr at voltaire.com Sun Jan 15 09:24:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Jan 2006 12:24:59 -0500 Subject: [openib-general] kernel build failure In-Reply-To: <20060115164858.GM1129@us.ibm.com> References: <20060115164858.GM1129@us.ibm.com> Message-ID: <1137345899.4336.353.camel@hal.voltaire.com> On Sun, 2006-01-15 at 11:48, Nishanth Aravamudan wrote: > Hi Roland, > > IPOIB is failing to build with 2.6.15 and svn 4981 and 5009 > > In file included from drivers/infiniband/core/at.c:55: > drivers/infiniband/ulp/ipoib/ipoib.h:52:34: linux/mutex-backport.h: No such file or directory > In file included from drivers/infiniband/core/at.c:55: > drivers/infiniband/ulp/ipoib/ipoib.h:133: error: field `mcast_mutex' has incomplete type > drivers/infiniband/ulp/ipoib/ipoib.h:134: error: field `vlan_mutex' has incomplete type > > Is this related the semaphore to mutex conversions from Ingo? If so, > does this mean that it's only expected to build with 2.6.15-git{6,7,...} > or so? (Remember, I'm stuck with 2.6.15 until 2.6.16-rc1 comes out so > that I can use ISER without having to patch again). The issue appears that there is no linux/mutex-backport.h for 2.6.15. It needs to point at the one Roland put in the tree. Not sure whether there will be other issues after this. Also, I believe AT is largely deprecated at this point. -- Hal From rdreier at cisco.com Sun Jan 15 09:42:35 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 15 Jan 2006 09:42:35 -0800 Subject: [openib-general] Re: IPoIB BUG at shutdown with latest OpenIB svn In-Reply-To: <1137345612.4336.341.camel@hal.voltaire.com> (Hal Rosenstock's message of "15 Jan 2006 12:20:12 -0500") References: <1137339072.4336.240.camel@hal.voltaire.com> <1137345612.4336.341.camel@hal.voltaire.com> Message-ID: Hal> Maybe not a crash but perhaps a deadlock ? It's possible but given that you said there were other messages that scrolled off your screen, it seems more likely that some of them were an oops that prevented something else from releasing the lock. - R. From rdreier at cisco.com Sun Jan 15 09:43:47 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 15 Jan 2006 09:43:47 -0800 Subject: [openib-general] kernel build failure In-Reply-To: <1137345899.4336.353.camel@hal.voltaire.com> (Hal Rosenstock's message of "15 Jan 2006 12:24:59 -0500") References: <20060115164858.GM1129@us.ibm.com> <1137345899.4336.353.camel@hal.voltaire.com> Message-ID: Hal> The issue appears that there is no linux/mutex-backport.h for Hal> 2.6.15. It needs to point at the one Roland put in the Hal> tree. Not sure whether there will be other issues after this. is purely a compatibility header that I created. It will never appear in any Linux tree. However, the svn build should pick it up the same way that it picks up the svn versions of the headers. - R. From nacc at us.ibm.com Sun Jan 15 16:09:45 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Sun, 15 Jan 2006 16:09:45 -0800 Subject: [openib-general] Re: kernel build failure In-Reply-To: References: <20060115164858.GM1129@us.ibm.com> Message-ID: <20060116000945.GN1129@us.ibm.com> On 15.01.2006 [08:56:48 -0800], Roland Dreier wrote: > Nishanth> Is this related the semaphore to mutex conversions from > Nishanth> Ingo? If so, does this mean that it's only expected to > Nishanth> build with 2.6.15-git{6,7,...} or so? (Remember, I'm > Nishanth> stuck with 2.6.15 until 2.6.16-rc1 comes out so that I > Nishanth> can use ISER without having to patch again). > > Yes, it's related, but 2.6.15 should still work. I'm surprised that > svn doesn't build for you against 2.6.15 -- I created > under the include directory in the svn > repository, and my tree does build against 2.6.15 for me. I think it was a bug in my build script, actually, I was only cp -R the rdma directory, not everything that might be under include/. I'm going to rerun the test now to see if that fixes things. > There is the same old story that until 2.6.16-rc1 comes out (which > should be in the next day or so), I don't know of a way to tell the > difference between a real 2.6.15 kernel, and a 2.6.15-gitX kernel. So > there may be trouble with svn against a Linus git kernel. Yup, I'll let you know soon with an updated status. Thanks, Nish From nacc at us.ibm.com Sun Jan 15 17:36:50 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Sun, 15 Jan 2006 17:36:50 -0800 Subject: [openib-general] Re: kernel build failure In-Reply-To: <20060116000945.GN1129@us.ibm.com> References: <20060115164858.GM1129@us.ibm.com> <20060116000945.GN1129@us.ibm.com> Message-ID: <20060116013650.GP1129@us.ibm.com> On 15.01.2006 [16:09:45 -0800], Nishanth Aravamudan wrote: > On 15.01.2006 [08:56:48 -0800], Roland Dreier wrote: > > Nishanth> Is this related the semaphore to mutex conversions from > > Nishanth> Ingo? If so, does this mean that it's only expected to > > Nishanth> build with 2.6.15-git{6,7,...} or so? (Remember, I'm > > Nishanth> stuck with 2.6.15 until 2.6.16-rc1 comes out so that I > > Nishanth> can use ISER without having to patch again). > > > > Yes, it's related, but 2.6.15 should still work. I'm surprised that > > svn doesn't build for you against 2.6.15 -- I created > > under the include directory in the svn > > repository, and my tree does build against 2.6.15 for me. > > I think it was a bug in my build script, actually, I was only cp -R the > rdma directory, not everything that might be under include/. I'm going > to rerun the test now to see if that fixes things. Yup, completely my fault... Fixed up the script to do the right thing and will let the jobs run, should have another 3 sets to post soon. Thanks, Nish From rdreier at cisco.com Sun Jan 15 17:39:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 15 Jan 2006 17:39:55 -0800 Subject: [openib-general] Re: IPoIB BUG at shutdown with latest OpenIB svn In-Reply-To: (Roland Dreier's message of "Sun, 15 Jan 2006 09:42:35 -0800") References: <1137339072.4336.240.camel@hal.voltaire.com> <1137345612.4336.341.camel@hal.voltaire.com> Message-ID: BTW, if you're not able to set up a serial console to catch kernel messages, then I highly recommend setting up netconsole (Documentation/networking/netconsole.txt) on any systems where you might want to be able to capture kernel output. Netconsole should work on just about any system, and all you need is one spare host running syslogd to store the messages. - R. From nacc at us.ibm.com Sun Jan 15 17:41:06 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Sun, 15 Jan 2006 17:41:06 -0800 Subject: [openib-general] kernel build failure In-Reply-To: <1137345899.4336.353.camel@hal.voltaire.com> References: <20060115164858.GM1129@us.ibm.com> <1137345899.4336.353.camel@hal.voltaire.com> Message-ID: <20060116014106.GQ1129@us.ibm.com> On 15.01.2006 [12:24:59 -0500], Hal Rosenstock wrote: > On Sun, 2006-01-15 at 11:48, Nishanth Aravamudan wrote: > > Hi Roland, > > > > IPOIB is failing to build with 2.6.15 and svn 4981 and 5009 > > > > In file included from drivers/infiniband/core/at.c:55: > > drivers/infiniband/ulp/ipoib/ipoib.h:52:34: linux/mutex-backport.h: No such file or directory > > In file included from drivers/infiniband/core/at.c:55: > > drivers/infiniband/ulp/ipoib/ipoib.h:133: error: field `mcast_mutex' has incomplete type > > drivers/infiniband/ulp/ipoib/ipoib.h:134: error: field `vlan_mutex' has incomplete type > > > > Is this related the semaphore to mutex conversions from Ingo? If so, > > does this mean that it's only expected to build with 2.6.15-git{6,7,...} > > or so? (Remember, I'm stuck with 2.6.15 until 2.6.16-rc1 comes out so > > that I can use ISER without having to patch again). > Also, I believe AT is largely deprecated at this point. That's fine, but my test script basically takes any Kconfig options in the mainline or subversion directories (as appopriate) and set them to module (I did not see enough variance yet between modular and built-in to keep testing both -- and only doing modular halves the test time required (around 10 hours right now for 8 runs)). So, once AT is removed, it won't be tested, per se. But as long as it's in svn, it will be built. Thanks, Nish From panda at cse.ohio-state.edu Sun Jan 15 18:29:14 2006 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sun, 15 Jan 2006 21:29:14 -0500 (EST) Subject: [openib-general] Announcing the release of MVAPICH2 0.9.2 (MPI-2 for InfiniBand and other RDMA Interconnects) Message-ID: <200601160229.k0G2TEAq002732@xi.cse.ohio-state.edu> The MVAPICH team is pleased to announce the release of MVAPICH2 0.9.2 with OpenIB/Gen2, VAPI and uDAPL transport interfaces. It also has support for the standard TCP/IP (provided by MPICH2 stack). It is optimized for the following platforms, OS, compilers and InfiniBand adapters: - Platforms: EM64T, Opteron, IA-32 and Mac G5 - Operating Systems: Linux, Solaris and Mac OSX - Compilers: gcc, intel, pathscale and pgi - InfiniBand Adapters: Mellanox adapters with PCI-X and PCI-Express (SDR and DDR with mem-full and mem-free cards) Starting with this release, MVAPICH2 enables InfiniBand support for OpenIB/Gen2. All features available for the VAPI and uDAPL interfaces are now available for the OpenIB/Gen2 interface. MVAPICH2 0.9.2 is being distributed as a single integrated package (with MPICH2 1.0.2p1 and MVICH). It is available under BSD license. This new release has the following features: - single code base with multiple underlying transport interfaces: VAPI, OpenIB/Gen2, uDAPL and TCP/IP - high-performance and optimized support for many MPI-2 functionalities (one-sided, collectives, datatype) - support for other MPI-2 functionalities (as provided by MPICH2 1.0.2p1). - high-performance and optimized support for all MPI-1 functionalities (including two-sided) - high performance and optimized support for all one-sided operations (Get, Put, and Accumulate) - support for both active and passive synchronization - optimized two-sided operations with RDMA support - efficient memory registration/de-registration schemes for RDMA operations - optimized intra-node shared memory support (bus-based and NUMA) - shared library support for existing binary MPI programs to run - ROMIO support for MPI-IO - uDAPL support for portability across networks and OS (tested for InfiniBand on Linux and Solaris; and Myrinet) - scalable job start-up - optimized and tuned for the above platforms and different network interfaces (PCI-X and PCI-Express with SDR and DDR) - support for multiple compilers (gcc, icc, pathscale and pgi) - single code base for all of the above platforms and OS - integrated and easy-to-use build script for installing the code on various platforms, OS, compilers, devices, and InfiniBand adapters - incorporates a set of runtime and compiler time tunable parameters for convenient tuning on large-scale clusters Other features of this release include: - Excellent performance: Sample performance numbers include: - OpenIB/Gen2 on EM64T with PCI-Ex and IBA-DDR: Two-sided operations: - 3.08 microsec one-way latency (4 bytes) - 1476 MB/sec unidirectional bandwidth - 2661 MB/sec bidirectional bandwidth One-sided operations: - 4.84 microsec Put latency - 1483 MB/sec unidirectional Put bandwidth - 2661 MB/sec bidirectional Put bandwidth - OpenIB/Gen2 on EM64T with PCI-Ex and IBA-SDR: Two-sided operations: - 3.35 microsec one-way latency (4 bytes) - 964 MB/sec unidirectional bandwidth - 1846 MB/sec bidirectional bandwidth One-sided operations: - 5.43 microsec Put latency - 964 MB/sec unidirectional Put bandwidth - 1846 MB/sec bidirectional Put bandwidth - OpenIB/Gen2 on Opteron with PCI-Ex and IBA-SDR: Two-sided operations: - 3.27 microsec one-way latency (4 bytes) - 968 MB/sec unidirectional bandwidth - 1896 MB/sec bidirectional bandwidth One-sided operations: - 5.95 microsec Put latency - 968 MB/sec unidirectional Put bandwidth - 1896 MB/sec bidirectional Put bandwidth - Solaris uDAPL/IBTL on Opteron with PCI-X and IBA-SDR: Two-sided operations: - 5.58 microsec one-way latency (4 bytes) - 655 MB/sec unidirectional bandwidth - 799 MB/sec bidirectional bandwidth - OpenIB/Gen2 uDAPL on Opteron with PCI-Ex and IBA-SDR: Two-sided operations: - 3.63 microsec one-way latency (4 bytes) - 962 MB/sec unidirectional bandwidth - 1869 MB/sec bidirectional bandwidth Performance numbers for all other platforms, system configurations and operations can be viewed by visiting `Performance Results' section of the project's web page. - Similar performance with MVAPICH: With the new ADI-3-level design, MVAPICH2 0.9.2 delivers similar performance for two-sided operations compared to MVAPICH 0.9.6. Organizations and users interested in getting the best performance for both two-sided and one-sided operations may migrate from MVAPICH code base to MVAPICH2 code base. - A set of benchmarks to evaluate both two-sided and one-sided operations (Put, Get, and Accumulate) - An enhanced and detailed `User Guide' to assist users: - to install this package on different platforms with interfaces (VAPI, uDAPL, OpenIB/Gen2 and TCP/IP) and different options - to vary different parameters of the MPI installation to extract maximum performance and achieve scalability, especially on large-scale systems. You are welcome to download the MVAPICH2 0.9.2 package and access relevant information from the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ A stripped down version of this release is available at the OpenIB SVN. A successive version with additional features and integrated with MPICH2 1.0.3 will be available soon. All feedbacks, including bug reports and hints for performance tuning, are welcome. Please send an e-mail to mvapich-help at cse.ohio-state.edu. Thanks, MVAPICH Team at OSU/NBCL ---------- PS: If you would like to be removed from this mailing list, please end an e-mail to mvapich_request at cse.ohio-state.edu. ====================================================================== MVAPICH/MVAPICH2 project is currently supported with funding from U.S. National Science Foundation, U.S. DOE Office of Science, Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; and with equipment support from AMD, Apple, Appro, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm and Sun Microsystems. Other technology partner includes Etnus. ====================================================================== From andrey.slepuhin at t-platforms.ru Mon Jan 16 02:48:32 2006 From: andrey.slepuhin at t-platforms.ru (Andrey Slepuhin) Date: Mon, 16 Jan 2006 13:48:32 +0300 Subject: [openib-general] OpenSM doesn't start on p5 570 Message-ID: <20060116104832.GA18902@forest.lab.t-platforms.ru> Dear folks, I have a problem starting opensm on a p5 570 machine. The following messages appear in the opensm log file: ****************************************************************** ******************** INITIATING HEAVY SWEEP ********************** ****************************************************************** Jan 16 13:30:55 737114 [40018DC0] -> osm_req_get: [ Jan 16 13:30:55 737130 [40018DC0] -> osm_mad_pool_get: [ Jan 16 13:30:55 737147 [40018DC0] -> osm_vendor_get: [ Jan 16 13:30:55 737161 [40018DC0] -> osm_vendor_get: Acquiring UMAD for p_madw = 0x100747dc, size = 256 Jan 16 13:30:55 737176 [40018DC0] -> osm_vendor_get: Acquired UMAD 0x1008ee40, size = 256 Jan 16 13:30:55 737192 [40018DC0] -> osm_vendor_get: ] Jan 16 13:30:55 737208 [40018DC0] -> osm_mad_pool_get: Acquired p_madw = 0x100747d0, p_mad = 0x1008ee78, size = 256 Jan 16 13:30:55 737223 [40018DC0] -> osm_mad_pool_get: ] Jan 16 13:30:55 737238 [40018DC0] -> osm_req_get: Getting NodeInfo (0x11), modifier = 0x0, TID = 0x1234 Jan 16 13:30:55 737255 [40018DC0] -> osm_vl15_post: [ Jan 16 13:30:55 737269 [40018DC0] -> osm_vl15_post: Posting p_madw = 0x0x100747d0 Jan 16 13:30:55 737284 [40018DC0] -> osm_vl15_post: 0 QP0 MADs on wire, 1 QP0 MADs outstanding Jan 16 13:30:55 737299 [40018DC0] -> osm_vl15_poll: [ Jan 16 13:30:55 737313 [40018DC0] -> osm_vl15_poll: Signalling poller thread Jan 16 13:30:55 737334 [40018DC0] -> osm_vl15_poll: ] Jan 16 13:30:55 737338 [42827B20] -> __osm_vl15_poller: Servicing p_madw = 0x100747d0 Jan 16 13:30:55 737352 [40018DC0] -> osm_vl15_post: ] Jan 16 13:30:55 737388 [40018DC0] -> osm_req_get: ] Jan 16 13:30:55 737404 [40018DC0] -> __osm_state_mgr_sweep_hop_0: ] Jan 16 13:30:55 737420 [40018DC0] -> osm_state_mgr_process: ] Jan 16 13:30:55 737436 [40018DC0] -> osm_sm_sweep: ] Jan 16 13:30:55 737464 [42827B20] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x0 trans_id................0x1234 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0] Return path: [0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Jan 16 13:30:55 737604 [42827B20] -> osm_vendor_send: [ Jan 16 13:30:55 737742 [42827B20] -> osm_vendor_send: Completed Sending Request p_madw = 0x100747d0 Jan 16 13:30:55 737761 [42827B20] -> osm_vendor_send: ] Jan 16 13:30:55 737768 [43027B20] -> osm_mad_pool_get: [ Jan 16 13:30:55 737784 [42827B20] -> __osm_vl15_poller: 1 QP0 MADs on wire, 1 outstanding, 0 unicasts sent, 1 total sent Jan 16 13:30:55 737812 [43027B20] -> osm_vendor_get: [ Jan 16 13:30:55 737848 [43027B20] -> osm_vendor_get: Acquiring UMAD for p_madw = 0x10074724, size = 256 Jan 16 13:30:55 737866 [43027B20] -> osm_vendor_get: Acquired UMAD 0x1008ef80, size = 256 Jan 16 13:30:55 737883 [43027B20] -> osm_vendor_get: ] Jan 16 13:30:55 737897 [43027B20] -> osm_mad_pool_get: Acquired p_madw = 0x10074718, p_mad = 0x1008efb8, size = 256 Jan 16 13:30:55 737915 [43027B20] -> osm_mad_pool_get: ] Jan 16 13:30:55 737939 [43027B20] -> umad_receiver: ERR 5413: Failed to obtain request madw for received MAD(method=0x81 attr=0x11) -- dropping Jan 16 13:30:55 737960 [43027B20] -> osm_mad_pool_put: [ Jan 16 13:30:55 737975 [43027B20] -> osm_mad_pool_put: Releasing p_madw = 0x10074718, p_mad = 0x1008ed00 Jan 16 13:30:55 737993 [43027B20] -> osm_vendor_put: [ Jan 16 13:30:55 738008 [43027B20] -> osm_vendor_put: Retiring UMAD 0x1008ecc8 Jan 16 13:30:55 738026 [43027B20] -> osm_vendor_put: ] Jan 16 13:30:55 738041 [43027B20] -> osm_mad_pool_put: ] My configuration consists of two 23108 HCAs directly connected without a switch, firmware is 3.3.3, kernel is 2.6.15-4 from OpenSUSE repository, userspace revision is 4978. Any help will be much appreciated Best regards, Andrey From halr at voltaire.com Mon Jan 16 03:31:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Jan 2006 06:31:46 -0500 Subject: [openib-general] OpenSM doesn't start on p5 570 In-Reply-To: <20060116104832.GA18902@forest.lab.t-platforms.ru> References: <20060116104832.GA18902@forest.lab.t-platforms.ru> Message-ID: <1137411106.4336.1475.camel@hal.voltaire.com> Hi, On Mon, 2006-01-16 at 05:48, Andrey Slepuhin wrote: > Dear folks, > > I have a problem starting opensm on a p5 570 machine. Is this the first time trying this on a p5 machine ? > The following messages > appear in the opensm log file: > > ****************************************************************** > ******************** INITIATING HEAVY SWEEP ********************** > ****************************************************************** > > > Jan 16 13:30:55 737114 [40018DC0] -> osm_req_get: [ > Jan 16 13:30:55 737130 [40018DC0] -> osm_mad_pool_get: [ > Jan 16 13:30:55 737147 [40018DC0] -> osm_vendor_get: [ > Jan 16 13:30:55 737161 [40018DC0] -> osm_vendor_get: Acquiring UMAD for p_madw = 0x100747dc, size = 256 > Jan 16 13:30:55 737176 [40018DC0] -> osm_vendor_get: Acquired UMAD 0x1008ee40, size = 256 > Jan 16 13:30:55 737192 [40018DC0] -> osm_vendor_get: ] > Jan 16 13:30:55 737208 [40018DC0] -> osm_mad_pool_get: Acquired p_madw = 0x100747d0, p_mad = 0x1008ee78, size = 256 > Jan 16 13:30:55 737223 [40018DC0] -> osm_mad_pool_get: ] > Jan 16 13:30:55 737238 [40018DC0] -> osm_req_get: Getting NodeInfo (0x11), modifier = 0x0, TID = 0x1234 > Jan 16 13:30:55 737255 [40018DC0] -> osm_vl15_post: [ > Jan 16 13:30:55 737269 [40018DC0] -> osm_vl15_post: Posting p_madw = 0x0x100747d0 > Jan 16 13:30:55 737284 [40018DC0] -> osm_vl15_post: 0 QP0 MADs on wire, 1 QP0 MADs outstanding > Jan 16 13:30:55 737299 [40018DC0] -> osm_vl15_poll: [ > Jan 16 13:30:55 737313 [40018DC0] -> osm_vl15_poll: Signalling poller thread > Jan 16 13:30:55 737334 [40018DC0] -> osm_vl15_poll: ] > Jan 16 13:30:55 737338 [42827B20] -> __osm_vl15_poller: Servicing p_madw = 0x100747d0 > Jan 16 13:30:55 737352 [40018DC0] -> osm_vl15_post: ] > Jan 16 13:30:55 737388 [40018DC0] -> osm_req_get: ] > Jan 16 13:30:55 737404 [40018DC0] -> __osm_state_mgr_sweep_hop_0: ] > Jan 16 13:30:55 737420 [40018DC0] -> osm_state_mgr_process: ] > Jan 16 13:30:55 737436 [40018DC0] -> osm_sm_sweep: ] > Jan 16 13:30:55 737464 [42827B20] -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x1 (SubnGet) > D bit...................0x0 > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x0 > trans_id................0x1234 > attr_id.................0x11 (NodeInfo) > resv....................0x0 > attr_mod................0x0 > m_key...................0x0000000000000000 > dr_slid.................0xFFFF > dr_dlid.................0xFFFF > > Initial path: [0] > Return path: [0] > Reserved: [0][0][0][0][0][0][0] > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Jan 16 13:30:55 737604 [42827B20] -> osm_vendor_send: [ > Jan 16 13:30:55 737742 [42827B20] -> osm_vendor_send: Completed Sending Request p_madw = 0x100747d0 > Jan 16 13:30:55 737761 [42827B20] -> osm_vendor_send: ] > Jan 16 13:30:55 737768 [43027B20] -> osm_mad_pool_get: [ > Jan 16 13:30:55 737784 [42827B20] -> __osm_vl15_poller: 1 QP0 MADs on wire, 1 outstanding, 0 unicasts sent, 1 total sent > Jan 16 13:30:55 737812 [43027B20] -> osm_vendor_get: [ > Jan 16 13:30:55 737848 [43027B20] -> osm_vendor_get: Acquiring UMAD for p_madw = 0x10074724, size = 256 > Jan 16 13:30:55 737866 [43027B20] -> osm_vendor_get: Acquired UMAD 0x1008ef80, size = 256 > Jan 16 13:30:55 737883 [43027B20] -> osm_vendor_get: ] > Jan 16 13:30:55 737897 [43027B20] -> osm_mad_pool_get: Acquired p_madw = 0x10074718, p_mad = 0x1008efb8, size = 256 > Jan 16 13:30:55 737915 [43027B20] -> osm_mad_pool_get: ] > Jan 16 13:30:55 737939 [43027B20] -> umad_receiver: ERR 5413: Failed to obtain request madw for received MAD(method=0x81 > attr=0x11) -- dropping This means that no matching transaction was found in transaction match table. This may be an endian problem with the tid. Can you validate the tid (print them out) in both get_madw and put_madw in osm_vendor_ibumad.c ? Since this seems to happen early on, there shouldn't be too many of these. Thanks. > Jan 16 13:30:55 737960 [43027B20] -> osm_mad_pool_put: [ > Jan 16 13:30:55 737975 [43027B20] -> osm_mad_pool_put: Releasing p_madw = 0x10074718, p_mad = 0x1008ed00 > Jan 16 13:30:55 737993 [43027B20] -> osm_vendor_put: [ > Jan 16 13:30:55 738008 [43027B20] -> osm_vendor_put: Retiring UMAD 0x1008ecc8 > Jan 16 13:30:55 738026 [43027B20] -> osm_vendor_put: ] > Jan 16 13:30:55 738041 [43027B20] -> osm_mad_pool_put: ] > > > My configuration consists of two 23108 HCAs directly connected without a switch, > firmware is 3.3.3, kernel is 2.6.15-4 from OpenSUSE repository, userspace > revision is 4978. Are the two HCAs on separate machines ? -- Hal > Any help will be much appreciated > > Best regards, > Andrey > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From andrey.slepuhin at t-platforms.ru Mon Jan 16 03:56:09 2006 From: andrey.slepuhin at t-platforms.ru (Andrey Slepuhin) Date: Mon, 16 Jan 2006 14:56:09 +0300 Subject: [openib-general] OpenSM doesn't start on p5 570 In-Reply-To: <1137411106.4336.1475.camel@hal.voltaire.com> References: <20060116104832.GA18902@forest.lab.t-platforms.ru> <1137411106.4336.1475.camel@hal.voltaire.com> Message-ID: <20060116115609.GB18902@forest.lab.t-platforms.ru> On Mon, Jan 16, 2006 at 06:31:46AM -0500, Hal Rosenstock wrote: > Hi, > > On Mon, 2006-01-16 at 05:48, Andrey Slepuhin wrote: > > Dear folks, > > > > I have a problem starting opensm on a p5 570 machine. > > Is this the first time trying this on a p5 machine ? > Yes. > > Jan 16 13:30:55 737939 [43027B20] -> umad_receiver: ERR 5413: Failed to obtain request madw for received MAD(method=0x81 > > attr=0x11) -- dropping > > This means that no matching transaction was found in transaction match > table. This may be an endian problem with the tid. > > Can you validate the tid (print them out) in both get_madw and put_madw > in osm_vendor_ibumad.c ? Since this seems to happen early on, there > shouldn't be too many of these. Thanks. I got the following: put_madw: tid=0x1234 get_madw: tid=0x1b00001234 > > Are the two HCAs on separate machines ? No, at the moment they are on the same machine. Best regards, Andrey From EzraknClark at acninc.com Mon Jan 16 04:25:46 2006 From: EzraknClark at acninc.com (Ezra Clark) Date: Mon, 16 Jan 2006 13:25:46 +0100 Subject: [openib-general] Your order# 8470. Message-ID: You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://ma-green.com/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://ma-green.com/ Regards, Dr. Ezra Clark From halr at voltaire.com Mon Jan 16 05:44:30 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Jan 2006 08:44:30 -0500 Subject: [openib-general] OpenSM doesn't start on p5 570 In-Reply-To: <20060116115609.GB18902@forest.lab.t-platforms.ru> References: <20060116104832.GA18902@forest.lab.t-platforms.ru> <1137411106.4336.1475.camel@hal.voltaire.com> <20060116115609.GB18902@forest.lab.t-platforms.ru> Message-ID: <1137419070.4346.17.camel@localhost.localdomain> On Mon, 2006-01-16 at 06:56, Andrey Slepuhin wrote: > On Mon, Jan 16, 2006 at 06:31:46AM -0500, Hal Rosenstock wrote: > > Hi, > > > > On Mon, 2006-01-16 at 05:48, Andrey Slepuhin wrote: > > > Dear folks, > > > > > > I have a problem starting opensm on a p5 570 machine. > > > > Is this the first time trying this on a p5 machine ? > > > > Yes. > > > > > Jan 16 13:30:55 737939 [43027B20] -> umad_receiver: ERR 5413: Failed to obtain request madw for received MAD(method=0x81 > > > attr=0x11) -- dropping > > > > This means that no matching transaction was found in transaction match > > table. This may be an endian problem with the tid. > > > > Can you validate the tid (print them out) in both get_madw and put_madw > > in osm_vendor_ibumad.c ? Since this seems to happen early on, there > > shouldn't be too many of these. Thanks. > > I got the following: > > put_madw: tid=0x1234 > get_madw: tid=0x1b00001234 This looks like an endian issue. I will have a patch for you to try later. Stay tuned. Thanks. > > > > Are the two HCAs on separate machines ? > > No, at the moment they are on the same machine. You should be able to run this in loopback. I have done this. Just wondering about the topology just to be sure... -- Hal > Best regards, > Andrey From mst at mellanox.co.il Mon Jan 16 06:05:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Jan 2006 16:05:56 +0200 Subject: [openib-general] [PATCH] mthca: cosmetic replace 4096 with MTHCA_ICM_PAGE_SIZE Message-ID: <20060116140556.GH22260@mellanox.co.il> The following looks to me like a good idea. Roland? --- Use a constant for the size of icm page in arbel. Signed-off-by: Ishai Rabinovitz Signed-off-by: Michael S. Tsirkin Index: last_stable/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-01-12 15:37:02.000000000 +0200 +++ last_stable/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-01-12 16:52:41.000000000 +0200 @@ -599,8 +599,9 @@ static int mthca_map_cmd(struct mthca_de * address or size and use that as our log2 size. */ lg = ffs(mthca_icm_addr(&iter) | mthca_icm_size(&iter)) - 1; - if (lg < 12) { - mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%lx).\n", + if (lg < MTHCA_ICM_PAGE_SHIFT) { + mthca_warn(dev, "Got FW area not aligned to %d (%llx/%lx).\n", + MTHCA_ICM_PAGE_SIZE, (unsigned long long) mthca_icm_addr(&iter), mthca_icm_size(&iter)); err = -EINVAL; @@ -612,8 +613,9 @@ static int mthca_map_cmd(struct mthca_de virt += 1 << lg; } - pages[nent * 2 + 1] = cpu_to_be64((mthca_icm_addr(&iter) + - (i << lg)) | (lg - 12)); + pages[nent * 2 + 1] = + cpu_to_be64((mthca_icm_addr(&iter) + (i << lg)) | + (lg - MTHCA_ICM_PAGE_SHIFT)); ts += 1 << (lg - 10); ++tc; Index: last_stable/drivers/infiniband/hw/mthca/mthca_memfree.c =================================================================== --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_memfree.c 2006-01-12 15:33:08.000000000 +0200 +++ last_stable/drivers/infiniband/hw/mthca/mthca_memfree.c 2006-01-12 15:41:45.000000000 +0200 @@ -202,7 +202,7 @@ void mthca_table_put(struct mthca_dev *d if (--table->icm[i]->refcount == 0) { mthca_UNMAP_ICM(dev, table->virt + i * MTHCA_TABLE_CHUNK_SIZE, - MTHCA_TABLE_CHUNK_SIZE >> 12, &status); + MTHCA_TABLE_CHUNK_SIZE >> MTHCA_ICM_PAGE_SHIFT, &status); mthca_free_icm(dev, table->icm[i]); table->icm[i] = NULL; } @@ -336,7 +336,8 @@ err: for (i = 0; i < num_icm; ++i) if (table->icm[i]) { mthca_UNMAP_ICM(dev, virt + i * MTHCA_TABLE_CHUNK_SIZE, - MTHCA_TABLE_CHUNK_SIZE >> 12, &status); + MTHCA_TABLE_CHUNK_SIZE >> MTHCA_ICM_PAGE_SHIFT, + &status); mthca_free_icm(dev, table->icm[i]); } @@ -353,7 +354,7 @@ void mthca_free_icm_table(struct mthca_d for (i = 0; i < table->num_icm; ++i) if (table->icm[i]) { mthca_UNMAP_ICM(dev, table->virt + i * MTHCA_TABLE_CHUNK_SIZE, - MTHCA_TABLE_CHUNK_SIZE >> 12, &status); + MTHCA_TABLE_CHUNK_SIZE >> MTHCA_ICM_PAGE_SHIFT, &status); mthca_free_icm(dev, table->icm[i]); } @@ -364,7 +365,7 @@ static u64 mthca_uarc_virt(struct mthca_ { return dev->uar_table.uarc_base + uar->index * dev->uar_table.uarc_size + - page * 4096; + page << MTHCA_ICM_PAGE_SHIFT; } int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar, @@ -401,7 +402,7 @@ int mthca_map_user_db(struct mthca_dev * if (ret < 0) goto out; - db_tab->page[i].mem.length = 4096; + db_tab->page[i].mem.length = MTHCA_ICM_PAGE_SIZE; db_tab->page[i].mem.offset = uaddr & ~PAGE_MASK; ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE); @@ -455,7 +456,7 @@ struct mthca_user_db_table *mthca_init_u if (!mthca_is_memfree(dev)) return NULL; - npages = dev->uar_table.uarc_size / 4096; + npages = dev->uar_table.uarc_size >> MTHCA_ICM_PAGE_SHIFT; db_tab = kmalloc(sizeof *db_tab + npages * sizeof *db_tab->page, GFP_KERNEL); if (!db_tab) return ERR_PTR(-ENOMEM); @@ -478,7 +479,7 @@ void mthca_cleanup_user_db_tab(struct mt if (!mthca_is_memfree(dev)) return; - for (i = 0; i < dev->uar_table.uarc_size / 4096; ++i) { + for (i = 0; i < dev->uar_table.uarc_size >> MTHCA_ICM_PAGE_SHIFT; ++i) { if (db_tab->page[i].uvirt) { mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1, &status); pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE); @@ -551,20 +552,20 @@ int mthca_alloc_db(struct mthca_dev *dev page = dev->db_tab->page + end; alloc: - page->db_rec = dma_alloc_coherent(&dev->pdev->dev, 4096, + page->db_rec = dma_alloc_coherent(&dev->pdev->dev, MTHCA_ICM_PAGE_SIZE, &page->mapping, GFP_KERNEL); if (!page->db_rec) { ret = -ENOMEM; goto out; } - memset(page->db_rec, 0, 4096); + memset(page->db_rec, 0, MTHCA_ICM_PAGE_SIZE); ret = mthca_MAP_ICM_page(dev, page->mapping, mthca_uarc_virt(dev, &dev->driver_uar, i), &status); if (!ret && status) ret = -EINVAL; if (ret) { - dma_free_coherent(&dev->pdev->dev, 4096, + dma_free_coherent(&dev->pdev->dev, MTHCA_ICM_PAGE_SIZE, page->db_rec, page->mapping); goto out; } @@ -612,7 +613,7 @@ void mthca_free_db(struct mthca_dev *dev i >= dev->db_tab->max_group1 - 1) { mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, &dev->driver_uar, i), 1, &status); - dma_free_coherent(&dev->pdev->dev, 4096, + dma_free_coherent(&dev->pdev->dev, MTHCA_ICM_PAGE_SIZE, page->db_rec, page->mapping); page->db_rec = NULL; @@ -640,7 +641,7 @@ int mthca_init_db_tab(struct mthca_dev * init_MUTEX(&dev->db_tab->mutex); - dev->db_tab->npages = dev->uar_table.uarc_size / 4096; + dev->db_tab->npages = dev->uar_table.uarc_size >> MTHCA_ICM_PAGE_SHIFT; dev->db_tab->max_group1 = 0; dev->db_tab->min_group2 = dev->db_tab->npages - 1; @@ -681,7 +682,7 @@ void mthca_cleanup_db_tab(struct mthca_d mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, &dev->driver_uar, i), 1, &status); - dma_free_coherent(&dev->pdev->dev, 4096, + dma_free_coherent(&dev->pdev->dev, MTHCA_ICM_PAGE_SIZE, dev->db_tab->page[i].db_rec, dev->db_tab->page[i].mapping); } Index: last_stable/drivers/infiniband/hw/mthca/mthca_memfree.h =================================================================== --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_memfree.h 2006-01-12 15:33:08.000000000 +0200 +++ last_stable/drivers/infiniband/hw/mthca/mthca_memfree.h 2006-01-12 16:49:23.000000000 +0200 @@ -54,6 +54,9 @@ typedef unsigned int gfp_t; ((256 - sizeof (struct list_head) - 2 * sizeof (int)) / \ (sizeof (struct scatterlist))) +#define MTHCA_ICM_PAGE_SHIFT 12 +#define MTHCA_ICM_PAGE_SIZE (1 << MTHCA_ICM_PAGE_SHIFT) + struct mthca_icm_chunk { struct list_head list; int npages; @@ -141,7 +144,7 @@ static inline unsigned long mthca_icm_si } enum { - MTHCA_DB_REC_PER_PAGE = 4096 / 8 + MTHCA_DB_REC_PER_PAGE = MTHCA_ICM_PAGE_SIZE / 8 }; struct mthca_db_page { Index: last_stable/drivers/infiniband/hw/mthca/mthca_eq.c =================================================================== --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2006-01-12 15:22:45.000000000 +0200 +++ last_stable/drivers/infiniband/hw/mthca/mthca_eq.c 2006-01-12 16:55:46.000000000 +0200 @@ -40,6 +40,7 @@ #include "mthca_dev.h" #include "mthca_cmd.h" +#include "mthca_memfree.h" #include "mthca_config_reg.h" enum { @@ -823,7 +824,8 @@ void __devexit mthca_unmap_eq_icm(struct { u8 status; - mthca_UNMAP_ICM(dev, dev->eq_table.icm_virt, PAGE_SIZE / 4096, &status); + mthca_UNMAP_ICM(dev, dev->eq_table.icm_virt, + PAGE_SIZE >> MTHCA_ICM_PAGE_SHIFT, &status); pci_unmap_page(dev->pdev, dev->eq_table.icm_dma, PAGE_SIZE, PCI_DMA_BIDIRECTIONAL); __free_page(dev->eq_table.icm_page); -- MST From halr at voltaire.com Mon Jan 16 06:39:25 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Jan 2006 09:39:25 -0500 Subject: [openib-general] OpenSM doesn't start on p5 570 In-Reply-To: <1137419070.4346.17.camel@localhost.localdomain> References: <20060116104832.GA18902@forest.lab.t-platforms.ru> <1137411106.4336.1475.camel@hal.voltaire.com> <20060116115609.GB18902@forest.lab.t-platforms.ru> <1137419070.4346.17.camel@localhost.localdomain> Message-ID: <1137421985.4346.131.camel@localhost.localdomain> Hi, On Mon, 2006-01-16 at 08:44, Hal Rosenstock wrote: > On Mon, 2006-01-16 at 06:56, Andrey Slepuhin wrote: > > On Mon, Jan 16, 2006 at 06:31:46AM -0500, Hal Rosenstock wrote: > > > Hi, > > > > > > On Mon, 2006-01-16 at 05:48, Andrey Slepuhin wrote: > > > > Dear folks, > > > > > > > > I have a problem starting opensm on a p5 570 machine. > > > > > > Is this the first time trying this on a p5 machine ? > > > > > > > Yes. > > > > > > > > Jan 16 13:30:55 737939 [43027B20] -> umad_receiver: ERR 5413: Failed to obtain request madw for received MAD(method=0x81 > > > > attr=0x11) -- dropping > > > > > > This means that no matching transaction was found in transaction match > > > table. This may be an endian problem with the tid. > > > > > > Can you validate the tid (print them out) in both get_madw and put_madw > > > in osm_vendor_ibumad.c ? Since this seems to happen early on, there > > > shouldn't be too many of these. Thanks. > > > > I got the following: > > > > put_madw: tid=0x1234 > > get_madw: tid=0x1b00001234 > > This looks like an endian issue. I will have a patch for you to try > later. Stay tuned. Thanks. Can you try this patch and let me know if this works for you ? Thanks. -- Hal Index: osm_vendor_ibumad.c =================================================================== --- osm_vendor_ibumad.c (revision 5016) +++ osm_vendor_ibumad.c (working copy) @@ -137,7 +137,7 @@ static osm_madw_t * get_madw(osm_vendor_t *p_vend, ib_net64_t *tid) { umad_match_t *m, *e; - ib_net64_t mtid = (*tid & 0xffffffff00000000llu); + ib_net64_t mtid = (*tid & cl_ntoh64(0x00000000ffffffffllu)); cl_spinlock_acquire( &p_vend->match_tbl_lock ); for (m = p_vend->mtbl.tbl, e = m + p_vend->mtbl.max; m < e; m++) { > > > Are the two HCAs on separate machines ? > > > > No, at the moment they are on the same machine. > > You should be able to run this in loopback. I have done this. Just > wondering about the topology just to be sure... > > -- Hal > > > Best regards, > > Andrey > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From andrey.slepuhin at t-platforms.ru Mon Jan 16 06:51:59 2006 From: andrey.slepuhin at t-platforms.ru (Andrey Slepuhin) Date: Mon, 16 Jan 2006 17:51:59 +0300 Subject: [openib-general] OpenSM doesn't start on p5 570 In-Reply-To: <1137421985.4346.131.camel@localhost.localdomain> References: <20060116104832.GA18902@forest.lab.t-platforms.ru> <1137411106.4336.1475.camel@hal.voltaire.com> <20060116115609.GB18902@forest.lab.t-platforms.ru> <1137419070.4346.17.camel@localhost.localdomain> <1137421985.4346.131.camel@localhost.localdomain> Message-ID: <20060116145159.GC18902@forest.lab.t-platforms.ru> On Mon, Jan 16, 2006 at 09:39:25AM -0500, Hal Rosenstock wrote: > > Can you try this patch and let me know if this works for you ? Thanks. Thanks, this patch solved the problem. Best regards, Andrey From dotanb at mellanox.co.il Mon Jan 16 07:20:39 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 16 Jan 2006 17:20:39 +0200 Subject: [openib-general] adding query_srq and query_qp verbs Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3011DDB8D@mtlexch01.mtl.com> Hi. QP and SRQ are objects that can be changed by the HCA (QP: state, SRQ: limit value). We would like to have query SRQ / QP in order to have better debug capabilities. If we will send you a patch that adds this functionality (verbs and mthca), will you add it to the trunk? thanks Dotan Barak Software Verification Engineer Mellanox Technologies LTD Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Jan 16 07:35:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Jan 2006 17:35:10 +0200 Subject: [openib-general] [PATCHv2] mthca: cosmetic replace 4096 with MTHCA_ICM_PAGE_SIZE In-Reply-To: <20060116140556.GH22260@mellanox.co.il> References: <20060116140556.GH22260@mellanox.co.il> Message-ID: <20060116153510.GI22260@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: [PATCH] mthca: cosmetic replace 4096 withMTHCA_ICM_PAGE_SIZE > > The following looks to me like a good idea. > Roland? Here's a slighly updated version. Some lines are still longer than 80 chars. What do you think? --- Replace 4096 and 12 with MTHCA_ICM_PAGE_SIZE and MTHCA_ICM_PAGE_SHIFT where appropriate. Signed-off-by: Ishai Rabinovitz Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- openib.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-01-14 18:40:12.000000000 +0200 +++ openib/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-01-16 17:24:53.000000000 +0200 @@ -599,8 +599,9 @@ * address or size and use that as our log2 size. */ lg = ffs(mthca_icm_addr(&iter) | mthca_icm_size(&iter)) - 1; - if (lg < 12) { - mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%lx).\n", + if (lg < MTHCA_ICM_PAGE_SHIFT) { + mthca_warn(dev, "Got FW area not aligned to %d (%llx/%lx).\n", + MTHCA_ICM_PAGE_SIZE, (unsigned long long) mthca_icm_addr(&iter), mthca_icm_size(&iter)); err = -EINVAL; @@ -612,8 +613,9 @@ virt += 1 << lg; } - pages[nent * 2 + 1] = cpu_to_be64((mthca_icm_addr(&iter) + - (i << lg)) | (lg - 12)); + pages[nent * 2 + 1] = + cpu_to_be64((mthca_icm_addr(&iter) + (i << lg)) | + (lg - MTHCA_ICM_PAGE_SHIFT)); ts += 1 << (lg - 10); ++tc; Index: openib/drivers/infiniband/hw/mthca/mthca_memfree.c =================================================================== --- openib.orig/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-12-16 00:07:16.000000000 +0200 +++ openib/drivers/infiniband/hw/mthca/mthca_memfree.c 2006-01-16 17:19:00.000000000 +0200 @@ -202,7 +202,8 @@ if (--table->icm[i]->refcount == 0) { mthca_UNMAP_ICM(dev, table->virt + i * MTHCA_TABLE_CHUNK_SIZE, - MTHCA_TABLE_CHUNK_SIZE >> 12, &status); + MTHCA_TABLE_CHUNK_SIZE / MTHCA_ICM_PAGE_SIZE, + &status); mthca_free_icm(dev, table->icm[i]); table->icm[i] = NULL; } @@ -336,7 +337,8 @@ for (i = 0; i < num_icm; ++i) if (table->icm[i]) { mthca_UNMAP_ICM(dev, virt + i * MTHCA_TABLE_CHUNK_SIZE, - MTHCA_TABLE_CHUNK_SIZE >> 12, &status); + MTHCA_TABLE_CHUNK_SIZE / MTHCA_ICM_PAGE_SIZE, + &status); mthca_free_icm(dev, table->icm[i]); } @@ -353,7 +355,8 @@ for (i = 0; i < table->num_icm; ++i) if (table->icm[i]) { mthca_UNMAP_ICM(dev, table->virt + i * MTHCA_TABLE_CHUNK_SIZE, - MTHCA_TABLE_CHUNK_SIZE >> 12, &status); + MTHCA_TABLE_CHUNK_SIZE / MTHCA_ICM_PAGE_SIZE, + &status); mthca_free_icm(dev, table->icm[i]); } @@ -364,7 +367,7 @@ { return dev->uar_table.uarc_base + uar->index * dev->uar_table.uarc_size + - page * 4096; + page * MTHCA_ICM_PAGE_SIZE; } int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar, @@ -401,7 +404,7 @@ if (ret < 0) goto out; - db_tab->page[i].mem.length = 4096; + db_tab->page[i].mem.length = MTHCA_ICM_PAGE_SIZE; db_tab->page[i].mem.offset = uaddr & ~PAGE_MASK; ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE); @@ -455,7 +458,7 @@ if (!mthca_is_memfree(dev)) return NULL; - npages = dev->uar_table.uarc_size / 4096; + npages = dev->uar_table.uarc_size / MTHCA_ICM_PAGE_SIZE; db_tab = kmalloc(sizeof *db_tab + npages * sizeof *db_tab->page, GFP_KERNEL); if (!db_tab) return ERR_PTR(-ENOMEM); @@ -478,7 +481,7 @@ if (!mthca_is_memfree(dev)) return; - for (i = 0; i < dev->uar_table.uarc_size / 4096; ++i) { + for (i = 0; i < dev->uar_table.uarc_size / MTHCA_ICM_PAGE_SIZE; ++i) { if (db_tab->page[i].uvirt) { mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1, &status); pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE); @@ -551,20 +554,20 @@ page = dev->db_tab->page + end; alloc: - page->db_rec = dma_alloc_coherent(&dev->pdev->dev, 4096, + page->db_rec = dma_alloc_coherent(&dev->pdev->dev, MTHCA_ICM_PAGE_SIZE, &page->mapping, GFP_KERNEL); if (!page->db_rec) { ret = -ENOMEM; goto out; } - memset(page->db_rec, 0, 4096); + memset(page->db_rec, 0, MTHCA_ICM_PAGE_SIZE); ret = mthca_MAP_ICM_page(dev, page->mapping, mthca_uarc_virt(dev, &dev->driver_uar, i), &status); if (!ret && status) ret = -EINVAL; if (ret) { - dma_free_coherent(&dev->pdev->dev, 4096, + dma_free_coherent(&dev->pdev->dev, MTHCA_ICM_PAGE_SIZE, page->db_rec, page->mapping); goto out; } @@ -612,7 +615,7 @@ i >= dev->db_tab->max_group1 - 1) { mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, &dev->driver_uar, i), 1, &status); - dma_free_coherent(&dev->pdev->dev, 4096, + dma_free_coherent(&dev->pdev->dev, MTHCA_ICM_PAGE_SIZE, page->db_rec, page->mapping); page->db_rec = NULL; @@ -640,7 +643,7 @@ init_MUTEX(&dev->db_tab->mutex); - dev->db_tab->npages = dev->uar_table.uarc_size / 4096; + dev->db_tab->npages = dev->uar_table.uarc_size / MTHCA_ICM_PAGE_SIZE; dev->db_tab->max_group1 = 0; dev->db_tab->min_group2 = dev->db_tab->npages - 1; @@ -681,7 +684,7 @@ mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, &dev->driver_uar, i), 1, &status); - dma_free_coherent(&dev->pdev->dev, 4096, + dma_free_coherent(&dev->pdev->dev, MTHCA_ICM_PAGE_SIZE, dev->db_tab->page[i].db_rec, dev->db_tab->page[i].mapping); } Index: openib/drivers/infiniband/hw/mthca/mthca_memfree.h =================================================================== --- openib.orig/drivers/infiniband/hw/mthca/mthca_memfree.h 2006-01-15 15:46:09.000000000 +0200 +++ openib/drivers/infiniband/hw/mthca/mthca_memfree.h 2006-01-16 16:03:10.000000000 +0200 @@ -46,6 +46,9 @@ ((256 - sizeof (struct list_head) - 2 * sizeof (int)) / \ (sizeof (struct scatterlist))) +#define MTHCA_ICM_PAGE_SHIFT 12 +#define MTHCA_ICM_PAGE_SIZE (1 << MTHCA_ICM_PAGE_SHIFT) + struct mthca_icm_chunk { struct list_head list; int npages; @@ -133,7 +136,7 @@ } enum { - MTHCA_DB_REC_PER_PAGE = 4096 / 8 + MTHCA_DB_REC_PER_PAGE = MTHCA_ICM_PAGE_SIZE / 8 }; struct mthca_db_page { Index: openib/drivers/infiniband/hw/mthca/mthca_eq.c =================================================================== --- openib.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2006-01-12 16:48:43.000000000 +0200 +++ openib/drivers/infiniband/hw/mthca/mthca_eq.c 2006-01-16 17:19:36.000000000 +0200 @@ -40,6 +40,7 @@ #include "mthca_dev.h" #include "mthca_cmd.h" +#include "mthca_memfree.h" #include "mthca_config_reg.h" enum { @@ -825,7 +826,8 @@ { u8 status; - mthca_UNMAP_ICM(dev, dev->eq_table.icm_virt, PAGE_SIZE / 4096, &status); + mthca_UNMAP_ICM(dev, dev->eq_table.icm_virt, PAGE_SIZE / MTHCA_ICM_PAGE_SIZE, + &status); pci_unmap_page(dev->pdev, dev->eq_table.icm_dma, PAGE_SIZE, PCI_DMA_BIDIRECTIONAL); __free_page(dev->eq_table.icm_page); -- MST From mst at mellanox.co.il Mon Jan 16 07:37:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Jan 2006 17:37:17 +0200 Subject: [openib-general] [PATCH] ipoib: pkt_queue Message-ID: <20060116153717.GA706@mellanox.co.il> The following patch fixes a crash we saw in testing. It replaces ipoib_multicast_drop_counter.patch --- Protect accesses to mcast->pkt_queue by tx_lock. Count multicast packets removed from pkt_queue as dropped. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-15 10:04:52.790884000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-15 16:02:45.975223000 +0200 @@ -123,8 +123,12 @@ static void ipoib_mcast_free(struct ipoi if (mcast->ah) ipoib_put_ah(mcast->ah); - while (!skb_queue_empty(&mcast->pkt_queue)) + while (!skb_queue_empty(&mcast->pkt_queue)) { + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->stats.tx_dropped; + spin_unlock_irqrestore(&priv->tx_lock, flags); dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } kfree(mcast); } @@ -276,8 +280,10 @@ static int ipoib_mcast_join_finish(struc } /* actually send any queued packets */ + spin_lock_irq(&priv->tx_lock); while (!skb_queue_empty(&mcast->pkt_queue)) { struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + spin_unlock_irq(&priv->tx_lock); skb->dev = dev; @@ -288,7 +294,9 @@ static int ipoib_mcast_join_finish(struc if (dev_queue_xmit(skb)) ipoib_warn(priv, "dev_queue_xmit failed to requeue packet\n"); + spin_lock_irq(&priv->tx_lock); } + spin_unlock_irq(&priv->tx_lock); return 0; } @@ -300,6 +308,7 @@ ipoib_mcast_sendonly_join_complete(int s { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); if (!status) ipoib_mcast_join_finish(mcast, mcmember); @@ -310,8 +319,12 @@ ipoib_mcast_sendonly_join_complete(int s IPOIB_GID_ARG(mcast->mcmember.mgid), status); /* Flush out any queued packets */ - while (!skb_queue_empty(&mcast->pkt_queue)) + spin_lock_irq(&priv->tx_lock); + while (!skb_queue_empty(&mcast->pkt_queue)) { + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } + spin_unlock_irq(&priv->tx_lock); /* Clear the busy flag so we try again */ clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); @@ -687,6 +700,7 @@ void ipoib_mcast_send(struct net_device if (!mcast) { ipoib_warn(priv, "unable to allocate memory for " "multicast structure\n"); + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); goto out; } @@ -700,8 +714,10 @@ void ipoib_mcast_send(struct net_device if (!mcast->ah) { if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) skb_queue_tail(&mcast->pkt_queue, skb); - else + else { + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); + } if (mcast->query) ipoib_dbg_mcast(priv, "no address vector, " -- MST From mst at mellanox.co.il Mon Jan 16 07:38:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Jan 2006 17:38:14 +0200 Subject: [openib-general] [PATCH] ipoib: path->ah Message-ID: <20060116153814.GA759@mellanox.co.il> The following patch fixes a crash we saw in testing. --- SA query completion could initialize dlid before callback initializes path->ah, so we must test path->ah rather than dlid. Signed-off-by: Michael S. Tsirkin Index: last_stable/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ last_stable/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -505,7 +505,7 @@ static void neigh_add_path(struct sk_buf list_add_tail(&neigh->list, &path->neigh_list); - if (path->pathrec.dlid) { + if (path->ah) { kref_get(&path->ah->ref); neigh->ah = path->ah; @@ -589,7 +589,7 @@ static void unicast_arp_send(struct sk_b return; } - if (path->pathrec.dlid) { + if (path->ah) { ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); -- MST From mst at mellanox.co.il Mon Jan 16 07:40:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Jan 2006 17:40:01 +0200 Subject: [openib-general] [PATCH] uverbs: flush scheduled_work Message-ID: <20060116154000.GA765@mellanox.co.il> The following was found by code review: uverbs_mem.c does schedule_work, so we must flush scheduled work before unloading the module. --- Flush work scheduled from uverbs_mem.c Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/core/uverbs_main.c =================================================================== --- openib/drivers/infiniband/core/uverbs_main.c (revision 4985) +++ openib/drivers/infiniband/core/uverbs_main.c (working copy) @@ -896,6 +896,7 @@ out: static void __exit ib_uverbs_cleanup(void) { + flush_scheduled_work(); ib_unregister_client(&uverbs_client); mntput(uverbs_event_mnt); unregister_filesystem(&uverbs_event_fs); -- MST From mst at mellanox.co.il Mon Jan 16 07:42:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Jan 2006 17:42:33 +0200 Subject: [openib-general] [PATCH] fix crash on hotplug Message-ID: <20060116154233.GA868@mellanox.co.il> The following fixes a crash on module unload that we saw in testing. SA even handler does schedule_work so we must flush_scheduled_work after deregistering event handler. --- Flush work scheduled by sa event handler. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/core/sa_query.c =================================================================== --- openib/drivers/infiniband/core/sa_query.c (revision 4985) +++ openib/drivers/infiniband/core/sa_query.c (working copy) @@ -955,6 +955,8 @@ static void ib_sa_remove_one(struct ib_d ib_unregister_event_handler(&sa_dev->event_handler); + flush_scheduled_work(); + for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { ib_unregister_mad_agent(sa_dev->port[i].agent); kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); -- MST From mst at mellanox.co.il Mon Jan 16 07:43:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Jan 2006 17:43:29 +0200 Subject: [openib-general] patch status update In-Reply-To: <20060112151322.GK16938@mellanox.co.il> References: <20060112151322.GK16938@mellanox.co.il> Message-ID: <20060116154329.GJ22260@mellanox.co.il> Hello, Roland! I have done the following changes to contrib/mellanox/patches: Removed patches: ipoib_multicast_drop_counter.patch - superceded by ipoib_multicast_pkt_queue.patch New patches: Crashes in testing: ipoib_multicast_pkt_queue.patch ipoib_path_ah.patch sa_query_flush.patch Fix for race found by code review: uverbs_flush.patch Cosmetic: mthca_cosmetic_icm_page_size.patch -- MST From mshefty at ichips.intel.com Mon Jan 16 09:54:46 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 16 Jan 2006 09:54:46 -0800 Subject: [openib-general] adding query_srq and query_qp verbs In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3011DDB8D@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3011DDB8D@mtlexch01.mtl.com> Message-ID: <43CBDDE6.9050808@ichips.intel.com> Dotan Barak wrote: > QP and SRQ are objects that can be changed by the HCA (QP: state, SRQ: > limit value). > We would like to have query SRQ / QP in order to have better debug > capabilities. > > If we will send you a patch that adds this functionality (verbs and > mthca), will you add it to the trunk? These calls should already be in verbs. - Sean From ralphc at pathscale.com Mon Jan 16 11:17:59 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 16 Jan 2006 11:17:59 -0800 Subject: [openib-general] [PATCH] Problem with directed route SMPs with beginning or ending LID routed parts Message-ID: <1137439079.4520.269.camel@brick.internal.keyresearch.com> OK. Here is a much simplified patch which fixes the problem of a directed route SMP with a with beginning or ending LID routed part. Signed-off-by: Ralph Campbell Index: core/mad.c =================================================================== --- core/mad.c (revision 5030) +++ core/mad.c (working copy) @@ -665,7 +665,15 @@ struct ib_wc mad_wc; struct ib_send_wr *send_wr = &mad_send_wr->send_wr; - if (!smi_handle_dr_smp_send(smp, device->node_type, port_num)) { + /* + * Directed route handling starts if the initial LID routed part of + * a request or the ending LID routed part of a response is empty. + * If we are at the start of the LID routed part, don't update the + * hop_ptr or hop_cnt. See section 14.2.2, Vol 1 IB spec. + */ + if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) == + IB_LID_PERMISSIVE && + !smi_handle_dr_smp_send(smp, device->node_type, port_num)) { ret = -EINVAL; printk(KERN_ERR PFX "Invalid directed route\n"); goto out; -- Ralph Campbell From mshefty at ichips.intel.com Mon Jan 16 12:27:45 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 16 Jan 2006 12:27:45 -0800 Subject: [openib-general] SA cache design In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B4FA@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B4FA@mtlexch01.mtl.com> Message-ID: <43CC01C1.8020603@ichips.intel.com> Eitan Zahavi wrote: > [EZ] The scalability issues we see today are what I most worry about. I think that we have a couple scalability issues at the core of this problem. I think that a cache can solve part of the problem, but to fully address the issues, we eventually may need to extend our APIs and underlying protocols. One issue that I see is that the CMA, IB CM, and DAPL APIs support only point-to-point connections. Trying to layer a many-to-many connection model over these is leading to the inefficiencies. For example, the CMA generates one SA query per connection. Another issue is that even if if the number of queries were reduced, the fabric will still see O(n^2) connection messages. Based on the code, the only SA query of interest to most users will be a path record query by gids/pkey. To speed up applications written to the current CMA, DAPL, and Intel's MPI (hey, I gotta eat), my actual implementation has a very limited path record cache in the kernel. The cache uses an index with O(1) insertion, removal, and retrieval. (I plan on re-using the index to help improve the performance of the IB CM as well.) I'm still working on ideas to address the many-to-many connection model. One idea is to have a centralized connection manager to coordinate the connections between the various endpoints. The drawback is that this requires defining a proprietary protocol. Any implementation work in this area will be deferred for now though. - Sean From corkgs at ardconsultants.com Mon Jan 16 12:38:08 2006 From: corkgs at ardconsultants.com (Albert Cook) Date: Mon, 16 Jan 2006 20:38:08 +0000 Subject: [openib-general] Last offer- Discount special for PE patch almost over! Message-ID: <000001c61b06$9f1efe00$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! http://www.savopu.com/pt/?46&dqqsp -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralphc at pathscale.com Mon Jan 16 13:09:19 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 16 Jan 2006 13:09:19 -0800 Subject: [openib-general] SDP status? Message-ID: <1137445759.4520.278.camel@brick.internal.keyresearch.com> I tried getting the latest OpenIB tree from trunk/src/linux-kernel/infiniband with my 2.6.15 kernel. Everything seems to be working OK except that I can't establish a SDP connection. I get: ib_sdp WARN: <0> <2100> Path record completion error <-101> Any ideas what is happening? -- Ralph Campbell From eitan at mellanox.co.il Mon Jan 16 13:39:47 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 16 Jan 2006 23:39:47 +0200 Subject: [openib-general] SDP status? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B535@mtlexch01.mtl.com> Do you have SM running on any node? Can you ping to the other node by its ib0 IP address? Eitan Zahavi > > I tried getting the latest OpenIB tree from > trunk/src/linux-kernel/infiniband with my 2.6.15 kernel. > Everything seems to be working OK except that I can't > establish a SDP connection. I get: > > ib_sdp WARN: <0> <2100> Path record completion error <-101> > > Any ideas what is happening? > > -- > Ralph Campbell > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Mon Jan 16 13:49:19 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 16 Jan 2006 23:49:19 +0200 Subject: [openib-general] SA cache design Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B536@mtlexch01.mtl.com> Hi Sean > Eitan Zahavi wrote: > > [EZ] The scalability issues we see today are what I most worry about. > > > One issue that I see is that the CMA, IB CM, and DAPL APIs support only > point-to-point connections. Trying to layer a many-to-many connection model > over these is leading to the inefficiencies. For example, the CMA generates one > SA query per connection. Another issue is that even if if the number of queries > were reduced, the fabric will still see O(n^2) connection messages. [EZ] Having N^2 messages is not a big problem if they do not all go one target... CM is distributed and this is good. Only the PathRecord section of the connection establishment is going today to one node (SA) and you are about to fix it... During initial connections setup you will not have anything in the SA cache and thus the SA will need to answer N^2 PathRecords. Smart exponential back-off can resolve that DOS attack on the SA at bring-up. > > Based on the code, the only SA query of interest to most users will be a path > record query by gids/pkey. To speed up applications written to the current CMA, > DAPL, and Intel's MPI (hey, I gotta eat), my actual implementation has a very > limited path record cache in the kernel. The cache uses an index with O(1) > insertion, removal, and retrieval. (I plan on re-using the index to help > improve the performance of the IB CM as well.) [EZ] We might need a little more in the key for QoS support (to come). > > I'm still working on ideas to address the many-to-many connection model. One [EZ] I would try and make sure the connections are not done in a manner such that all nodes try to establish connections to a single node at the same time. This is an application issue but can be easily resolve. Do the MPI connection in a loop like: for (target = (myRank + 1) % numNodes ; target != myRank; (target++) % numNodes) { /* establish connection to node target */ } > idea is to have a centralized connection manager to coordinate the connections > between the various endpoints. The drawback is that this requires defining a > proprietary protocol. Any implementation work in this area will be deferred for > now though. [EZ] I think a centralized CM is a only going to make things worse. > > - Sean From eitan at mellanox.co.il Mon Jan 16 14:02:31 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 17 Jan 2006 00:02:31 +0200 Subject: [openib-general] SA cache design Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B537@mtlexch01.mtl.com> What was I thinking ... for (target = (myRank + 1) % numNodes ; target != myRank; target = (target + 1)% numNodes) { /* establish connection to node target */ } > [EZ] I would try and make sure the connections are not done in a manner > such that all nodes try to establish connections to a single node at the > same time. This is an application issue but can be easily resolve. Do > the MPI connection in a loop like: > > for (target = (myRank + 1) % numNodes ; target != myRank; (target++) % > numNodes) { > /* establish connection to node target */ > } > From ralphc at pathscale.com Mon Jan 16 13:59:53 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 16 Jan 2006 13:59:53 -0800 Subject: [openib-general] SDP status? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B535@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B535@mtlexch01.mtl.com> Message-ID: <1137448793.4520.284.camel@brick.internal.keyresearch.com> I can ping via IPoIB. The SM/SA is running on the switch (I forget what manufacturer). On Mon, 2006-01-16 at 23:39 +0200, Eitan Zahavi wrote: > Do you have SM running on any node? > Can you ping to the other node by its ib0 IP address? > > Eitan Zahavi > > > > I tried getting the latest OpenIB tree from > > trunk/src/linux-kernel/infiniband with my 2.6.15 kernel. > > Everything seems to be working OK except that I can't > > establish a SDP connection. I get: > > > > ib_sdp WARN: <0> <2100> Path record completion error <-101> > > > > Any ideas what is happening? > > > > -- > > Ralph Campbell > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general -- Ralph Campbell From eitan at mellanox.co.il Mon Jan 16 14:14:18 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 17 Jan 2006 00:14:18 +0200 Subject: [openib-general] SDP status? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B539@mtlexch01.mtl.com> You might try using osmtest to see if the SM is returning all path records: osmtest -f c This should generate a file ./osmtest.dat with all path records See if you get any lines that start with PATH in that file. That is as far as I can help today (just about midnight here) Eitan Zahavi > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Ralph Campbell > Sent: Tuesday, January 17, 2006 12:00 AM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [openib-general] SDP status? > > I can ping via IPoIB. The SM/SA is running on the switch > (I forget what manufacturer). > > On Mon, 2006-01-16 at 23:39 +0200, Eitan Zahavi wrote: > > Do you have SM running on any node? > > Can you ping to the other node by its ib0 IP address? > > > > Eitan Zahavi > > > > > > I tried getting the latest OpenIB tree from > > > trunk/src/linux-kernel/infiniband with my 2.6.15 kernel. > > > Everything seems to be working OK except that I can't > > > establish a SDP connection. I get: > > > > > > ib_sdp WARN: <0> <2100> Path record completion error <-101> > > > > > > Any ideas what is happening? > > > > > > -- > > > Ralph Campbell > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > -- > Ralph Campbell > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From trimmer at silverstorm.com Mon Jan 16 14:28:14 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Mon, 16 Jan 2006 17:28:14 -0500 Subject: [openib-general] SA cache design Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670A0972@mercury.infiniconsys.com> > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > What was I thinking ... > for (target = (myRank + 1) % numNodes ; target != myRank; target = > (target + 1)% numNodes) { /* establish connection to node target > */ > } This can be even simpler for MPI. Given some nodes must listen and others must connect, have an approch such as higher rank processes connect to lower rank processes. Then its simply: initiate listen on my endpoint /* could omit this for highest rank in job */ for (target=(my_rank-1); target>0; target--) initiate connect to target For even greater efficiency, the "initiate connect to target" could be done in parallel batches. Eg. start 50 outbound connects, wait for some or all of them to complete, then start the next batch. Such as: for (target=(my_rank-1); target>0; target--) while (num_outstanding > limit) wait num_outstanding++ initiate connect to target Then the callback for completing a connection sequence could decrement num_outstanding and wakeup the waiter (or the waiter could be a sleep/poll type loop). We have been successfully using the algorithms above for about 2-3 years now and they work very well. Todd Rimmer From mshefty at ichips.intel.com Mon Jan 16 14:30:20 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 16 Jan 2006 14:30:20 -0800 Subject: [openib-general] SA cache design In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B536@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B536@mtlexch01.mtl.com> Message-ID: <43CC1E7C.7090008@ichips.intel.com> Eitan Zahavi wrote: > [EZ] Having N^2 messages is not a big problem if they do not all go one > target... > CM is distributed and this is good. Only the PathRecord section of the > connection establishment is going today to one node (SA) and you are > about to fix it... I expect that we'll start having issues scaling when the number of nodes starts to exceed the size of the CM's QP. Your idea below should help. > During initial connections setup you will not have anything in the SA > cache and thus the SA will need to answer N^2 PathRecords. Smart > exponential back-off can resolve that DOS attack on the SA at bring-up. I'll post the code for the cache once I complete my testing, but it issues a single query to fill the cache. The SA will only see O(n) requests. The cache also supports an update delay, or settle time, and minimum update time to prevent spamming the SA with back to back requests. > [EZ] We might need a little more in the key for QoS support (to come). This would need to be exposed through our APIs as well. Alternate paths are also not yet supported. > [EZ] I would try and make sure the connections are not done in a manner > such that all nodes try to establish connections to a single node at the > same time. This is an application issue but can be easily resolve. I agree. > [EZ] I think a centralized CM is a only going to make things worse. It can reduce the number of messages on the network from O(n^2) to O(n). The idea is that instead of all nodes sending connection requests to all other nodes, they send a single connection request -- containing an array of QP information -- to one node. (The array could be sent over an established connection, rather than in MADs.) The amount of traffic to that one node should be only slightly worse than the all to all case. - Sean From arlin.r.davis at intel.com Mon Jan 16 14:54:59 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 16 Jan 2006 14:54:59 -0800 Subject: [openib-general] [RFC] DAT 2.0 immediate data proposal Message-ID: Arkady, The attached proposal adds immediate data options as standard API's instead of extensions for the following calls. dat_ep_post_send_immed() dat_ep_post_recv_immed() dat_ep_post_write_immed() The patch should be ready by tomorrow. Thanks, -arlin -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: DAT_immediate_data.pdf Type: application/pdf Size: 49419 bytes Desc: not available URL: From arlin.r.davis at intel.com Mon Jan 16 14:55:03 2006 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 16 Jan 2006 14:55:03 -0800 Subject: [openib-general] [RFC] DAT 2.0 extension proposal Message-ID: <59278FC0C48A994BABABD069571E45680D9C723E@orsmsx401.amr.corp.intel.com> Arkady, The attached proposal adds generic DTO extensions and provider specific atomic operations as follow. dat_ep_post_cmp_and_swap() dat_ep_post_fetch_and_add() The patch should be ready by tomorrow. Thanks, -arlin -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: DAT_Extensions.pdf Type: application/octet-stream Size: 76434 bytes Desc: DAT_Extensions.pdf URL: From ralphc at pathscale.com Mon Jan 16 15:02:39 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 16 Jan 2006 15:02:39 -0800 Subject: [openib-general] [PATCH] race in pingpong -e Message-ID: <1137452559.4520.295.camel@brick.internal.keyresearch.com> The example pingpong programs have a race when using events where the client sends the first packet but the server hasn't yet armed the CQ by calling ibv_req_notify_cq() thus waiting forever in ibv_get_cq_event(). The fix is to move the call to ibv_req_notify_cq() before signaling the client to "start". Signed-off-by: Ralph Campbell Index: libibverbs/examples/rc_pingpong.c =================================================================== --- libibverbs/examples/rc_pingpong.c (revision 5031) +++ libibverbs/examples/rc_pingpong.c (working copy) @@ -568,6 +568,12 @@ return 1; } + if (use_event) + if (ibv_req_notify_cq(ctx->cq, 0)) { + fprintf(stderr, "Couldn't request CQ notification\n"); + return 1; + } + my_dest.lid = pp_get_local_lid(ctx, ib_port); my_dest.qpn = ctx->qp->qp_num; my_dest.psn = lrand48() & 0xffffff; @@ -594,12 +600,6 @@ if (pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest)) return 1; - if (use_event) - if (ibv_req_notify_cq(ctx->cq, 0)) { - fprintf(stderr, "Couldn't request CQ notification\n"); - return 1; - } - ctx->pending = PINGPONG_RECV_WRID; if (servername) { Index: libibverbs/examples/uc_pingpong.c =================================================================== --- libibverbs/examples/uc_pingpong.c (revision 5031) +++ libibverbs/examples/uc_pingpong.c (working copy) @@ -556,6 +556,12 @@ return 1; } + if (use_event) + if (ibv_req_notify_cq(ctx->cq, 0)) { + fprintf(stderr, "Couldn't request CQ notification\n"); + return 1; + } + my_dest.lid = pp_get_local_lid(ctx, ib_port); my_dest.qpn = ctx->qp->qp_num; my_dest.psn = lrand48() & 0xffffff; @@ -582,12 +588,6 @@ if (pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest)) return 1; - if (use_event) - if (ibv_req_notify_cq(ctx->cq, 0)) { - fprintf(stderr, "Couldn't request CQ notification\n"); - return 1; - } - ctx->pending = PINGPONG_RECV_WRID; if (servername) { Index: libibverbs/examples/ud_pingpong.c =================================================================== --- libibverbs/examples/ud_pingpong.c (revision 5031) +++ libibverbs/examples/ud_pingpong.c (working copy) @@ -564,6 +564,12 @@ return 1; } + if (use_event) + if (ibv_req_notify_cq(ctx->cq, 0)) { + fprintf(stderr, "Couldn't request CQ notification\n"); + return 1; + } + my_dest.lid = pp_get_local_lid(ctx, ib_port); my_dest.qpn = ctx->qp->qp_num; my_dest.psn = lrand48() & 0xffffff; @@ -590,12 +596,6 @@ if (pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest)) return 1; - if (use_event) - if (ibv_req_notify_cq(ctx->cq, 0)) { - fprintf(stderr, "Couldn't request CQ notification\n"); - return 1; - } - ctx->pending = PINGPONG_RECV_WRID; if (servername) { Index: libibverbs/examples/srq_pingpong.c =================================================================== --- libibverbs/examples/srq_pingpong.c (revision 5031) +++ libibverbs/examples/srq_pingpong.c (working copy) @@ -649,6 +649,12 @@ return 1; } + if (use_event) + if (ibv_req_notify_cq(ctx->cq, 0)) { + fprintf(stderr, "Couldn't request CQ notification\n"); + return 1; + } + memset(my_dest, 0, sizeof my_dest); for (i = 0; i < num_qp; ++i) { @@ -680,12 +686,6 @@ if (pp_connect_ctx(ctx, ib_port, my_dest, rem_dest)) return 1; - if (use_event) - if (ibv_req_notify_cq(ctx->cq, 0)) { - fprintf(stderr, "Couldn't request CQ notification\n"); - return 1; - } - if (servername) for (i = 0; i < num_qp; ++i) { if (pp_post_send(ctx, i)) { -- Ralph Campbell From ralphc at pathscale.com Mon Jan 16 15:09:08 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 16 Jan 2006 15:09:08 -0800 Subject: [openib-general] SDP status? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B539@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B539@mtlexch01.mtl.com> Message-ID: <1137452948.4520.299.camel@brick.internal.keyresearch.com> I guess our switch software needs updating :-( # osmtest -f c Command Line Arguments Done with args Flow = Create Inventory using default guid 0x1175000004e007 Jan 16 15:03:29 218782 [AABCC280] -> osm_vendor_bind: Binding to port 0x1175000004e007. Jan 16 15:03:29 231748 [AABCC280] -> osmtest_validate_sa_class_port_info: ----------------------------- SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x2 resp_time_val:0x11000000 ----------------------------- Jan 16 15:03:29 245098 [41001960] -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0006 Jan 16 15:03:29 245124 [41001960] -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR). Jan 16 15:03:29 245150 [AABCC280] -> osmtest_get_all_recs: ERR 0064: ib_query failed (IB_REMOTE_ERROR). Jan 16 15:03:29 245163 [AABCC280] -> osmtest_get_all_recs: Remote error = IB_SA_MAD_STATUS_INSUF_COMPS. Jan 16 15:03:29 245172 [AABCC280] -> osmtest_write_all_path_recs: ERR 0025: osmtest_get_all_recs failed (IB_REMOTE_ERROR) Jan 16 15:03:29 245194 [AABCC280] -> osmtest_run: ERR 00139: Inventory file create failed (IB_REMOTE_ERROR) OSMTEST: TEST "Create Inventory" FAIL On Tue, 2006-01-17 at 00:14 +0200, Eitan Zahavi wrote: > You might try using osmtest to see if the SM is returning all path > records: > > osmtest -f c > This should generate a file ./osmtest.dat with all path records > > See if you get any lines that start with PATH in that file. > > That is as far as I can help today (just about midnight here) > > Eitan Zahavi > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of Ralph Campbell > > Sent: Tuesday, January 17, 2006 12:00 AM > > To: Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: RE: [openib-general] SDP status? > > > > I can ping via IPoIB. The SM/SA is running on the switch > > (I forget what manufacturer). > > > > On Mon, 2006-01-16 at 23:39 +0200, Eitan Zahavi wrote: > > > Do you have SM running on any node? > > > Can you ping to the other node by its ib0 IP address? > > > > > > Eitan Zahavi > > > > > > > > I tried getting the latest OpenIB tree from > > > > trunk/src/linux-kernel/infiniband with my 2.6.15 kernel. > > > > Everything seems to be working OK except that I can't > > > > establish a SDP connection. I get: > > > > > > > > ib_sdp WARN: <0> <2100> Path record completion error <-101> > > > > > > > > Any ideas what is happening? > > > > > > > > -- > > > > Ralph Campbell > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > -- > > Ralph Campbell > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general -- Ralph Campbell From robert.j.woodruff at intel.com Mon Jan 16 15:51:05 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 16 Jan 2006 15:51:05 -0800 Subject: [openib-general] Missing file, trunk/src/linux-kernel/include/scsi/srp.h. in SVN5031 Message-ID: Hi Roland, There use to be a file trunk/src/linux-kernel/include/scsi/srp.h in SVN4900. This file has now been deleted from SVN5031, causing srp.c to fail to build on older kernels. I also see that this file is in the latest 2.6.15 kernel in linux-2.6.15/include/scsi/srp.h and that there is a backport patch for older kernels to put it into the linux/include/scsi directory. However, for the latest development version, shouldn't that remain in SVN at infiniband/scsi/srp.h ? Somewhat confused ? woody From eitan at mellanox.co.il Mon Jan 16 22:00:12 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 17 Jan 2006 08:00:12 +0200 Subject: [openib-general] SDP status? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B53B@mtlexch01.mtl.com> Hi Ralph I think that the error reported: IB_SA_MAD_STATUS_INSUF_COMPS means that for some reason the SA on the switch requires some more bits component mask for PathRecord query (component mask bits flag the SA by which fields it should filter paths). This might as well be the cause for SDP fail to get PathRecord response. My only advice would be to contact the SA vendor you use and figure out from the SM logs what is going on. Eitan Zahavi > -----Original Message----- > From: Ralph Campbell [mailto:ralphc at pathscale.com] > Sent: Tuesday, January 17, 2006 1:09 AM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [openib-general] SDP status? > > I guess our switch software needs updating :-( > > # osmtest -f c > > Command Line Arguments > Done with args > Flow = Create Inventory > using default guid 0x1175000004e007 > Jan 16 15:03:29 218782 [AABCC280] -> osm_vendor_bind: Binding to port > 0x1175000004e007. > Jan 16 15:03:29 231748 [AABCC280] -> osmtest_validate_sa_class_port_info: > ----------------------------- > SA Class Port Info: > base_ver:1 > class_ver:2 > cap_mask:0x2 > resp_time_val:0x11000000 > ----------------------------- > Jan 16 15:03:29 245098 [41001960] -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote > error:0x0006 > Jan 16 15:03:29 245124 [41001960] -> osmtest_query_res_cb: ERR 0003: Error on query > (IB_REMOTE_ERROR). > Jan 16 15:03:29 245150 [AABCC280] -> osmtest_get_all_recs: ERR 0064: ib_query failed > (IB_REMOTE_ERROR). > Jan 16 15:03:29 245163 [AABCC280] -> osmtest_get_all_recs: Remote error = > IB_SA_MAD_STATUS_INSUF_COMPS. > Jan 16 15:03:29 245172 [AABCC280] -> osmtest_write_all_path_recs: ERR 0025: > osmtest_get_all_recs failed (IB_REMOTE_ERROR) > Jan 16 15:03:29 245194 [AABCC280] -> osmtest_run: ERR 00139: Inventory file create > failed (IB_REMOTE_ERROR) > OSMTEST: TEST "Create Inventory" FAIL > > > > On Tue, 2006-01-17 at 00:14 +0200, Eitan Zahavi wrote: > > You might try using osmtest to see if the SM is returning all path > > records: > > > > osmtest -f c > > This should generate a file ./osmtest.dat with all path records > > > > See if you get any lines that start with PATH in that file. > > > > That is as far as I can help today (just about midnight here) > > > > Eitan Zahavi > > > > > -----Original Message----- > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > bounces at openib.org] On Behalf Of Ralph Campbell > > > Sent: Tuesday, January 17, 2006 12:00 AM > > > To: Eitan Zahavi > > > Cc: openib-general at openib.org > > > Subject: RE: [openib-general] SDP status? > > > > > > I can ping via IPoIB. The SM/SA is running on the switch > > > (I forget what manufacturer). > > > > > > On Mon, 2006-01-16 at 23:39 +0200, Eitan Zahavi wrote: > > > > Do you have SM running on any node? > > > > Can you ping to the other node by its ib0 IP address? > > > > > > > > Eitan Zahavi > > > > > > > > > > I tried getting the latest OpenIB tree from > > > > > trunk/src/linux-kernel/infiniband with my 2.6.15 kernel. > > > > > Everything seems to be working OK except that I can't > > > > > establish a SDP connection. I get: > > > > > > > > > > ib_sdp WARN: <0> <2100> Path record completion error <-101> > > > > > > > > > > Any ideas what is happening? > > > > > > > > > > -- > > > > > Ralph Campbell > > > > > > > > > > _______________________________________________ > > > > > openib-general mailing list > > > > > openib-general at openib.org > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > To unsubscribe, please visit > > > > http://openib.org/mailman/listinfo/openib-general > > > -- > > > Ralph Campbell > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > -- > Ralph Campbell > From dotanb at mellanox.co.il Mon Jan 16 22:39:11 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 17 Jan 2006 08:39:11 +0200 Subject: [openib-general] adding query_srq and query_qp verbs Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3011DDC7B@mtlexch01.mtl.com> > > If we will send you a patch that adds this functionality (verbs and > > mthca), will you add it to the trunk? > > These calls should already be in verbs. > > - Sean > Those calls exists in kernel level only. we would like to use them in user level as well. Dotan From rdreier at cisco.com Mon Jan 16 22:53:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jan 2006 22:53:41 -0800 Subject: [openib-general] Re: Missing file, trunk/src/linux-kernel/include/scsi/srp.h. in SVN5031 In-Reply-To: (Bob Woodruff's message of "Mon, 16 Jan 2006 15:51:05 -0800") References: Message-ID: Bob> There use to be a file Bob> trunk/src/linux-kernel/include/scsi/srp.h Bob> in SVN4900. This file has now been deleted from SVN5031, Bob> causing srp.c to fail to build on older kernels. I also see Bob> that this file is in the latest 2.6.15 kernel in Bob> linux-2.6.15/include/scsi/srp.h and that there is a backport Bob> patch for older kernels to put it into the linux/include/scsi Bob> directory. However, for the latest development version, Bob> shouldn't that remain in SVN at infiniband/scsi/srp.h ? My reasoning was that now that the file is upstream, it is really part of the Linux SCSI stack and shouldn't really be maintained in the openib svn repo. There's nothing IB-specific about it, and in fact I need to migrate drivers/scsi/ibmvscsi to using it. However my reasoning could easily be all wrong -- what are the advantages of having it in openib svn? - R. From rdreier at cisco.com Mon Jan 16 22:56:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jan 2006 22:56:52 -0800 Subject: [openib-general] Re: adding query_srq and query_qp verbs In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3011DDB8D@mtlexch01.mtl.com> (Dotan Barak's message of "Mon, 16 Jan 2006 17:20:39 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3011DDB8D@mtlexch01.mtl.com> Message-ID: Dotan> If we will send you a patch that adds this functionality Dotan> (verbs and mthca), will you add it to the trunk? Yes (obviously pending review of the code itself). I would like to declare the libibverbs API frozen for a 1.0 release within, say, three or four weeks, so please try to get this done quickly. (The last thing on my list of things to implement before 1.0 is the resize CQ verb) - R. From ogerlitz at voltaire.com Mon Jan 16 23:13:51 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 17 Jan 2006 09:13:51 +0200 (IST) Subject: [openib-general] [PATCH] iser: use host->host_lock instead of conn->lock Message-ID: applied to r5032 moved to use host->host_lock instead of conn->lock in conn_destroy, removed conn->lock and conn->max_outstanding_cmds Signed-off-by: Or Gerlitz Signed-off-by: Erez Zilber Index: ulp/iser/iscsi_iser.h =================================================================== --- ulp/iser/iscsi_iser.h (revision 5024) +++ ulp/iser/iscsi_iser.h (working copy) @@ -198,11 +198,9 @@ struct iscsi_iser_conn uint32_t exp_statsn; int id; /* iSCSI CID */ - spinlock_t lock; /* MERGE_FIXME: can it be removed */ int max_recv_dlength; /* == initiator_max_recv_dsl */ int max_xmit_dlength; /* == target_max_recv_dsl */ - unsigned int max_outstand_cmds; /* MERGE_FIXME need2 review */ /* abort */ wait_queue_head_t ehwait; /* used in eh_abort() */ Index: ulp/iser/iscsi_iser.c =================================================================== --- ulp/iser/iscsi_iser.c (revision 5024) +++ ulp/iser/iscsi_iser.c (working copy) @@ -643,12 +643,10 @@ static inline void iscsi_iser_ctask_clea } if (sc->sc_data_direction == DMA_TO_DEVICE) { struct iscsi_iser_data_task *dtask, *n; - spin_lock(&conn->lock); list_for_each_entry_safe(dtask, n, &ctask->dataqueue, item) { list_del(&dtask->item); mempool_free(dtask, ctask->datapool); } - spin_unlock(&conn->lock); } ctask->sc = NULL; @@ -1116,8 +1114,6 @@ static iscsi_connh_t iscsi_iser_conn_cre conn->exp_statsn = 0; - spin_lock_init(&conn->lock); - /* initialize general xmit PDU commands queue */ conn->xmitqueue = kfifo_alloc(session->cmds_max * sizeof(void*), GFP_KERNEL, NULL); @@ -1231,6 +1227,7 @@ static void iscsi_iser_conn_destroy(iscs { struct iscsi_iser_conn *conn = iscsi_ptr(connh); struct iscsi_iser_session *session = conn->session; + unsigned long flags; debug_iser("%s: enter\n", __FUNCTION__); @@ -1260,13 +1257,13 @@ static void iscsi_iser_conn_destroy(iscs * time out or fail. */ for (;;) { - spin_lock_bh(&conn->lock); + spin_lock_irqsave(session->host->host_lock, flags); if (!session->host->host_busy) { /* OK for ERL == 0 */ - spin_unlock_bh(&conn->lock); - debug_iser("%s: released conn->lock (host's not busy)\n", __FUNCTION__); + spin_unlock_irqrestore(session->host->host_lock, flags); + debug_iser("%s: released host_lock (host's not busy)\n", __FUNCTION__); break; } - spin_unlock_bh(&conn->lock); + spin_unlock_irqrestore(session->host->host_lock, flags); msleep_interruptible(500); debug_iser("conn_destroy(): host = 0x%p, host_busy %d host_failed %d\n", session->host, session->host->host_busy, session->host->host_failed); From ogerlitz at voltaire.com Mon Jan 16 23:38:00 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 17 Jan 2006 09:38:00 +0200 (IST) Subject: [openib-general] [PATCH] iser: use semaphore instead of spinlock for critical section Message-ID: use semaphore instead of spinlock for the adaptor lookup/creation critical section, this should eliminate the case where the spinlock debug code wrong CPU assertion was activated. Signed-off-by: Or Gerlitz Signed-off-by: Dan Bar Dov Index: ulp/iser/iser_mod.c =================================================================== --- ulp/iser/iser_mod.c (revision 5032) +++ ulp/iser/iser_mod.c (revision 5033) @@ -119,7 +119,7 @@ int init_module(void) return -ENOMEM; /* adaptor init is called only after the first addr resolution */ - spin_lock_init(&ig.adaptor_list_lock); + init_MUTEX(&ig.adaptor_list_sem); INIT_LIST_HEAD(&ig.adaptor_list); ig.num_adaptors = 0; Index: ulp/iser/iser_verbs.c =================================================================== --- ulp/iser/iser_verbs.c (revision 5032) +++ ulp/iser/iser_verbs.c (revision 5033) @@ -235,7 +235,7 @@ struct iser_adaptor *iser_adaptor_find_b struct list_head *p_list; struct iser_adaptor *p_adaptor = NULL; - spin_lock(&ig.adaptor_list_lock); + down(&ig.adaptor_list_sem); p_list = ig.adaptor_list.next; while (p_list != &ig.adaptor_list) { @@ -257,7 +257,7 @@ struct iser_adaptor *iser_adaptor_find_b list_add(&p_adaptor->ig_list, &ig.adaptor_list); } end: - spin_unlock(&ig.adaptor_list_lock); + up(&ig.adaptor_list_sem); return p_adaptor; } Index: ulp/iser/iser.h =================================================================== --- ulp/iser/iser.h (revision 5032) +++ ulp/iser/iser.h (revision 5033) @@ -41,6 +41,7 @@ #include #include #include +#include #include #include @@ -126,8 +127,7 @@ struct iser_adaptor { */ struct iser_global { unsigned int num_adaptors; - - spinlock_t adaptor_list_lock; /* */ + struct semaphore adaptor_list_sem; /* */ struct list_head adaptor_list; /* all iSER adaptors */ kmem_cache_t *dto_cache; /* slab for iser_dto */ From yaeli at mellanox.co.il Tue Jan 17 00:30:16 2006 From: yaeli at mellanox.co.il (Yael Shenhav) Date: Tue, 17 Jan 2006 10:30:16 +0200 Subject: [openib-general] Problems with dmcli on 64b hosts Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3011DDCE7@mtlexch01.mtl.com> Hi Roland, I am using dmcli Python utility for SRP purposes. With x86_64, FedoraCore4, I get an error: Traceback (most recent call last): File "/usr/local/ibg2/bin/dmcli", line 185, in ? main() File "/usr/local/ibg2/bin/dmcli", line 149, in main agt = f.reg_agent(1) File "/usr/local/ibg2/lib64/python/umad.py", line 121, in reg_agent if fcntl.ioctl(self._fd, REGISTER_AGENT, buf, 1): TypeError: ioctl requires a file or file descriptor, an integer and optionally a integer or buffer argument I saw you identified this issue as being a python BUG and proposed a workaround inspired by https://sourceforge.net/tracker/?func=detail &atid=105470&aid=1112949&group_id=5470 Are you planning to get this workaround in dmcli? Thanks. Regards, Yael Shenhav -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Tue Jan 17 02:05:08 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 17 Jan 2006 12:05:08 +0200 (IST) Subject: [openib-general] [PATCH] enable the fmr pool user to set the page size Message-ID: Roland, This patch allows the consumer to set the page size of "pages" mapped by the pool fmrs which is a feature already existing in the ib_verbs api. On the cosmetic side it changes ib_fmr_attr.page_size field to be named page_shift. Note that i did not go down to change mpt_entry->page_size name so its up to you if to leave the page_size convention. A patch to convert the fmr consumers to the new api is below, if this api change is accepted we will enhance iser code eg to fmr in 4K "pages" resolution. Or. Signed-off-by: Or Gerlitz Index: include/rdma/ib_verbs.h =================================================================== --- include/rdma/ib_verbs.h (revision 4911) +++ include/rdma/ib_verbs.h (working copy) @@ -650,7 +650,7 @@ struct ib_mw_bind { struct ib_fmr_attr { int max_pages; int max_maps; - u8 page_size; + u8 page_shift; }; struct ib_ucontext { Index: include/rdma/ib_fmr_pool.h =================================================================== --- include/rdma/ib_fmr_pool.h (revision 4911) +++ include/rdma/ib_fmr_pool.h (working copy) @@ -43,6 +43,7 @@ struct ib_fmr_pool; /** * struct ib_fmr_pool_param - Parameters for creating FMR pool * @max_pages_per_fmr:Maximum number of pages per map request. + * @page_shift: Log2 of sizeof "pages" mapped by this fmr * @access:Access flags for FMRs in pool. * @pool_size:Number of FMRs to allocate for pool. * @dirty_watermark:Flush is triggered when @dirty_watermark dirty @@ -55,6 +56,7 @@ struct ib_fmr_pool; */ struct ib_fmr_pool_param { int max_pages_per_fmr; + int page_shift; enum ib_access_flags access; int pool_size; int dirty_watermark; Index: core/fmr_pool.c =================================================================== --- core/fmr_pool.c (revision 4911) +++ core/fmr_pool.c (working copy) @@ -280,7 +280,7 @@ struct ib_fmr_pool *ib_create_fmr_pool(s struct ib_fmr_attr attr = { .max_pages = params->max_pages_per_fmr, .max_maps = IB_FMR_MAX_REMAPS, - .page_size = PAGE_SHIFT + .page_shift = params->page_shift; }; for (i = 0; i < params->pool_size; ++i) { Index: hw/mthca/mthca_mr.c =================================================================== --- hw/mthca/mthca_mr.c (revision 4911) +++ hw/mthca/mthca_mr.c (working copy) @@ -497,7 +497,7 @@ int mthca_fmr_alloc(struct mthca_dev *de might_sleep(); - if (mr->attr.page_size < 12 || mr->attr.page_size >= 32) + if (mr->attr.page_shift < 12 || mr->attr.page_shift >= 32) return -EINVAL; /* For Arbel, all MTTs must fit in the same page. */ @@ -549,7 +549,7 @@ int mthca_fmr_alloc(struct mthca_dev *de MTHCA_MPT_FLAG_REGION | access); - mpt_entry->page_size = cpu_to_be32(mr->attr.page_size - 12); + mpt_entry->page_size = cpu_to_be32(mr->attr.page_shift - 12); mpt_entry->key = cpu_to_be32(key); mpt_entry->pd = cpu_to_be32(pd); memset(&mpt_entry->start, 0, Index: ulp/sdp/sdp_conn.c =================================================================== --- ulp/sdp/sdp_conn.c (revision 4911) +++ ulp/sdp/sdp_conn.c (working copy) @@ -1759,6 +1759,7 @@ static void sdp_device_init_one(struct i /* * FMR allocation */ + fmr_param_s.page_shift = PAGE_SHIFT; fmr_param_s.pool_size = SDP_FMR_POOL_SIZE; fmr_param_s.dirty_watermark = SDP_FMR_DIRTY_SIZE; fmr_param_s.cache = 1; Index: ulp/iser/iser_verbs.c =================================================================== --- ulp/iser/iser_verbs.c (revision 5033) +++ ulp/iser/iser_verbs.c (working copy) @@ -150,6 +150,7 @@ int iser_create_ib_conn_res(struct iser_ p_iser_adaptor = p_iser_conn->p_adaptor; + params.page_shift = PAGE_SHIFT; params.max_pages_per_fmr = ISCSI_ISER_SG_TABLESIZE; params.pool_size = ISCSI_ISER_XMIT_CMDS_MAX; params.dirty_watermark = 32; From sashak at voltaire.com Tue Jan 17 02:10:00 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 17 Jan 2006 12:10:00 +0200 Subject: [openib-general] [patch] userspace/management/diags/src/sminfo.c - cmdline processing fix Message-ID: <20060117101000.GA8053@sashak.voltaire.com> Hello Hal, There is small bug in sminfo's cmdline processing, this will segfault when option argument is missing (like 'sminfo -a'). The "fast and dirty" fix is inlined. The same problem exists with most diag tools, so I think we need to rework AGRBEGIN { ... } ARGEND stuff (actually remove it from libibcommon since it is used by diag tools only). I can do it if there are no objections. Regards, Sasha. This fast fix for invalid ARGF() usage in sminfo.c. Signed-off-by: Sasha Khapyorsky Index: diags/src/sminfo.c =================================================================== --- diags/src/sminfo.c (revision 5017) +++ diags/src/sminfo.c (working copy) @@ -49,6 +49,8 @@ #define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) +#define SAFE_ARGF() (*(argv+1) ? ARGF() : ( usage(), NULL ) ) + static void iberror(const char *fn, char *msg, ...) { @@ -116,10 +118,10 @@ ARGBEGIN { case 'C': - ca = ARGF(); + ca = SAFE_ARGF(); break; case 'P': - ca_port = strtoul(ARGF(), 0, 0); + ca_port = strtoul(SAFE_ARGF(), 0, 0); break; case 'd': ibdebug++; @@ -137,17 +139,17 @@ dest_type = IB_DEST_GUID; break; case 't': - timeout = strtoul(ARGF(), 0, 0); + timeout = strtoul(SAFE_ARGF(), 0, 0); madrpc_set_timeout(timeout); break; case 'a': - act = strtoul(ARGF(), 0, 0); + act = strtoul(SAFE_ARGF(), 0, 0); break; case 's': - state = strtoul(ARGF(), 0, 0); + state = strtoul(SAFE_ARGF(), 0, 0); break; case 'p': - prio = strtoul(ARGF(), 0, 0); + prio = strtoul(SAFE_ARGF(), 0, 0); break; case 'V': fprintf(stderr, "%s %s\n", argv0, get_build_version() ); From mst at mellanox.co.il Tue Jan 17 02:19:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jan 2006 12:19:50 +0200 Subject: [openib-general] Re: [PATCH] enable the fmr pool user to set the pagesize In-Reply-To: References: Message-ID: <20060117101950.GP22260@mellanox.co.il> Quoting Or Gerlitz : > A patch to convert the fmr consumers to the new api is below, if this > api change is accepted we will enhance iser code eg to fmr in 4K "pages" > resolution. Out of curiosity, why would you want to make the page size smaller than PAGE_SIZE? Bigger pages typically give you better performance, isnt that true for iser? -- MST From mst at mellanox.co.il Tue Jan 17 02:30:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jan 2006 12:30:56 +0200 Subject: [openib-general] Re: [patch] userspace/management/diags/src/sminfo.c -cmdline processing fix In-Reply-To: <20060117101000.GA8053@sashak.voltaire.com> References: <20060117101000.GA8053@sashak.voltaire.com> Message-ID: <20060117103056.GQ22260@mellanox.co.il> Quoting Sasha Khapyorsky : > Subject: [patch] userspace/management/diags/src/sminfo.c -cmdline processing fix > > Hello Hal, > > There is small bug in sminfo's cmdline processing, this will segfault > when option argument is missing (like 'sminfo -a'). The "fast and dirty" > fix is inlined. > > The same problem exists with most diag tools, so I think we need to > rework AGRBEGIN { ... } ARGEND stuff (actually remove it from > libibcommon since it is used by diag tools only). I can do it if there > are no objections. > > Regards, > Sasha. > > > This fast fix for invalid ARGF() usage in sminfo.c. > > Signed-off-by: Sasha Khapyorsky BTW, why arent the diags using the standard getopt_long? That would solve the problem above in a clean way and help us get rid of the ARGxxx macros completely. Hal? -- MST From sashak at voltaire.com Tue Jan 17 03:57:02 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 17 Jan 2006 13:57:02 +0200 Subject: [openib-general] Re: [patch] userspace/management/diags/src/sminfo.c -cmdline processing fix In-Reply-To: <20060117103056.GQ22260@mellanox.co.il> References: <20060117101000.GA8053@sashak.voltaire.com> <20060117103056.GQ22260@mellanox.co.il> Message-ID: <20060117115702.GB8053@sashak.voltaire.com> On 12:30 Tue 17 Jan , Michael S. Tsirkin wrote: > > BTW, why arent the diags using the standard getopt_long? > That would solve the problem above in a clean way and help us get rid of > the ARGxxx macros completely. Agree. Sasha. From halr at voltaire.com Tue Jan 17 03:59:37 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 17 Jan 2006 13:59:37 +0200 Subject: [openib-general] RE: [patch] userspace/management/diags/src/sminfo.c -cmdline processing fix Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589ABCE@taurus.voltaire.com> Hi Michael, I believe this is largely historical. I will put this on the list TODO for the diags and hopefully get to it in the not too distant future. -- Hal ________________________________ From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] Sent: Tue 1/17/2006 5:30 AM To: Sasha Khapyorsky Cc: Hal Rosenstock; openib Subject: Re: [patch] userspace/management/diags/src/sminfo.c -cmdline processing fix Quoting Sasha Khapyorsky : > Subject: [patch] userspace/management/diags/src/sminfo.c -cmdline processing fix > > Hello Hal, > > There is small bug in sminfo's cmdline processing, this will segfault > when option argument is missing (like 'sminfo -a'). The "fast and dirty" > fix is inlined. > > The same problem exists with most diag tools, so I think we need to > rework AGRBEGIN { ... } ARGEND stuff (actually remove it from > libibcommon since it is used by diag tools only). I can do it if there > are no objections. > > Regards, > Sasha. > > > This fast fix for invalid ARGF() usage in sminfo.c. > > Signed-off-by: Sasha Khapyorsky BTW, why arent the diags using the standard getopt_long? That would solve the problem above in a clean way and help us get rid of the ARGxxx macros completely. Hal? -- MST From ogerlitz at voltaire.com Tue Jan 17 04:13:59 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 17 Jan 2006 14:13:59 +0200 Subject: [openib-general] Re: [PATCH] enable the fmr pool user to set the pagesize In-Reply-To: <20060117101950.GP22260@mellanox.co.il> References: <20060117101950.GP22260@mellanox.co.il> Message-ID: <43CCDF87.8020507@voltaire.com> Michael S. Tsirkin wrote: > Out of curiosity, why would you want to make the page size smaller than > PAGE_SIZE? Bigger pages typically give you better performance, isnt that true > for iser? First just for the sake of clarity it is important to emphasize in the verbs api level the decoupling of the OS page notation to the "page" used by the HCA to map bunch of buffers to one network VA. Second and indeed more important, from our experience, there are eventually IB consumers such as the Linux SCSI Mid-Layer which sometimes generate Scatter-Gather lists that are "RDMA aligned" when treated in a resolution different from the system PAGE_SHIFT. Example to that we saw with ia64 SLES9 SP1/2 kernels. So if you work in PAGE_SHIFT you can not produce one VA for many of the SG submitted by the mid-layer. Or. From mst at mellanox.co.il Tue Jan 17 04:24:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jan 2006 14:24:56 +0200 Subject: [openib-general] Re: [PATCH] enable the fmr pool user to set the pagesize In-Reply-To: <43CCDF87.8020507@voltaire.com> References: <43CCDF87.8020507@voltaire.com> Message-ID: <20060117122455.GR22260@mellanox.co.il> Quoting Or Gerlitz : > Second and indeed more important, from our experience, there are > eventually IB consumers such as the Linux SCSI Mid-Layer which sometimes > generate Scatter-Gather lists that are "RDMA aligned" when treated in a > resolution different from the system PAGE_SHIFT. Example to that we saw > with ia64 SLES9 SP1/2 kernels. So if you work in PAGE_SHIFT you can not > produce one VA for many of the SG submitted by the mid-layer. Interesting. Where does the mid-layer get the 4K (not PAGE_SIZE) aligned buffers? Any idea? -- MST From rdreier at cisco.com Tue Jan 17 04:59:08 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 04:59:08 -0800 Subject: [openib-general] Comments on ehca updates In-Reply-To: (Heiko J. Schick's message of "Tue, 10 Jan 2006 10:26:16 +0100") References: Message-ID: Hi, I noticed that you checked in some ehca changes. A couple of comments: 1) While most of the changes to your #includes are correct, like -#include +#include "hipz_fns_core.h" since you should use "" instead of <> for includes in your own local directory to make the kernel build work, things in the kernel's own include/ directory should still use <>, so for example -#include +#include "linux/version.h" Also, I never noticed this before but -#include +#include "ib_mad.h" should really just be #include . 2) How can the changes like @@ -75,7 +74,7 @@ int ehca_post_send(struct ib_qp *qp, my_qp, qp->qp_num, send_wr, bad_send_wr); /* LOCK the QUEUE */ - spin_lock_irqsave(&my_qp->spinlock_s, spin_flags); + spin_lock(&my_qp->spinlock_s); be correct? ehca_post_send() is called directly as your device's post_send method, which means that a consumer can call it from both process and interrupt context. So using plain spin_lock() can deadlock if a process context call is interrupted by an interrupt context call. The same comment applies to your other changes like this, at least in your post_recv and poll_cq methods. - R. From rdreier at cisco.com Tue Jan 17 05:01:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 05:01:59 -0800 Subject: [openib-general] Re: Problems with dmcli on 64b hosts In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3011DDCE7@mtlexch01.mtl.com> (Yael Shenhav's message of "Tue, 17 Jan 2006 10:30:16 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3011DDCE7@mtlexch01.mtl.com> Message-ID: Yael> Are you planning to get this workaround in dmcli? Not really -- I think the C version of the DM client that I posted later is more useful. By the way, I still need to integrate Alexander Beyn's fix to that code, and I will do that soon. Even better would be if someone ambitious created a DM tool that automatically connects to the targets it discovers, etc. - R. From SCHICKHJ at de.ibm.com Tue Jan 17 04:40:48 2006 From: SCHICKHJ at de.ibm.com (Heiko J Schick) Date: Tue, 17 Jan 2006 13:40:48 +0100 Subject: [openib-general] Re: Comments on ehca updates In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Jan 17 06:20:22 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 17 Jan 2006 16:20:22 +0200 Subject: [openib-general] RE: [patch] userspace/management/diags/src/sminfo.c - cmdline processing fix Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589ABD6@taurus.voltaire.com> Hi Sasha, Thanks. Applied. I would welcome such a patch. -- Hal ________________________________ From: Sasha Khapyorsky [mailto:sashak at voltaire.com] Sent: Tue 1/17/2006 5:10 AM To: Hal Rosenstock Cc: openib Subject: [patch] userspace/management/diags/src/sminfo.c - cmdline processing fix Hello Hal, There is small bug in sminfo's cmdline processing, this will segfault when option argument is missing (like 'sminfo -a'). The "fast and dirty" fix is inlined. The same problem exists with most diag tools, so I think we need to rework AGRBEGIN { ... } ARGEND stuff (actually remove it from libibcommon since it is used by diag tools only). I can do it if there are no objections. Regards, Sasha. This fast fix for invalid ARGF() usage in sminfo.c. Signed-off-by: Sasha Khapyorsky Index: diags/src/sminfo.c =================================================================== --- diags/src/sminfo.c (revision 5017) +++ diags/src/sminfo.c (working copy) @@ -49,6 +49,8 @@ #define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) +#define SAFE_ARGF() (*(argv+1) ? ARGF() : ( usage(), NULL ) ) + static void iberror(const char *fn, char *msg, ...) { @@ -116,10 +118,10 @@ ARGBEGIN { case 'C': - ca = ARGF(); + ca = SAFE_ARGF(); break; case 'P': - ca_port = strtoul(ARGF(), 0, 0); + ca_port = strtoul(SAFE_ARGF(), 0, 0); break; case 'd': ibdebug++; @@ -137,17 +139,17 @@ dest_type = IB_DEST_GUID; break; case 't': - timeout = strtoul(ARGF(), 0, 0); + timeout = strtoul(SAFE_ARGF(), 0, 0); madrpc_set_timeout(timeout); break; case 'a': - act = strtoul(ARGF(), 0, 0); + act = strtoul(SAFE_ARGF(), 0, 0); break; case 's': - state = strtoul(ARGF(), 0, 0); + state = strtoul(SAFE_ARGF(), 0, 0); break; case 'p': - prio = strtoul(ARGF(), 0, 0); + prio = strtoul(SAFE_ARGF(), 0, 0); break; case 'V': fprintf(stderr, "%s %s\n", argv0, get_build_version() ); From steve.apo at googlemail.com Tue Jan 17 07:02:31 2006 From: steve.apo at googlemail.com (Steven Wooding) Date: Tue, 17 Jan 2006 15:02:31 +0000 Subject: [openib-general] Unknown symbol ip_dev_find (2.6.15.1 kernel) Message-ID: <2cfcf21e0601170702w30f111cs@mail.gmail.com> Hi, I was updating my kernel and openib drivers (haven't done so for a couple of months) and I've got stuck on the following problem. When you do "make modules_install" you can the following warnings at the end: WARNING: /lib/modules/2.6.15.1/kernel/drivers/infiniband/ulp/sdp/ib_sdp.ko needs unknown symbol ip_dev_find WARNING: /lib/modules/2.6.15.1/kernel/drivers/infiniband/core/ib_at.ko needs unknown symbol ip_dev_find WARNING: /lib/modules/2.6.15.1/kernel/drivers/infiniband/core/ib_addr.ko needs unknown symbol ip_dev_find I tried to reboot anyway, but these modules do indeed fail to load due to this problem. I notice this was fixed for 2.6.14 with a patch that exported the ip_dev_find symbol. Do we need one for 2.6.15.1 or have I missed a step out of my installation process? Thanks for the help. Cheers, Steve. From Arkady.Kanevsky at netapp.com Tue Jan 17 07:16:27 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 17 Jan 2006 10:16:27 -0500 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal Message-ID: Arlin, a few things need to be addressed. 1. correlation with local and remote invalidate This potentially effects both DAT_DTOs and post operations 2. Need a precise defintion for CONFIRM_FLAG definition in a transport independent fashion. What guarantees DAT Provider "provides" on successful local completion? Remote end guarantee? My understanding what you are trying to do is create 2 models one IB and one for iWARP. So for IB Consumers will use CONFIRM_FLAG and for iWARP IMMED_FLAG. Provider will indicate in Provider_attr which model it supports. The issue I have with it is that I do not see a model that Consumer can use to create a transport independent code. It looks like Immed_flag can be made transport independent. But with "sender" specifying the behavior a protocol extension is needed for IB. IB will always deliver Immediate data in the header not a payload and remote Provider can control how it is delivered to a Consumer. But this means that there is no need for DTO_flags for Send side. Instead it can be used for Recv side or controlled purely by Provider. 3. Need to define error behavior. for new operations, async errors, EP behavior. 4. Need to define DAT_Provider attributes for immediate data and dto_flags behavior 5. Does Solicited_wait completion_flag value now applicable for RDMA_write for immediate data? 6. Is dto_completion_data xfer_length include immediate_data size or not? 7. what memory privilages needed for a recv buffer for immediate data? 8. SRQ interaction? 9. What happens of buffer for recv operation NOT recv_immed is matched for incomming recv/rdma_write op? 10. Change dat_ep_post_write_immed to dat_ep_post_rdma_write_immed to be consistent with current terminology. 11. Need to cleanup operation description to make it clear that Send|RDMA_write and immediate data part is a single atomic operation. The current "followed by" language is misleading. Make it explicit that there is a single local DTO completion and single remote DTO completion. 12. Is your intension that post_recv_immed can ONLY except immediate data and is not capable to recv any message? 13. size should be num_segments for dat_ep_post_recv_immed() Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 ________________________________ From: Arlin Davis [mailto:arlin.r.davis at intel.com] Sent: Monday, January 16, 2006 5:55 PM To: Kanevsky, Arkady; Lentini, James Cc: dat-discussions at yahoogroups.com; openib-general at openib.org Subject: [RFC] DAT 2.0 immediate data proposal Arkady, The attached proposal adds immediate data options as standard API's instead of extensions for the following calls. dat_ep_post_send_immed() dat_ep_post_recv_immed() dat_ep_post_write_immed() The patch should be ready by tomorrow. Thanks, -arlin -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Tue Jan 17 07:21:10 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 17 Jan 2006 10:21:10 -0500 (EST) Subject: [openib-general] Unknown symbol ip_dev_find (2.6.15.1 kernel) In-Reply-To: <2cfcf21e0601170702w30f111cs@mail.gmail.com> References: <2cfcf21e0601170702w30f111cs@mail.gmail.com> Message-ID: > I notice this was fixed for 2.6.14 with a patch that exported the > ip_dev_find symbol. Do we need one for 2.6.15.1 or have I missed a > step out of my installation process? The same fix is necessary. You can apply the 2.6.14 patch to 2.6.15.1. From Arkady.Kanevsky at netapp.com Tue Jan 17 07:47:56 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 17 Jan 2006 10:47:56 -0500 Subject: [openib-general] RE: [RFC] DAT 2.0 extension proposal Message-ID: Arlin, 1. Does it mean that existing DAT providers will have to be modified so they report DAT_NOT_IMPLEMENTED for each extension? 2. Why is there DAT_INVALID in DAT_DTOS? 3. Do you want to use DAT_EXTENSION_DATA or DAT_EXT_DATA? 4. The proposed operations are operation on EP and they are DTOs. Why not define DAT_DTO_EXT_OP instead of DAT_EXT_OP? MY concern is that if these are not DTO then we have a new event stream type for "extensions" and we need to define rules for this event stream including ordering rules and interactions with other event streams, provider attributes for stream mixing and so on... If we restrict extensions to DTO operation extension we avoid all these issues and simplify APIs. On the negative side these extension are restrictive. 5. Memory protection extension for atomic operations 6. error returns for extensions? Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 ________________________________ From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] Sent: Monday, January 16, 2006 5:55 PM To: Kanevsky, Arkady; Lentini, James Cc: dat-discussions at yahoogroups.com; openib-general at openib.org Subject: [RFC] DAT 2.0 extension proposal Arkady, The attached proposal adds generic DTO extensions and provider specific atomic operations as follow. dat_ep_post_cmp_and_swap() dat_ep_post_fetch_and_add() The patch should be ready by tomorrow. Thanks, -arlin -------------- next part -------------- An HTML attachment was scrubbed... URL: From oferg at mellanox.co.il Tue Jan 17 08:19:08 2006 From: oferg at mellanox.co.il (Ofer Gigi) Date: Tue, 17 Jan 2006 18:19:08 +0200 Subject: [openib-general] Send packets above ibumad and waiting for a response Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3011DDED7@mtlexch01.mtl.com> Hi, I am trying to write a test that will send/receive packets directly above IBUMAD. Does anyone have such a chunk of code that do this? Thanks a lot in advance! Ofer -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Jan 17 08:19:56 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 17 Jan 2006 18:19:56 +0200 Subject: [openib-general] Send packets above ibumad and waiting for aresponse Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589ABDE@taurus.voltaire.com> Hi Ofer, Are you referring to the kernelmodule or the library here ? -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Ofer Gigi Sent: Tue 1/17/2006 11:19 AM To: openib-general at openib.org Subject: [openib-general] Send packets above ibumad and waiting for aresponse Hi, I am trying to write a test that will send/receive packets directly above IBUMAD. Does anyone have such a chunk of code that do this? Thanks a lot in advance! Ofer From rdreier at cisco.com Tue Jan 17 08:25:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 08:25:04 -0800 Subject: [openib-general] [PATCH] ibsrpdm: use the proper HCA and port with non-default umad device In-Reply-To: <439F8C89.7040507@datadirectnet.com> (Alexander Beyn's message of "Tue, 13 Dec 2005 19:07:53 -0800") References: <439F8C89.7040507@datadirectnet.com> Message-ID: At long last I've integrated your patch into srptools. I took this as an excuse to check the DM package into svn under https://openib.org/svn/gen2/trunk/src/userspace/srptools as well, rather than passing tarballs around on the mailing list. Thanks, Roland From robert.j.woodruff at intel.com Tue Jan 17 08:43:01 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 17 Jan 2006 08:43:01 -0800 Subject: [openib-general] RE: Missing file, trunk/src/linux-kernel/include/scsi/srp.h. in SVN5031 In-Reply-To: Message-ID: Roland wrote, >However my reasoning could easily be all wrong -- what are the >advantages of having it in openib svn? >- R. I guess one could go either way. The benefit of keeping it in SVN is that it that if any changes are needed, they are in some database until they are pushed upstream. The benefit of just having the kernel.org version being the latest is that there is only one copy and no confusion as to what is the latest version. It seems odd though to have the .h file only in the kernel.org tree and the srp.c files in openib SVN. One could argue that once a component is accepted upstream, that the kernel.org version is the latest, but then it makes tracking in between kernel.org releases a bit more difficult. Anyone else have a comment on this one ? woody From halr at voltaire.com Tue Jan 17 09:00:52 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 17 Jan 2006 19:00:52 +0200 Subject: [openib-general] [PATCH] Problem with directed route SMPs withbeginning or ending LID routed parts Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589ABE1@taurus.voltaire.com> Hi Ralph, This is much simpler :-) Thanks! Applied. I tested this both in an operational network with some different topologies as well as passing the previously failed compliance C14-11. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Ralph Campbell Sent: Mon 1/16/2006 2:17 PM To: openib-general at openib.org Subject: [openib-general] [PATCH] Problem with directed route SMPs withbeginning or ending LID routed parts OK. Here is a much simplified patch which fixes the problem of a directed route SMP with a with beginning or ending LID routed part. Signed-off-by: Ralph Campbell Index: core/mad.c =================================================================== --- core/mad.c (revision 5030) +++ core/mad.c (working copy) @@ -665,7 +665,15 @@ struct ib_wc mad_wc; struct ib_send_wr *send_wr = &mad_send_wr->send_wr; - if (!smi_handle_dr_smp_send(smp, device->node_type, port_num)) { + /* + * Directed route handling starts if the initial LID routed part of + * a request or the ending LID routed part of a response is empty. + * If we are at the start of the LID routed part, don't update the + * hop_ptr or hop_cnt. See section 14.2.2, Vol 1 IB spec. + */ + if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) == + IB_LID_PERMISSIVE && + !smi_handle_dr_smp_send(smp, device->node_type, port_num)) { ret = -EINVAL; printk(KERN_ERR PFX "Invalid directed route\n"); goto out; -- Ralph Campbell _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bardov at gmail.com Tue Jan 17 09:02:52 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Tue, 17 Jan 2006 19:02:52 +0200 Subject: [openib-general] Re: [PATCH] enable the fmr pool user to set the pagesize In-Reply-To: <20060117122455.GR22260@mellanox.co.il> References: <43CCDF87.8020507@voltaire.com> <20060117122455.GR22260@mellanox.co.il> Message-ID: On 1/17/06, Michael S. Tsirkin wrote: > Quoting Or Gerlitz : > > Second and indeed more important, from our experience, there are > > eventually IB consumers such as the Linux SCSI Mid-Layer which sometimes > > generate Scatter-Gather lists that are "RDMA aligned" when treated in a > > resolution different from the system PAGE_SHIFT. Example to that we saw > > with ia64 SLES9 SP1/2 kernels. So if you work in PAGE_SHIFT you can not > > produce one VA for many of the SG submitted by the mid-layer. > > Interesting. Where does the mid-layer get the 4K (not PAGE_SIZE) aligned > buffers? Any idea? > Not really. We suspect a different allocation unit in the buffer cache or some file systems not doing what we think they do.. We saw it also on Itanium with 16K page size that was sending 4K sg elements. What it made obvious is that the memory registration restrictions are the HCA restrictions, and those have nothing to do with kernel restrictions, yet the driver code relies on kernel restrictions. Dan > > -- > MST > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Tue Jan 17 09:15:31 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 09:15:31 -0800 Subject: [openib-general] [PATCH] race in pingpong -e In-Reply-To: <1137452559.4520.295.camel@brick.internal.keyresearch.com> (Ralph Campbell's message of "Mon, 16 Jan 2006 15:02:39 -0800") References: <1137452559.4520.295.camel@brick.internal.keyresearch.com> Message-ID: Thanks, applied. From rdreier at cisco.com Tue Jan 17 09:22:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 09:22:11 -0800 Subject: [openib-general] Re: [PATCH] ipoib: path->ah In-Reply-To: <20060116153814.GA759@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 16 Jan 2006 17:38:14 +0200") References: <20060116153814.GA759@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Tue Jan 17 09:31:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 09:31:45 -0800 Subject: [openib-general] Re: [PATCH] ipoib: pkt_queue In-Reply-To: <20060116153717.GA706@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 16 Jan 2006 17:37:17 +0200") References: <20060116153717.GA706@mellanox.co.il> Message-ID: > - while (!skb_queue_empty(&mcast->pkt_queue)) > + while (!skb_queue_empty(&mcast->pkt_queue)) { > + spin_lock_irqsave(&priv->tx_lock, flags); > + ++priv->stats.tx_dropped; > + spin_unlock_irqrestore(&priv->tx_lock, flags); > dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); > + } Any reason to drop the lock every time around this loop? Would it make more sense to count the number of packets and then just add it in after the loop? > + spin_lock_irq(&priv->tx_lock); > while (!skb_queue_empty(&mcast->pkt_queue)) { > struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); > + spin_unlock_irq(&priv->tx_lock); Again, why are we dropping the lock every time through this loop? Is it just to reduce the lock hold time here? - R. From rdreier at cisco.com Tue Jan 17 09:42:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 09:42:09 -0800 Subject: [openib-general] Re: [PATCH] uverbs: flush scheduled_work In-Reply-To: <20060116154000.GA765@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 16 Jan 2006 17:40:01 +0200") References: <20060116154000.GA765@mellanox.co.il> Message-ID: Thanks, applied. From mst at mellanox.co.il Tue Jan 17 09:50:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jan 2006 19:50:25 +0200 Subject: [openib-general] Re: [PATCH] ipoib: pkt_queue In-Reply-To: References: Message-ID: <20060117175025.GA11561@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib: pkt_queue > > > - while (!skb_queue_empty(&mcast->pkt_queue)) > > + while (!skb_queue_empty(&mcast->pkt_queue)) { > > + spin_lock_irqsave(&priv->tx_lock, flags); > > + ++priv->stats.tx_dropped; > > + spin_unlock_irqrestore(&priv->tx_lock, flags); > > dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); > > + } > > Any reason to drop the lock every time around this loop? Would it > make more sense to count the number of packets and then just add it in > after the loop? Makes sense. > > + spin_lock_irq(&priv->tx_lock); > > while (!skb_queue_empty(&mcast->pkt_queue)) { > > struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); > > + spin_unlock_irq(&priv->tx_lock); > > Again, why are we dropping the lock every time through this loop? Is > it just to reduce the lock hold time here? We seem to be doing operations that cant be called under tx_lock a few lines below. -- MST From rdreier at cisco.com Tue Jan 17 09:55:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 09:55:40 -0800 Subject: [openib-general] Re: [PATCH] fix crash on hotplug In-Reply-To: <20060116154233.GA868@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 16 Jan 2006 17:42:33 +0200") References: <20060116154233.GA868@mellanox.co.il> Message-ID: thanks, applied. From rdreier at cisco.com Tue Jan 17 09:59:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 09:59:56 -0800 Subject: [openib-general] Re: [PATCH] ipoib: pkt_queue In-Reply-To: <20060117175025.GA11561@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 17 Jan 2006 19:50:25 +0200") References: <20060117175025.GA11561@mellanox.co.il> Message-ID: Michael> We seem to be doing operations that cant be called under Michael> tx_lock a few lines below. Do you mean dev_queue_xmit()? Can that call directly back into our xmit function (I honestly don't know the locking rules here)? - R. From mst at mellanox.co.il Tue Jan 17 10:03:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jan 2006 20:03:10 +0200 Subject: [openib-general] Re: [PATCH] ipoib: pkt_queue In-Reply-To: References: Message-ID: <20060117180310.GB11561@mellanox.co.il> Quoting r. Roland Dreier : > Subject: [openib-general] Re: [PATCH] ipoib: pkt_queue > > Michael> We seem to be doing operations that cant be called under > Michael> tx_lock a few lines below. > > Do you mean dev_queue_xmit()? Can that call directly back into our > xmit function (I honestly don't know the locking rules here)? Yes, exactly. -- MST From mst at mellanox.co.il Tue Jan 17 10:10:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jan 2006 20:10:48 +0200 Subject: [openib-general] Re: Re: [PATCH] ipoib: pkt_queue In-Reply-To: <20060117180310.GB11561@mellanox.co.il> References: <20060117180310.GB11561@mellanox.co.il> Message-ID: <20060117181048.GC11561@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: Re: [PATCH] ipoib: pkt_queue > > Quoting r. Roland Dreier : > > Subject: [openib-general] Re: [PATCH] ipoib: pkt_queue > > > > Michael> We seem to be doing operations that cant be called under > > Michael> tx_lock a few lines below. > > > > Do you mean dev_queue_xmit()? Can that call directly back into our > > xmit function (I honestly don't know the locking rules here)? > > Yes, exactly. Here it is from net/core/dev.c /** * dev_queue_xmit - transmit a buffer * @skb: buffer to transmit * * Queue a buffer for transmission to a network device. The caller must * have set the device and priority and built the buffer before calling * this function. The function can be called from an interrupt. * * A negative errno code is returned on a failure. A success does not * guarantee the frame will be transmitted as it may be dropped due * to congestion or traffic shaping. * * ----------------------------------------------------------------------------------- * I notice this method can also return errors from the queue disciplines, * including NET_XMIT_DROP, which is a positive value. So, errors can also * be positive. * * Regardless of the return value, the skb is consumed, so it is currently * difficult to retry a send to this method. (You can bump the ref count * before sending to hold a reference for retry if you are careful.) * * When calling this method, interrupts MUST be enabled. This is because * the BH enable code must have IRQs enabled so that it will not deadlock. * --BLG */ int dev_queue_xmit(struct sk_buff *skb) { struct net_device *dev = skb->dev; struct Qdisc *q; int rc = -ENOMEM; if (skb_shinfo(skb)->frag_list && !(dev->features & NETIF_F_FRAGLIST) && __skb_linearize(skb, GFP_ATOMIC)) goto out_kfree_skb; /* Fragmented skb is linearized if device does not support SG, * or if at least one of fragments is in highmem and device * does not support DMA from it. */ if (skb_shinfo(skb)->nr_frags && (!(dev->features & NETIF_F_SG) || illegal_highdma(dev, skb)) && __skb_linearize(skb, GFP_ATOMIC)) goto out_kfree_skb; /* If packet is not checksummed and device does not support * checksumming for this protocol, complete checksumming here. */ if (skb->ip_summed == CHECKSUM_HW && (!(dev->features & (NETIF_F_HW_CSUM | NETIF_F_NO_CSUM)) && (!(dev->features & NETIF_F_IP_CSUM) || skb->protocol != htons(ETH_P_IP)))) if (skb_checksum_help(skb, 0)) goto out_kfree_skb; spin_lock_prefetch(&dev->queue_lock); /* Disable soft irqs for various locks below. Also * stops preemption for RCU. */ local_bh_disable(); /* Updates of qdisc are serialized by queue_lock. * The struct Qdisc which is pointed to by qdisc is now a * rcu structure - it may be accessed without acquiring * a lock (but the structure may be stale.) The freeing of the * qdisc will be deferred until it's known that there are no * more references to it. * * If the qdisc has an enqueue function, we still need to * hold the queue_lock before calling it, since queue_lock * also serializes access to the device queue. */ q = rcu_dereference(dev->qdisc); #ifdef CONFIG_NET_CLS_ACT skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS); #endif if (q->enqueue) { /* Grab device queue */ spin_lock(&dev->queue_lock); rc = q->enqueue(skb, q); qdisc_run(dev); spin_unlock(&dev->queue_lock); rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc; goto out; } /* The device has no queue. Common case for software devices: loopback, all the sorts of tunnels... Really, it is unlikely that xmit_lock protection is necessary here. (f.e. loopback and IP tunnels are clean ignoring statistics counters.) However, it is possible, that they rely on protection made by us here. Check this and shot the lock. It is not prone from deadlocks. Either shot noqueue qdisc, it is even simpler 8) */ if (dev->flags & IFF_UP) { int cpu = smp_processor_id(); /* ok because BHs are off */ if (dev->xmit_lock_owner != cpu) { HARD_TX_LOCK(dev, cpu); if (!netif_queue_stopped(dev)) { if (netdev_nit) dev_queue_xmit_nit(skb, dev); rc = 0; if (!dev->hard_start_xmit(skb, dev)) { HARD_TX_UNLOCK(dev); goto out; } } HARD_TX_UNLOCK(dev); if (net_ratelimit()) printk(KERN_CRIT "Virtual device %s asks to " "queue packet!\n", dev->name); } else { /* Recursion is detected! It is possible, * unfortunately */ if (net_ratelimit()) printk(KERN_CRIT "Dead loop on virtual device " "%s, fix it urgently!\n", dev->name); } } rc = -ENETDOWN; local_bh_enable(); out_kfree_skb: kfree_skb(skb); return rc; out: local_bh_enable(); return rc; } -- MST From mst at mellanox.co.il Tue Jan 17 11:12:40 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jan 2006 21:12:40 +0200 Subject: [openib-general] Re: [PATCH] ipoib: pkt_queue In-Reply-To: References: Message-ID: <20060117191240.GA12211@mellanox.co.il> Quoting r. Roland Dreier : > Any reason to drop the lock every time around this loop? Would it > make more sense to count the number of packets and then just add it in > after the loop? Is this better? -- Protect accesses to mcast->pkt_queue by tx_lock. Count multicast packets removed from pkt_queue as dropped. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-15 16:14:00.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-17 21:11:39.000000000 +0200 @@ -97,6 +97,7 @@ static void ipoib_mcast_free(struct ipoi struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tmp; unsigned long flags; + int tx_dropped = 0; ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group " IPOIB_GID_FMT "\n", @@ -123,8 +124,13 @@ static void ipoib_mcast_free(struct ipoi if (mcast->ah) ipoib_put_ah(mcast->ah); - while (!skb_queue_empty(&mcast->pkt_queue)) + while (!skb_queue_empty(&mcast->pkt_queue)) { + ++tx_dropped; dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } + spin_lock_irqsave(&priv->tx_lock, flags); + priv->stats.tx_dropped += tx_dropped; + spin_unlock_irqrestore(&priv->tx_lock, flags); kfree(mcast); } @@ -276,8 +282,10 @@ static int ipoib_mcast_join_finish(struc } /* actually send any queued packets */ + spin_lock_irq(&priv->tx_lock); while (!skb_queue_empty(&mcast->pkt_queue)) { struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + spin_unlock_irq(&priv->tx_lock); skb->dev = dev; @@ -288,7 +296,9 @@ static int ipoib_mcast_join_finish(struc if (dev_queue_xmit(skb)) ipoib_warn(priv, "dev_queue_xmit failed to requeue packet\n"); + spin_lock_irq(&priv->tx_lock); } + spin_unlock_irq(&priv->tx_lock); return 0; } @@ -300,6 +310,7 @@ ipoib_mcast_sendonly_join_complete(int s { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); if (!status) ipoib_mcast_join_finish(mcast, mcmember); @@ -310,8 +321,12 @@ ipoib_mcast_sendonly_join_complete(int s IPOIB_GID_ARG(mcast->mcmember.mgid), status); /* Flush out any queued packets */ - while (!skb_queue_empty(&mcast->pkt_queue)) + spin_lock_irq(&priv->tx_lock); + while (!skb_queue_empty(&mcast->pkt_queue)) { + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } + spin_unlock_irq(&priv->tx_lock); /* Clear the busy flag so we try again */ clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); @@ -687,6 +702,7 @@ void ipoib_mcast_send(struct net_device if (!mcast) { ipoib_warn(priv, "unable to allocate memory for " "multicast structure\n"); + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); goto out; } @@ -700,8 +716,10 @@ void ipoib_mcast_send(struct net_device if (!mcast->ah) { if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) skb_queue_tail(&mcast->pkt_queue, skb); - else + else { + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); + } if (mcast->query) ipoib_dbg_mcast(priv, "no address vector, " -- MST From mst at mellanox.co.il Tue Jan 17 11:21:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jan 2006 21:21:07 +0200 Subject: [openib-general] ipoib_mcast_send.patch In-Reply-To: <20060112213248.GH9256@mellanox.co.il> References: <20060112213248.GH9256@mellanox.co.il> Message-ID: <20060117192107.GA12456@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: Re: ipoib: outstanding patches > > Quoting Roland Dreier : > > > ipoib_mcast_send.patch > > > > Could we reuse the IPOIB_MCAST_RUN bit rather than adding a new bit? > > It seems that we could kill mcast_mutex and replace uses with > > priv->lock instead -- I don't see anything that sleeps inside mcast_mutex. > > Yes, I now believe that we should be able to do it this way. Something like this? --- Fix the following race scenario: Device is up. Port event or set mcast list triggers ipoib_mcast_stop_thread, this cancels the query and waits on mcast "done" completion. Completion is called and "done" is set. Meanwhile, ipoib_mcast_send arrives and starts a new query, re-initializing "done". Further, there's an additional issue that I saw in testing: ipoib_mcast_send may get called when priv->broadcast is NULL (e.g. if the device was downed and then upped internally because of a port event). If this happends and the sendonly join request gets completed before priv->broadcast is set, we get an oops ---- Do not send multicasts if mcast thread is stopped or if priv->broadcast is not set. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-17 21:13:43.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-17 21:18:36.000000000 +0200 @@ -55,8 +55,6 @@ MODULE_PARM_DESC(mcast_debug_level, "Enable multicast debug tracing if > 0"); #endif -static DEFINE_MUTEX(mcast_mutex); - /* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ struct ipoib_mcast { struct ib_sa_mcmember_rec mcmember; @@ -385,10 +383,10 @@ static void ipoib_mcast_join_complete(in if (!status && !ipoib_mcast_join_finish(mcast, mcmember)) { mcast->backoff = 1; - mutex_lock(&mcast_mutex); + spin_lock_irq(&priv->lock); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_work(ipoib_workqueue, &priv->mcast_task); - mutex_unlock(&mcast_mutex); + spin_unlock_irq(&priv->lock); complete(&mcast->done); return; } @@ -418,7 +416,7 @@ static void ipoib_mcast_join_complete(in mcast->query = NULL; - mutex_lock(&mcast_mutex); + spin_lock_irq(&priv->lock); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { if (status == -ETIMEDOUT) queue_work(ipoib_workqueue, &priv->mcast_task); @@ -427,7 +425,7 @@ static void ipoib_mcast_join_complete(in mcast->backoff * HZ); } else complete(&mcast->done); - mutex_unlock(&mcast_mutex); + spin_unlock_irq(&priv->lock); return; } @@ -482,12 +480,12 @@ static void ipoib_mcast_join(struct net_ if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; - mutex_lock(&mcast_mutex); + spin_lock_irq(&priv->lock); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_delayed_work(ipoib_workqueue, &priv->mcast_task, mcast->backoff * HZ); - mutex_unlock(&mcast_mutex); + spin_unlock_irq(&priv->lock); } else mcast->query_id = ret; } @@ -520,11 +518,11 @@ void ipoib_mcast_join_task(void *dev_ptr priv->broadcast = ipoib_mcast_alloc(dev, 1); if (!priv->broadcast) { ipoib_warn(priv, "failed to allocate broadcast group\n"); - mutex_lock(&mcast_mutex); + spin_lock_irq(&priv->lock); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_delayed_work(ipoib_workqueue, &priv->mcast_task, HZ); - mutex_unlock(&mcast_mutex); + spin_unlock_irq(&priv->lock); return; } @@ -580,10 +578,10 @@ int ipoib_mcast_start_thread(struct net_ ipoib_dbg_mcast(priv, "starting multicast thread\n"); - mutex_lock(&mcast_mutex); + spin_lock_irq(&priv->lock); if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_work(ipoib_workqueue, &priv->mcast_task); - mutex_unlock(&mcast_mutex); + spin_unlock_irq(&priv->lock); return 0; } @@ -595,10 +593,10 @@ int ipoib_mcast_stop_thread(struct net_d ipoib_dbg_mcast(priv, "stopping multicast thread\n"); - mutex_lock(&mcast_mutex); + spin_lock_irq(&priv->lock); clear_bit(IPOIB_MCAST_RUN, &priv->flags); cancel_delayed_work(&priv->mcast_task); - mutex_unlock(&mcast_mutex); + spin_unlock_irq(&priv->lock); if (flush) flush_workqueue(ipoib_workqueue); @@ -677,6 +675,11 @@ void ipoib_mcast_send(struct net_device */ spin_lock(&priv->lock); + if (!test_bit(IPOIB_MCAST_RUN, &priv->flags) || !priv->broadcast) { + dev_kfree_skb_any(skb); + goto unlock; + } + mcast = __ipoib_mcast_find(dev, mgid); if (!mcast) { /* Let's create a new send only group now */ @@ -735,6 +738,7 @@ out: ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); } +unlock: spin_unlock(&priv->lock); } -- MST From ardavis at ichips.intel.com Tue Jan 17 11:28:03 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 17 Jan 2006 11:28:03 -0800 Subject: [openib-general] RE: [RFC] DAT 2.0 extension proposal In-Reply-To: References: Message-ID: <43CD4543.10201@ichips.intel.com> Kanevsky, Arkady wrote: > Arlin, > > 1. Does it mean that existing DAT providers will have to be modified > so they report > DAT_NOT_IMPLEMENTED for each extension? No. During the open, a dat library built to support extensions, a query call is made to verify that the provider supports extensions and sets a global flag accordingly. This flag is checked via our single dat_extension call in dat_api. Take a look at the patch for all the details. > > 2. Why is there DAT_INVALID in DAT_DTOS? no reason. I can get rid of it. I will go ahead and keep this in sync with the latest 1.3 (2.0) definitions. > > 3. Do you want to use DAT_EXTENSION_DATA or DAT_EXT_DATA? sure. > > 4. The proposed operations are operation on EP and they are DTOs. > Why not define DAT_DTO_EXT_OP instead of DAT_EXT_OP? Yes, it makes more sense if we decide to limit these extensions to DTO types. > > MY concern is that if these are not DTO then we have a new event > stream type > for "extensions" and we need to define rules for this event stream > including > ordering rules and interactions with other event streams, provider > attributes > for stream mixing and so on... > > If we restrict extensions to DTO operation extension we avoid all > these issues > and simplify APIs. On the negative side these extension are restrictive. I have no problem limiting this proposal and work to DTO extensions. However, we should get consensus on this. > > 5. Memory protection extension for atomic operations > > 6. error returns for extensions? yes and yes; I will work these into the next patch and update the proposal. -arlin From bos at serpentine.com Tue Jan 17 11:33:41 2006 From: bos at serpentine.com (Bryan O'Sullivan) Date: Tue, 17 Jan 2006 11:33:41 -0800 Subject: [openib-general] Code with questionable license in OpenIB tree Message-ID: <1137526421.2527.12.camel@serpentine.pathscale.com> Hi, Michael - I have found some code in the OpenIB Subversion repo that appears to have been committed by you, and which has Mellanox proprietary licenses in the header files. Most of the files in the src/userspace/imgen directory contain the following boilerplate text: * - Mellanox Confidential and Proprietary - * * Copyright (C) July 2002, Mellanox Technologies Ltd. ALL RIGHTS RESERVED. * * Except as specifically permitted herein, no portion of the information, * including but not limited to object code and source code, may be reproduced, * modified, distributed, republished or otherwise exploited in any form or by * any means for any purpose without the prior written permission of Mellanox * Technologies Ltd. Use of software subject to the terms and conditions * detailed in the file "LICENSE.txt". There is no LICENSE.txt file in that portion of the tree. The only file by that name anywhere in the tree is a copy of the Common Public License in src/userspace/dapl/LICENSE.txt. However, it is not at all clear that this is the license that you intended to reference. I would appreciate a modification of the licensing language on those files, to something that is more in line with the rest of the openib.org tree (i.e. open to free redistribution and modification). Regards, References: <43CD4543.10201@ichips.intel.com> Message-ID: <43CD4AE6.5020509@ichips.intel.com> Arlin Davis wrote: > Kanevsky, Arkady wrote: > >> >> 5. Memory protection extension for atomic operations >> >> 6. error returns for extensions? > > > yes and yes; I will work these into the next patch and update the > proposal. For error returns I am thinking about carving up the return type, adding a new mask, and extension get type macro. Suggestions on carving up the following? Carve into type or subtype? other suggestions? type: DAT_RETURN_CLASS DAT_RETURN_TYPE DAT_RETURN_SUBTYPE bits: 31-30 29-16 15-0 -arlin From rdreier at cisco.com Tue Jan 17 11:54:22 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 11:54:22 -0800 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: <20060117192107.GA12456@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 17 Jan 2006 21:21:07 +0200") References: <20060112213248.GH9256@mellanox.co.il> <20060117192107.GA12456@mellanox.co.il> Message-ID: Does this actually work? > + if (!test_bit(IPOIB_MCAST_RUN, &priv->flags) || !priv->broadcast) { > + dev_kfree_skb_any(skb); > + goto unlock; > + } It seems that this code at the end of ipoib_mcast_join_task() might screw things up: ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n"); clear_bit(IPOIB_MCAST_RUN, &priv->flags); Probably the semantics of IPOIB_MCAST_RUN need to change slightly. I'm not sure this necessarily can be made to work -- maybe we just need more than one bit of status information to handle everything. Also should we count dropped packets here? - R. From rdreier at cisco.com Tue Jan 17 12:19:50 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 12:19:50 -0800 Subject: [openib-general] Re: [PATCH] ipoib: pkt_queue In-Reply-To: <20060117191240.GA12211@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 17 Jan 2006 21:12:40 +0200") References: <20060117191240.GA12211@mellanox.co.il> Message-ID: Yes, looks good ... committed From rdreier at cisco.com Tue Jan 17 12:23:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 12:23:19 -0800 Subject: [openib-general] Re: [PATCH] enable the fmr pool user to set the page size In-Reply-To: (Or Gerlitz's message of "Tue, 17 Jan 2006 12:05:08 +0200 (IST)") References: Message-ID: Seems reasonable. Unfortunately we just missed the 2.6.16-rc1 window so I think this should wait for the 2.6.17 window. BTW, do you ever see the SCSI layer giving you 512 byte blocks? - R. From mst at mellanox.co.il Tue Jan 17 12:31:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jan 2006 22:31:36 +0200 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: References: Message-ID: <20060117203136.GB12484@mellanox.co.il> Quoting Roland Dreier : > It seems that this code at the end of ipoib_mcast_join_task() might > screw things up: > > ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n"); > > clear_bit(IPOIB_MCAST_RUN, &priv->flags); Right. That probably was the reason I invented MCAST_STARTED. > Probably the semantics of IPOIB_MCAST_RUN need to change slightly. > I'm not sure this necessarily can be made to work -- maybe we just > need more than one bit of status information to handle everything. Kind of like what original patch in svn does? -- MST From rdreier at cisco.com Tue Jan 17 12:36:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jan 2006 12:36:52 -0800 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: <20060117203136.GB12484@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 17 Jan 2006 22:31:36 +0200") References: <20060117203136.GB12484@mellanox.co.il> Message-ID: Michael> Kind of like what original patch in svn does? Yeah -- my original question about reusing the MCAST_RUN bit was an honest question -- and it seems the answer is probably, "no, we need another bit to make it work." It seems that killing mcast_mutex might be a good, independent cleanup. From mst at mellanox.co.il Tue Jan 17 12:41:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jan 2006 22:41:18 +0200 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: References: Message-ID: <20060117204118.GA12964@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] Re: ipoib_mcast_send.patch > > Michael> Kind of like what original patch in svn does? > > Yeah -- my original question about reusing the MCAST_RUN bit was an > honest question -- and it seems the answer is probably, "no, we need > another bit to make it work." > > It seems that killing mcast_mutex might be a good, independent cleanup. pkey_mutex too. -- MST From ralphc at pathscale.com Tue Jan 17 13:55:02 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 17 Jan 2006 13:55:02 -0800 Subject: [openib-general] [PATCH] fix minor typo in SDP Message-ID: <1137534902.4520.345.camel@brick.internal.keyresearch.com> This patch fixes a minor misspelling in SDP. Signed-off-by: Ralph Campbell Index: ulp/sdp/sdp_inet.c =================================================================== --- ulp/sdp/sdp_inet.c (revision 5055) +++ ulp/sdp/sdp_inet.c (working copy) @@ -836,7 +836,7 @@ /* * file and/or wait can be NULL, once poll is asleep and needs to - * recheck the falgs on being woken. + * recheck the flags on being woken. */ sk = sock->sk; conn = sdp_sk(sk); Index: ulp/sdp/sdp_recv.c =================================================================== --- ulp/sdp/sdp_recv.c (revision 5055) +++ ulp/sdp/sdp_recv.c (working copy) @@ -1234,7 +1234,7 @@ sk = sock->sk; conn = sdp_sk(sk); - sdp_dbg_data(conn, "state <%08x> size <%Zu> pending <%d> falgs <%08x>", + sdp_dbg_data(conn, "state <%08x> size <%Zu> pending <%d> flags <%08x>", conn->state, size, conn->byte_strm, flags); sdp_dbg_data(conn, "read IOCB <%d> addr <%p> users <%d> flags <%08lx>", req->ki_key, msg->msg_iov->iov_base, -- Ralph Campbell From mst at mellanox.co.il Tue Jan 17 14:20:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 00:20:21 +0200 Subject: [openib-general] [PATCH 1 of 3] move destructor to struct neigh_parms In-Reply-To: <20060112162438.GO16938@mellanox.co.il> References: <20060112162438.GO16938@mellanox.co.il> Message-ID: <20060117222021.GA13186@mellanox.co.il> Quoting Michael S. Tsirkin : > Subject: [PATCH 1 of 3] move destructor to struct neigh_parms > > This is an alternative approach to the one presented in > ipoib_all_neigh_issues_2.patch. > > --- > > Move destructor from neigh_ops (which is shared between devices) > to neigh_parms which is not, so that multiple drivers can set > it safely. > > Signed-off-by: Michael S. Tsirkin > > Index: linux-2.6.15/net/core/neighbour.c > =================================================================== > --- linux-2.6.15.orig/net/core/neighbour.c 2006-01-12 11:58:15.000000000 +0200 > +++ linux-2.6.15/net/core/neighbour.c 2006-01-12 20:10:00.000000000 +0200 > @@ -586,8 +586,8 @@ void neigh_destroy(struct neighbour *nei > kfree(hh); > } > > - if (neigh->ops && neigh->ops->destructor) > - (neigh->ops->destructor)(neigh); > + if (neigh->parms->neigh_destructor) > + (neigh->parms->neigh_destructor)(neigh); > > skb_queue_purge(&neigh->arp_queue); > > Index: linux-2.6.15/include/net/neighbour.h > =================================================================== > --- linux-2.6.15.orig/include/net/neighbour.h 2006-01-03 05:21:10.000000000 +0200 > +++ linux-2.6.15/include/net/neighbour.h 2006-01-12 20:09:27.000000000 +0200 > @@ -68,6 +68,7 @@ struct neigh_parms > struct net_device *dev; > struct neigh_parms *next; > int (*neigh_setup)(struct neighbour *); > + void (*neigh_destructor)(struct neighbour *); > struct neigh_table *tbl; > > void *sysctl_table; > @@ -145,7 +146,6 @@ struct neighbour > struct neigh_ops > { > int family; > - void (*destructor)(struct neighbour *); > void (*solicit)(struct neighbour *, struct sk_buff*); > void (*error_report)(struct neighbour *, struct sk_buff*); > int (*output)(struct sk_buff*); > Roland, what do you say to this approach? We still could try this for 2.6.16, couldnt we? Its small and the interface is unused in kernel except by us. Otherwise for 2.6.16 and earlier we'll have to maintain the global list of neighbours along the lines of ipoib_all_neigh_issues_2.patch. -- MST From sean.hefty at intel.com Tue Jan 17 15:16:04 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Jan 2006 15:16:04 -0800 Subject: [openib-general] [PATCH 0/5] [RFC] Infiniband: connection abstraction Message-ID: The following set of patches defines a connection abstraction for Infiniband and other RDMA devices, and serves several purposes: * It implements a connection protocol over Infiniband based on IP addressing. This greatly simplifies clients wishing to establish connections over Infiniband. * It defines a connection abstraction that works over multiple RDMA devices. The submitted implementation targets Infiniband, but has been tested over other RDMA devices as well. * It handles RDMA device insertion and removal on behalf of its clients. The changes have been broken into 5 separate patches. The basic purpose of each patch is: 1. Provide common handling for marshalling data between userspace clients and kernel mode Infiniband drivers. 2. Extend the Infiniband CM to include private data comparisons as part of its connection request matching process. 3. Provide an address translation service that maps IP addresses to Infiniband addresses (GIDs). This patch touches outside of the Infiniband core, so I'm including the netdev mailing list. 4. Implement the kernel mode RDMA connection management agent. 5. Implement the userspace RDMA connection management agent kernel support module. Please copy the openib-general mailing list on any replies. Thanks, Sean From sean.hefty at intel.com Tue Jan 17 15:21:38 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Jan 2006 15:21:38 -0800 Subject: [openib-general] [PATCH 1/5] [RFC] Infiniband: connection abstraction In-Reply-To: Message-ID: The following patch provides common handling for marshalling data between userspace clients and kernel mode Infiniband drivers. Signed-off-by: Sean Hefty --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 10:25:27.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 15:34:15.000000000 -0800 @@ -16,4 +16,5 @@ ib_umad-y := user_mad.o ib_ucm-y := ucm.o -ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o +ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ + uverbs_marshall.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucm.c linux-2.6.ib/drivers/infiniband/core/ucm.c --- linux-2.6.git/drivers/infiniband/core/ucm.c 2006-01-16 10:25:26.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucm.c 2006-01-16 15:34:15.000000000 -0800 @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ucm.c 2594 2005-06-13 19:46:02Z libor $ + * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include #include @@ -48,6 +48,7 @@ #include #include +#include MODULE_AUTHOR("Libor Michalek"); MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access"); @@ -203,36 +204,6 @@ error: return NULL; } -static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath, - struct ib_sa_path_rec *kpath) -{ - if (!kpath || !upath) - return; - - memcpy(upath->dgid, kpath->dgid.raw, sizeof *upath->dgid); - memcpy(upath->sgid, kpath->sgid.raw, sizeof *upath->sgid); - - upath->dlid = kpath->dlid; - upath->slid = kpath->slid; - upath->raw_traffic = kpath->raw_traffic; - upath->flow_label = kpath->flow_label; - upath->hop_limit = kpath->hop_limit; - upath->traffic_class = kpath->traffic_class; - upath->reversible = kpath->reversible; - upath->numb_path = kpath->numb_path; - upath->pkey = kpath->pkey; - upath->sl = kpath->sl; - upath->mtu_selector = kpath->mtu_selector; - upath->mtu = kpath->mtu; - upath->rate_selector = kpath->rate_selector; - upath->rate = kpath->rate; - upath->packet_life_time = kpath->packet_life_time; - upath->preference = kpath->preference; - - upath->packet_life_time_selector = - kpath->packet_life_time_selector; -} - static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, struct ib_cm_req_event_param *kreq) { @@ -251,8 +222,10 @@ static void ib_ucm_event_req_get(struct ureq->srq = kreq->srq; ureq->port = kreq->port; - ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path); - ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path); + ib_copy_path_rec_to_user(&ureq->primary_path, kreq->primary_path); + if (kreq->alternate_path) + ib_copy_path_rec_to_user(&ureq->alternate_path, + kreq->alternate_path); } static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep, @@ -322,8 +295,8 @@ static int ib_ucm_event_process(struct i info = evt->param.rej_rcvd.ari; break; case IB_CM_LAP_RECEIVED: - ib_ucm_event_path_get(&uvt->resp.u.lap_resp.path, - evt->param.lap_rcvd.alternate_path); + ib_copy_path_rec_to_user(&uvt->resp.u.lap_resp.path, + evt->param.lap_rcvd.alternate_path); uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE; uvt->resp.present = IB_UCM_PRES_ALTERNATE; break; @@ -635,65 +608,11 @@ static ssize_t ib_ucm_attr_id(struct ib_ return result; } -static void ib_ucm_copy_ah_attr(struct ib_ucm_ah_attr *dest_attr, - struct ib_ah_attr *src_attr) -{ - memcpy(dest_attr->grh_dgid, src_attr->grh.dgid.raw, - sizeof src_attr->grh.dgid); - dest_attr->grh_flow_label = src_attr->grh.flow_label; - dest_attr->grh_sgid_index = src_attr->grh.sgid_index; - dest_attr->grh_hop_limit = src_attr->grh.hop_limit; - dest_attr->grh_traffic_class = src_attr->grh.traffic_class; - - dest_attr->dlid = src_attr->dlid; - dest_attr->sl = src_attr->sl; - dest_attr->src_path_bits = src_attr->src_path_bits; - dest_attr->static_rate = src_attr->static_rate; - dest_attr->is_global = (src_attr->ah_flags & IB_AH_GRH); - dest_attr->port_num = src_attr->port_num; -} - -static void ib_ucm_copy_qp_attr(struct ib_ucm_init_qp_attr_resp *dest_attr, - struct ib_qp_attr *src_attr) -{ - dest_attr->cur_qp_state = src_attr->cur_qp_state; - dest_attr->path_mtu = src_attr->path_mtu; - dest_attr->path_mig_state = src_attr->path_mig_state; - dest_attr->qkey = src_attr->qkey; - dest_attr->rq_psn = src_attr->rq_psn; - dest_attr->sq_psn = src_attr->sq_psn; - dest_attr->dest_qp_num = src_attr->dest_qp_num; - dest_attr->qp_access_flags = src_attr->qp_access_flags; - - dest_attr->max_send_wr = src_attr->cap.max_send_wr; - dest_attr->max_recv_wr = src_attr->cap.max_recv_wr; - dest_attr->max_send_sge = src_attr->cap.max_send_sge; - dest_attr->max_recv_sge = src_attr->cap.max_recv_sge; - dest_attr->max_inline_data = src_attr->cap.max_inline_data; - - ib_ucm_copy_ah_attr(&dest_attr->ah_attr, &src_attr->ah_attr); - ib_ucm_copy_ah_attr(&dest_attr->alt_ah_attr, &src_attr->alt_ah_attr); - - dest_attr->pkey_index = src_attr->pkey_index; - dest_attr->alt_pkey_index = src_attr->alt_pkey_index; - dest_attr->en_sqd_async_notify = src_attr->en_sqd_async_notify; - dest_attr->sq_draining = src_attr->sq_draining; - dest_attr->max_rd_atomic = src_attr->max_rd_atomic; - dest_attr->max_dest_rd_atomic = src_attr->max_dest_rd_atomic; - dest_attr->min_rnr_timer = src_attr->min_rnr_timer; - dest_attr->port_num = src_attr->port_num; - dest_attr->timeout = src_attr->timeout; - dest_attr->retry_cnt = src_attr->retry_cnt; - dest_attr->rnr_retry = src_attr->rnr_retry; - dest_attr->alt_port_num = src_attr->alt_port_num; - dest_attr->alt_timeout = src_attr->alt_timeout; -} - static ssize_t ib_ucm_init_qp_attr(struct ib_ucm_file *file, const char __user *inbuf, int in_len, int out_len) { - struct ib_ucm_init_qp_attr_resp resp; + struct ib_uverbs_qp_attr resp; struct ib_ucm_init_qp_attr cmd; struct ib_ucm_context *ctx; struct ib_qp_attr qp_attr; @@ -716,7 +635,7 @@ static ssize_t ib_ucm_init_qp_attr(struc if (result) goto out; - ib_ucm_copy_qp_attr(&resp, &qp_attr); + ib_copy_qp_attr_to_user(&resp, &qp_attr); if (copy_to_user((void __user *)(unsigned long)cmd.response, &resp, sizeof(resp))) @@ -791,7 +710,7 @@ static int ib_ucm_alloc_data(const void static int ib_ucm_path_get(struct ib_sa_path_rec **path, u64 src) { - struct ib_ucm_path_rec ucm_path; + struct ib_user_path_rec upath; struct ib_sa_path_rec *sa_path; *path = NULL; @@ -803,36 +722,14 @@ static int ib_ucm_path_get(struct ib_sa_ if (!sa_path) return -ENOMEM; - if (copy_from_user(&ucm_path, (void __user *)(unsigned long)src, - sizeof(ucm_path))) { + if (copy_from_user(&upath, (void __user *)(unsigned long)src, + sizeof(upath))) { kfree(sa_path); return -EFAULT; } - memcpy(sa_path->dgid.raw, ucm_path.dgid, sizeof sa_path->dgid); - memcpy(sa_path->sgid.raw, ucm_path.sgid, sizeof sa_path->sgid); - - sa_path->dlid = ucm_path.dlid; - sa_path->slid = ucm_path.slid; - sa_path->raw_traffic = ucm_path.raw_traffic; - sa_path->flow_label = ucm_path.flow_label; - sa_path->hop_limit = ucm_path.hop_limit; - sa_path->traffic_class = ucm_path.traffic_class; - sa_path->reversible = ucm_path.reversible; - sa_path->numb_path = ucm_path.numb_path; - sa_path->pkey = ucm_path.pkey; - sa_path->sl = ucm_path.sl; - sa_path->mtu_selector = ucm_path.mtu_selector; - sa_path->mtu = ucm_path.mtu; - sa_path->rate_selector = ucm_path.rate_selector; - sa_path->rate = ucm_path.rate; - sa_path->packet_life_time = ucm_path.packet_life_time; - sa_path->preference = ucm_path.preference; - - sa_path->packet_life_time_selector = - ucm_path.packet_life_time_selector; - + ib_copy_path_rec_from_user(sa_path, &upath); *path = sa_path; return 0; } @@ -1243,8 +1140,10 @@ static unsigned int ib_ucm_poll(struct f poll_wait(filp, &file->poll_wait, wait); + down(&file->mutex); if (!list_empty(&file->events)) mask = POLLIN | POLLRDNORM; + up(&file->mutex); return mask; } diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/uverbs_marshall.c linux-2.6.ib/drivers/infiniband/core/uverbs_marshall.c --- linux-2.6.git/drivers/infiniband/core/uverbs_marshall.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/uverbs_marshall.c 2006-01-16 15:34:15.000000000 -0800 @@ -0,0 +1,138 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include + +static void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst, + struct ib_ah_attr *src) +{ + memcpy(dst->grh.dgid, src->grh.dgid.raw, sizeof src->grh.dgid); + dst->grh.flow_label = src->grh.flow_label; + dst->grh.sgid_index = src->grh.sgid_index; + dst->grh.hop_limit = src->grh.hop_limit; + dst->grh.traffic_class = src->grh.traffic_class; + dst->dlid = src->dlid; + dst->sl = src->sl; + dst->src_path_bits = src->src_path_bits; + dst->static_rate = src->static_rate; + dst->is_global = src->ah_flags & IB_AH_GRH ? 1 : 0; + dst->port_num = src->port_num; +} + +void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst, + struct ib_qp_attr *src) +{ + dst->cur_qp_state = src->cur_qp_state; + dst->path_mtu = src->path_mtu; + dst->path_mig_state = src->path_mig_state; + dst->qkey = src->qkey; + dst->rq_psn = src->rq_psn; + dst->sq_psn = src->sq_psn; + dst->dest_qp_num = src->dest_qp_num; + dst->qp_access_flags = src->qp_access_flags; + + dst->max_send_wr = src->cap.max_send_wr; + dst->max_recv_wr = src->cap.max_recv_wr; + dst->max_send_sge = src->cap.max_send_sge; + dst->max_recv_sge = src->cap.max_recv_sge; + dst->max_inline_data = src->cap.max_inline_data; + + ib_copy_ah_attr_to_user(&dst->ah_attr, &src->ah_attr); + ib_copy_ah_attr_to_user(&dst->alt_ah_attr, &src->alt_ah_attr); + + dst->pkey_index = src->pkey_index; + dst->alt_pkey_index = src->alt_pkey_index; + dst->en_sqd_async_notify = src->en_sqd_async_notify; + dst->sq_draining = src->sq_draining; + dst->max_rd_atomic = src->max_rd_atomic; + dst->max_dest_rd_atomic = src->max_dest_rd_atomic; + dst->min_rnr_timer = src->min_rnr_timer; + dst->port_num = src->port_num; + dst->timeout = src->timeout; + dst->retry_cnt = src->retry_cnt; + dst->rnr_retry = src->rnr_retry; + dst->alt_port_num = src->alt_port_num; + dst->alt_timeout = src->alt_timeout; +} +EXPORT_SYMBOL(ib_copy_qp_attr_to_user); + +void ib_copy_path_rec_to_user(struct ib_user_path_rec *dst, + struct ib_sa_path_rec *src) +{ + memcpy(dst->dgid, src->dgid.raw, sizeof src->dgid); + memcpy(dst->sgid, src->sgid.raw, sizeof src->sgid); + + dst->dlid = src->dlid; + dst->slid = src->slid; + dst->raw_traffic = src->raw_traffic; + dst->flow_label = src->flow_label; + dst->hop_limit = src->hop_limit; + dst->traffic_class = src->traffic_class; + dst->reversible = src->reversible; + dst->numb_path = src->numb_path; + dst->pkey = src->pkey; + dst->sl = src->sl; + dst->mtu_selector = src->mtu_selector; + dst->mtu = src->mtu; + dst->rate_selector = src->rate_selector; + dst->rate = src->rate; + dst->packet_life_time = src->packet_life_time; + dst->preference = src->preference; + dst->packet_life_time_selector = src->packet_life_time_selector; +} +EXPORT_SYMBOL(ib_copy_path_rec_to_user); + +void ib_copy_path_rec_from_user(struct ib_sa_path_rec *dst, + struct ib_user_path_rec *src) +{ + memcpy(dst->dgid.raw, src->dgid, sizeof dst->dgid); + memcpy(dst->sgid.raw, src->sgid, sizeof dst->sgid); + + dst->dlid = src->dlid; + dst->slid = src->slid; + dst->raw_traffic = src->raw_traffic; + dst->flow_label = src->flow_label; + dst->hop_limit = src->hop_limit; + dst->traffic_class = src->traffic_class; + dst->reversible = src->reversible; + dst->numb_path = src->numb_path; + dst->pkey = src->pkey; + dst->sl = src->sl; + dst->mtu_selector = src->mtu_selector; + dst->mtu = src->mtu; + dst->rate_selector = src->rate_selector; + dst->rate = src->rate; + dst->packet_life_time = src->packet_life_time; + dst->preference = src->preference; + dst->packet_life_time_selector = src->packet_life_time_selector; +} +EXPORT_SYMBOL(ib_copy_path_rec_from_user); diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/ib_marshall.h linux-2.6.ib/include/rdma/ib_marshall.h --- linux-2.6.git/include/rdma/ib_marshall.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/include/rdma/ib_marshall.h 2006-01-16 15:34:15.000000000 -0800 @@ -0,0 +1,50 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#if !defined(IB_USER_MARSHALL_H) +#define IB_USER_MARSHALL_H + +#include +#include +#include +#include + +void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst, + struct ib_qp_attr *src); + +void ib_copy_path_rec_to_user(struct ib_user_path_rec *dst, + struct ib_sa_path_rec *src); + +void ib_copy_path_rec_from_user(struct ib_sa_path_rec *dst, + struct ib_user_path_rec *src); + +#endif /* IB_USER_MARSHALL_H */ diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/ib_user_cm.h linux-2.6.ib/include/rdma/ib_user_cm.h --- linux-2.6.git/include/rdma/ib_user_cm.h 2006-01-16 10:26:47.000000000 -0800 +++ linux-2.6.ib/include/rdma/ib_user_cm.h 2006-01-16 15:34:15.000000000 -0800 @@ -30,13 +30,13 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_user_cm.h 2576 2005-06-09 17:00:30Z libor $ + * $Id: ib_user_cm.h 4019 2005-11-11 00:33:09Z sean.hefty $ */ #ifndef IB_USER_CM_H #define IB_USER_CM_H -#include +#include #define IB_USER_CM_ABI_VERSION 4 @@ -110,58 +110,6 @@ struct ib_ucm_init_qp_attr { __u32 qp_state; }; -struct ib_ucm_ah_attr { - __u8 grh_dgid[16]; - __u32 grh_flow_label; - __u16 dlid; - __u16 reserved; - __u8 grh_sgid_index; - __u8 grh_hop_limit; - __u8 grh_traffic_class; - __u8 sl; - __u8 src_path_bits; - __u8 static_rate; - __u8 is_global; - __u8 port_num; -}; - -struct ib_ucm_init_qp_attr_resp { - __u32 qp_attr_mask; - __u32 qp_state; - __u32 cur_qp_state; - __u32 path_mtu; - __u32 path_mig_state; - __u32 qkey; - __u32 rq_psn; - __u32 sq_psn; - __u32 dest_qp_num; - __u32 qp_access_flags; - - struct ib_ucm_ah_attr ah_attr; - struct ib_ucm_ah_attr alt_ah_attr; - - /* ib_qp_cap */ - __u32 max_send_wr; - __u32 max_recv_wr; - __u32 max_send_sge; - __u32 max_recv_sge; - __u32 max_inline_data; - - __u16 pkey_index; - __u16 alt_pkey_index; - __u8 en_sqd_async_notify; - __u8 sq_draining; - __u8 max_rd_atomic; - __u8 max_dest_rd_atomic; - __u8 min_rnr_timer; - __u8 port_num; - __u8 timeout; - __u8 retry_cnt; - __u8 rnr_retry; - __u8 alt_port_num; - __u8 alt_timeout; -}; - struct ib_ucm_listen { __be64 service_id; __be64 service_mask; @@ -180,28 +128,6 @@ struct ib_ucm_private_data { __u8 reserved[3]; }; -struct ib_ucm_path_rec { - __u8 dgid[16]; - __u8 sgid[16]; - __be16 dlid; - __be16 slid; - __u32 raw_traffic; - __be32 flow_label; - __u32 reversible; - __u32 mtu; - __be16 pkey; - __u8 hop_limit; - __u8 traffic_class; - __u8 numb_path; - __u8 sl; - __u8 mtu_selector; - __u8 rate_selector; - __u8 rate; - __u8 packet_life_time_selector; - __u8 packet_life_time; - __u8 preference; -}; - struct ib_ucm_req { __u32 id; __u32 qpn; @@ -304,8 +230,8 @@ struct ib_ucm_event_get { }; struct ib_ucm_req_event_resp { - struct ib_ucm_path_rec primary_path; - struct ib_ucm_path_rec alternate_path; + struct ib_user_path_rec primary_path; + struct ib_user_path_rec alternate_path; __be64 remote_ca_guid; __u32 remote_qkey; __u32 remote_qpn; @@ -349,7 +275,7 @@ struct ib_ucm_mra_event_resp { }; struct ib_ucm_lap_event_resp { - struct ib_ucm_path_rec path; + struct ib_user_path_rec path; }; struct ib_ucm_apr_event_resp { diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/ib_user_sa.h linux-2.6.ib/include/rdma/ib_user_sa.h --- linux-2.6.git/include/rdma/ib_user_sa.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/include/rdma/ib_user_sa.h 2006-01-16 15:34:15.000000000 -0800 @@ -0,0 +1,60 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef IB_USER_SA_H +#define IB_USER_SA_H + +#include + +struct ib_user_path_rec { + __u8 dgid[16]; + __u8 sgid[16]; + __be16 dlid; + __be16 slid; + __u32 raw_traffic; + __be32 flow_label; + __u32 reversible; + __u32 mtu; + __be16 pkey; + __u8 hop_limit; + __u8 traffic_class; + __u8 numb_path; + __u8 sl; + __u8 mtu_selector; + __u8 rate_selector; + __u8 rate; + __u8 packet_life_time_selector; + __u8 packet_life_time; + __u8 preference; +}; + +#endif /* IB_USER_SA_H */ diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/ib_user_verbs.h linux-2.6.ib/include/rdma/ib_user_verbs.h --- linux-2.6.git/include/rdma/ib_user_verbs.h 2006-01-16 10:26:47.000000000 -0800 +++ linux-2.6.ib/include/rdma/ib_user_verbs.h 2006-01-16 15:34:15.000000000 -0800 @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_user_verbs.h 2708 2005-06-24 17:27:21Z roland $ + * $Id: ib_user_verbs.h 4019 2005-11-11 00:33:09Z sean.hefty $ */ #ifndef IB_USER_VERBS_H @@ -311,6 +311,64 @@ struct ib_uverbs_destroy_cq_resp { __u32 async_events_reported; }; +struct ib_uverbs_global_route { + __u8 dgid[16]; + __u32 flow_label; + __u8 sgid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 reserved; +}; + +struct ib_uverbs_ah_attr { + struct ib_uverbs_global_route grh; + __u16 dlid; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; + __u8 reserved; +}; + +struct ib_uverbs_qp_attr { + __u32 qp_attr_mask; + __u32 qp_state; + __u32 cur_qp_state; + __u32 path_mtu; + __u32 path_mig_state; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + + struct ib_uverbs_ah_attr ah_attr; + struct ib_uverbs_ah_attr alt_ah_attr; + + /* ib_qp_cap */ + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 en_sqd_async_notify; + __u8 sq_draining; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; + __u8 reserved[5]; +}; + struct ib_uverbs_create_qp { __u64 response; __u64 user_handle; @@ -487,26 +545,6 @@ struct ib_uverbs_post_srq_recv_resp { __u32 bad_wr; }; -struct ib_uverbs_global_route { - __u8 dgid[16]; - __u32 flow_label; - __u8 sgid_index; - __u8 hop_limit; - __u8 traffic_class; - __u8 reserved; -}; - -struct ib_uverbs_ah_attr { - struct ib_uverbs_global_route grh; - __u16 dlid; - __u8 sl; - __u8 src_path_bits; - __u8 static_rate; - __u8 is_global; - __u8 port_num; - __u8 reserved; -}; - struct ib_uverbs_create_ah { __u64 response; __u64 user_handle; From sean.hefty at intel.com Tue Jan 17 15:24:37 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Jan 2006 15:24:37 -0800 Subject: [openib-general] RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction In-Reply-To: Message-ID: The following patch extends matching connection requests to listens in the Infiniband CM to include private data. Signed-off-by: Sean Hefty --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/cm.c linux-2.6.ib/drivers/infiniband/core/cm.c --- linux-2.6.git/drivers/infiniband/core/cm.c 2006-01-16 10:25:26.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/cm.c 2006-01-16 16:03:35.000000000 -0800 @@ -32,7 +32,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: cm.c 2821 2005-07-08 17:07:28Z sean.hefty $ + * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include #include @@ -130,6 +130,7 @@ struct cm_id_private { /* todo: use alternate port on send failure */ struct cm_av av; struct cm_av alt_av; + struct ib_cm_private_data_compare *compare_data; void *private_data; __be64 tid; @@ -355,6 +356,40 @@ static struct cm_id_private * cm_acquire return cm_id_priv; } +static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask) +{ + int i; + + for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++) + dst[i] = src[i] & mask[i]; +} + +static int cm_compare_data(struct ib_cm_private_data_compare *src_data, + struct ib_cm_private_data_compare *dst_data) +{ + u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + u8 dst[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + + if (!src_data || !dst_data) + return 0; + + cm_mask_compare_data(src, src_data->data, dst_data->mask); + cm_mask_compare_data(dst, dst_data->data, src_data->mask); + return memcmp(src, dst, IB_CM_PRIVATE_DATA_COMPARE_SIZE); +} + +static int cm_compare_private_data(u8 *private_data, + struct ib_cm_private_data_compare *dst_data) +{ + u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + + if (!dst_data) + return 0; + + cm_mask_compare_data(src, private_data, dst_data->mask); + return memcmp(src, dst_data->data, IB_CM_PRIVATE_DATA_COMPARE_SIZE); +} + static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) { struct rb_node **link = &cm.listen_service_table.rb_node; @@ -362,14 +397,18 @@ static struct cm_id_private * cm_insert_ struct cm_id_private *cur_cm_id_priv; __be64 service_id = cm_id_priv->id.service_id; __be64 service_mask = cm_id_priv->id.service_mask; + int data_cmp; while (*link) { parent = *link; cur_cm_id_priv = rb_entry(parent, struct cm_id_private, service_node); + data_cmp = cm_compare_data(cm_id_priv->compare_data, + cur_cm_id_priv->compare_data); if ((cur_cm_id_priv->id.service_mask & service_id) == (service_mask & cur_cm_id_priv->id.service_id) && - (cm_id_priv->id.device == cur_cm_id_priv->id.device)) + (cm_id_priv->id.device == cur_cm_id_priv->id.device) && + !data_cmp) return cur_cm_id_priv; if (cm_id_priv->id.device < cur_cm_id_priv->id.device) @@ -378,6 +417,10 @@ static struct cm_id_private * cm_insert_ link = &(*link)->rb_right; else if (service_id < cur_cm_id_priv->id.service_id) link = &(*link)->rb_left; + else if (service_id > cur_cm_id_priv->id.service_id) + link = &(*link)->rb_right; + else if (data_cmp < 0) + link = &(*link)->rb_left; else link = &(*link)->rb_right; } @@ -387,16 +430,20 @@ static struct cm_id_private * cm_insert_ } static struct cm_id_private * cm_find_listen(struct ib_device *device, - __be64 service_id) + __be64 service_id, + u8 *private_data) { struct rb_node *node = cm.listen_service_table.rb_node; struct cm_id_private *cm_id_priv; + int data_cmp; while (node) { cm_id_priv = rb_entry(node, struct cm_id_private, service_node); + data_cmp = cm_compare_private_data(private_data, + cm_id_priv->compare_data); if ((cm_id_priv->id.service_mask & service_id) == cm_id_priv->id.service_id && - (cm_id_priv->id.device == device)) + (cm_id_priv->id.device == device) && !data_cmp) return cm_id_priv; if (device < cm_id_priv->id.device) @@ -405,6 +452,10 @@ static struct cm_id_private * cm_find_li node = node->rb_right; else if (service_id < cm_id_priv->id.service_id) node = node->rb_left; + else if (service_id > cm_id_priv->id.service_id) + node = node->rb_right; + else if (data_cmp < 0) + node = node->rb_left; else node = node->rb_right; } @@ -728,15 +779,14 @@ retest: wait_event(cm_id_priv->wait, !atomic_read(&cm_id_priv->refcount)); while ((work = cm_dequeue_work(cm_id_priv)) != NULL) cm_free_work(work); - if (cm_id_priv->private_data && cm_id_priv->private_data_len) - kfree(cm_id_priv->private_data); + kfree(cm_id_priv->compare_data); + kfree(cm_id_priv->private_data); kfree(cm_id_priv); } EXPORT_SYMBOL(ib_destroy_cm_id); -int ib_cm_listen(struct ib_cm_id *cm_id, - __be64 service_id, - __be64 service_mask) +int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, + struct ib_cm_private_data_compare *compare_data) { struct cm_id_private *cm_id_priv, *cur_cm_id_priv; unsigned long flags; @@ -750,7 +800,19 @@ int ib_cm_listen(struct ib_cm_id *cm_id, return -EINVAL; cm_id_priv = container_of(cm_id, struct cm_id_private, id); - BUG_ON(cm_id->state != IB_CM_IDLE); + if (cm_id->state != IB_CM_IDLE) + return -EINVAL; + + if (compare_data) { + cm_id_priv->compare_data = kzalloc(sizeof *compare_data, + GFP_KERNEL); + if (!cm_id_priv->compare_data) + return -ENOMEM; + cm_mask_compare_data(cm_id_priv->compare_data->data, + compare_data->data, compare_data->mask); + memcpy(cm_id_priv->compare_data->mask, compare_data->mask, + IB_CM_PRIVATE_DATA_COMPARE_SIZE); + } cm_id->state = IB_CM_LISTEN; @@ -767,6 +829,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, if (cur_cm_id_priv) { cm_id->state = IB_CM_IDLE; + kfree(cm_id_priv->compare_data); + cm_id_priv->compare_data = NULL; ret = -EBUSY; } return ret; @@ -1239,7 +1303,8 @@ static struct cm_id_private * cm_match_r /* Find matching listen request. */ listen_cm_id_priv = cm_find_listen(cm_id_priv->id.device, - req_msg->service_id); + req_msg->service_id, + req_msg->private_data); if (!listen_cm_id_priv) { spin_unlock_irqrestore(&cm.lock, flags); cm_issue_rej(work->port, work->mad_recv_wc, @@ -2646,7 +2711,8 @@ static int cm_sidr_req_handler(struct cm goto out; /* Duplicate message. */ } cur_cm_id_priv = cm_find_listen(cm_id->device, - sidr_req_msg->service_id); + sidr_req_msg->service_id, + sidr_req_msg->private_data); if (!cur_cm_id_priv) { rb_erase(&cm_id_priv->sidr_id_node, &cm.remote_sidr_table); spin_unlock_irqrestore(&cm.lock, flags); diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucm.c linux-2.6.ib/drivers/infiniband/core/ucm.c --- linux-2.6.git/drivers/infiniband/core/ucm.c 2006-01-16 16:03:08.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucm.c 2006-01-16 16:03:35.000000000 -0800 @@ -646,6 +646,17 @@ out: return result; } +static int ucm_validate_listen(__be64 service_id, __be64 service_mask) +{ + service_id &= service_mask; + + if (((service_id & IB_CMA_SERVICE_ID_MASK) == IB_CMA_SERVICE_ID) || + ((service_id & IB_SDP_SERVICE_ID_MASK) == IB_SDP_SERVICE_ID)) + return -EINVAL; + + return 0; +} + static ssize_t ib_ucm_listen(struct ib_ucm_file *file, const char __user *inbuf, int in_len, int out_len) @@ -661,7 +672,13 @@ static ssize_t ib_ucm_listen(struct ib_u if (IS_ERR(ctx)) return PTR_ERR(ctx); - result = ib_cm_listen(ctx->cm_id, cmd.service_id, cmd.service_mask); + result = ucm_validate_listen(cmd.service_id, cmd.service_mask); + if (result) + goto out; + + result = ib_cm_listen(ctx->cm_id, cmd.service_id, cmd.service_mask, + NULL); +out: ib_ucm_ctx_put(ctx); return result; } diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/ib_cm.h linux-2.6.ib/include/rdma/ib_cm.h --- linux-2.6.git/include/rdma/ib_cm.h 2006-01-16 10:26:47.000000000 -0800 +++ linux-2.6.ib/include/rdma/ib_cm.h 2006-01-16 16:03:35.000000000 -0800 @@ -32,7 +32,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_cm.h 2730 2005-06-28 16:43:03Z sean.hefty $ + * $Id: ib_cm.h 4311 2005-12-05 18:42:01Z sean.hefty $ */ #if !defined(IB_CM_H) #define IB_CM_H @@ -102,7 +102,8 @@ enum ib_cm_data_size { IB_CM_APR_INFO_LENGTH = 72, IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE = 216, IB_CM_SIDR_REP_PRIVATE_DATA_SIZE = 136, - IB_CM_SIDR_REP_INFO_LENGTH = 72 + IB_CM_SIDR_REP_INFO_LENGTH = 72, + IB_CM_PRIVATE_DATA_COMPARE_SIZE = 64 }; struct ib_cm_id; @@ -238,7 +239,6 @@ struct ib_cm_sidr_rep_event_param { u32 qpn; void *info; u8 info_len; - }; struct ib_cm_event { @@ -317,6 +317,15 @@ void ib_destroy_cm_id(struct ib_cm_id *c #define IB_SERVICE_ID_AGN_MASK __constant_cpu_to_be64(0xFF00000000000000ULL) #define IB_CM_ASSIGN_SERVICE_ID __constant_cpu_to_be64(0x0200000000000000ULL) +#define IB_CMA_SERVICE_ID __constant_cpu_to_be64(0x0000000001000000ULL) +#define IB_CMA_SERVICE_ID_MASK __constant_cpu_to_be64(0xFFFFFFFFFF000000ULL) +#define IB_SDP_SERVICE_ID __constant_cpu_to_be64(0x0000000000010000ULL) +#define IB_SDP_SERVICE_ID_MASK __constant_cpu_to_be64(0xFFFFFFFFFFFF0000ULL) + +struct ib_cm_private_data_compare { + u8 data[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + u8 mask[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; +}; /** * ib_cm_listen - Initiates listening on the specified service ID for @@ -330,10 +339,12 @@ void ib_destroy_cm_id(struct ib_cm_id *c * range of service IDs. If set to 0, the service ID is matched * exactly. This parameter is ignored if %service_id is set to * IB_CM_ASSIGN_SERVICE_ID. + * @compare_data: This parameter is optional. It specifies data that must + * appear in the private data of a connection request for the specified + * listen request. */ -int ib_cm_listen(struct ib_cm_id *cm_id, - __be64 service_id, - __be64 service_mask); +int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, + struct ib_cm_private_data_compare *compare_data); struct ib_cm_req_param { struct ib_sa_path_rec *primary_path; From sean.hefty at intel.com Tue Jan 17 15:28:17 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Jan 2006 15:28:17 -0800 Subject: [openib-general] [PATCH 3/5] [RFC] Infiniband: connection abstraction In-Reply-To: Message-ID: The following provides an address translation service that maps IP addresses to Infiniband addresses (GIDs) using IPoIB. Signed-off-by: Sean Hefty --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/addr.c linux-2.6.ib/drivers/infiniband/core/addr.c --- linux-2.6.git/drivers/infiniband/core/addr.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/addr.c 2006-01-16 16:14:24.000000000 -0800 @@ -0,0 +1,356 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("IB Address Translation"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct addr_req { + struct list_head list; + struct sockaddr src_addr; + struct sockaddr dst_addr; + struct rdma_dev_addr *addr; + void *context; + void (*callback)(int status, struct sockaddr *src_addr, + struct rdma_dev_addr *addr, void *context); + unsigned long timeout; + int status; +}; + +static void process_req(void *data); + +static DECLARE_MUTEX(mutex); +static LIST_HEAD(req_list); +static DECLARE_WORK(work, process_req, NULL); +struct workqueue_struct *rdma_wq; +EXPORT_SYMBOL(rdma_wq); + +static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, + unsigned char *dst_dev_addr) +{ + switch (dev->type) { + case ARPHRD_INFINIBAND: + dev_addr->dev_type = IB_NODE_CA; + break; + default: + return -EADDRNOTAVAIL; + } + + memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN); + memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN); + if (dst_dev_addr) + memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); + return 0; +} + +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) +{ + struct net_device *dev; + u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr; + int ret; + + dev = ip_dev_find(ip); + if (!dev) + return -EADDRNOTAVAIL; + + ret = copy_addr(dev_addr, dev, NULL); + dev_put(dev); + return ret; +} +EXPORT_SYMBOL(rdma_translate_ip); + +static void set_timeout(unsigned long time) +{ + unsigned long delay; + + cancel_delayed_work(&work); + + delay = time - jiffies; + if ((long)delay <= 0) + delay = 1; + + queue_delayed_work(rdma_wq, &work, delay); +} + +static void queue_req(struct addr_req *req) +{ + struct addr_req *temp_req; + + down(&mutex); + list_for_each_entry_reverse(temp_req, &req_list, list) { + if (time_after(req->timeout, temp_req->timeout)) + break; + } + + list_add(&req->list, &temp_req->list); + + if (req_list.next == &req->list) + set_timeout(req->timeout); + up(&mutex); +} + +static void addr_send_arp(struct sockaddr_in *dst_in) +{ + struct rtable *rt; + struct flowi fl; + u32 dst_ip = dst_in->sin_addr.s_addr; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = dst_ip; + if (ip_route_output_key(&rt, &fl)) + return; + + arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev, + rt->rt_src, NULL, rt->idev->dev->dev_addr, NULL); + ip_rt_put(rt); +} + +static int addr_resolve_remote(struct sockaddr_in *src_in, + struct sockaddr_in *dst_in, + struct rdma_dev_addr *addr) +{ + u32 src_ip = src_in->sin_addr.s_addr; + u32 dst_ip = dst_in->sin_addr.s_addr; + struct flowi fl; + struct rtable *rt; + struct neighbour *neigh; + int ret; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = dst_ip; + fl.nl_u.ip4_u.saddr = src_ip; + ret = ip_route_output_key(&rt, &fl); + if (ret) + goto out; + + neigh = neigh_lookup(&arp_tbl, &rt->rt_gateway, rt->idev->dev); + if (!neigh) { + ret = -ENODATA; + goto err1; + } + + if (!(neigh->nud_state & NUD_VALID)) { + ret = -ENODATA; + goto err2; + } + + if (!src_ip) { + src_in->sin_family = dst_in->sin_family; + src_in->sin_addr.s_addr = rt->rt_src; + } + + ret = copy_addr(addr, neigh->dev, neigh->ha); +err2: + neigh_release(neigh); +err1: + ip_rt_put(rt); +out: + return ret; +} + +static void process_req(void *data) +{ + struct addr_req *req, *temp_req; + struct sockaddr_in *src_in, *dst_in; + struct list_head done_list; + + INIT_LIST_HEAD(&done_list); + + down(&mutex); + list_for_each_entry_safe(req, temp_req, &req_list, list) { + if (req->status) { + src_in = (struct sockaddr_in *) &req->src_addr; + dst_in = (struct sockaddr_in *) &req->dst_addr; + req->status = addr_resolve_remote(src_in, dst_in, + req->addr); + } + if (req->status && time_after(jiffies, req->timeout)) + req->status = -ETIMEDOUT; + else if (req->status == -ENODATA) + continue; + + list_del(&req->list); + list_add_tail(&req->list, &done_list); + } + + if (!list_empty(&req_list)) { + req = list_entry(req_list.next, struct addr_req, list); + set_timeout(req->timeout); + } + up(&mutex); + + list_for_each_entry_safe(req, temp_req, &done_list, list) { + list_del(&req->list); + req->callback(req->status, &req->src_addr, req->addr, + req->context); + kfree(req); + } +} + +static int addr_resolve_local(struct sockaddr_in *src_in, + struct sockaddr_in *dst_in, + struct rdma_dev_addr *addr) +{ + struct net_device *dev; + u32 src_ip = src_in->sin_addr.s_addr; + u32 dst_ip = dst_in->sin_addr.s_addr; + int ret; + + dev = ip_dev_find(dst_ip); + if (!dev) + return -EADDRNOTAVAIL; + + if (!src_ip) { + src_in->sin_family = dst_in->sin_family; + src_in->sin_addr.s_addr = dst_ip; + ret = copy_addr(addr, dev, dev->dev_addr); + } else { + ret = rdma_translate_ip((struct sockaddr *)src_in, addr); + if (!ret) + memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN); + } + + dev_put(dev); + return ret; +} + +int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct rdma_dev_addr *addr, int timeout_ms, + void (*callback)(int status, struct sockaddr *src_addr, + struct rdma_dev_addr *addr, void *context), + void *context) +{ + struct sockaddr_in *src_in, *dst_in; + struct addr_req *req; + int ret = 0; + + req = kmalloc(sizeof *req, GFP_KERNEL); + if (!req) + return -ENOMEM; + memset(req, 0, sizeof *req); + + if (src_addr) + memcpy(&req->src_addr, src_addr, ip_addr_size(src_addr)); + memcpy(&req->dst_addr, dst_addr, ip_addr_size(dst_addr)); + req->addr = addr; + req->callback = callback; + req->context = context; + + src_in = (struct sockaddr_in *) &req->src_addr; + dst_in = (struct sockaddr_in *) &req->dst_addr; + + req->status = addr_resolve_local(src_in, dst_in, addr); + if (req->status == -EADDRNOTAVAIL) + req->status = addr_resolve_remote(src_in, dst_in, addr); + + switch (req->status) { + case 0: + req->timeout = jiffies; + queue_req(req); + break; + case -ENODATA: + req->timeout = msecs_to_jiffies(timeout_ms) + jiffies; + queue_req(req); + addr_send_arp(dst_in); + break; + default: + ret = req->status; + kfree(req); + break; + } + return ret; +} +EXPORT_SYMBOL(rdma_resolve_ip); + +void rdma_addr_cancel(struct rdma_dev_addr *addr) +{ + struct addr_req *req, *temp_req; + + up(&mutex); + list_for_each_entry_safe(req, temp_req, &req_list, list) { + if (req->addr == addr) { + req->status = -ECANCELED; + req->timeout = jiffies; + list_del(&req->list); + list_add(&req->list, &req_list); + set_timeout(req->timeout); + break; + } + } + up(&mutex); +} +EXPORT_SYMBOL(rdma_addr_cancel); + +static int addr_arp_recv(struct sk_buff *skb, struct net_device *dev, + struct packet_type *pkt, struct net_device *orig_dev) +{ + struct arphdr *arp_hdr; + + arp_hdr = (struct arphdr *) skb->nh.raw; + + if (dev->type == ARPHRD_INFINIBAND && + (arp_hdr->ar_op == __constant_htons(ARPOP_REQUEST) || + arp_hdr->ar_op == __constant_htons(ARPOP_REPLY))) + set_timeout(jiffies); + + kfree_skb(skb); + return 0; +} + +static struct packet_type addr_arp = { + .type = __constant_htons(ETH_P_ARP), + .func = addr_arp_recv, + .af_packet_priv = (void*) 1, +}; + +static int addr_init(void) +{ + rdma_wq = create_singlethread_workqueue("rdma_wq"); + if (!rdma_wq) + return -ENOMEM; + + dev_add_pack(&addr_arp); + return 0; +} + +static void addr_cleanup(void) +{ + dev_remove_pack(&addr_arp); + destroy_workqueue(rdma_wq); +} + +module_init(addr_init); +module_exit(addr_cleanup); diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 16:03:08.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 16:14:24.000000000 -0800 @@ -1,5 +1,5 @@ obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o + ib_cm.o ib_addr.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o @@ -12,6 +12,8 @@ ib_sa-y := sa_query.o ib_cm-y := cm.o +ib_addr-y := addr.o + ib_umad-y := user_mad.o ib_ucm-y := ucm.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/ib_addr.h linux-2.6.ib/include/rdma/ib_addr.h --- linux-2.6.git/include/rdma/ib_addr.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/include/rdma/ib_addr.h 2006-01-16 16:14:24.000000000 -0800 @@ -0,0 +1,117 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(IB_ADDR_H) +#define IB_ADDR_H + +#include +#include +#include +#include +#include + +extern struct workqueue_struct *rdma_wq; + +struct rdma_dev_addr { + unsigned char src_dev_addr[MAX_ADDR_LEN]; + unsigned char dst_dev_addr[MAX_ADDR_LEN]; + unsigned char broadcast[MAX_ADDR_LEN]; + enum ib_node_type dev_type; +}; + +/** + * rdma_translate_ip - Translate a local IP address to an RDMA hardware + * address. + */ +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr); + +/** + * rdma_resolve_ip - Resolve source and destination IP addresses to + * RDMA hardware addresses. + * @src_addr: An optional source address to use in the resolution. If a + * source address is not provided, a usable address will be returned via + * the callback. + * @dst_addr: The destination address to resolve. + * @addr: A reference to a data location that will receive the resolved + * addresses. The data location must remain valid until the callback has + * been invoked. + * @timeout_ms: Amount of time to wait for the address resolution to complete. + * @callback: Call invoked once address resolution has completed, timed out, + * or been canceled. A status of 0 indicates success. + * @context: User-specified context associated with the call. + */ +int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct rdma_dev_addr *addr, int timeout_ms, + void (*callback)(int status, struct sockaddr *src_addr, + struct rdma_dev_addr *addr, void *context), + void *context); + +void rdma_addr_cancel(struct rdma_dev_addr *addr); + +static inline int ip_addr_size(struct sockaddr *addr) +{ + return addr->sa_family == AF_INET6 ? + sizeof(struct sockaddr_in6) : sizeof(struct sockaddr_in); +} + +static inline u16 ib_addr_get_pkey(struct rdma_dev_addr *dev_addr) +{ + return ((u16)dev_addr->broadcast[8] << 8) | (u16)dev_addr->broadcast[9]; +} + +static inline void ib_addr_set_pkey(struct rdma_dev_addr *dev_addr, u16 pkey) +{ + dev_addr->broadcast[8] = pkey >> 8; + dev_addr->broadcast[9] = (unsigned char) pkey; +} + +static inline union ib_gid* ib_addr_get_sgid(struct rdma_dev_addr *dev_addr) +{ + return (union ib_gid *) (dev_addr->src_dev_addr + 4); +} + +static inline void ib_addr_set_sgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) +{ + memcpy(dev_addr->src_dev_addr + 4, gid, sizeof *gid); +} + +static inline union ib_gid* ib_addr_get_dgid(struct rdma_dev_addr *dev_addr) +{ + return (union ib_gid *) (dev_addr->dst_dev_addr + 4); +} + +static inline void ib_addr_set_dgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) +{ + memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid); +} + +#endif /* IB_ADDR_H */ + diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/net/ipv4/fib_frontend.c linux-2.6.ib/net/ipv4/fib_frontend.c --- linux-2.6.git/net/ipv4/fib_frontend.c 2006-01-16 10:28:29.000000000 -0800 +++ linux-2.6.ib/net/ipv4/fib_frontend.c 2006-01-16 16:14:24.000000000 -0800 @@ -666,4 +666,5 @@ void __init ip_fib_init(void) } EXPORT_SYMBOL(inet_addr_type); +EXPORT_SYMBOL(ip_dev_find); EXPORT_SYMBOL(ip_rt_ioctl); From sean.hefty at intel.com Tue Jan 17 15:37:38 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Jan 2006 15:37:38 -0800 Subject: [openib-general] [PATCH 4/5] [RFC] Infiniband: connection abstraction In-Reply-To: Message-ID: The following patch implements a kernel mode connection management agent over Infiniband that connects based on IP addresses. The agent defines a generic RDMA connection abstraction to support clients wanting to connect over different RDMA devices. It also handles RDMA device hotplug events on behalf of clients. - Signed-off-by: Sean Hefty --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/cma.c linux-2.6.ib/drivers/infiniband/core/cma.c --- linux-2.6.git/drivers/infiniband/core/cma.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/cma.c 2006-01-16 16:17:34.000000000 -0800 @@ -0,0 +1,1639 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Guy German"); +MODULE_DESCRIPTION("Generic RDMA CM Agent"); +MODULE_LICENSE("Dual BSD/GPL"); + +#define CMA_CM_RESPONSE_TIMEOUT 20 +#define CMA_MAX_CM_RETRIES 3 + +static void cma_add_one(struct ib_device *device); +static void cma_remove_one(struct ib_device *device); + +static struct ib_client cma_client = { + .name = "cma", + .add = cma_add_one, + .remove = cma_remove_one +}; + +static LIST_HEAD(dev_list); +static LIST_HEAD(listen_any_list); +static DECLARE_MUTEX(mutex); + +struct cma_device { + struct list_head list; + struct ib_device *device; + __be64 node_guid; + wait_queue_head_t wait; + atomic_t refcount; + struct list_head id_list; +}; + +enum cma_state { + CMA_IDLE, + CMA_ADDR_QUERY, + CMA_ADDR_RESOLVED, + CMA_ROUTE_QUERY, + CMA_ROUTE_RESOLVED, + CMA_CONNECT, + CMA_ADDR_BOUND, + CMA_LISTEN, + CMA_DEVICE_REMOVAL, + CMA_DESTROYING +}; + +/* + * Device removal can occur at anytime, so we need extra handling to + * serialize notifying the user of device removal with other callbacks. + * We do this by disabling removal notification while a callback is in process, + * and reporting it after the callback completes. + */ +struct rdma_id_private { + struct rdma_cm_id id; + + struct list_head list; + struct list_head listen_list; + struct cma_device *cma_dev; + + enum cma_state state; + spinlock_t lock; + wait_queue_head_t wait; + atomic_t refcount; + wait_queue_head_t wait_remove; + atomic_t dev_remove; + + int backlog; + int timeout_ms; + struct ib_sa_query *query; + int query_id; + struct ib_cm_id *cm_id; + + u32 seq_num; + u32 qp_num; + enum ib_qp_type qp_type; + u8 srq; +}; + +struct cma_work { + struct work_struct work; + struct rdma_id_private *id; +}; + +union cma_ip_addr { + struct in6_addr ip6; + struct { + __u32 pad[3]; + __u32 addr; + } ip4; +}; + +struct cma_hdr { + u8 cma_version; + u8 ip_version; /* IP version: 7:4 */ + __u16 port; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +}; + +struct sdp_hh { + u8 sdp_version; + u8 ip_version; /* IP version: 7:4 */ + u8 sdp_specific1[10]; + __u16 port; + __u16 sdp_specific2; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +}; + +#define CMA_VERSION 0x10 +#define SDP_VERSION 0x22 + +static int cma_comp(struct rdma_id_private *id_priv, enum cma_state comp) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&id_priv->lock, flags); + ret = (id_priv->state == comp); + spin_unlock_irqrestore(&id_priv->lock, flags); + return ret; +} + +static int cma_comp_exch(struct rdma_id_private *id_priv, + enum cma_state comp, enum cma_state exch) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&id_priv->lock, flags); + if ((ret = (id_priv->state == comp))) + id_priv->state = exch; + spin_unlock_irqrestore(&id_priv->lock, flags); + return ret; +} + +static enum cma_state cma_exch(struct rdma_id_private *id_priv, + enum cma_state exch) +{ + unsigned long flags; + enum cma_state old; + + spin_lock_irqsave(&id_priv->lock, flags); + old = id_priv->state; + id_priv->state = exch; + spin_unlock_irqrestore(&id_priv->lock, flags); + return old; +} + +static inline u8 cma_get_ip_ver(struct cma_hdr *hdr) +{ + return hdr->ip_version >> 4; +} + +static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 ip_ver) +{ + hdr->ip_version = (ip_ver << 4) | (hdr->ip_version & 0xF); +} + +static inline u8 sdp_get_ip_ver(struct sdp_hh *hh) +{ + return hh->ip_version >> 4; +} + +static inline void sdp_set_ip_ver(struct sdp_hh *hh, u8 ip_ver) +{ + hh->ip_version = (ip_ver << 4) | (hh->ip_version & 0xF); +} + +static void cma_attach_to_dev(struct rdma_id_private *id_priv, + struct cma_device *cma_dev) +{ + atomic_inc(&cma_dev->refcount); + id_priv->cma_dev = cma_dev; + id_priv->id.device = cma_dev->device; + list_add_tail(&id_priv->list, &cma_dev->id_list); +} + +static void cma_detach_from_dev(struct rdma_id_private *id_priv) +{ + list_del(&id_priv->list); + if (atomic_dec_and_test(&id_priv->cma_dev->refcount)) + wake_up(&id_priv->cma_dev->wait); + id_priv->cma_dev = NULL; +} + +static int cma_acquire_ib_dev(struct rdma_id_private *id_priv) +{ + struct cma_device *cma_dev; + union ib_gid *gid; + int ret = -ENODEV; + + gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + + down(&mutex); + list_for_each_entry(cma_dev, &dev_list, list) { + ret = ib_find_cached_gid(cma_dev->device, gid, + &id_priv->id.port_num, NULL); + if (!ret) { + cma_attach_to_dev(id_priv, cma_dev); + break; + } + } + up(&mutex); + return ret; +} + +static int cma_acquire_dev(struct rdma_id_private *id_priv) +{ + switch (id_priv->id.route.addr.dev_addr.dev_type) { + case IB_NODE_CA: + return cma_acquire_ib_dev(id_priv); + default: + return -ENODEV; + } +} + +static void cma_deref_id(struct rdma_id_private *id_priv) +{ + if (atomic_dec_and_test(&id_priv->refcount)) + wake_up(&id_priv->wait); +} + +static void cma_release_remove(struct rdma_id_private *id_priv) +{ + if (atomic_dec_and_test(&id_priv->dev_remove)) + wake_up(&id_priv->wait_remove); +} + +struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler event_handler, + void *context, enum rdma_port_space ps) +{ + struct rdma_id_private *id_priv; + + id_priv = kzalloc(sizeof *id_priv, GFP_KERNEL); + if (!id_priv) + return ERR_PTR(-ENOMEM); + + id_priv->state = CMA_IDLE; + id_priv->id.context = context; + id_priv->id.event_handler = event_handler; + id_priv->id.ps = ps; + spin_lock_init(&id_priv->lock); + init_waitqueue_head(&id_priv->wait); + atomic_set(&id_priv->refcount, 1); + init_waitqueue_head(&id_priv->wait_remove); + atomic_set(&id_priv->dev_remove, 0); + INIT_LIST_HEAD(&id_priv->listen_list); + get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num); + + return &id_priv->id; +} +EXPORT_SYMBOL(rdma_create_id); + +static int cma_init_ib_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + struct rdma_dev_addr *dev_addr; + int ret; + + dev_addr = &id_priv->id.route.addr.dev_addr; + ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num, + ib_addr_get_pkey(dev_addr), + &qp_attr.pkey_index); + if (ret) + return ret; + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + qp_attr.port_num = id_priv->id.port_num; + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | IB_QP_PORT); +} + +int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr) +{ + struct rdma_id_private *id_priv; + struct ib_qp *qp; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (id->device != pd->device) + return -EINVAL; + + qp = ib_create_qp(pd, qp_init_attr); + if (IS_ERR(qp)) + return PTR_ERR(qp); + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_init_ib_qp(id_priv, qp); + break; + default: + ret = -ENOSYS; + break; + } + + if (ret) + goto err; + + id->qp = qp; + id_priv->qp_num = qp->qp_num; + id_priv->qp_type = qp->qp_type; + id_priv->srq = (qp->srq != NULL); + return 0; +err: + ib_destroy_qp(qp); + return ret; +} +EXPORT_SYMBOL(rdma_create_qp); + +void rdma_destroy_qp(struct rdma_cm_id *id) +{ + ib_destroy_qp(id->qp); +} +EXPORT_SYMBOL(rdma_destroy_qp); + +static int cma_modify_qp_rtr(struct rdma_cm_id *id) +{ + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + + if (!id->qp) + return 0; + + /* Need to update QP attributes from default values. */ + qp_attr.qp_state = IB_QPS_INIT; + ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + ret = ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); + if (ret) + return ret; + + qp_attr.qp_state = IB_QPS_RTR; + ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); +} + +static int cma_modify_qp_rts(struct rdma_cm_id *id) +{ + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + + if (!id->qp) + return 0; + + qp_attr.qp_state = IB_QPS_RTS; + ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); +} + +static int cma_modify_qp_err(struct rdma_cm_id *id) +{ + struct ib_qp_attr qp_attr; + + if (!id->qp) + return 0; + + qp_attr.qp_state = IB_QPS_ERR; + return ib_modify_qp(id->qp, &qp_attr, IB_QP_STATE); +} + +int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + ret = ib_cm_init_qp_attr(id_priv->cm_id, qp_attr, + qp_attr_mask); + if (qp_attr->qp_state == IB_QPS_RTR) + qp_attr->rq_psn = id_priv->seq_num; + break; + default: + ret = -ENOSYS; + break; + } + + return ret; +} +EXPORT_SYMBOL(rdma_init_qp_attr); + +static inline int cma_any_addr(struct sockaddr *addr) +{ + struct in6_addr *ip6; + + if (addr->sa_family == AF_INET) + return ((struct sockaddr_in *) addr)->sin_addr.s_addr == + INADDR_ANY; + else { + ip6 = &((struct sockaddr_in6 *) addr)->sin6_addr; + return (ip6->s6_addr32[0] | ip6->s6_addr32[1] | + ip6->s6_addr32[3] | ip6->s6_addr32[4]) == 0; + } +} + +static inline int cma_loopback_addr(struct sockaddr *addr) +{ + return ((struct sockaddr_in *) addr)->sin_addr.s_addr == + ntohl(INADDR_LOOPBACK); +} + +static int cma_get_net_info(void *hdr, enum rdma_port_space ps, + u8 *ip_ver, __u16 *port, + union cma_ip_addr **src, union cma_ip_addr **dst) +{ + switch (ps) { + case RDMA_PS_SDP: + if (((struct sdp_hh *) hdr)->sdp_version != SDP_VERSION) + return -EINVAL; + + *ip_ver = sdp_get_ip_ver(hdr); + *port = ((struct sdp_hh *) hdr)->port; + *src = &((struct sdp_hh *) hdr)->src_addr; + *dst = &((struct sdp_hh *) hdr)->dst_addr; + break; + default: + if (((struct cma_hdr *) hdr)->cma_version != CMA_VERSION) + return -EINVAL; + + *ip_ver = cma_get_ip_ver(hdr); + *port = ((struct cma_hdr *) hdr)->port; + *src = &((struct cma_hdr *) hdr)->src_addr; + *dst = &((struct cma_hdr *) hdr)->dst_addr; + break; + } + return 0; +} + +static void cma_save_net_info(struct rdma_addr *addr, + struct rdma_addr *listen_addr, + u8 ip_ver, __u16 port, + union cma_ip_addr *src, union cma_ip_addr *dst) +{ + struct sockaddr_in *listen4, *ip4; + struct sockaddr_in6 *listen6, *ip6; + + switch (ip_ver) { + case 4: + listen4 = (struct sockaddr_in *) &listen_addr->src_addr; + ip4 = (struct sockaddr_in *) &addr->src_addr; + ip4->sin_family = listen4->sin_family; + ip4->sin_addr.s_addr = dst->ip4.addr; + ip4->sin_port = listen4->sin_port; + + ip4 = (struct sockaddr_in *) &addr->dst_addr; + ip4->sin_family = listen4->sin_family; + ip4->sin_addr.s_addr = src->ip4.addr; + ip4->sin_port = port; + break; + case 6: + listen6 = (struct sockaddr_in6 *) &listen_addr->src_addr; + ip6 = (struct sockaddr_in6 *) &addr->src_addr; + ip6->sin6_family = listen6->sin6_family; + ip6->sin6_addr = dst->ip6; + ip6->sin6_port = listen6->sin6_port; + + ip6 = (struct sockaddr_in6 *) &addr->dst_addr; + ip6->sin6_family = listen6->sin6_family; + ip6->sin6_addr = src->ip6; + ip6->sin6_port = port; + break; + default: + break; + } +} + +static inline int cma_user_data_offset(enum rdma_port_space ps) +{ + switch (ps) { + case RDMA_PS_SDP: + return 0; + default: + return sizeof(struct cma_hdr); + } +} + +static int cma_notify_user(struct rdma_id_private *id_priv, + enum rdma_cm_event_type type, int status, + void *data, u8 data_len) +{ + struct rdma_cm_event event; + + event.event = type; + event.status = status; + event.private_data = data; + event.private_data_len = data_len; + + return id_priv->id.event_handler(&id_priv->id, &event); +} + +static void cma_cancel_addr(struct rdma_id_private *id_priv) +{ + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + rdma_addr_cancel(&id_priv->id.route.addr.dev_addr); + break; + default: + break; + } +} + +static void cma_cancel_route(struct rdma_id_private *id_priv) +{ + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + ib_sa_cancel_query(id_priv->query_id, id_priv->query); + break; + default: + break; + } +} + +static inline int cma_internal_listen(struct rdma_id_private *id_priv) +{ + return (id_priv->state == CMA_LISTEN) && id_priv->cma_dev && + cma_any_addr(&id_priv->id.route.addr.src_addr); +} + +static void cma_destroy_listen(struct rdma_id_private *id_priv) +{ + cma_exch(id_priv, CMA_DESTROYING); + + if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) + ib_destroy_cm_id(id_priv->cm_id); + + list_del(&id_priv->listen_list); + if (id_priv->cma_dev) + cma_detach_from_dev(id_priv); + + atomic_dec(&id_priv->refcount); + wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); + + kfree(id_priv); +} + +static void cma_cancel_listens(struct rdma_id_private *id_priv) +{ + struct rdma_id_private *dev_id_priv; + + down(&mutex); + list_del(&id_priv->list); + + while (!list_empty(&id_priv->listen_list)) { + dev_id_priv = list_entry(id_priv->listen_list.next, + struct rdma_id_private, listen_list); + cma_destroy_listen(dev_id_priv); + } + up(&mutex); +} + +static void cma_cancel_operation(struct rdma_id_private *id_priv, + enum cma_state state) +{ + switch (state) { + case CMA_ADDR_QUERY: + cma_cancel_addr(id_priv); + break; + case CMA_ROUTE_QUERY: + cma_cancel_route(id_priv); + break; + case CMA_LISTEN: + if (cma_any_addr(&id_priv->id.route.addr.src_addr) && + !id_priv->cma_dev) + cma_cancel_listens(id_priv); + break; + default: + break; + } +} + +void rdma_destroy_id(struct rdma_cm_id *id) +{ + struct rdma_id_private *id_priv; + enum cma_state state; + + id_priv = container_of(id, struct rdma_id_private, id); + state = cma_exch(id_priv, CMA_DESTROYING); + cma_cancel_operation(id_priv, state); + + if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) + ib_destroy_cm_id(id_priv->cm_id); + + if (id_priv->cma_dev) { + down(&mutex); + cma_detach_from_dev(id_priv); + up(&mutex); + } + + atomic_dec(&id_priv->refcount); + wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); + + kfree(id_priv->id.route.path_rec); + kfree(id_priv); +} +EXPORT_SYMBOL(rdma_destroy_id); + +static int cma_rep_recv(struct rdma_id_private *id_priv) +{ + int ret; + + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) + goto reject; + + ret = cma_modify_qp_rts(&id_priv->id); + if (ret) + goto reject; + + ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); + if (ret) + goto reject; + + return 0; +reject: + cma_modify_qp_err(&id_priv->id); + ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + return ret; +} + +static int cma_rtu_recv(struct rdma_id_private *id_priv) +{ + int ret; + + ret = cma_modify_qp_rts(&id_priv->id); + if (ret) + goto reject; + + return 0; +reject: + cma_modify_qp_err(&id_priv->id); + ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + return ret; +} + +static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) +{ + struct rdma_id_private *id_priv = cm_id->context; + enum rdma_cm_event_type event; + u8 private_data_len = 0; + int ret = 0, status = 0; + + if (!cma_comp(id_priv, CMA_CONNECT)) + return 0; + + atomic_inc(&id_priv->dev_remove); + switch (ib_event->event) { + case IB_CM_REQ_ERROR: + case IB_CM_REP_ERROR: + event = RDMA_CM_EVENT_UNREACHABLE; + status = -ETIMEDOUT; + break; + case IB_CM_REP_RECEIVED: + if (id_priv->id.qp) { + status = cma_rep_recv(id_priv); + event = status ? RDMA_CM_EVENT_CONNECT_ERROR : + RDMA_CM_EVENT_ESTABLISHED; + } else + event = RDMA_CM_EVENT_CONNECT_RESPONSE; + private_data_len = IB_CM_REP_PRIVATE_DATA_SIZE; + break; + case IB_CM_RTU_RECEIVED: + status = cma_rtu_recv(id_priv); + event = status ? RDMA_CM_EVENT_CONNECT_ERROR : + RDMA_CM_EVENT_ESTABLISHED; + break; + case IB_CM_DREQ_ERROR: + status = -ETIMEDOUT; /* fall through */ + case IB_CM_DREQ_RECEIVED: + case IB_CM_DREP_RECEIVED: + event = RDMA_CM_EVENT_DISCONNECTED; + break; + case IB_CM_TIMEWAIT_EXIT: + case IB_CM_MRA_RECEIVED: + /* ignore event */ + goto out; + case IB_CM_REJ_RECEIVED: + cma_modify_qp_err(&id_priv->id); + status = ib_event->param.rej_rcvd.reason; + event = RDMA_CM_EVENT_REJECTED; + break; + default: + printk(KERN_ERR "RDMA CMA: unexpected IB CM event: %d", + ib_event->event); + goto out; + } + + ret = cma_notify_user(id_priv, event, status, ib_event->private_data, + private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + id_priv->cm_id = NULL; + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + rdma_destroy_id(&id_priv->id); + return ret; + } +out: + cma_release_remove(id_priv); + return ret; +} + +static struct rdma_id_private* cma_new_id(struct rdma_cm_id *listen_id, + struct ib_cm_event *ib_event) +{ + struct rdma_id_private *id_priv; + struct rdma_cm_id *id; + struct rdma_route *rt; + union cma_ip_addr *src, *dst; + __u16 port; + u8 ip_ver; + + id = rdma_create_id(listen_id->event_handler, listen_id->context, + listen_id->ps); + if (IS_ERR(id)) + return NULL; + + rt = &id->route; + rt->num_paths = ib_event->param.req_rcvd.alternate_path ? 2 : 1; + rt->path_rec = kmalloc(sizeof *rt->path_rec * rt->num_paths, GFP_KERNEL); + if (!rt->path_rec) + goto err; + + if (cma_get_net_info(ib_event->private_data, listen_id->ps, + &ip_ver, &port, &src, &dst)) + goto err; + + cma_save_net_info(&id->route.addr, &listen_id->route.addr, + ip_ver, port, src, dst); + rt->path_rec[0] = *ib_event->param.req_rcvd.primary_path; + if (rt->num_paths == 2) + rt->path_rec[1] = *ib_event->param.req_rcvd.alternate_path; + + ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); + ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); + ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); + rt->addr.dev_addr.dev_type = IB_NODE_CA; + + id_priv = container_of(id, struct rdma_id_private, id); + id_priv->state = CMA_CONNECT; + return id_priv; +err: + rdma_destroy_id(id); + return NULL; +} + +static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) +{ + struct rdma_id_private *listen_id, *conn_id; + int offset, ret; + + listen_id = cm_id->context; + atomic_inc(&listen_id->dev_remove); + if (!cma_comp(listen_id, CMA_LISTEN)) { + ret = -ECONNABORTED; + goto out; + } + + conn_id = cma_new_id(&listen_id->id, ib_event); + if (!conn_id) { + ret = -ENOMEM; + goto out; + } + + atomic_inc(&conn_id->dev_remove); + ret = cma_acquire_ib_dev(conn_id); + if (ret) { + ret = -ENODEV; + cma_release_remove(conn_id); + rdma_destroy_id(&conn_id->id); + goto out; + } + + conn_id->cm_id = cm_id; + cm_id->context = conn_id; + cm_id->cm_handler = cma_ib_handler; + + offset = cma_user_data_offset(listen_id->id.ps); + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, + ib_event->private_data + offset, + IB_CM_REQ_PRIVATE_DATA_SIZE - offset); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + conn_id->cm_id = NULL; + cma_exch(conn_id, CMA_DESTROYING); + cma_release_remove(conn_id); + rdma_destroy_id(&conn_id->id); + } +out: + cma_release_remove(listen_id); + return ret; +} + +static __be64 cma_get_service_id(enum rdma_port_space ps, struct sockaddr *addr) +{ + return cpu_to_be64(((u64)ps << 16) + + ((struct sockaddr_in *) addr)->sin_port); +} + +static void cma_set_compare_data(struct sockaddr *addr, + struct ib_cm_private_data_compare *compare) +{ + struct cma_hdr *data, *mask; + + memset(compare, 0, sizeof *compare); + data = (void *) compare->data; + mask = (void *) compare->mask; + + switch (addr->sa_family) { + case AF_INET: + cma_set_ip_ver(data, 4); + cma_set_ip_ver(mask, 0xF); + data->dst_addr.ip4.addr = ((struct sockaddr_in *) addr)-> + sin_addr.s_addr; + mask->dst_addr.ip4.addr = ~0; + break; + case AF_INET6: + cma_set_ip_ver(data, 6); + cma_set_ip_ver(mask, 0xF); + data->dst_addr.ip6 = ((struct sockaddr_in6 *) addr)-> + sin6_addr; + memset(&mask->dst_addr.ip6, 1, sizeof mask->dst_addr.ip6); + break; + default: + break; + } +} + +static int cma_ib_listen(struct rdma_id_private *id_priv) +{ + struct ib_cm_private_data_compare compare_data; + struct sockaddr *addr; + __be64 svc_id; + int ret; + + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_req_handler, + id_priv); + if (IS_ERR(id_priv->cm_id)) + return PTR_ERR(id_priv->cm_id); + + addr = &id_priv->id.route.addr.src_addr; + svc_id = cma_get_service_id(id_priv->id.ps, addr); + if (cma_any_addr(addr)) + ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, NULL); + else { + cma_set_compare_data(addr, &compare_data); + ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, &compare_data); + } + + if (ret) { + ib_destroy_cm_id(id_priv->cm_id); + id_priv->cm_id = NULL; + } + + return ret; +} + +static int cma_duplicate_listen(struct rdma_id_private *id_priv) +{ + struct rdma_id_private *cur_id_priv; + struct sockaddr_in *cur_addr, *new_addr; + + new_addr = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + list_for_each_entry(cur_id_priv, &listen_any_list, listen_list) { + cur_addr = (struct sockaddr_in *) + &cur_id_priv->id.route.addr.src_addr; + if (cur_addr->sin_port == new_addr->sin_port) + return -EADDRINUSE; + } + return 0; +} + +static int cma_listen_handler(struct rdma_cm_id *id, + struct rdma_cm_event *event) +{ + struct rdma_id_private *id_priv = id->context; + + id->context = id_priv->id.context; + id->event_handler = id_priv->id.event_handler; + return id_priv->id.event_handler(id, event); +} + +static void cma_listen_on_dev(struct rdma_id_private *id_priv, + struct cma_device *cma_dev) +{ + struct rdma_id_private *dev_id_priv; + struct rdma_cm_id *id; + int ret; + + id = rdma_create_id(cma_listen_handler, id_priv, id_priv->id.ps); + if (IS_ERR(id)) + return; + + dev_id_priv = container_of(id, struct rdma_id_private, id); + ret = rdma_bind_addr(id, &id_priv->id.route.addr.src_addr); + if (ret) + goto err; + + cma_attach_to_dev(dev_id_priv, cma_dev); + list_add_tail(&dev_id_priv->listen_list, &id_priv->listen_list); + + ret = rdma_listen(id, id_priv->backlog); + if (ret) + goto err; + + return; +err: + cma_destroy_listen(dev_id_priv); +} + +static int cma_listen_on_all(struct rdma_id_private *id_priv) +{ + struct cma_device *cma_dev; + int ret; + + down(&mutex); + ret = cma_duplicate_listen(id_priv); + if (ret) + goto out; + + list_add_tail(&id_priv->list, &listen_any_list); + list_for_each_entry(cma_dev, &dev_list, list) + cma_listen_on_dev(id_priv, cma_dev); +out: + up(&mutex); + return ret; +} + +int rdma_listen(struct rdma_cm_id *id, int backlog) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) + return -EINVAL; + + if (id->device) { + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_ib_listen(id_priv); + break; + default: + ret = -ENOSYS; + break; + } + } else + ret = cma_listen_on_all(id_priv); + + if (ret) + goto err; + + id_priv->backlog = backlog; + return 0; +err: + cma_comp_exch(id_priv, CMA_LISTEN, CMA_ADDR_BOUND); + return ret; +}; +EXPORT_SYMBOL(rdma_listen); + +static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, + void *context) +{ + struct rdma_id_private *id_priv = context; + struct rdma_route *route = &id_priv->id.route; + enum rdma_cm_event_type event = RDMA_CM_EVENT_ROUTE_RESOLVED; + + atomic_inc(&id_priv->dev_remove); + if (!status) { + route->path_rec = kmalloc(sizeof *route->path_rec, GFP_KERNEL); + if (route->path_rec) { + route->num_paths = 1; + *route->path_rec = *path_rec; + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, + CMA_ROUTE_RESOLVED)) { + kfree(route->path_rec); + goto out; + } + } else + status = -ENOMEM; + } + + if (status) { + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED)) + goto out; + event = RDMA_CM_EVENT_ROUTE_ERROR; + } + + if (cma_notify_user(id_priv, event, status, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); +} + +static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) +{ + struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; + struct ib_sa_path_rec path_rec; + + memset(&path_rec, 0, sizeof path_rec); + path_rec.sgid = *ib_addr_get_sgid(addr); + path_rec.dgid = *ib_addr_get_dgid(addr); + path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); + path_rec.numb_path = 1; + + id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, + id_priv->id.port_num, &path_rec, + IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, + timeout_ms, GFP_KERNEL, + cma_query_handler, id_priv, &id_priv->query); + + return (id_priv->query_id < 0) ? id_priv->query_id : 0; +} + +int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_ADDR_RESOLVED, CMA_ROUTE_QUERY)) + return -EINVAL; + + atomic_inc(&id_priv->refcount); + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_resolve_ib_route(id_priv, timeout_ms); + break; + default: + ret = -ENOSYS; + break; + } + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED); + cma_deref_id(id_priv); + return ret; +} +EXPORT_SYMBOL(rdma_resolve_route); + +static int cma_bind_loopback(struct rdma_id_private *id_priv) +{ + struct cma_device *cma_dev; + union ib_gid *gid; + u16 pkey; + int ret; + + down(&mutex); + if (list_empty(&dev_list)) { + ret = -ENODEV; + goto out; + } + + cma_dev = list_entry(dev_list.next, struct cma_device, list); + gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + ret = ib_get_cached_gid(cma_dev->device, 1, 0, gid); + if (ret) + goto out; + + ret = ib_get_cached_pkey(cma_dev->device, 1, 0, &pkey); + if (ret) + goto out; + + ib_addr_set_pkey(&id_priv->id.route.addr.dev_addr, pkey); + id_priv->id.port_num = 1; + cma_attach_to_dev(id_priv, cma_dev); +out: + up(&mutex); + return ret; +} + +static void addr_handler(int status, struct sockaddr *src_addr, + struct rdma_dev_addr *dev_addr, void *context) +{ + struct rdma_id_private *id_priv = context; + enum rdma_cm_event_type event; + enum cma_state old_state; + + atomic_inc(&id_priv->dev_remove); + if (!id_priv->cma_dev) { + old_state = CMA_IDLE; + if (!status) + status = cma_acquire_dev(id_priv); + } else + old_state = CMA_ADDR_BOUND; + + if (status) { + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, old_state)) + goto out; + event = RDMA_CM_EVENT_ADDR_ERROR; + } else { + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) + goto out; + memcpy(&id_priv->id.route.addr.src_addr, src_addr, + ip_addr_size(src_addr)); + event = RDMA_CM_EVENT_ADDR_RESOLVED; + } + + if (cma_notify_user(id_priv, event, status, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); +} + +static void loopback_addr_handler(void *data) +{ + struct cma_work *work = data; + struct rdma_id_private *id_priv = work->id; + + kfree(work); + atomic_inc(&id_priv->dev_remove); + + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) + goto out; + + if (cma_notify_user(id_priv, RDMA_CM_EVENT_ADDR_RESOLVED, 0, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); +} + +static int cma_resolve_loopback(struct rdma_id_private *id_priv, + struct sockaddr *src_addr, enum cma_state state) +{ + struct cma_work *work; + struct rdma_dev_addr *dev_addr; + int ret; + + work = kmalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + if (state == CMA_IDLE) { + ret = cma_bind_loopback(id_priv); + if (ret) + goto err; + dev_addr = &id_priv->id.route.addr.dev_addr; + ib_addr_set_dgid(dev_addr, ib_addr_get_sgid(dev_addr)); + if (!src_addr || cma_any_addr(src_addr)) + src_addr = &id_priv->id.route.addr.dst_addr; + memcpy(&id_priv->id.route.addr.src_addr, src_addr, + ip_addr_size(src_addr)); + } + + work->id = id_priv; + INIT_WORK(&work->work, loopback_addr_handler, work); + queue_work(rdma_wq, &work->work); + return 0; +err: + kfree(work); + return ret; +} + +int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, + struct sockaddr *dst_addr, int timeout_ms) +{ + struct rdma_id_private *id_priv; + enum cma_state expected_state; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (id_priv->cma_dev) { + expected_state = CMA_ADDR_BOUND; + src_addr = &id->route.addr.src_addr; + } else + expected_state = CMA_IDLE; + + if (!cma_comp_exch(id_priv, expected_state, CMA_ADDR_QUERY)) + return -EINVAL; + + atomic_inc(&id_priv->refcount); + memcpy(&id->route.addr.dst_addr, dst_addr, ip_addr_size(dst_addr)); + if (cma_loopback_addr(dst_addr)) + ret = cma_resolve_loopback(id_priv, src_addr, expected_state); + else + ret = rdma_resolve_ip(src_addr, dst_addr, + &id->route.addr.dev_addr, + timeout_ms, addr_handler, id_priv); + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_ADDR_QUERY, expected_state); + cma_deref_id(id_priv); + return ret; +} +EXPORT_SYMBOL(rdma_resolve_addr); + +int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) +{ + struct rdma_id_private *id_priv; + struct rdma_dev_addr *dev_addr; + int ret; + + if (addr->sa_family != AF_INET) + return -EINVAL; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_BOUND)) + return -EINVAL; + + if (cma_any_addr(addr)) { + ret = 0; + } else if (cma_loopback_addr(addr)) { + ret = cma_bind_loopback(id_priv); + } else { + dev_addr = &id->route.addr.dev_addr; + ret = rdma_translate_ip(addr, dev_addr); + if (!ret) + ret = cma_acquire_dev(id_priv); + } + + if (ret) + goto err; + + memcpy(&id->route.addr.src_addr, addr, ip_addr_size(addr)); + return 0; +err: + cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_IDLE); + return ret; +} +EXPORT_SYMBOL(rdma_bind_addr); + +static void cma_format_hdr(void *hdr, enum rdma_port_space ps, + struct rdma_route *route) +{ + struct sockaddr_in *src4, *dst4; + struct cma_hdr *cma_hdr; + struct sdp_hh *sdp_hdr; + + src4 = (struct sockaddr_in *) &route->addr.src_addr; + dst4 = (struct sockaddr_in *) &route->addr.dst_addr; + + switch (ps) { + case RDMA_PS_SDP: + sdp_hdr = hdr; + sdp_hdr->sdp_version = SDP_VERSION; + sdp_set_ip_ver(sdp_hdr, 4); + sdp_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr; + sdp_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr; + sdp_hdr->port = src4->sin_port; + break; + default: + cma_hdr = hdr; + cma_hdr->cma_version = CMA_VERSION; + cma_set_ip_ver(cma_hdr, 4); + cma_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr; + cma_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr; + cma_hdr->port = src4->sin_port; + break; + } +} + +static int cma_connect_ib(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct ib_cm_req_param req; + struct rdma_route *route; + void *private_data; + int offset, ret; + + memset(&req, 0, sizeof req); + offset = cma_user_data_offset(id_priv->id.ps); + req.private_data_len = offset + conn_param->private_data_len; + private_data = kzalloc(req.private_data_len, GFP_ATOMIC); + if (!private_data) + return -ENOMEM; + + if (conn_param->private_data && conn_param->private_data_len) + memcpy(private_data + offset, conn_param->private_data, + conn_param->private_data_len); + + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, + id_priv); + if (IS_ERR(id_priv->cm_id)) { + ret = PTR_ERR(id_priv->cm_id); + goto out; + } + + route = &id_priv->id.route; + cma_format_hdr(private_data, id_priv->id.ps, route); + req.private_data = private_data; + + req.primary_path = &route->path_rec[0]; + if (route->num_paths == 2) + req.alternate_path = &route->path_rec[1]; + + req.service_id = cma_get_service_id(id_priv->id.ps, + &route->addr.dst_addr); + req.qp_num = id_priv->qp_num; + req.qp_type = id_priv->qp_type; + req.starting_psn = id_priv->seq_num; + req.responder_resources = conn_param->responder_resources; + req.initiator_depth = conn_param->initiator_depth; + req.flow_control = conn_param->flow_control; + req.retry_count = conn_param->retry_count; + req.rnr_retry_count = conn_param->rnr_retry_count; + req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; + req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; + req.max_cm_retries = CMA_MAX_CM_RETRIES; + req.srq = id_priv->srq ? 1 : 0; + + ret = ib_send_cm_req(id_priv->cm_id, &req); +out: + kfree(private_data); + return ret; +} + +int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) + return -EINVAL; + + if (!id->qp) { + id_priv->qp_num = conn_param->qp_num; + id_priv->qp_type = conn_param->qp_type; + id_priv->srq = conn_param->srq; + } + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_connect_ib(id_priv, conn_param); + break; + default: + ret = -ENOSYS; + break; + } + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_CONNECT, CMA_ROUTE_RESOLVED); + return ret; +} +EXPORT_SYMBOL(rdma_connect); + +static int cma_accept_ib(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct ib_cm_rep_param rep; + int ret; + + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) + return ret; + + memset(&rep, 0, sizeof rep); + rep.qp_num = id_priv->qp_num; + rep.starting_psn = id_priv->seq_num; + rep.private_data = conn_param->private_data; + rep.private_data_len = conn_param->private_data_len; + rep.responder_resources = conn_param->responder_resources; + rep.initiator_depth = conn_param->initiator_depth; + rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT; + rep.failover_accepted = 0; + rep.flow_control = conn_param->flow_control; + rep.rnr_retry_count = conn_param->rnr_retry_count; + rep.srq = id_priv->srq ? 1 : 0; + + return ib_send_cm_rep(id_priv->cm_id, &rep); +} + +int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + if (!id->qp && conn_param) { + id_priv->qp_num = conn_param->qp_num; + id_priv->qp_type = conn_param->qp_type; + id_priv->srq = conn_param->srq; + } + + switch (id->device->node_type) { + case IB_NODE_CA: + if (conn_param) + ret = cma_accept_ib(id_priv, conn_param); + else + ret = cma_rep_recv(id_priv); + break; + default: + ret = -ENOSYS; + break; + } + + if (ret) + goto reject; + + return 0; +reject: + cma_modify_qp_err(id); + rdma_reject(id, NULL, 0); + return ret; +} +EXPORT_SYMBOL(rdma_accept); + +int rdma_reject(struct rdma_cm_id *id, const void *private_data, + u8 private_data_len) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, private_data, private_data_len); + break; + default: + ret = -ENOSYS; + break; + } + return ret; +}; +EXPORT_SYMBOL(rdma_reject); + +int rdma_disconnect(struct rdma_cm_id *id) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + ret = cma_modify_qp_err(id); + if (ret) + goto out; + + switch (id->device->node_type) { + case IB_NODE_CA: + /* Initiate or respond to a disconnect. */ + if (ib_send_cm_dreq(id_priv->cm_id, NULL, 0)) + ib_send_cm_drep(id_priv->cm_id, NULL, 0); + break; + default: + break; + } +out: + return ret; +} +EXPORT_SYMBOL(rdma_disconnect); + +static void cma_add_one(struct ib_device *device) +{ + struct cma_device *cma_dev; + struct rdma_id_private *id_priv; + + cma_dev = kmalloc(sizeof *cma_dev, GFP_KERNEL); + if (!cma_dev) + return; + + cma_dev->device = device; + cma_dev->node_guid = device->node_guid; + if (!cma_dev->node_guid) + goto err; + + init_waitqueue_head(&cma_dev->wait); + atomic_set(&cma_dev->refcount, 1); + INIT_LIST_HEAD(&cma_dev->id_list); + ib_set_client_data(device, &cma_client, cma_dev); + + down(&mutex); + list_add_tail(&cma_dev->list, &dev_list); + list_for_each_entry(id_priv, &listen_any_list, list) + cma_listen_on_dev(id_priv, cma_dev); + up(&mutex); + return; +err: + kfree(cma_dev); +} + +static int cma_remove_id_dev(struct rdma_id_private *id_priv) +{ + enum cma_state state; + + /* Record that we want to remove the device */ + state = cma_exch(id_priv, CMA_DEVICE_REMOVAL); + if (state == CMA_DESTROYING) + return 0; + + cma_cancel_operation(id_priv, state); + wait_event(id_priv->wait_remove, !atomic_read(&id_priv->dev_remove)); + + /* Check for destruction from another callback. */ + if (!cma_comp(id_priv, CMA_DEVICE_REMOVAL)) + return 0; + + return cma_notify_user(id_priv, RDMA_CM_EVENT_DEVICE_REMOVAL, + 0, NULL, 0); +} + +static void cma_process_remove(struct cma_device *cma_dev) +{ + struct list_head remove_list; + struct rdma_id_private *id_priv; + int ret; + + INIT_LIST_HEAD(&remove_list); + + down(&mutex); + while (!list_empty(&cma_dev->id_list)) { + id_priv = list_entry(cma_dev->id_list.next, + struct rdma_id_private, list); + + if (cma_internal_listen(id_priv)) { + cma_destroy_listen(id_priv); + continue; + } + + list_del(&id_priv->list); + list_add_tail(&id_priv->list, &remove_list); + atomic_inc(&id_priv->refcount); + up(&mutex); + + ret = cma_remove_id_dev(id_priv); + cma_deref_id(id_priv); + if (ret) + rdma_destroy_id(&id_priv->id); + + down(&mutex); + } + up(&mutex); + + atomic_dec(&cma_dev->refcount); + wait_event(cma_dev->wait, !atomic_read(&cma_dev->refcount)); +} + +static void cma_remove_one(struct ib_device *device) +{ + struct cma_device *cma_dev; + + cma_dev = ib_get_client_data(device, &cma_client); + if (!cma_dev) + return; + + down(&mutex); + list_del(&cma_dev->list); + up(&mutex); + + cma_process_remove(cma_dev); + kfree(cma_dev); +} + +static int cma_init(void) +{ + return ib_register_client(&cma_client); +} + +static void cma_cleanup(void) +{ + ib_unregister_client(&cma_client); +} + +module_init(cma_init); +module_exit(cma_cleanup); diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 16:16:18.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 16:35:48.000000000 -0800 @@ -1,5 +1,5 @@ obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o ib_addr.o + ib_cm.o ib_addr.o rdma_cm.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o @@ -12,6 +12,8 @@ ib_sa-y := sa_query.o ib_cm-y := cm.o +rdma_cm-y := cma.o + ib_addr-y := addr.o ib_umad-y := user_mad.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/rdma_cm.h linux-2.6.ib/include/rdma/rdma_cm.h --- linux-2.6.git/include/rdma/rdma_cm.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/include/rdma/rdma_cm.h 2006-01-16 16:19:12.000000000 -0800 @@ -0,0 +1,255 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(RDMA_CM_H) +#define RDMA_CM_H + +#include +#include +#include +#include + +/* + * Upon receiving a device removal event, users must destroy the associated + * RDMA identifier and release all resources allocated with the device. + */ +enum rdma_cm_event_type { + RDMA_CM_EVENT_ADDR_RESOLVED, + RDMA_CM_EVENT_ADDR_ERROR, + RDMA_CM_EVENT_ROUTE_RESOLVED, + RDMA_CM_EVENT_ROUTE_ERROR, + RDMA_CM_EVENT_CONNECT_REQUEST, + RDMA_CM_EVENT_CONNECT_RESPONSE, + RDMA_CM_EVENT_CONNECT_ERROR, + RDMA_CM_EVENT_UNREACHABLE, + RDMA_CM_EVENT_REJECTED, + RDMA_CM_EVENT_ESTABLISHED, + RDMA_CM_EVENT_DISCONNECTED, + RDMA_CM_EVENT_DEVICE_REMOVAL, +}; + +enum rdma_port_space { + RDMA_PS_SDP = 0x0001, + RDMA_PS_TCP = 0x0106, + RDMA_PS_UDP = 0x0111, + RDMA_PS_SCTP = 0x0183 +}; + +struct rdma_addr { + struct sockaddr src_addr; + u8 src_pad[sizeof(struct sockaddr_in6) - + sizeof(struct sockaddr)]; + struct sockaddr dst_addr; + u8 dst_pad[sizeof(struct sockaddr_in6) - + sizeof(struct sockaddr)]; + struct rdma_dev_addr dev_addr; +}; + +struct rdma_route { + struct rdma_addr addr; + struct ib_sa_path_rec *path_rec; + int num_paths; +}; + +struct rdma_cm_event { + enum rdma_cm_event_type event; + int status; + void *private_data; + u8 private_data_len; +}; + +struct rdma_cm_id; + +/** + * rdma_cm_event_handler - Callback used to report user events. + * + * Notes: Users may not call rdma_destroy_id from this callback to destroy + * the passed in id, or a corresponding listen id. Returning a + * non-zero value from the callback will destroy the corresponding id. + */ +typedef int (*rdma_cm_event_handler)(struct rdma_cm_id *id, + struct rdma_cm_event *event); + +struct rdma_cm_id { + struct ib_device *device; + void *context; + struct ib_qp *qp; + rdma_cm_event_handler event_handler; + struct rdma_route route; + enum rdma_port_space ps; + u8 port_num; +}; + +/** + * rdma_create_id - Create an RDMA identifier. + * + * @event_handler: User callback invoked to report events associated with the + * returned rdma_id. + * @context: User specified context associated with the id. + * @ps: RDMA port space. + */ +struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler event_handler, + void *context, enum rdma_port_space ps); + +void rdma_destroy_id(struct rdma_cm_id *id); + +/** + * rdma_bind_addr - Bind an RDMA identifier to a source address and + * associated RDMA device, if needed. + * + * @id: RDMA identifier. + * @addr: Local address information. Wildcard values are permitted. + * + * This associates a source address with the RDMA identifier before calling + * rdma_listen. If a specific local address is given, the RDMA identifier will + * be bound to a local RDMA device. + */ +int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr); + +/** + * rdma_resolve_addr - Resolve destination and optional source addresses + * from IP addresses to an RDMA address. If successful, the specified + * rdma_cm_id will be bound to a local device. + * + * @id: RDMA identifier. + * @src_addr: Source address information. This parameter may be NULL. + * @dst_addr: Destination address information. + * @timeout_ms: Time to wait for resolution to complete. + */ +int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, + struct sockaddr *dst_addr, int timeout_ms); + +/** + * rdma_resolve_route - Resolve the RDMA address bound to the RDMA identifier + * into route information needed to establish a connection. + * + * This is called on the client side of a connection. + * Users must have first called rdma_resolve_addr to resolve a dst_addr + * into an RDMA address before calling this routine. + */ +int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms); + +/** + * rdma_create_qp - Allocate a QP and associate it with the specified RDMA + * identifier. + * + * QPs allocated to an rdma_cm_id will automatically be transitioned by the CMA + * through their states. + */ +int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + +/** + * rdma_destroy_qp - Deallocate the QP associated with the specified RDMA + * identifier. + * + * Users must destroy any QP associated with an RDMA identifier before + * destroying the RDMA ID. + */ +void rdma_destroy_qp(struct rdma_cm_id *id); + +/** + * rdma_init_qp_attr - Initializes the QP attributes for use in transitioning + * to a specified QP state. + * @id: Communication identifier associated with the QP attributes to + * initialize. + * @qp_attr: On input, specifies the desired QP state. On output, the + * mandatory and desired optional attributes will be set in order to + * modify the QP to the specified state. + * @qp_attr_mask: The QP attribute mask that may be used to transition the + * QP to the specified state. + * + * Users must set the @qp_attr->qp_state to the desired QP state. This call + * will set all required attributes for the given transition, along with + * known optional attributes. Users may override the attributes returned from + * this call before calling ib_modify_qp. + * + * Users that wish to have their QP automatically transitioned through its + * states can associate a QP with the rdma_cm_id by calling rdma_create_qp(). + */ +int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, + int *qp_attr_mask); + +struct rdma_conn_param { + const void *private_data; + u8 private_data_len; + u8 responder_resources; + u8 initiator_depth; + u8 flow_control; + u8 retry_count; /* ignored when accepting */ + u8 rnr_retry_count; + /* Fields below ignored if a QP is created on the rdma_cm_id. */ + u8 srq; + u32 qp_num; + enum ib_qp_type qp_type; +}; + +/** + * rdma_connect - Initiate an active connection request. + * + * Users must have resolved a route for the rdma_cm_id to connect with + * by having called rdma_resolve_route before calling this routine. + */ +int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param); + +/** + * rdma_listen - This function is called by the passive side to + * listen for incoming connection requests. + * + * Users must have bound the rdma_cm_id to a local address by calling + * rdma_bind_addr before calling this routine. + */ +int rdma_listen(struct rdma_cm_id *id, int backlog); + +/** + * rdma_accept - Called to accept a connection request or response. + * @id: Connection identifier associated with the request. + * @conn_param: Information needed to establish the connection. This must be + * provided if accepting a connection request. If accepting a connection + * response, this parameter must be NULL. + * + * Typically, this routine is only called by the listener to accept a connection + * request. It must also be called on the active side of a connection if the + * user is performing their own QP transitions. + */ +int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param); + +/** + * rdma_reject - Called on the passive side to reject a connection request. + */ +int rdma_reject(struct rdma_cm_id *id, const void *private_data, + u8 private_data_len); + +/** + * rdma_disconnect - This function disconnects the associated QP. + */ +int rdma_disconnect(struct rdma_cm_id *id); + +#endif /* RDMA_CM_H */ + From shemminger at osdl.org Tue Jan 17 15:38:38 2006 From: shemminger at osdl.org (Stephen Hemminger) Date: Tue, 17 Jan 2006 15:38:38 -0800 Subject: [openib-general] Re: [PATCH 2/5] [RFC] Infiniband: connection abstraction In-Reply-To: References: Message-ID: <20060117153838.3dc2cd2e@dxpl.pdx.osdl.net> Minor nits. On Tue, 17 Jan 2006 15:24:37 -0800 "Sean Hefty" wrote: > The following patch extends matching connection requests to listens in the > Infiniband CM to include private data. > > Signed-off-by: Sean Hefty > > --- > +static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask) static void cm_mask_compare_data(u8 *dst, const u8 *src, u8 *mask) but I would rename it to cm_mask_copy since it doesn't really do a compare. > +{ > + int i; > + > + for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++) > + dst[i] = src[i] & mask[i]; > +} > + > +static int cm_compare_data(struct ib_cm_private_data_compare *src_data, > + struct ib_cm_private_data_compare *dst_data) static int cm_compare_data(const struct ib_cm_private_data_compare *src, cosnt struct ib_cm_private_data_compare *dst) Your data type names are getting too long ^^^^^^^^^^^^^^^^^^^^^^^^ Also should infiniband exports be EXPORT_SYMBOL_GPL, to make it clear that binary drivers for this are not allowed?? -- Stephen Hemminger OSDL http://developer.osdl.org/~shemminger From sean.hefty at intel.com Tue Jan 17 15:44:48 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Jan 2006 15:44:48 -0800 Subject: [openib-general] [PATCH 5/5] [RFC] Infiniband: connection abstraction In-Reply-To: Message-ID: This patch adds the kernel component to support the userspace Infiniband/RDMA connection agent library. Signed-off-by: Sean Hefty --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 16:58:58.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 16:55:25.000000000 -0800 @@ -1,5 +1,5 @@ obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o ib_addr.o rdma_cm.o + ib_cm.o ib_addr.o rdma_cm.o rdma_ucm.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o @@ -14,6 +14,8 @@ ib_cm-y := cm.o rdma_cm-y := cma.o +rdma_ucm-y := ucma.o + ib_addr-y := addr.o ib_umad-y := user_mad.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucma.c linux-2.6.ib/drivers/infiniband/core/ucma.c --- linux-2.6.git/drivers/infiniband/core/ucma.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucma.c 2006-01-16 16:54:31.000000000 -0800 @@ -0,0 +1,788 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + UCMA_MAX_BACKLOG = 128 +}; + +struct ucma_file { + struct semaphore mutex; + struct file *filp; + struct list_head ctxs; + struct list_head events; + wait_queue_head_t poll_wait; +}; + +struct ucma_context { + int id; + wait_queue_head_t wait; + atomic_t ref; + int events_reported; + int backlog; + + struct ucma_file *file; + struct rdma_cm_id *cm_id; + __u64 uid; + + struct list_head events; /* list of pending events. */ + struct list_head file_list; /* member in file ctx list */ +}; + +struct ucma_event { + struct ucma_context *ctx; + struct list_head file_list; /* member in file event list */ + struct list_head ctx_list; /* member in ctx event list */ + struct rdma_cm_id *cm_id; + struct rdma_ucm_event_resp resp; +}; + +static DECLARE_MUTEX(ctx_mutex); +static DEFINE_IDR(ctx_idr); + +static struct ucma_context* ucma_get_ctx(struct ucma_file *file, int id) +{ + struct ucma_context *ctx; + + down(&ctx_mutex); + ctx = idr_find(&ctx_idr, id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + atomic_inc(&ctx->ref); + up(&ctx_mutex); + + return ctx; +} + +static void ucma_put_ctx(struct ucma_context *ctx) +{ + if (atomic_dec_and_test(&ctx->ref)) + wake_up(&ctx->wait); +} + +static void ucma_cleanup_events(struct ucma_context *ctx) +{ + struct ucma_event *uevent; + + down(&ctx->file->mutex); + list_del(&ctx->file_list); + while (!list_empty(&ctx->events)) { + + uevent = list_entry(ctx->events.next, struct ucma_event, + ctx_list); + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + + /* clear incoming connections. */ + if (uevent->resp.event == RDMA_CM_EVENT_CONNECT_REQUEST) + rdma_destroy_id(uevent->cm_id); + + kfree(uevent); + } + up(&ctx->file->mutex); +} + +static struct ucma_context* ucma_alloc_ctx(struct ucma_file *file) +{ + struct ucma_context *ctx; + int ret; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + atomic_set(&ctx->ref, 1); + init_waitqueue_head(&ctx->wait); + ctx->file = file; + INIT_LIST_HEAD(&ctx->events); + + do { + ret = idr_pre_get(&ctx_idr, GFP_KERNEL); + if (!ret) + goto error; + + down(&ctx_mutex); + ret = idr_get_new(&ctx_idr, ctx, &ctx->id); + up(&ctx_mutex); + } while (ret == -EAGAIN); + + if (ret) + goto error; + + list_add_tail(&ctx->file_list, &file->ctxs); + return ctx; + +error: + kfree(ctx); + return NULL; +} + +static int ucma_event_handler(struct rdma_cm_id *cm_id, + struct rdma_cm_event *event) +{ + struct ucma_event *uevent; + struct ucma_context *ctx = cm_id->context; + int ret = 0; + + uevent = kzalloc(sizeof(*uevent), GFP_KERNEL); + if (!uevent) + return event->event == RDMA_CM_EVENT_CONNECT_REQUEST; + + uevent->ctx = ctx; + uevent->cm_id = cm_id; + uevent->resp.uid = ctx->uid; + uevent->resp.id = ctx->id; + uevent->resp.event = event->event; + uevent->resp.status = event->status; + if ((uevent->resp.private_data_len = event->private_data_len)) + memcpy(uevent->resp.private_data, event->private_data, + event->private_data_len); + + down(&ctx->file->mutex); + if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) { + if (!ctx->backlog) { + ret = -EDQUOT; + goto out; + } + ctx->backlog--; + } + list_add_tail(&uevent->file_list, &ctx->file->events); + list_add_tail(&uevent->ctx_list, &ctx->events); + wake_up_interruptible(&ctx->file->poll_wait); +out: + up(&ctx->file->mutex); + return ret; +} + +static ssize_t ucma_get_event(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct ucma_context *ctx; + struct rdma_ucm_get_event cmd; + struct ucma_event *uevent; + int ret = 0; + DEFINE_WAIT(wait); + + if (out_len < sizeof(struct rdma_ucm_event_resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + down(&file->mutex); + while (list_empty(&file->events)) { + if (file->filp->f_flags & O_NONBLOCK) { + ret = -EAGAIN; + break; + } + + if (signal_pending(current)) { + ret = -ERESTARTSYS; + break; + } + + prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); + up(&file->mutex); + schedule(); + down(&file->mutex); + finish_wait(&file->poll_wait, &wait); + } + + if (ret) + goto done; + + uevent = list_entry(file->events.next, struct ucma_event, file_list); + + if (uevent->resp.event == RDMA_CM_EVENT_CONNECT_REQUEST) { + ctx = ucma_alloc_ctx(file); + if (!ctx) { + ret = -ENOMEM; + goto done; + } + uevent->ctx->backlog++; + ctx->cm_id = uevent->cm_id; + ctx->cm_id->context = ctx; + uevent->resp.id = ctx->id; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &uevent->resp, sizeof(uevent->resp))) { + ret = -EFAULT; + goto done; + } + + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + uevent->ctx->events_reported++; + kfree(uevent); +done: + up(&file->mutex); + return ret; +} + +static ssize_t ucma_create_id(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_create_id cmd; + struct rdma_ucm_create_id_resp resp; + struct ucma_context *ctx; + int ret; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + down(&file->mutex); + ctx = ucma_alloc_ctx(file); + up(&file->mutex); + if (!ctx) + return -ENOMEM; + + ctx->uid = cmd.uid; + ctx->cm_id = rdma_create_id(ucma_event_handler, ctx, RDMA_PS_TCP); + if (IS_ERR(ctx->cm_id)) { + ret = PTR_ERR(ctx->cm_id); + goto err1; + } + + resp.id = ctx->id; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) { + ret = -EFAULT; + goto err2; + } + return 0; + +err2: + rdma_destroy_id(ctx->cm_id); +err1: + down(&ctx_mutex); + idr_remove(&ctx_idr, ctx->id); + up(&ctx_mutex); + kfree(ctx); + return ret; +} + +static ssize_t ucma_destroy_id(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_destroy_id cmd; + struct rdma_ucm_destroy_id_resp resp; + struct ucma_context *ctx; + int ret = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + down(&ctx_mutex); + ctx = idr_find(&ctx_idr, cmd.id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + idr_remove(&ctx_idr, ctx->id); + up(&ctx_mutex); + + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + atomic_dec(&ctx->ref); + wait_event(ctx->wait, !atomic_read(&ctx->ref)); + + /* No new events will be generated after destroying the id. */ + rdma_destroy_id(ctx->cm_id); + /* Cleanup events not yet reported to the user. */ + ucma_cleanup_events(ctx); + + resp.events_reported = ctx->events_reported; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; + + kfree(ctx); + return ret; +} + +static ssize_t ucma_bind_addr(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_bind_addr cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_bind_addr(ctx->cm_id, (struct sockaddr *) &cmd.addr); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_resolve_addr(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_resolve_addr cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_resolve_addr(ctx->cm_id, (struct sockaddr *) &cmd.src_addr, + (struct sockaddr *) &cmd.dst_addr, + cmd.timeout_ms); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_resolve_route(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_resolve_route cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_resolve_route(ctx->cm_id, cmd.timeout_ms); + ucma_put_ctx(ctx); + return ret; +} + +static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp, + struct rdma_route *route) +{ + struct rdma_dev_addr *dev_addr; + + resp->num_paths = route->num_paths; + switch (route->num_paths) { + case 0: + dev_addr = &route->addr.dev_addr; + memcpy(&resp->ib_route[0].dgid, ib_addr_get_dgid(dev_addr), + sizeof(union ib_gid)); + memcpy(&resp->ib_route[0].sgid, ib_addr_get_sgid(dev_addr), + sizeof(union ib_gid)); + resp->ib_route[0].pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); + break; + case 2: + ib_copy_path_rec_to_user(&resp->ib_route[1], + &route->path_rec[1]); + /* fall through */ + case 1: + ib_copy_path_rec_to_user(&resp->ib_route[0], + &route->path_rec[0]); + break; + default: + break; + } +} + +static ssize_t ucma_query_route(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_query_route cmd; + struct rdma_ucm_query_route_resp resp; + struct ucma_context *ctx; + struct sockaddr *addr; + int ret = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + if (!ctx->cm_id->device) { + ret = -ENODEV; + goto out; + } + + addr = &ctx->cm_id->route.addr.src_addr; + memcpy(&resp.src_addr, addr, addr->sa_family == AF_INET ? + sizeof(struct sockaddr_in) : + sizeof(struct sockaddr_in6)); + addr = &ctx->cm_id->route.addr.dst_addr; + memcpy(&resp.dst_addr, addr, addr->sa_family == AF_INET ? + sizeof(struct sockaddr_in) : + sizeof(struct sockaddr_in6)); + resp.node_guid = ctx->cm_id->device->node_guid; + resp.port_num = ctx->cm_id->port_num; + switch (ctx->cm_id->device->node_type) { + case IB_NODE_CA: + ucma_copy_ib_route(&resp, &ctx->cm_id->route); + default: + break; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; + +out: + ucma_put_ctx(ctx); + return ret; +} + +static void ucma_copy_conn_param(struct rdma_conn_param *dst_conn, + struct rdma_ucm_conn_param *src_conn) +{ + dst_conn->private_data = src_conn->private_data; + dst_conn->private_data_len = src_conn->private_data_len; + dst_conn->responder_resources =src_conn->responder_resources; + dst_conn->initiator_depth = src_conn->initiator_depth; + dst_conn->flow_control = src_conn->flow_control; + dst_conn->retry_count = src_conn->retry_count; + dst_conn->rnr_retry_count = src_conn->rnr_retry_count; + dst_conn->srq = src_conn->srq; + dst_conn->qp_num = src_conn->qp_num; + dst_conn->qp_type = src_conn->qp_type; +} + +static ssize_t ucma_connect(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_connect cmd; + struct rdma_conn_param conn_param; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + if (!cmd.conn_param.valid) + return -EINVAL; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ucma_copy_conn_param(&conn_param, &cmd.conn_param); + ret = rdma_connect(ctx->cm_id, &conn_param); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_listen(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_listen cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ctx->backlog = cmd.backlog > 0 && cmd.backlog < UCMA_MAX_BACKLOG ? + cmd.backlog : UCMA_MAX_BACKLOG; + ret = rdma_listen(ctx->cm_id, ctx->backlog); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_accept(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_accept cmd; + struct rdma_conn_param conn_param; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + if (cmd.conn_param.valid) { + ctx->uid = cmd.uid; + ucma_copy_conn_param(&conn_param, &cmd.conn_param); + ret = rdma_accept(ctx->cm_id, &conn_param); + } else + ret = rdma_accept(ctx->cm_id, NULL); + + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_reject(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_reject cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_reject(ctx->cm_id, cmd.private_data, cmd.private_data_len); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_disconnect(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_disconnect cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_disconnect(ctx->cm_id); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_init_qp_attr(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_init_qp_attr cmd; + struct ib_uverbs_qp_attr resp; + struct ucma_context *ctx; + struct ib_qp_attr qp_attr; + int ret; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + resp.qp_attr_mask = 0; + memset(&qp_attr, 0, sizeof qp_attr); + qp_attr.qp_state = cmd.qp_state; + ret = rdma_init_qp_attr(ctx->cm_id, &qp_attr, &resp.qp_attr_mask); + if (ret) + goto out; + + ib_copy_qp_attr_to_user(&resp, &qp_attr); + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; + +out: + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t (*ucma_cmd_table[])(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) = { + [RDMA_USER_CM_CMD_CREATE_ID] = ucma_create_id, + [RDMA_USER_CM_CMD_DESTROY_ID] = ucma_destroy_id, + [RDMA_USER_CM_CMD_BIND_ADDR] = ucma_bind_addr, + [RDMA_USER_CM_CMD_RESOLVE_ADDR] = ucma_resolve_addr, + [RDMA_USER_CM_CMD_RESOLVE_ROUTE]= ucma_resolve_route, + [RDMA_USER_CM_CMD_QUERY_ROUTE] = ucma_query_route, + [RDMA_USER_CM_CMD_CONNECT] = ucma_connect, + [RDMA_USER_CM_CMD_LISTEN] = ucma_listen, + [RDMA_USER_CM_CMD_ACCEPT] = ucma_accept, + [RDMA_USER_CM_CMD_REJECT] = ucma_reject, + [RDMA_USER_CM_CMD_DISCONNECT] = ucma_disconnect, + [RDMA_USER_CM_CMD_INIT_QP_ATTR] = ucma_init_qp_attr, + [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event +}; + +static ssize_t ucma_write(struct file *filp, const char __user *buf, + size_t len, loff_t *pos) +{ + struct ucma_file *file = filp->private_data; + struct rdma_ucm_cmd_hdr hdr; + ssize_t ret; + + if (len < sizeof(hdr)) + return -EINVAL; + + if (copy_from_user(&hdr, buf, sizeof(hdr))) + return -EFAULT; + + if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) + return -EINVAL; + + if (hdr.in + sizeof(hdr) > len) + return -EINVAL; + + ret = ucma_cmd_table[hdr.cmd](file, buf + sizeof(hdr), hdr.in, hdr.out); + if (!ret) + ret = len; + + return ret; +} + +static unsigned int ucma_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct ucma_file *file = filp->private_data; + unsigned int mask = 0; + + poll_wait(filp, &file->poll_wait, wait); + + down(&file->mutex); + if (!list_empty(&file->events)) + mask = POLLIN | POLLRDNORM; + up(&file->mutex); + + return mask; +} + +static int ucma_open(struct inode *inode, struct file *filp) +{ + struct ucma_file *file; + + file = kmalloc(sizeof *file, GFP_KERNEL); + if (!file) + return -ENOMEM; + + INIT_LIST_HEAD(&file->events); + INIT_LIST_HEAD(&file->ctxs); + init_waitqueue_head(&file->poll_wait); + init_MUTEX(&file->mutex); + + filp->private_data = file; + file->filp = filp; + return 0; +} + +static int ucma_close(struct inode *inode, struct file *filp) +{ + struct ucma_file *file = filp->private_data; + struct ucma_context *ctx; + + down(&file->mutex); + while (!list_empty(&file->ctxs)) { + ctx = list_entry(file->ctxs.next, struct ucma_context, + file_list); + up(&file->mutex); + + down(&ctx_mutex); + idr_remove(&ctx_idr, ctx->id); + up(&ctx_mutex); + + rdma_destroy_id(ctx->cm_id); + ucma_cleanup_events(ctx); + kfree(ctx); + + down(&file->mutex); + } + up(&file->mutex); + kfree(file); + return 0; +} + +static struct file_operations ucma_fops = { + .owner = THIS_MODULE, + .open = ucma_open, + .release = ucma_close, + .write = ucma_write, + .poll = ucma_poll, +}; + +static struct miscdevice ucma_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = "rdma_cm", + .fops = &ucma_fops, +}; + +static int __init ucma_init(void) +{ + return misc_register(&ucma_misc); +} + +static void __exit ucma_cleanup(void) +{ + misc_deregister(&ucma_misc); + idr_destroy(&ctx_idr); +} + +module_init(ucma_init); +module_exit(ucma_cleanup); diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/rdma_user_cm.h linux-2.6.ib/include/rdma/rdma_user_cm.h --- linux-2.6.git/include/rdma/rdma_user_cm.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/include/rdma/rdma_user_cm.h 2006-01-16 16:54:55.000000000 -0800 @@ -0,0 +1,186 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef RDMA_USER_CM_H +#define RDMA_USER_CM_H + +#include +#include +#include +#include + +#define RDMA_USER_CM_ABI_VERSION 1 + +#define RDMA_MAX_PRIVATE_DATA 256 + +enum { + RDMA_USER_CM_CMD_CREATE_ID, + RDMA_USER_CM_CMD_DESTROY_ID, + RDMA_USER_CM_CMD_BIND_ADDR, + RDMA_USER_CM_CMD_RESOLVE_ADDR, + RDMA_USER_CM_CMD_RESOLVE_ROUTE, + RDMA_USER_CM_CMD_QUERY_ROUTE, + RDMA_USER_CM_CMD_CONNECT, + RDMA_USER_CM_CMD_LISTEN, + RDMA_USER_CM_CMD_ACCEPT, + RDMA_USER_CM_CMD_REJECT, + RDMA_USER_CM_CMD_DISCONNECT, + RDMA_USER_CM_CMD_INIT_QP_ATTR, + RDMA_USER_CM_CMD_GET_EVENT +}; + +/* + * command ABI structures. + */ +struct rdma_ucm_cmd_hdr { + __u32 cmd; + __u16 in; + __u16 out; +}; + +struct rdma_ucm_create_id { + __u64 uid; + __u64 response; +}; + +struct rdma_ucm_create_id_resp { + __u32 id; +}; + +struct rdma_ucm_destroy_id { + __u64 response; + __u32 id; + __u32 reserved; +}; + +struct rdma_ucm_destroy_id_resp { + __u32 events_reported; +}; + +struct rdma_ucm_bind_addr { + __u64 response; + struct sockaddr_in6 addr; + __u32 id; +}; + +struct rdma_ucm_resolve_addr { + struct sockaddr_in6 src_addr; + struct sockaddr_in6 dst_addr; + __u32 id; + __u32 timeout_ms; +}; + +struct rdma_ucm_resolve_route { + __u32 id; + __u32 timeout_ms; +}; + +struct rdma_ucm_query_route { + __u64 response; + __u32 id; + __u32 reserved; +}; + +struct rdma_ucm_query_route_resp { + __u64 node_guid; + struct ib_user_path_rec ib_route[2]; + struct sockaddr_in6 src_addr; + struct sockaddr_in6 dst_addr; + __u32 num_paths; + __u8 port_num; + __u8 reserved[3]; +}; + +struct rdma_ucm_conn_param { + __u32 qp_num; + __u32 qp_type; + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; + __u8 private_data_len; + __u8 srq; + __u8 responder_resources; + __u8 initiator_depth; + __u8 flow_control; + __u8 retry_count; + __u8 rnr_retry_count; + __u8 valid; +}; + +struct rdma_ucm_connect { + struct rdma_ucm_conn_param conn_param; + __u32 id; + __u32 reserved; +}; + +struct rdma_ucm_listen { + __u32 id; + __u32 backlog; +}; + +struct rdma_ucm_accept { + __u64 uid; + struct rdma_ucm_conn_param conn_param; + __u32 id; + __u32 reserved; +}; + +struct rdma_ucm_reject { + __u32 id; + __u8 private_data_len; + __u8 reserved[3]; + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; +}; + +struct rdma_ucm_disconnect { + __u32 id; +}; + +struct rdma_ucm_init_qp_attr { + __u64 response; + __u32 id; + __u32 qp_state; +}; + +struct rdma_ucm_get_event { + __u64 response; +}; + +struct rdma_ucm_event_resp { + __u64 uid; + __u32 id; + __u32 event; + __u32 status; + __u8 private_data_len; + __u8 reserved[3]; + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; +}; + +#endif /* RDMA_USER_CM_H */ From sean.hefty at intel.com Tue Jan 17 15:51:23 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Jan 2006 15:51:23 -0800 Subject: [openib-general] RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction In-Reply-To: <20060117153838.3dc2cd2e@dxpl.pdx.osdl.net> Message-ID: >> +static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask) > >static void cm_mask_compare_data(u8 *dst, const u8 *src, u8 *mask) > >but I would rename it to cm_mask_copy since it doesn't really do a compare. I'll change this. The function is masking the "data to use in the comparison", but I can see the confusion. >> +static int cm_compare_data(struct ib_cm_private_data_compare *src_data, >> + struct ib_cm_private_data_compare *dst_data) > >static int cm_compare_data(const struct ib_cm_private_data_compare *src, > cosnt struct ib_cm_private_data_compare *dst) >Your data type names are getting too long ^^^^^^^^^^^^^^^^^^^^^^^^ I'll fix. Thanks for the comments. - Sean From iod00d at hp.com Tue Jan 17 18:03:42 2006 From: iod00d at hp.com (Grant Grundler) Date: Tue, 17 Jan 2006 18:03:42 -0800 Subject: [openib-general] RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction In-Reply-To: References: Message-ID: <20060118020342.GB3740@esmail.cup.hp.com> On Tue, Jan 17, 2006 at 03:24:37PM -0800, Sean Hefty wrote: > +static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask) > +{ > + int i; > + > + for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++) > + dst[i] = src[i] & mask[i]; > +} Is this code going to get invoked very often? If so, can the mask operation use a "native" size since IB_CM_PRIVATE_DATA_COMPARE_SIZE is hard coded to 64 byte? e.g something like: for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE/sizeof(unsigned long); i++) ((unsigned long *)dst)[i] = ((unsigned long *)src)[i] & ((unsigned long *)mask)[i]; thanks, grant > + > +static int cm_compare_data(struct ib_cm_private_data_compare *src_data, > + struct ib_cm_private_data_compare *dst_data) > +{ > + u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; > + u8 dst[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; > + > + if (!src_data || !dst_data) > + return 0; > + > + cm_mask_compare_data(src, src_data->data, dst_data->mask); > + cm_mask_compare_data(dst, dst_data->data, src_data->mask); > + return memcmp(src, dst, IB_CM_PRIVATE_DATA_COMPARE_SIZE); > +} > + > +static int cm_compare_private_data(u8 *private_data, > + struct ib_cm_private_data_compare *dst_data) > +{ > + u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; > + > + if (!dst_data) > + return 0; > + > + cm_mask_compare_data(src, private_data, dst_data->mask); > + return memcmp(src, dst_data->data, IB_CM_PRIVATE_DATA_COMPARE_SIZE); > +} > + > static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) > { > struct rb_node **link = &cm.listen_service_table.rb_node; > @@ -362,14 +397,18 @@ static struct cm_id_private * cm_insert_ > struct cm_id_private *cur_cm_id_priv; > __be64 service_id = cm_id_priv->id.service_id; > __be64 service_mask = cm_id_priv->id.service_mask; > + int data_cmp; > > while (*link) { > parent = *link; > cur_cm_id_priv = rb_entry(parent, struct cm_id_private, > service_node); > + data_cmp = cm_compare_data(cm_id_priv->compare_data, > + cur_cm_id_priv->compare_data); > if ((cur_cm_id_priv->id.service_mask & service_id) == > (service_mask & cur_cm_id_priv->id.service_id) && > - (cm_id_priv->id.device == cur_cm_id_priv->id.device)) > + (cm_id_priv->id.device == cur_cm_id_priv->id.device) && > + !data_cmp) > return cur_cm_id_priv; > > if (cm_id_priv->id.device < cur_cm_id_priv->id.device) > @@ -378,6 +417,10 @@ static struct cm_id_private * cm_insert_ > link = &(*link)->rb_right; > else if (service_id < cur_cm_id_priv->id.service_id) > link = &(*link)->rb_left; > + else if (service_id > cur_cm_id_priv->id.service_id) > + link = &(*link)->rb_right; > + else if (data_cmp < 0) > + link = &(*link)->rb_left; > else > link = &(*link)->rb_right; > } > @@ -387,16 +430,20 @@ static struct cm_id_private * cm_insert_ > } > > static struct cm_id_private * cm_find_listen(struct ib_device *device, > - __be64 service_id) > + __be64 service_id, > + u8 *private_data) > { > struct rb_node *node = cm.listen_service_table.rb_node; > struct cm_id_private *cm_id_priv; > + int data_cmp; > > while (node) { > cm_id_priv = rb_entry(node, struct cm_id_private, service_node); > + data_cmp = cm_compare_private_data(private_data, > + cm_id_priv->compare_data); > if ((cm_id_priv->id.service_mask & service_id) == > cm_id_priv->id.service_id && > - (cm_id_priv->id.device == device)) > + (cm_id_priv->id.device == device) && !data_cmp) > return cm_id_priv; > > if (device < cm_id_priv->id.device) > @@ -405,6 +452,10 @@ static struct cm_id_private * cm_find_li > node = node->rb_right; > else if (service_id < cm_id_priv->id.service_id) > node = node->rb_left; > + else if (service_id > cm_id_priv->id.service_id) > + node = node->rb_right; > + else if (data_cmp < 0) > + node = node->rb_left; > else > node = node->rb_right; > } > @@ -728,15 +779,14 @@ retest: > wait_event(cm_id_priv->wait, !atomic_read(&cm_id_priv->refcount)); > while ((work = cm_dequeue_work(cm_id_priv)) != NULL) > cm_free_work(work); > - if (cm_id_priv->private_data && cm_id_priv->private_data_len) > - kfree(cm_id_priv->private_data); > + kfree(cm_id_priv->compare_data); > + kfree(cm_id_priv->private_data); > kfree(cm_id_priv); > } > EXPORT_SYMBOL(ib_destroy_cm_id); > > -int ib_cm_listen(struct ib_cm_id *cm_id, > - __be64 service_id, > - __be64 service_mask) > +int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, > + struct ib_cm_private_data_compare *compare_data) > { > struct cm_id_private *cm_id_priv, *cur_cm_id_priv; > unsigned long flags; > @@ -750,7 +800,19 @@ int ib_cm_listen(struct ib_cm_id *cm_id, > return -EINVAL; > > cm_id_priv = container_of(cm_id, struct cm_id_private, id); > - BUG_ON(cm_id->state != IB_CM_IDLE); > + if (cm_id->state != IB_CM_IDLE) > + return -EINVAL; > + > + if (compare_data) { > + cm_id_priv->compare_data = kzalloc(sizeof *compare_data, > + GFP_KERNEL); > + if (!cm_id_priv->compare_data) > + return -ENOMEM; > + cm_mask_compare_data(cm_id_priv->compare_data->data, > + compare_data->data, compare_data->mask); > + memcpy(cm_id_priv->compare_data->mask, compare_data->mask, > + IB_CM_PRIVATE_DATA_COMPARE_SIZE); > + } > > cm_id->state = IB_CM_LISTEN; > > @@ -767,6 +829,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, > > if (cur_cm_id_priv) { > cm_id->state = IB_CM_IDLE; > + kfree(cm_id_priv->compare_data); > + cm_id_priv->compare_data = NULL; > ret = -EBUSY; > } > return ret; > @@ -1239,7 +1303,8 @@ static struct cm_id_private * cm_match_r > > /* Find matching listen request. */ > listen_cm_id_priv = cm_find_listen(cm_id_priv->id.device, > - req_msg->service_id); > + req_msg->service_id, > + req_msg->private_data); > if (!listen_cm_id_priv) { > spin_unlock_irqrestore(&cm.lock, flags); > cm_issue_rej(work->port, work->mad_recv_wc, > @@ -2646,7 +2711,8 @@ static int cm_sidr_req_handler(struct cm > goto out; /* Duplicate message. */ > } > cur_cm_id_priv = cm_find_listen(cm_id->device, > - sidr_req_msg->service_id); > + sidr_req_msg->service_id, > + sidr_req_msg->private_data); > if (!cur_cm_id_priv) { > rb_erase(&cm_id_priv->sidr_id_node, &cm.remote_sidr_table); > spin_unlock_irqrestore(&cm.lock, flags); > diff -uprN -X linux-2.6.git/Documentation/dontdiff > linux-2.6.git/drivers/infiniband/core/ucm.c > linux-2.6.ib/drivers/infiniband/core/ucm.c > --- linux-2.6.git/drivers/infiniband/core/ucm.c 2006-01-16 16:03:08.000000000 -0800 > +++ linux-2.6.ib/drivers/infiniband/core/ucm.c 2006-01-16 16:03:35.000000000 -0800 > @@ -646,6 +646,17 @@ out: > return result; > } > > +static int ucm_validate_listen(__be64 service_id, __be64 service_mask) > +{ > + service_id &= service_mask; > + > + if (((service_id & IB_CMA_SERVICE_ID_MASK) == IB_CMA_SERVICE_ID) || > + ((service_id & IB_SDP_SERVICE_ID_MASK) == IB_SDP_SERVICE_ID)) > + return -EINVAL; > + > + return 0; > +} > + > static ssize_t ib_ucm_listen(struct ib_ucm_file *file, > const char __user *inbuf, > int in_len, int out_len) > @@ -661,7 +672,13 @@ static ssize_t ib_ucm_listen(struct ib_u > if (IS_ERR(ctx)) > return PTR_ERR(ctx); > > - result = ib_cm_listen(ctx->cm_id, cmd.service_id, cmd.service_mask); > + result = ucm_validate_listen(cmd.service_id, cmd.service_mask); > + if (result) > + goto out; > + > + result = ib_cm_listen(ctx->cm_id, cmd.service_id, cmd.service_mask, > + NULL); > +out: > ib_ucm_ctx_put(ctx); > return result; > } > diff -uprN -X linux-2.6.git/Documentation/dontdiff > linux-2.6.git/include/rdma/ib_cm.h > linux-2.6.ib/include/rdma/ib_cm.h > --- linux-2.6.git/include/rdma/ib_cm.h 2006-01-16 10:26:47.000000000 -0800 > +++ linux-2.6.ib/include/rdma/ib_cm.h 2006-01-16 16:03:35.000000000 -0800 > @@ -32,7 +32,7 @@ > * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > * SOFTWARE. > * > - * $Id: ib_cm.h 2730 2005-06-28 16:43:03Z sean.hefty $ > + * $Id: ib_cm.h 4311 2005-12-05 18:42:01Z sean.hefty $ > */ > #if !defined(IB_CM_H) > #define IB_CM_H > @@ -102,7 +102,8 @@ enum ib_cm_data_size { > IB_CM_APR_INFO_LENGTH = 72, > IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE = 216, > IB_CM_SIDR_REP_PRIVATE_DATA_SIZE = 136, > - IB_CM_SIDR_REP_INFO_LENGTH = 72 > + IB_CM_SIDR_REP_INFO_LENGTH = 72, > + IB_CM_PRIVATE_DATA_COMPARE_SIZE = 64 > }; > > struct ib_cm_id; > @@ -238,7 +239,6 @@ struct ib_cm_sidr_rep_event_param { > u32 qpn; > void *info; > u8 info_len; > - > }; > > struct ib_cm_event { > @@ -317,6 +317,15 @@ void ib_destroy_cm_id(struct ib_cm_id *c > > #define IB_SERVICE_ID_AGN_MASK __constant_cpu_to_be64(0xFF00000000000000ULL) > #define IB_CM_ASSIGN_SERVICE_ID __constant_cpu_to_be64(0x0200000000000000ULL) > +#define IB_CMA_SERVICE_ID __constant_cpu_to_be64(0x0000000001000000ULL) > +#define IB_CMA_SERVICE_ID_MASK __constant_cpu_to_be64(0xFFFFFFFFFF000000ULL) > +#define IB_SDP_SERVICE_ID __constant_cpu_to_be64(0x0000000000010000ULL) > +#define IB_SDP_SERVICE_ID_MASK __constant_cpu_to_be64(0xFFFFFFFFFFFF0000ULL) > + > +struct ib_cm_private_data_compare { > + u8 data[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; > + u8 mask[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; > +}; > > /** > * ib_cm_listen - Initiates listening on the specified service ID for > @@ -330,10 +339,12 @@ void ib_destroy_cm_id(struct ib_cm_id *c > * range of service IDs. If set to 0, the service ID is matched > * exactly. This parameter is ignored if %service_id is set to > * IB_CM_ASSIGN_SERVICE_ID. > + * @compare_data: This parameter is optional. It specifies data that must > + * appear in the private data of a connection request for the specified > + * listen request. > */ > -int ib_cm_listen(struct ib_cm_id *cm_id, > - __be64 service_id, > - __be64 service_mask); > +int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, > + struct ib_cm_private_data_compare *compare_data); > > struct ib_cm_req_param { > struct ib_sa_path_rec *primary_path; > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Wed Jan 18 01:13:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 11:13:13 +0200 Subject: [openib-general] [PATCH] mthca: fix sgid for port 2 mad Message-ID: <20060118091313.GY22260@mellanox.co.il> mthca_create_ah includes the port number in the gid index. The reverse needs to be done in mthca_read_ah. Noted by Hal Rosenstock. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- openib.orig/drivers/infiniband/hw/mthca/mthca_av.c 2006-01-14 18:40:12.000000000 +0200 +++ openib/drivers/infiniband/hw/mthca/mthca_av.c 2006-01-18 02:35:18.000000000 +0200 @@ -182,7 +182,7 @@ int mthca_read_ah(struct mthca_dev *dev, ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); ib_get_cached_gid(&dev->ib_dev, be32_to_cpu(ah->av->port_pd) >> 24, - ah->av->gid_index, + ah->av->gid_index % dev->limits.gid_table_len, &header->grh.source_gid); memcpy(header->grh.destination_gid.raw, ah->av->dgid, 16); -- MST From halr at voltaire.com Wed Jan 18 02:23:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 18 Jan 2006 12:23:19 +0200 Subject: [openib-general] [PATCH] mthca: fix sgid for port 2 mad Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AC05@taurus.voltaire.com> One comment/question on this is that while I know ib_get_cached_gid _should_ not fail here, it did because of this. Should the return be checked and handled just in case ? Actually, this (prior to this patch) had an interesting effect to send a GRH with an SGID of 0 on port 2. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Michael S. Tsirkin Sent: Wed 1/18/2006 4:13 AM To: openib-general at openib.org; Roland Dreier Subject: [openib-general] [PATCH] mthca: fix sgid for port 2 mad mthca_create_ah includes the port number in the gid index. The reverse needs to be done in mthca_read_ah. Noted by Hal Rosenstock. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- openib.orig/drivers/infiniband/hw/mthca/mthca_av.c 2006-01-14 18:40:12.000000000 +0200 +++ openib/drivers/infiniband/hw/mthca/mthca_av.c 2006-01-18 02:35:18.000000000 +0200 @@ -182,7 +182,7 @@ int mthca_read_ah(struct mthca_dev *dev, ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); ib_get_cached_gid(&dev->ib_dev, be32_to_cpu(ah->av->port_pd) >> 24, - ah->av->gid_index, + ah->av->gid_index % dev->limits.gid_table_len, &header->grh.source_gid); memcpy(header->grh.destination_gid.raw, ah->av->dgid, 16); -- MST _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bardov at gmail.com Wed Jan 18 02:45:06 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Wed, 18 Jan 2006 12:45:06 +0200 Subject: [openib-general] Re: [PATCH] enable the fmr pool user to set the page size In-Reply-To: References: Message-ID: On 1/17/06, Roland Dreier wrote: > Seems reasonable. Unfortunately we just missed the 2.6.16-rc1 window > so I think this should wait for the 2.6.17 window. > > BTW, do you ever see the SCSI layer giving you 512 byte blocks? Yes, mostly those are sg elements that are contiguous with other elements to page_size. In some rare cases, we see non-contiguous elements smaller than page_size (or 4K) which are non-rdma-able as a single unit, in such cases we must copy. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jackm at mellanox.co.il Wed Jan 18 05:05:36 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 18 Jan 2006 15:05:36 +0200 Subject: [openib-general] [PATCH] ib_mad: prevent duplicate outstanding MAD transactions with same TID. Message-ID: <20060118130536.GA24415@mellanox.co.il> Prevent multiple outstanding MAD transactions with the same TID. Could happen if duplicate requests are posted. Signed-off-by: Jack Morgenstein Index: latest/drivers/infiniband/core/mad.c =================================================================== --- latest.orig/drivers/infiniband/core/mad.c +++ latest/drivers/infiniband/core/mad.c @@ -964,6 +964,12 @@ int ib_post_send_mad(struct ib_mad_send_ /* Reference MAD agent until send completes */ atomic_inc(&mad_agent_priv->refcount); spin_lock_irqsave(&mad_agent_priv->lock, flags); + if (ib_find_send_mad(mad_agent_priv, mad_send_wr->tid)) { + /* Duplicate send request */ + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + atomic_dec(&mad_agent_priv->refcount); + return -EBUSY; + } list_add_tail(&mad_send_wr->agent_list, &mad_agent_priv->send_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); From ogerlitz at voltaire.com Wed Jan 18 05:42:18 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 18 Jan 2006 15:42:18 +0200 (IST) Subject: [openib-general] [PATCH] iser: add struct iser_hdr Message-ID: applied in r5060 remove iser_header.h, have the iser header in struct iser_hdr which is encoded for the wire Signed-off-by: Or Gerlitz Index: ulp/iser/iser.h =================================================================== --- ulp/iser/iser.h (revision 5033) +++ ulp/iser/iser.h (revision 5060) @@ -47,8 +47,6 @@ #include #include "iscsi_iser.h" -#include "iser_header.h" - #include #include #include @@ -59,8 +57,23 @@ #define PFX "iser:" +struct iser_hdr { + u8 flags; + u8 rsvd[3]; + __be32 write_stag; /* write rkey */ + __be64 write_va; + __be32 read_stag; /* read rkey */ + __be64 read_va; +} __attribute__((packed)); + +#define ISER_VER 0x10 +#define ISER_WSV 0x08 +#define ISER_RSV 0x04 + /* Constant PDU lengths calculations */ -#define ISER_PDU_BHS_LENGTH 48 +#define ISER_HDR_LEN sizeof (struct iser_hdr) +#define ISER_PDU_BHS_LENGTH sizeof (struct iscsi_hdr) + #define ISER_TOTAL_HEADERS_LEN \ (ISER_HDR_LEN + ISER_PDU_BHS_LENGTH) Index: ulp/iser/iser_dto.c =================================================================== --- ulp/iser/iser_dto.c (revision 5033) +++ ulp/iser/iser_dto.c (revision 5060) @@ -203,7 +203,8 @@ struct iser_dto *iser_dto_send_create(st p_iser_header = (unsigned char *)p_regd_hdr->virt_addr; memset(p_iser_header, 0, ISER_HDR_LEN); - ISER_HDR_SET_VERSION(p_iser_header); + ((struct iser_hdr *)p_iser_header)->flags = ISER_VER; + memcpy(p_iser_header + ISER_HDR_LEN, hdr, ISER_PDU_BHS_LENGTH); iser_dto_add_regd_buff(p_send_dto, p_regd_hdr, USE_NO_OFFSET, USE_SIZE(ISER_TOTAL_HEADERS_LEN)); Index: ulp/iser/iser_initiator.c =================================================================== --- ulp/iser/iser_initiator.c (revision 5033) +++ ulp/iser/iser_initiator.c (revision 5060) @@ -135,6 +135,7 @@ static int iser_prepare_read_cmd(struct dma_addr_t dma_addr; int dma_nents; struct device *dma_device; + struct iser_hdr *hdr = (struct iser_hdr *)p_iser_header; p_iser_task->dir[ISER_DIR_IN] = 1; dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; @@ -177,13 +178,15 @@ static int iser_prepare_read_cmd(struct return err; } p_regd_buf = p_iser_task->rdma_regd[ISER_DIR_IN]; - ISER_HDR_SET_BITS(p_iser_header, RSV, 1); - ISER_HDR_R_VADDR(p_iser_header) = cpu_to_be64(p_regd_buf->reg.va); - ISER_HDR_R_RKEY(p_iser_header) = htonl(p_regd_buf->reg.rkey); - iser_dbg("Cmd itt:%d, READ tags, RKEY:0x%08X VA:0x%08lX\n", + hdr->flags |= ISER_RSV; + hdr->read_stag = cpu_to_be32(p_regd_buf->reg.rkey); + hdr->read_va = cpu_to_be64(p_regd_buf->reg.va); + + iser_err("Cmd itt:%d READ tags RKEY:%#.4X VA:%#llX\n", p_iser_task->itt, p_regd_buf->reg.rkey, - (unsigned long)p_regd_buf->reg.va); + (unsigned long long)p_regd_buf->reg.va); + return 0; } @@ -205,6 +208,7 @@ iser_prepare_write_cmd(struct iscsi_iser dma_addr_t dma_addr; int dma_nents; struct device *dma_device; + struct iser_hdr *hdr = (struct iser_hdr *)p_iser_header; p_iser_task->dir[ISER_DIR_OUT] = 1; dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; @@ -252,16 +256,14 @@ iser_prepare_write_cmd(struct iscsi_iser p_regd_buf = p_iser_task->rdma_regd[ISER_DIR_OUT]; if(unsol_sz < edtl) { - ISER_HDR_SET_BITS(p_iser_header, WSV, 1); - ISER_HDR_W_VADDR(p_iser_header) = cpu_to_be64( - p_regd_buf->reg.va + unsol_sz); - ISER_HDR_W_RKEY(p_iser_header) = htonl(p_regd_buf->reg.rkey); + hdr->flags |= ISER_WSV; + hdr->write_stag = cpu_to_be32(p_regd_buf->reg.rkey); + hdr->write_va = cpu_to_be64(p_regd_buf->reg.va + unsol_sz); - iser_dbg("Cmd itt:%d, WRITE tags, RKEY:0x%08X " - "VA:0x%08lX + unsol:%d\n", + iser_err("Cmd itt:%d, WRITE tags, RKEY:%#.4X " + "VA:%#llX + unsol:%d\n", p_iser_task->itt, p_regd_buf->reg.rkey, - (unsigned long)p_regd_buf->reg.va, - unsol_sz); + (unsigned long long)p_regd_buf->reg.va, unsol_sz); } if (imm_sz > 0) { Index: ulp/iser/iser_header.h =================================================================== --- ulp/iser/iser_header.h (revision 5033) +++ ulp/iser/iser_header.h (revision 5060) @@ -1,164 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - -#ifndef __ISER_HEADER_H__ -#define __ISER_HEADER_H__ - -/* - BYTE |0 1 2 3 - |0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -DWORD | |W|R| | - 0 | 0001b |S|S| Reserved | - | |V|V| | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 1 | Write STag high (or N/A) = RKey | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 2 | Write STag low (or N/A) = Virt.Addr. | - 3 | | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 4 | Read STag high (or N/A) = RKey | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 5 | Read STag low (or N/A) = Virt.Addr. | - 6 | | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -*/ - -#define ISER_HDR_LEN 28 - -/* Version */ -#define ISER_HDR_VER_OFFSET 0 /* in byte 0 */ -#define ISER_HDR_VER_SHIFT 4 /* last bit 3 */ -#define ISER_HDR_VER_MASK 0x0F /* 4 bits */ - -/* WSV - Write STag Valid flag */ -#define ISER_HDR_WSV_OFFSET 0 /* in byte 0 */ -#define ISER_HDR_WSV_SHIFT 3 /* last bit 4 */ -#define ISER_HDR_WSV_MASK 0x01 /* 1 bit */ - -/* RSV - Read STag Valid flag */ -#define ISER_HDR_RSV_OFFSET 0 /* in byte 0 */ -#define ISER_HDR_RSV_SHIFT 2 /* last bit 5 */ -#define ISER_HDR_RSV_MASK 0x01 /* 1 bit */ - -/* Retrieve a bit field from a header byte array. - Possible field names are: VER, WSV, RSV -*/ -#define ISER_HDR_GET_BITS(p,field) \ - ((((unsigned char *)p)[ISER_HDR_ ## field ## _OFFSET] >> \ - ISER_HDR_ ## field ## _SHIFT) & \ - ISER_HDR_ ## field ## _MASK) - -/* Set the passed value in the bit field - Possible field names are: VER, WSV, RSV -*/ -#define ISER_HDR_SET_BITS(p,field,val) \ - do { \ - ((unsigned char *)p)[ISER_HDR_ ## field ## _OFFSET] &= \ - ~(ISER_HDR_ ## field ## _MASK << \ - ISER_HDR_ ## field ## _SHIFT); \ - ((unsigned char *)p)[ISER_HDR_ ## field ## _OFFSET] |= \ - ((val) & ISER_HDR_ ## field ## _MASK) << \ - ISER_HDR_ ## field ## _SHIFT; \ - } while(0) - -#define ISER_HDR_SET_VERSION(p) ISER_HDR_SET_BITS(p,VER,0x01) - -/* Access to the fields Read S-Tag, Write S-Tag by 32-bit halves. - Returns l-value. -*/ -#define ISER_HDR_W_RKEY(p) \ - (*(u32 *) (((unsigned char *) p) + 4)) -#define ISER_HDR_W_VADDR(p) \ - (*(u64 *) (((unsigned char *) p) + 8)) - -#define ISER_HDR_R_RKEY(p) \ - (*(u32 *) (((unsigned char *) p) + 16)) -#define ISER_HDR_R_VADDR(p) \ - (*(u64 *) (((unsigned char *) p) + 20)) - -/* - BYTE |0 1 2 3 - |0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -DWORD | - 0 | 0010b | Rsvd | MaxVer| MinVer| Reserved | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 1 | Reserved | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 2 | Reserved | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 3 | IBVer | IPVer | InitiatorRecvDataSegmentLength | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 4 | ICap | Reserved | Local Port | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 5 | Src IP (127-96) | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 6 | Src IP ( 95-64) | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 7 | Src IP ( 63-32) | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 8 | Src IP ( 31-00) | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 9 | Dst IP (127-96) | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 10 | Dst IP ( 95-64) | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 11 | Dst IP ( 63-32) | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 12 | Dst IP ( 31-00) | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -*/ - -#define ISER_HELLO_LEN 52 - -/* - BYTE |0 1 2 3 - |0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -DWORD | | |R| | | | - 0 | 0011b |Rsvd |E| MaxVer| CurVer| iSER-ORD | - | | |J| | | | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 1 | Reserved | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 2 | Reserved | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - 3 | IBVer | TCap | TargetRecvDataSegmentLength | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -*/ - -#define ISER_HELLO_REPLY_LEN 16 - -#endif /* __ISER_HEADER_H__ */ From devesh28 at gmail.com Wed Jan 18 05:46:49 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Wed, 18 Jan 2006 19:16:49 +0530 Subject: [openib-general] Significance of IB_SIGNAL_REQ_WR Message-ID: <309a667c0601180546r58585abl8300b84e8aa43f94@mail.gmail.com> Hi Hal and list, I want to know what is the significance of the Qp attribute IB_SIGNAL_REQ_WR. Is the concept of this flag is similar to the concept of DAT_COMPLETION_SOLICITED_WAIT flag? From sashak at voltaire.com Wed Jan 18 06:28:53 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Jan 2006 16:28:53 +0200 Subject: [openib-general] [patch] userspace/management: ARGBEGIN() -> getopt() conversion for diags Message-ID: <20060118142853.GA19642@sashak.voltaire.com> Hi Hal, Diag utils are converted to getopt(). It is just basically tested, so please report bugs (if any). Sasha. This converts diag utils to more standard getopt() using instead of AGRBEGIN() buggy macros. Unused now ARGBEGIN() related code is removed from libibcommon. Signed-off-by: Sasha Khapyorsky Index: diags/src/ibtracert.c =================================================================== --- diags/src/ibtracert.c (revision 5057) +++ diags/src/ibtracert.c (working copy) @@ -41,6 +41,7 @@ #include #include #include +#include #include #include @@ -54,6 +55,8 @@ int force; FILE *f; +static char *argv0 = "ibtracert"; + #undef DEBUG #define DEBUG if (ibdebug || verbose) IBWARN #define VERBOSE if (ibdebug || verbose > 1) IBWARN @@ -726,7 +729,7 @@ basename++; fprintf(stderr, "Usage: %s [-d(ebug) -v(erbose) -D(irect_path_addrs) -G(uid_addrs) -v(erbose) -n(o_info) -C ca_name -P hca_port " - "-s smlid -t timeout_ms -m mlid] \n", + "-s smlid -t(imeout) timeout_ms -m mlid] \n", basename); fprintf(stderr, "\n\tUnicast examples:\n"); fprintf(stderr, "\t\t%s 4 16\t\t\t# show path between lids 4 and 16\n", basename); @@ -753,57 +756,85 @@ char *ca = 0; int ca_port = 0; + static char const str_opts[] = "C:P:t:s:m:dvfDGnVhu"; + static const struct option long_opts[] = { + { "C", 1, 0, 'C'}, + { "P", 1, 0, 'P'}, + { "debug", 0, 0, 'd'}, + { "verbose", 0, 0, 'v'}, + { "force", 0, 0, 'f'}, + { "Direct_path_addrs", 0, 0, 'D'}, + { "Guid_addrs", 0, 0, 'G'}, + { "no_info", 0, 0, 'n'}, + { "timeout", 1, 0, 't'}, + { "s", 1, 0, 's'}, + { "m", 1, 0, 'm'}, + { "Version", 0, 0, 'V'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + + argv0 = argv[0]; + f = stderr; - ARGBEGIN { - case 'C': - ca = ARGF(); - break; - case 'P': - ca_port = strtoul(ARGF(), 0, 0); - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'D': - dest_type = IB_DEST_DRPATH; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'm': - multicast++; - mlid = strtoul(ARGF(), 0, 0); - break; - case 'f': - force++; - break; - case 'n': - dumplevel = 1; - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, ARGF(), IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", ARGF()); - sm_id = &sm_portid; - break; - case 't': - timeout = strtoul(ARGF(), 0, 0); - madrpc_set_timeout(timeout); - break; - case 'v': - madrpc_show_errors(1); - verbose++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - case 'h': - default: - usage(); - } ARGEND; + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 'C': + ca = optarg; + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'd': + ibdebug++; + madrpc_show_errors(1); + umad_debug(udebug); + udebug++; + break; + case 'D': + dest_type = IB_DEST_DRPATH; + break; + case 'G': + dest_type = IB_DEST_GUID; + break; + case 'm': + multicast++; + mlid = strtoul(optarg, 0, 0); + break; + case 'f': + force++; + break; + case 'n': + dumplevel = 1; + break; + case 's': + if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) + IBERROR("can't resolve SM destination port %s", optarg); + sm_id = &sm_portid; + break; + case 't': + timeout = strtoul(optarg, 0, 0); + madrpc_set_timeout(timeout); + break; + case 'v': + madrpc_show_errors(1); + verbose++; + break; + case 'V': + fprintf(stderr, "%s %s\n", argv0, get_build_version() ); + exit(-1); + case 'h': + default: + usage(); + } + } + argc -= optind; + argv += optind; if (argc < 2) usage(); Index: diags/src/ibnetdiscover.c =================================================================== --- diags/src/ibnetdiscover.c (revision 5057) +++ diags/src/ibnetdiscover.c (working copy) @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -60,6 +61,8 @@ #define DEBUG if (verbose>1) IBWARN #define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) +static char *argv0 = "ibnetdiscover"; + void iberror(const char *fn, char *msg, ...) { @@ -553,8 +556,8 @@ void usage(void) { - fprintf(stderr, "Usage: %s [-d(ebug)] -e(err_show) -v(erbose) -s(how) -l(ist) -H(ca_list) -S(witch_list) -V(ersion) -C ca_name -P hca_port " - "-t timeout_ms] []\n", + fprintf(stderr, "Usage: %s [-d(ebug)] -e(rr_show) -v(erbose) -s(how) -l(ist) -H(ca_list) -S(witch_list) -V(ersion) -C ca_name -P hca_port " + "-t(imeout) timeout_ms] []\n", argv0); fprintf(stderr, "%s %s\n", argv0, get_build_version() ); exit(-1); @@ -570,50 +573,77 @@ char *ca = 0; int ca_port = 0; + static char const str_opts[] = "C:P:t:devslHSVhu"; + static const struct option long_opts[] = { + { "C", 1, 0, 'C'}, + { "P", 1, 0, 'P'}, + { "debug", 0, 0, 'd'}, + { "err_show", 0, 0, 'e'}, + { "verbose", 0, 0, 'v'}, + { "show", 0, 0, 's'}, + { "list", 0, 0, 'l'}, + { "Hca_list", 0, 0, 'H'}, + { "Switch_list", 0, 0, 'S'}, + { "timeout", 1, 0, 't'}, + { "Version", 0, 0, 'V'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + f = stdout; - ARGBEGIN { - case 'C': - ca = ARGF(); - break; - case 'P': - ca_port = strtoul(ARGF(), 0, 0); - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 't': - timeout = strtoul(ARGF(), 0, 0); - break; - case 'v': - verbose++; - dumplevel++; - break; - case 's': - dumplevel = 1; - break; - case 'e': - madrpc_show_errors(1); - break; - case 'l': - list = HCA_NODE | SWITCH_NODE; - break; - case 'S': - list = SWITCH_NODE; - break; - case 'H': - list = HCA_NODE; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - } ARGEND; + argv0 = argv[0]; + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 'C': + ca = optarg; + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'd': + ibdebug++; + madrpc_show_errors(1); + umad_debug(udebug); + udebug++; + break; + case 't': + timeout = strtoul(optarg, 0, 0); + break; + case 'v': + verbose++; + dumplevel++; + break; + case 's': + dumplevel = 1; + break; + case 'e': + madrpc_show_errors(1); + break; + case 'l': + list = HCA_NODE | SWITCH_NODE; + break; + case 'S': + list = SWITCH_NODE; + break; + case 'H': + list = HCA_NODE; + break; + case 'V': + fprintf(stderr, "%s %s\n", argv0, get_build_version() ); + exit(-1); + default: + usage(); + } + } + argc -= optind; + argv += optind; + if (argc) if (!(f = fopen(argv[0], "w"))) IBERROR("can't open file %s for writing", argv[0]); Index: diags/src/ibportstate.c =================================================================== --- diags/src/ibportstate.c (revision 5057) +++ diags/src/ibportstate.c (working copy) @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -54,6 +55,8 @@ static int dest_type = IB_DEST_LID; static int verbose; +static char *argv0 = "ibportstate"; + static void iberror(const char *fn, char *msg, ...) { @@ -142,7 +145,7 @@ basename++; fprintf(stderr, "Usage: %s [-d(ebug) -e(rr_show) -v(erbose) -D(irect) -G(uid) -s smlid -V(ersion) -C ca_name -P hca_port " - "-t timeout_ms] []\n", + "-t(imeout) timeout_ms] []\n", basename); fprintf(stderr, "\tsupported ops: enable, disable, query\n"); fprintf(stderr, "\n\texamples:\n"); @@ -166,48 +169,74 @@ char *err; char data[IB_SMP_DATA_SIZE]; - ARGBEGIN { - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'e': - madrpc_show_errors(1); - break; - case 'D': - dest_type = IB_DEST_DRPATH; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'C': - ca = ARGF(); - break; - case 'P': - ca_port = strtoul(ARGF(), 0, 0); - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, ARGF(), IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", ARGF()); - sm_id = &sm_portid; - break; - case 't': - timeout = strtoul(ARGF(), 0, 0); - madrpc_set_timeout(timeout); - break; - case 'v': - verbose++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - case 'h': - default: - usage(); - } ARGEND; + static char const str_opts[] = "C:P:t:s:devDGVhu"; + static const struct option long_opts[] = { + { "C", 1, 0, 'C'}, + { "P", 1, 0, 'P'}, + { "debug", 0, 0, 'd'}, + { "err_show", 0, 0, 'e'}, + { "verbose", 0, 0, 'v'}, + { "Direct", 0, 0, 'D'}, + { "Guid", 0, 0, 'G'}, + { "timeout", 1, 0, 't'}, + { "s", 1, 0, 's'}, + { "Version", 0, 0, 'V'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 'd': + ibdebug++; + madrpc_show_errors(1); + umad_debug(udebug); + udebug++; + break; + case 'e': + madrpc_show_errors(1); + break; + case 'D': + dest_type = IB_DEST_DRPATH; + break; + case 'G': + dest_type = IB_DEST_GUID; + break; + case 'C': + ca = optarg; + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 's': + if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) + IBERROR("can't resolve SM destination port %s", optarg); + sm_id = &sm_portid; + break; + case 't': + timeout = strtoul(optarg, 0, 0); + madrpc_set_timeout(timeout); + break; + case 'v': + verbose++; + break; + case 'V': + fprintf(stderr, "%s %s\n", argv0, get_build_version() ); + exit(-1); + case 'h': + default: + usage(); + } + } + argc -= optind; + argv += optind; + if (argc < 2) usage(); Index: diags/src/perfquery.c =================================================================== --- diags/src/perfquery.c (revision 5057) +++ diags/src/perfquery.c (working copy) @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -47,6 +48,8 @@ static uint8_t pc[1024]; +static char *argv0 = "perfquery"; + #define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) static void @@ -80,7 +83,7 @@ basename++; fprintf(stderr, "Usage: %s [-d(ebug) -G(uid_addr) -a(ll_ports) -r(reset_after_read) -C ca_name -P hca_port " - "-R(eset_only) -t timeout_ms -V(ersion) -h(elp)] [ [[port] [reset_mask]]]\n", + "-R(eset_only) -t(imeout) timeout_ms -V(ersion) -h(elp)] [ [[port] [reset_mask]]]\n", basename); fprintf(stderr, "\tExamples:\n"); fprintf(stderr, "\t\t%s\t\t# read local port's performance counters\n", basename); @@ -110,49 +113,75 @@ char *ca = 0; int ca_port = 0; - ARGBEGIN { - case 'C': - ca = ARGF(); - break; - case 'P': - ca_port = strtoul(ARGF(), 0, 0); - break; - case 'a': - all++; - port = 0xff; - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, ARGF(), IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", ARGF()); - sm_id = &sm_portid; - break; - case 'r': - reset++; - break; - case 'R': - reset_only++; - break; - case 't': - timeout = strtoul(ARGF(), 0, 0); - madrpc_set_timeout(timeout); - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - case 'h': - default: - usage(); - } ARGEND; + static char const str_opts[] = "C:P:s:t:dGarRVhu"; + static const struct option long_opts[] = { + { "C", 1, 0, 'C'}, + { "P", 1, 0, 'P'}, + { "debug", 0, 0, 'd'}, + { "Guid_addr", 0, 0, 'G'}, + { "all_ports", 0, 0, 'a'}, + { "reset_after_read", 0, 0, 'r'}, + { "Reset_only", 0, 0, 'R'}, + { "sm_portid", 1, 0, 's'}, + { "timeout", 1, 0, 't'}, + { "Version", 0, 0, 'V'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 'C': + ca = optarg; + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'a': + all++; + port = 0xff; + break; + case 'd': + ibdebug++; + madrpc_show_errors(1); + umad_debug(udebug); + udebug++; + break; + case 'G': + dest_type = IB_DEST_GUID; + break; + case 's': + if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) + IBERROR("can't resolve SM destination port %s", optarg); + sm_id = &sm_portid; + break; + case 'r': + reset++; + break; + case 'R': + reset_only++; + break; + case 't': + timeout = strtoul(optarg, 0, 0); + madrpc_set_timeout(timeout); + break; + case 'V': + fprintf(stderr, "%s %s\n", argv0, get_build_version() ); + exit(-1); + case 'h': + default: + usage(); + } + } + argc -= optind; + argv += optind; + if (argc > 1) port = strtoul(argv[1], 0, 0); if (argc > 2) Index: diags/src/smpdump.c =================================================================== --- diags/src/smpdump.c (revision 5057) +++ diags/src/smpdump.c (working copy) @@ -51,6 +51,7 @@ #include #include #include +#include #include #include #include @@ -75,6 +76,8 @@ static int debug; +static char *argv0 = "smpdump"; + typedef struct { char path[64]; int hop_cnt; @@ -238,31 +241,54 @@ uint8_t *desc; int length; - ARGBEGIN { - case 's': - dump_char++; - break; - case 'd': - debug++; - if (debug > 1) - umad_debug(debug-1); - break; - case 'D': - mgmt_class = CLASS_SUBN_DIRECTED_ROUTE; - break; - case 'C': - dev_name = ARGF(); - break; - case 'P': - dev_port = atoi(ARGF()); - break; - case 't': - timeout_ms = strtoul(ARGF(), 0, 0); - break; - default: - usage(); - } ARGEND; + fprintf(stderr, "Usage: %s [-s(ring) -C ca_name -P ca_port] [mod]\n", argv0); + static char const str_opts[] = "C:P:t:dsDhu"; + static const struct option long_opts[] = { + { "C", 1, 0, 'C'}, + { "P", 1, 0, 'P'}, + { "debug", 0, 0, 'd'}, + { "sring", 0, 0, 's'}, + { "Direct", 0, 0, 'D'}, + { "timeout", 1, 0, 't'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 's': + dump_char++; + break; + case 'd': + debug++; + if (debug > 1) + umad_debug(debug-1); + break; + case 'D': + mgmt_class = CLASS_SUBN_DIRECTED_ROUTE; + break; + case 'C': + dev_name = optarg; + break; + case 'P': + dev_port = atoi(optarg); + break; + case 't': + timeout_ms = strtoul(optarg, 0, 0); + break; + default: + usage(); + } + } + argc -= optind; + argv += optind; + if (argc < 2) usage(); Index: diags/src/ibsysstat.c =================================================================== --- diags/src/ibsysstat.c (revision 5057) +++ diags/src/ibsysstat.c (working copy) @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -75,6 +76,8 @@ static char ipinfo[IB_VENDOR_RANGE2_DATA_SIZE] = "ipinfo"; static char ibinfo[IB_VENDOR_RANGE2_DATA_SIZE] = "ibinfo"; +static char *argv0 = "ibsysstat"; + static void iberror(const char *fn, char *msg, ...) { @@ -258,8 +261,8 @@ else basename++; - fprintf(stderr, "Usage: %s [-d(ebug) -e(rr_show) -v(erbose) -D(irect) -G(uid) -s smlid -V(ersion) -C ca_name -P hca_port " - "-t timeout_ms] [op params]\n", + fprintf(stderr, "Usage: %s [-d(ebug) -e(rr_show) -v(erbose) -D(irect) -G(uid) -s smlid -o oui -V(ersion) -C ca_name -P hca_port " + "-t(imeout) timeout_ms] [op params]\n", basename); exit(-1); } @@ -278,51 +281,78 @@ char *ca = 0; int ca_port = 0; - ARGBEGIN { - case 'C': - ca = ARGF(); - break; - case 'P': - ca_port = strtoul(ARGF(), 0, 0); - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'e': - madrpc_show_errors(1); - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'o': - oui = strtoul(ARGF(), 0, 0); - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, ARGF(), IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", ARGF()); - sm_id = &sm_portid; - break; - case 'S': - server++; - break; - case 't': - timeout = strtoul(ARGF(), 0, 0); - madrpc_set_timeout(timeout); - break; - case 'v': - verbose++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - case 'h': - default: - usage(); - } ARGEND; + static char const str_opts[] = "C:P:t:s:o:devDGVhu"; + static const struct option long_opts[] = { + { "C", 1, 0, 'C'}, + { "P", 1, 0, 'P'}, + { "debug", 0, 0, 'd'}, + { "err_show", 0, 0, 'e'}, + { "verbose", 0, 0, 'v'}, + { "Direct", 0, 0, 'D'}, + { "Guid", 0, 0, 'G'}, + { "timeout", 1, 0, 't'}, + { "s", 1, 0, 's'}, + { "o", 1, 0, 'o'}, + { "Version", 0, 0, 'V'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 'C': + ca = optarg; + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'd': + ibdebug++; + madrpc_show_errors(1); + umad_debug(udebug); + udebug++; + break; + case 'e': + madrpc_show_errors(1); + break; + case 'G': + dest_type = IB_DEST_GUID; + break; + case 'o': + oui = strtoul(optarg, 0, 0); + break; + case 's': + if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) + IBERROR("can't resolve SM destination port %s", optarg); + sm_id = &sm_portid; + break; + case 'S': + server++; + break; + case 't': + timeout = strtoul(optarg, 0, 0); + madrpc_set_timeout(timeout); + break; + case 'v': + verbose++; + break; + case 'V': + fprintf(stderr, "%s %s\n", argv0, get_build_version() ); + exit(-1); + case 'h': + default: + usage(); + } + } + argc -= optind; + argv += optind; + if (!argc && !server) usage(); Index: diags/src/ibaddr.c =================================================================== --- diags/src/ibaddr.c (revision 5057) +++ diags/src/ibaddr.c (working copy) @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -47,6 +48,8 @@ #define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) +static char *argv0 = "ibaddr"; + static void iberror(const char *fn, char *msg, ...) { @@ -114,7 +117,7 @@ basename++; fprintf(stderr, "Usage: %s [-d(ebug) -D(irect_path_addr) -G(uid_addr) -l(id_show) -g(id_show) -C ca_name -P hca_port " - "-t timeout_ms -V(ersion) -h(elp)] []\n", + "-t(imeout) timeout_ms -V(ersion) -h(elp)] []\n", basename); fprintf(stderr, "\tExamples:\n"); fprintf(stderr, "\t\t%s\t\t\t# local port's address\n", basename); @@ -140,48 +143,75 @@ char *ca = 0; int ca_port = 0; - ARGBEGIN { - case 'C': - ca = ARGF(); - break; - case 'P': - ca_port = strtoul(ARGF(), 0, 0); - break; - case 'd': - ibdebug++; - break; - case 'D': - dest_type = IB_DEST_DRPATH; - break; - case 'g': - show_gid++; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'l': - show_lid++; - break; - case 'L': - show_lid = -100; - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, ARGF(), IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", ARGF()); - sm_id = &sm_portid; - break; - case 't': - timeout = strtoul(ARGF(), 0, 0); - madrpc_set_timeout(timeout); - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - case 'h': - default: - usage(); - } ARGEND; + static char const str_opts[] = "C:P:t:s:dDGglLVhu"; + static const struct option long_opts[] = { + { "C", 1, 0, 'C'}, + { "P", 1, 0, 'P'}, + { "debug", 0, 0, 'd'}, + { "Direct_path_addr", 0, 0, 'D'}, + { "Guid_addr", 0, 0, 'G'}, + { "gid_show", 0, 0, 'g'}, + { "lid_show", 0, 0, 'l'}, + { "Lid_show", 0, 0, 'L'}, + { "timeout", 1, 0, 't'}, + { "sm_port", 1, 0, 's'}, + { "Version", 0, 0, 'V'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 'C': + ca = optarg; + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'd': + ibdebug++; + break; + case 'D': + dest_type = IB_DEST_DRPATH; + break; + case 'g': + show_gid++; + break; + case 'G': + dest_type = IB_DEST_GUID; + break; + case 'l': + show_lid++; + break; + case 'L': + show_lid = -100; + break; + case 's': + if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) + IBERROR("can't resolve SM destination port %s", optarg); + sm_id = &sm_portid; + break; + case 't': + timeout = strtoul(optarg, 0, 0); + madrpc_set_timeout(timeout); + break; + case 'V': + fprintf(stderr, "%s %s\n", argv0, get_build_version() ); + exit(-1); + case 'h': + default: + usage(); + } + } + argc -= optind; + argv += optind; + if (argc > 1) port = strtoul(argv[1], 0, 0); Index: diags/src/smpquery.c =================================================================== --- diags/src/smpquery.c (revision 5057) +++ diags/src/smpquery.c (working copy) @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -73,6 +74,8 @@ {0} }; +static char *argv0 = "smpquery"; + static void iberror(const char *fn, char *msg, ...) { @@ -234,7 +237,7 @@ basename++; fprintf(stderr, "Usage: %s [-d(ebug) -e(rr_show) -v(erbose) -D(irect) -G(uid) -s smlid -V(ersion) -C ca_name -P hca_port " - "-t timeout_ms] [op params]\n", + "-t(imeout) timeout_ms] [op params]\n", basename); fprintf(stderr, "\tsupported ops:\n"); fprintf(stderr, "\t\tnodeinfo \n"); @@ -262,48 +265,74 @@ char *err; op_fn_t *fn; - ARGBEGIN { - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'e': - madrpc_show_errors(1); - break; - case 'D': - dest_type = IB_DEST_DRPATH; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'C': - ca = ARGF(); - break; - case 'P': - ca_port = strtoul(ARGF(), 0, 0); - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, ARGF(), IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", ARGF()); - sm_id = &sm_portid; - break; - case 't': - timeout = strtoul(ARGF(), 0, 0); - madrpc_set_timeout(timeout); - break; - case 'v': - verbose++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - case 'h': - default: - usage(); - } ARGEND; + static char const str_opts[] = "C:P:t:s:devDGVhu"; + static const struct option long_opts[] = { + { "C", 1, 0, 'C'}, + { "P", 1, 0, 'P'}, + { "debug", 0, 0, 'd'}, + { "err_show", 0, 0, 'e'}, + { "verbose", 0, 0, 'v'}, + { "Direct", 0, 0, 'D'}, + { "Guid", 0, 0, 'G'}, + { "smlid", 1, 0, 's'}, + { "timeout", 1, 0, 't'}, + { "Version", 0, 0, 'V'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 'd': + ibdebug++; + madrpc_show_errors(1); + umad_debug(udebug); + udebug++; + break; + case 'e': + madrpc_show_errors(1); + break; + case 'D': + dest_type = IB_DEST_DRPATH; + break; + case 'G': + dest_type = IB_DEST_GUID; + break; + case 'C': + ca = optarg; + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 's': + if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) + IBERROR("can't resolve SM destination port %s", optarg); + sm_id = &sm_portid; + break; + case 't': + timeout = strtoul(optarg, 0, 0); + madrpc_set_timeout(timeout); + break; + case 'v': + verbose++; + break; + case 'V': + fprintf(stderr, "%s %s\n", argv0, get_build_version() ); + exit(-1); + case 'h': + default: + usage(); + } + } + argc -= optind; + argv += optind; + if (argc < 2) usage(); Index: diags/src/ibstat.c =================================================================== --- diags/src/ibstat.c (revision 5057) +++ diags/src/ibstat.c (working copy) @@ -51,6 +51,7 @@ #include #include #include +#include #include #include #include @@ -64,6 +65,8 @@ static int debug; +static char *argv0 = "ibstat"; + static char *node_type_str[] = { "???", "CA", @@ -202,23 +205,43 @@ int list_only = 0, short_format = 0, list_ports = 0; int n, i; - ARGBEGIN { - case 'd': - debug++; - break; - case 'l': - list_only++; - break; - case 's': - short_format++; - break; - case 'p': - list_ports++; - break; - default: - usage(); - } ARGEND; + static char const str_opts[] = "dlsphu"; + static const struct option long_opts[] = { + { "debug", 0, 0, 'd'}, + { "list_of_cas", 0, 0, 'l'}, + { "short", 0, 0, 's'}, + { "port_list", 0, 0, 'p'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 'd': + debug++; + break; + case 'l': + list_only++; + break; + case 's': + short_format++; + break; + case 'p': + list_ports++; + break; + default: + usage(); + } + } + argc -= optind; + argv += optind; + if (argc > 1) dev_port = strtol(argv[1], 0, 0); Index: diags/src/ibping.c =================================================================== --- diags/src/ibping.c (revision 5057) +++ diags/src/ibping.c (working copy) @@ -43,6 +43,7 @@ #include #include #include +#include #include #include @@ -57,6 +58,8 @@ static char host_and_domain[IB_VENDOR_RANGE2_DATA_SIZE]; static char last_host[IB_VENDOR_RANGE2_DATA_SIZE]; +static char *argv0 = "ibping"; + static void iberror(const char *fn, char *msg, ...) { @@ -180,7 +183,7 @@ basename++; fprintf(stderr, "Usage: %s [-d(ebug) -e(rr_show) -v(erbose) -G(uid) -s smlid -V(ersion) -C ca_name -P hca_port" - "-t timeout_ms -c ping_count -f(lood) -o oui -S(erver)] \n", + "-t(imeout) timeout_ms -c ping_count -f(lood) -o oui -S(erver)] \n", basename); exit(-1); } @@ -237,57 +240,86 @@ char *ca = 0; int ca_port = 0; - ARGBEGIN { - case 'C': - ca = ARGF(); - break; - case 'P': - ca_port = strtoul(ARGF(), 0, 0); - break; - case 'c': - count = strtoul(ARGF(), 0, 0); - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'e': - madrpc_show_errors(1); - break; - case 'f': - flood++; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'o': - oui = strtoul(ARGF(), 0, 0); - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, ARGF(), IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", ARGF()); - sm_id = &sm_portid; - break; - case 'S': - server++; - break; - case 't': - timeout = strtoul(ARGF(), 0, 0); - madrpc_set_timeout(timeout); - break; - case 'v': - verbose++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - case 'h': - default: - usage(); - } ARGEND; + static char const str_opts[] = "C:P:t:s:c:o:devGfSVhu"; + static const struct option long_opts[] = { + { "C", 1, 0, 'C'}, + { "P", 1, 0, 'P'}, + { "debug", 0, 0, 'd'}, + { "err_show", 0, 0, 'e'}, + { "verbose", 0, 0, 'v'}, + { "Guid", 0, 0, 'G'}, + { "s", 1, 0, 's'}, + { "timeout", 1, 0, 't'}, + { "c", 1, 0, 'c'}, + { "flood", 0, 0, 'f'}, + { "o", 1, 0, 'o'}, + { "Server", 0, 0, 'S'}, + { "Version", 0, 0, 'V'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 'C': + ca = optarg; + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'c': + count = strtoul(optarg, 0, 0); + break; + case 'd': + ibdebug++; + madrpc_show_errors(1); + umad_debug(udebug); + udebug++; + break; + case 'e': + madrpc_show_errors(1); + break; + case 'f': + flood++; + break; + case 'G': + dest_type = IB_DEST_GUID; + break; + case 'o': + oui = strtoul(optarg, 0, 0); + break; + case 's': + if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) + IBERROR("can't resolve SM destination port %s", optarg); + sm_id = &sm_portid; + break; + case 'S': + server++; + break; + case 't': + timeout = strtoul(optarg, 0, 0); + madrpc_set_timeout(timeout); + break; + case 'v': + verbose++; + break; + case 'V': + fprintf(stderr, "%s %s\n", argv0, get_build_version() ); + exit(-1); + case 'h': + default: + usage(); + } + } + argc -= optind; + argv += optind; + if (!argc && !server) usage(); Index: diags/src/ibroute.c =================================================================== --- diags/src/ibroute.c (revision 5057) +++ diags/src/ibroute.c (working copy) @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -57,6 +58,8 @@ static int verbose; static int dump_all; +static char *argv0 = "ibroute"; + static void iberror(const char *fn, char *msg, ...) { @@ -377,7 +380,7 @@ basename++; fprintf(stderr, "Usage: %s [-d(ebug)] -a(ll) -n(o_dests) -v(erbose) -D(irect) -G(uid) -M(ulticast) -s smlid -V(ersion) -C ca_name -P hca_port " - "-t timeout_ms] [ [ []]]\n", + "-t(imeout) timeout_ms] [ [ []]]\n", basename); fprintf(stderr, "\n\tUnicast examples:\n"); fprintf(stderr, "\t\t%s 4\t# dump all lids with valid out ports of switch with lid 4\n", basename); @@ -408,52 +411,80 @@ char *ca = 0; int ca_port = 0; - ARGBEGIN { - case 'C': - ca = ARGF(); - break; - case 'P': - ca_port = strtoul(ARGF(), 0, 0); - break; - case 'a': - dump_all++; - break; - case 'd': - ibdebug++; - break; - case 'D': - dest_type = IB_DEST_DRPATH; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'M': - multicast++; - break; - case 'n': - brief++; - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, ARGF(), IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", ARGF()); - sm_id = &sm_portid; - break; - case 't': - timeout = strtoul(ARGF(), 0, 0); - madrpc_set_timeout(timeout); - break; - case 'v': - madrpc_show_errors(1); - verbose++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - case 'h': - default: - usage(); - } ARGEND; + static char const str_opts[] = "C:P:t:s:danvDGMVhu"; + static const struct option long_opts[] = { + { "C", 1, 0, 'C'}, + { "P", 1, 0, 'P'}, + { "debug", 0, 0, 'd'}, + { "all", 0, 0, 'a'}, + { "no_dests", 0, 0, 'n'}, + { "verbose", 0, 0, 'v'}, + { "Direct", 0, 0, 'D'}, + { "Guid", 0, 0, 'G'}, + { "Multicast", 0, 0, 'M'}, + { "timeout", 1, 0, 't'}, + { "s", 1, 0, 's'}, + { "Version", 0, 0, 'V'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 'C': + ca = optarg; + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'a': + dump_all++; + break; + case 'd': + ibdebug++; + break; + case 'D': + dest_type = IB_DEST_DRPATH; + break; + case 'G': + dest_type = IB_DEST_GUID; + break; + case 'M': + multicast++; + break; + case 'n': + brief++; + break; + case 's': + if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) + IBERROR("can't resolve SM destination port %s", optarg); + sm_id = &sm_portid; + break; + case 't': + timeout = strtoul(optarg, 0, 0); + madrpc_set_timeout(timeout); + break; + case 'v': + madrpc_show_errors(1); + verbose++; + break; + case 'V': + fprintf(stderr, "%s %s\n", argv0, get_build_version() ); + exit(-1); + case 'h': + default: + usage(); + } + } + argc -= optind; + argv += optind; + if (!argc) usage(); Index: diags/src/sminfo.c =================================================================== --- diags/src/sminfo.c (revision 5057) +++ diags/src/sminfo.c (working copy) @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -47,10 +48,10 @@ static uint8_t sminfo[1024]; +static char *argv0 = "sminfo"; + #define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) -#define SAFE_ARGF() (*(argv+1) ? ARGF() : ( usage(), NULL ) ) - static void iberror(const char *fn, char *msg, ...) { @@ -74,8 +75,8 @@ void usage(void) { - fprintf(stderr, "Usage: %s [-d(ebug) -s state -p prio -a activity -D(irect) -G(uid) -V(ersion) -C ca_name -P hca_port " - "-t timeout_ms] [modifier]\n", + fprintf(stderr, "Usage: %s [-d(ebug) -e(rr_show) -s state -p prio -a activity -D(irect) -G(uid) -V(ersion) -C ca_name -P hca_port " + "-t(imeout) timeout_ms] [modifier]\n", argv0); fprintf(stderr, "%s %s\n", argv0, get_build_version() ); exit(-1); @@ -116,48 +117,75 @@ char *ca = 0; int ca_port = 0; - ARGBEGIN { - case 'C': - ca = SAFE_ARGF(); - break; - case 'P': - ca_port = strtoul(SAFE_ARGF(), 0, 0); - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'e': - madrpc_show_errors(1); - break; - case 'D': - dest_type = IB_DEST_DRPATH; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 't': - timeout = strtoul(SAFE_ARGF(), 0, 0); - madrpc_set_timeout(timeout); - break; - case 'a': - act = strtoul(SAFE_ARGF(), 0, 0); - break; - case 's': - state = strtoul(SAFE_ARGF(), 0, 0); - break; - case 'p': - prio = strtoul(SAFE_ARGF(), 0, 0); - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - } ARGEND; + static char const str_opts[] = "C:P:t:s:p:a:deDGVhu"; + static const struct option long_opts[] = { + { "C", 1, 0, 'C'}, + { "P", 1, 0, 'P'}, + { "debug", 0, 0, 'd'}, + { "err_show", 0, 0, 'e'}, + { "s", 1, 0, 's'}, + { "p", 1, 0, 'p'}, + { "a", 1, 0, 'a'}, + { "Direct", 0, 0, 'D'}, + { "Guid", 0, 0, 'G'}, + { "Version", 0, 0, 'V'}, + { "timeout", 1, 0, 't'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 'C': + ca = optarg; + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'd': + ibdebug++; + madrpc_show_errors(1); + umad_debug(udebug); + udebug++; + break; + case 'e': + madrpc_show_errors(1); + break; + case 'D': + dest_type = IB_DEST_DRPATH; + break; + case 'G': + dest_type = IB_DEST_GUID; + break; + case 't': + timeout = strtoul(optarg, 0, 0); + madrpc_set_timeout(timeout); + break; + case 'a': + act = strtoul(optarg, 0, 0); + break; + case 's': + state = strtoul(optarg, 0, 0); + break; + case 'p': + prio = strtoul(optarg, 0, 0); + break; + case 'V': + fprintf(stderr, "%s %s\n", argv0, get_build_version() ); + exit(-1); + default: + usage(); + } + } + argc -= optind; + argv += optind; + if (argc > 1) mod = atoi(argv[1]); Index: libibcommon/include/infiniband/common.h =================================================================== --- libibcommon/include/infiniband/common.h (revision 5057) +++ libibcommon/include/infiniband/common.h (working copy) @@ -83,25 +83,6 @@ * COMMON MACHINE INDEPENDENT */ -/* argc, argv parsing */ - -/** Begin arguments parsing block */ -#define ARGBEGIN {char *_ss;\ - for (argv0 = *argv++; *argv && *argv[0] == '-'; argv++, argc--)\ - for (_ss = *argv + 1; *_ss; _ss++) switch (*_ss) - -/** End arguments parsing block */ -#define ARGEND argc--;}; - -/** Return current option argument */ -#define ARGF() (argc--, *++argv) - -/** Return current option character */ -#define ARGC() (*_ss) - -/** global application names (was argv[0]) */ -extern char *argv0; - /* Misc. macros: */ /** align value \a l to \a size (ceil) */ #define ALIGN(l, size) (((l) + ((size) - 1)) / (size) * (size)) Index: libibcommon/src/vars.c =================================================================== --- libibcommon/src/vars.c (revision 5057) +++ libibcommon/src/vars.c (working copy) @@ -1,35 +0,0 @@ -/* - * Copyright (c) 2004,2005 Voltaire Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - -char *argv0; Index: libibcommon/Makefile.am =================================================================== --- libibcommon/Makefile.am (revision 5057) +++ libibcommon/Makefile.am (working copy) @@ -13,7 +13,7 @@ libibcommon_version_script = endif -libibcommon_la_SOURCES = src/stack.c src/sysfs.c src/util.c src/vars.c src/time.c src/hash.c +libibcommon_la_SOURCES = src/stack.c src/sysfs.c src/util.c src/time.c src/hash.c libibcommon_la_LDFLAGS = -version-info 1 -export-dynamic \ $(libibcommon_version_script) libibcommon_la_DEPENDENCIES = $(srcdir)/src/libibcommon.map From ianjiang.ict at gmail.com Wed Jan 18 07:02:17 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Wed, 18 Jan 2006 23:02:17 +0800 Subject: [openib-general] [multicast]examples using multicast Message-ID: <7b2fa1820601180702w384ff786p7779bbba15a022d8@mail.gmail.com> I want to find some examples that use the IB multicast verbs. Are there any other ones except osmtest? Thanks! -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Jan 18 07:04:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 18 Jan 2006 17:04:19 +0200 Subject: [openib-general] [multicast]examples using multicast Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AC09@taurus.voltaire.com> Hi Ian, osmtest only deals with multicast groups (creating, querying, joining, leaving, deleting) and does not send any data. Multicast verbs would be used to send data from user space. Is that what you are looking for ? -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Ian Jiang Sent: Wed 1/18/2006 10:02 AM To: openib-general Subject: [openib-general] [multicast]examples using multicast I want to find some examples that use the IB multicast verbs. Are there any other ones except osmtest? Thanks! -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences From mst at mellanox.co.il Wed Jan 18 07:26:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 17:26:05 +0200 Subject: [openib-general] [PATCH] srptools on FC4 Message-ID: <20060118152605.GE22260@mellanox.co.il> ----- Forwarded message from Yael Shenhav ----- Roland, I get errors when trying to compile ibsrpdm on FedoraCore4: In file included from src/srp-dm.c:41: src/ib_user_mad.h:80: error: syntax error before '__be32' src/ib_user_mad.h:80: warning: no semicolon at end of struct or union etc. On FC4, __be32 is not defined by userspace headers. I think you can use __u32 instead. ----- End forwarded message ----- Replace __beXX with __uXX for userspace code. Signed-off-by: Yael Shenhav Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/srptools/src/ib_user_mad.h =================================================================== --- openib/src/userspace/srptools/src/ib_user_mad.h +++ openib/src/userspace/srptools/src/ib_user_mad.h @@ -77,9 +77,9 @@ struct ib_user_mad_hdr { __u32 timeout_ms; __u32 retries; __u32 length; - __be32 qpn; - __be32 qkey; - __be16 lid; + __u32 qpn; + __u32 qkey; + __u16 lid; __u8 sl; __u8 path_bits; __u8 grh_present; @@ -87,7 +87,7 @@ struct ib_user_mad_hdr { __u8 hop_limit; __u8 traffic_class; __u8 gid[16]; - __be32 flow_label; + __u32 flow_label; }; /** -- MST From mst at mellanox.co.il Wed Jan 18 07:32:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 17:32:04 +0200 Subject: [openib-general] Re: [patch] userspace/management: ARGBEGIN() ->getopt() conversion for diags In-Reply-To: <20060118142853.GA19642@sashak.voltaire.com> References: <20060118142853.GA19642@sashak.voltaire.com> Message-ID: <20060118153204.GH22260@mellanox.co.il> Quoting r. Sasha Khapyorsky : > Subject: [patch] userspace/management: ARGBEGIN() ->getopt() conversion for diags > > Hi Hal, > > Diag utils are converted to getopt(). It is just basically tested, > so please report bugs (if any). > > Sasha. > > > This converts diag utils to more standard getopt() using instead of > AGRBEGIN() buggy macros. Unused now ARGBEGIN() related code is > removed from libibcommon. > Good stuff. > Signed-off-by: Sasha Khapyorsky > > Index: diags/src/ibtracert.c > =================================================================== > --- diags/src/ibtracert.c (revision 5057) > +++ diags/src/ibtracert.c (working copy) > @@ -41,6 +41,7 @@ > #include > #include > #include > +#include > #include > > #include > @@ -54,6 +55,8 @@ > int force; > FILE *f; > > +static char *argv0 = "ibtracert"; > + Is there some reason to initialize it? You seem to set it to argv[0] below ... > #undef DEBUG > #define DEBUG if (ibdebug || verbose) IBWARN > #define VERBOSE if (ibdebug || verbose > 1) IBWARN > @@ -726,7 +729,7 @@ > basename++; > > fprintf(stderr, "Usage: %s [-d(ebug) -v(erbose) -D(irect_path_addrs) -G(uid_addrs) -v(erbose) -n(o_info) -C ca_name -P hca_port " > - "-s smlid -t timeout_ms -m mlid] \n", > + "-s smlid -t(imeout) timeout_ms -m mlid] \n", > basename); > fprintf(stderr, "\n\tUnicast examples:\n"); > fprintf(stderr, "\t\t%s 4 16\t\t\t# show path between lids 4 and 16\n", basename); > @@ -753,57 +756,85 @@ > char *ca = 0; > int ca_port = 0; > > + static char const str_opts[] = "C:P:t:s:m:dvfDGnVhu"; > + static const struct option long_opts[] = { > + { "C", 1, 0, 'C'}, > + { "P", 1, 0, 'P'}, > + { "debug", 0, 0, 'd'}, > + { "verbose", 0, 0, 'v'}, > + { "force", 0, 0, 'f'}, > + { "Direct_path_addrs", 0, 0, 'D'}, > + { "Guid_addrs", 0, 0, 'G'}, > + { "no_info", 0, 0, 'n'}, > + { "timeout", 1, 0, 't'}, > + { "s", 1, 0, 's'}, > + { "m", 1, 0, 'm'}, > + { "Version", 0, 0, 'V'}, > + { "help", 0, 0, 'h'}, > + { "usage", 0, 0, 'u'}, > + { } > + }; > + > + argv0 = argv[0]; > + > f = stderr; -- MST From ogerlitz at voltaire.com Wed Jan 18 07:36:05 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 18 Jan 2006 17:36:05 +0200 Subject: [openib-general] Re: was [PATCH] enable the fmr pool user to set the page size - 2 Q on upstream pushes In-Reply-To: References: Message-ID: <43CE6065.8060909@voltaire.com> Roland Dreier wrote: > Seems reasonable. Unfortunately we just missed the 2.6.16-rc1 window > so I think this should wait for the 2.6.17 window. We have two questions related to upstream pushes on which it would be really helpful if you can help resolving: first, what does it means that the 2.6.16-rc1 window was missed? what changes makes sense for non -rc1 releases? is it correct that the parity of the X minor number in 2.6.X relates to which changes can be pushed? that is for even X (eg 2.6.14/16) only bug fixes and for odd X (eg 2.6.15/17) also new features (eg srp & open-iscsi "waited" for 2.6.15)? Second, I've noted upstream pushes (eg Sean's and yours) are done from git source trees. The natural code flow would be from the openib svn tree to the git tree. But as of the openib convention to work against the latest stable kernel (2.6.15) one can't commit into the svn changes depending in a code which is not in the latest stable kernel yet. So i understand you keep such changes in a git tree and when the next latest kernel is releases merge them into the svn (eg the replacement of all semaphores to mutexes). We will be just to happy to know that's the process more or less. Is there an FAQ explaining all this...? (many) thanks, Or. From jeff.walls at hp.com Wed Jan 18 07:39:24 2006 From: jeff.walls at hp.com (Walls, Jeffrey Joel) Date: Wed, 18 Jan 2006 10:39:24 -0500 Subject: [openib-general] Debugging Infiniband? Message-ID: Hi, I first must admit that I'm new to Infiniband and Infiniband programming. I have just begun writing my first commercial application using IB late last year. I'm very familiar with socket programming (TCP, Multicast, etc), though. I'm wondering what techniques expierenced IB programmers use to debug IB applications. My situation is that I'm running an data producer on Windows XP and a set of data consumers on Linux. So for Windows, I'm using WinIB (gen1) and for linux I'm using OpenIB (gen2). I have both sides implemented according to some of the example code I've seen and also according to the documents I've been able to find. The connections all seem to be set up properly and my producer successfully posts all of its sends (at least according to my CQE's returned). The problem is that my receiver never sees any of the IB packets. I post the receive and then wait forever polling the CQ. I've run out of ideas on what to even look at and am now looking for suggestions on how to best figure out this problem. If you have any ideas or need more clarification, I'd love to hear from you. Also, if this isn't the proper forum for such discussions, if you could please guide me in the right direction, I would greatly appreciate that as well. Best Regards, Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Jan 18 07:43:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Jan 2006 07:43:09 -0800 Subject: [openib-general] [PATCH 5/5] [RFC] Infiniband: connection abstraction In-Reply-To: (Sean Hefty's message of "Tue, 17 Jan 2006 15:44:48 -0800") References: Message-ID: > +struct ucma_file { > + struct semaphore mutex; This should be a struct mutex instead, I think. > +static DECLARE_MUTEX(ctx_mutex); Same here. - R. From rdreier at cisco.com Wed Jan 18 07:44:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Jan 2006 07:44:33 -0800 Subject: [openib-general] [PATCH 5/5] [RFC] Infiniband: connection abstraction In-Reply-To: (Sean Hefty's message of "Tue, 17 Jan 2006 15:44:48 -0800") References: Message-ID: > + UCMA_MAX_BACKLOG = 128 Is there any reason that we might want to make this a tunable? Maybe as a module parameter that's writable in sysfs... - R. From mulix at mulix.org Wed Jan 18 07:47:29 2006 From: mulix at mulix.org (Muli Ben-Yehuda) Date: Wed, 18 Jan 2006 17:47:29 +0200 Subject: [openib-general] Re: was [PATCH] enable the fmr pool user to set the page size - 2 Q on upstream pushes In-Reply-To: <43CE6065.8060909@voltaire.com> References: <43CE6065.8060909@voltaire.com> Message-ID: <20060118154728.GB22449@granada.merseine.nu> On Wed, Jan 18, 2006 at 05:36:05PM +0200, Or Gerlitz wrote: > first, what does it means that the 2.6.16-rc1 window was missed? Current kernel development follows a model where there's an approximately two week merge window after each major kernel release (e.g., 2.6.15). Then follow approximately 6 weeks of stabilizing the massive ammount of code that went in during the merge window. > what > changes makes sense for non -rc1 releases? Theoretically anything can go in in -rc1 (assuming it is appropriate to go in in the first place). Other rc's have much stricter criteria for what can go in. It's the maintainer's decision what should go in. > is it correct that the parity > of the X minor number in 2.6.X relates to which changes can be pushed? > that is for even X (eg 2.6.14/16) only bug fixes and for odd X (eg > 2.6.15/17) also new features (eg srp & open-iscsi "waited" for > 2.6.15)? No. Once upon a time odd numbered kernels (e.g., 2.3.X, 2.5.X) were experimental and even numbered kernel (2.2.X, 2.4.X) were stable. Nowdays all 2.6.X kernels are equally experimental or stable, depending on your point of view. > Is there an FAQ explaining all this...? Not really, since it's a constantly changing constantly evolving process. This stuff is usually discussed on the kernel mailing list, and summarized or one of the many summaries (kerneltraffic, lwn.net's weekly kernel page). Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ From dotanb at mellanox.co.il Wed Jan 18 07:53:28 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 18 Jan 2006 17:53:28 +0200 Subject: [openib-general] Debugging Infiniband? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3012D6550@mtlexch01.mtl.com> Hi jeff. there are some issues you need to check: there are WR that were posted to the remote QP RQ before posting the WR to the SQ in local side both of the QPs are alive and in valid states (at least RTR for responder and RTS for requestor) the QPs parameters are synch (for example: the psn) the route that you are using is valid (port, remote QP number, remote lid) if you are using UD/UC QPs maybe the packet were dropped .. if you have an IB analyzer you should check that the packet was sent to the expected QP number you can check the port counters to see how many data was sent / received to each IB port I Hope i gave you some useful information [Dotan Barak] -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org]On Behalf Of Walls, Jeffrey Joel Sent: Wednesday, January 18, 2006 5:39 PM To: openib-general Subject: [openib-general] Debugging Infiniband? Hi, I first must admit that I'm new to Infiniband and Infiniband programming. I have just begun writing my first commercial application using IB late last year. I'm very familiar with socket programming (TCP, Multicast, etc), though. I'm wondering what techniques expierenced IB programmers use to debug IB applications. My situation is that I'm running an data producer on Windows XP and a set of data consumers on Linux. So for Windows, I'm using WinIB (gen1) and for linux I'm using OpenIB (gen2). I have both sides implemented according to some of the example code I've seen and also according to the documents I've been able to find. The connections all seem to be set up properly and my producer successfully posts all of its sends (at least according to my CQE's returned). The problem is that my receiver never sees any of the IB packets. I post the receive and then wait forever polling the CQ. I've run out of ideas on what to even look at and am now looking for suggestions on how to best figure out this problem. If you have any ideas or need more clarification, I'd love to hear from you. Also, if this isn't the proper forum for such discussions, if you could please guide me in the right direction, I would greatly appreciate that as well. Best Regards, Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Jan 18 07:53:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Jan 2006 07:53:11 -0800 Subject: [openib-general] Re: [PATCH 1 of 3] move destructor to struct neigh_parms In-Reply-To: <20060117222021.GA13186@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 18 Jan 2006 00:20:21 +0200") References: <20060112162438.GO16938@mellanox.co.il> <20060117222021.GA13186@mellanox.co.il> Message-ID: Michael> Roland, what do you say to this approach? We still could Michael> try this for 2.6.16, couldnt we? Its small and the Michael> interface is unused in kernel except by us. Yes, it makes sense. I haven't had a chance to send it to netdev for comments yet though. It would be good to know what the design was behind putting the destructor method there in the first place. If you get a chance to send it, that would be a good thing to do. Michael> Otherwise for 2.6.16 and earlier we'll have to maintain Michael> the global list of neighbours along the lines of Michael> ipoib_all_neigh_issues_2.patch. Well, we already have the problem for 2.6.15 and earlier, so I'm not sure if this is that big an issue. - R. From mst at mellanox.co.il Wed Jan 18 07:55:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 17:55:17 +0200 Subject: [openib-general] Re: was [PATCH] enable the fmr pool user to set the page size - 2 Q on upstream pushes In-Reply-To: <43CE6065.8060909@voltaire.com> References: <43CE6065.8060909@voltaire.com> Message-ID: <20060118155517.GI22260@mellanox.co.il> Quoting Or Gerlitz : > Is it correct that the parity > of the X minor number in 2.6.X relates to which changes can be pushed? > that is for even X (eg 2.6.14/16) only bug fixes and for odd X (eg > 2.6.15/17) also new features (eg srp & open-iscsi "waited" for 2.6.15)? Take a look at http://lwn.net/Articles/95312/ More stuff here http://lwn.net/Kernel/Index/ -- MST From eitan at mellanox.co.il Wed Jan 18 07:58:11 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 18 Jan 2006 17:58:11 +0200 Subject: [openib-general] [PATCH] osm: lib vendor race cause OpenSM crashes Message-ID: <86vewh8r58.fsf@mtl066.yok.mtl.com> Hi Hal We have found a race in OpenSM that can cause an active madw be returned during the transaction. This is a fatal high priority bug as it very likely cause a crash. Eitan Signed-off-by: Eitan Zahavi Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 5009) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -138,14 +138,16 @@ get_madw(osm_vendor_t *p_vend, ib_net64_ { umad_match_t *m, *e; ib_net64_t mtid = (*tid & 0xffffffff00000000llu); + osm_madw_t *res; cl_spinlock_acquire( &p_vend->match_tbl_lock ); for (m = p_vend->mtbl.tbl, e = m + p_vend->mtbl.max; m < e; m++) { if (m->tid == mtid) { m->tid = 0; *tid = mtid; + res = m->v; cl_spinlock_release( &p_vend->match_tbl_lock ); - return m->v; + return res; } } From mst at mellanox.co.il Wed Jan 18 08:10:20 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 18:10:20 +0200 Subject: [openib-general] Fwd: [PATCH 1 of 3] move destructor to struct neigh_parms Message-ID: <20060118161020.GJ22260@mellanox.co.il> Hi! struct neigh_ops currently has a destructor field, unused by in-kernel drivers outside the infiniband subtree. infiniband ipoib in-tree driver currently uses this field, and we've run into problems: since the destructor is shared between neighbours that belong to different net devices, there's no way to set/clear it safely. It would be good to know what the design was behind putting the destructor method there in the first place. The following patch moves this field to neigh_parms where it can be safely set, together with its twin neigh_setup. Two additional patches in the patch series update ipoib to use this new interface. --- Move destructor from neigh_ops (which is shared between devices) to neigh_parms which is not, so that multiple drivers can set it safely. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/net/core/neighbour.c =================================================================== --- linux-2.6.15.orig/net/core/neighbour.c 2006-01-12 11:58:15.000000000 +0200 +++ linux-2.6.15/net/core/neighbour.c 2006-01-12 20:10:00.000000000 +0200 @@ -586,8 +586,8 @@ void neigh_destroy(struct neighbour *nei kfree(hh); } - if (neigh->ops && neigh->ops->destructor) - (neigh->ops->destructor)(neigh); + if (neigh->parms->neigh_destructor) + (neigh->parms->neigh_destructor)(neigh); skb_queue_purge(&neigh->arp_queue); Index: linux-2.6.15/include/net/neighbour.h =================================================================== --- linux-2.6.15.orig/include/net/neighbour.h 2006-01-03 05:21:10.000000000 +0200 +++ linux-2.6.15/include/net/neighbour.h 2006-01-12 20:09:27.000000000 +0200 @@ -68,6 +68,7 @@ struct neigh_parms struct net_device *dev; struct neigh_parms *next; int (*neigh_setup)(struct neighbour *); + void (*neigh_destructor)(struct neighbour *); struct neigh_table *tbl; void *sysctl_table; @@ -145,7 +146,6 @@ struct neighbour struct neigh_ops { int family; - void (*destructor)(struct neighbour *); void (*solicit)(struct neighbour *, struct sk_buff*); void (*error_report)(struct neighbour *, struct sk_buff*); int (*output)(struct sk_buff*); -- MST From mst at mellanox.co.il Wed Jan 18 08:12:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 18:12:41 +0200 Subject: [openib-general] [PATCH 2 of 3] ipoib: move destructor to struct neigh_parms Message-ID: <20060118161240.GK22260@mellanox.co.il> Move destructor from neigh_ops (which is shared between devices) to neigh_parms which is not, so that multiple drivers can set it safely. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 20:30:52.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 20:31:26.000000000 +0200 @@ -247,7 +247,6 @@ static void path_free(struct net_device if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } @@ -530,7 +529,6 @@ static void neigh_add_path(struct sk_buf err: *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; kfree(neigh); ++priv->stats.tx_dropped; @@ -769,21 +767,9 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } -static int ipoib_neigh_setup(struct neighbour *neigh) -{ - /* - * Is this kosher? I can't find anybody in the kernel that - * sets neigh->destructor, so we should be able to set it here - * without trouble. - */ - neigh->ops->destructor = ipoib_neigh_destructor; - - return 0; -} - static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { - parms->neigh_setup = ipoib_neigh_setup; + parms->neigh_destructor = ipoib_neigh_destructor; return 0; } -- MST From mst at mellanox.co.il Wed Jan 18 08:15:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 18:15:02 +0200 Subject: [openib-general] [PATCH 3 of 3] ipoib: fix error handling Message-ID: <20060118161502.GL22260@mellanox.co.il> The following patch is not directly related to the destructor issue, but I'm posting it here fore completeness since it needs to be applied on top of the previous pair of patches in the destructor series. --- Fix error handling in neigh_add_path. Reduce code duplication by implementing alloc/free functions for ipoib_neigh. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 20:48:06.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 20:48:43.000000000 +0200 @@ -246,8 +246,7 @@ static void path_free(struct net_device */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -475,7 +474,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -483,8 +482,6 @@ static void neigh_add_path(struct sk_buf } skb_queue_head_init(&neigh->queue); - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; /* * We can only be called from ipoib_start_xmit, so we're @@ -497,7 +494,7 @@ static void neigh_add_path(struct sk_buf path = path_rec_create(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); if (!path) - goto err; + goto err_path; __path_add(dev, path); } @@ -527,10 +524,9 @@ static void neigh_add_path(struct sk_buf return; err: - *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - kfree(neigh); - +err_path: + ipoib_neigh_free(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -757,8 +753,7 @@ static void ipoib_neigh_destructor(struc if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); - *to_ipoib_neigh(n) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -766,6 +761,26 @@ static void ipoib_neigh_destructor(struc if (ah) ipoib_put_ah(ah); } + +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +{ + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) + return NULL; + + neigh->neighbour = neighbour; + *to_ipoib_neigh(neighbour) = neigh; + + return neigh; +} + +void ipoib_neigh_free(struct ipoib_neigh *neigh) +{ + *to_ipoib_neigh(neigh->neighbour) = NULL; + kfree(neigh); +} static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-12 20:32:08.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-12 20:48:43.000000000 +0200 @@ -113,8 +113,7 @@ static void ipoib_mcast_free(struct ipoi */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -720,13 +719,11 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (neigh) { kref_get(&mcast->ah->ref); neigh->ah = mcast->ah; - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; list_add_tail(&neigh->list, &mcast->neigh_list); } } Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-12 20:27:47.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-12 20:48:43.000000000 +0200 @@ -222,6 +222,9 @@ static inline struct ipoib_neigh **to_ip (offsetof(struct neighbour, ha) & 4)); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +void ipoib_neigh_free(struct ipoib_neigh *neigh); + extern struct workqueue_struct *ipoib_workqueue; /* functions */ -- MST From mst at mellanox.co.il Wed Jan 18 08:16:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 18:16:03 +0200 Subject: [openib-general] Re: [PATCH 1 of 3] move destructor to struct neigh_parms In-Reply-To: References: Message-ID: <20060118161603.GM22260@mellanox.co.il> Quoting Roland Dreier : > Subject: Re: [PATCH 1 of 3] move destructor to struct neigh_parms > > Michael> Roland, what do you say to this approach? We still could > Michael> try this for 2.6.16, couldnt we? Its small and the > Michael> interface is unused in kernel except by us. > > Yes, it makes sense. I haven't had a chance to send it to netdev for > comments yet though. It would be good to know what the design was > behind putting the destructor method there in the first place. If you > get a chance to send it, that would be a good thing to do. OK, I just did that. -- MST From jlentini at netapp.com Wed Jan 18 08:43:01 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 18 Jan 2006 11:43:01 -0500 (EST) Subject: [openib-general] [CMA][PATCH] port byte order fix Message-ID: The CMA appears to assume that a struct sockaddr_in's sin_port value will be in host byte order. This is incorrect. Here is a quick fix. I hope this will make it into the initial kernel.org submission. Signed-off-by: James Lentini Index: core/cma.c =================================================================== --- core/cma.c (revision 5063) +++ core/cma.c (working copy) @@ -842,7 +842,7 @@ out: static __be64 cma_get_service_id(enum rdma_port_space ps, struct sockaddr *addr) { return cpu_to_be64(((u64)ps << 16) + - ((struct sockaddr_in *) addr)->sin_port); + be16_to_cpu(((struct sockaddr_in *) addr)->sin_port)); } static void cma_set_compare_data(struct sockaddr *addr, From mshefty at ichips.intel.com Wed Jan 18 09:21:26 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 18 Jan 2006 09:21:26 -0800 Subject: [openib-general] [PATCH] ib_mad: prevent duplicate outstanding MAD transactions with same TID. In-Reply-To: <20060118130536.GA24415@mellanox.co.il> References: <20060118130536.GA24415@mellanox.co.il> Message-ID: <43CE7916.2070909@ichips.intel.com> Jack Morgenstein wrote: > Prevent multiple outstanding MAD transactions with the same TID. > Could happen if duplicate requests are posted. This is an issue only for a single client, so I think that each client should be responsible for handling this. - Sean From mshefty at ichips.intel.com Wed Jan 18 09:39:27 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 18 Jan 2006 09:39:27 -0800 Subject: [openib-general] Re: [CMA][PATCH] port byte order fix In-Reply-To: References: Message-ID: <43CE7D4F.5030806@ichips.intel.com> James Lentini wrote: > The CMA appears to assume that a struct sockaddr_in's sin_port value > will be in host byte order. This is incorrect. Thanks - I'm fairly certain that the CMA will need some updates before final acceptance into the kernel. I'll add this in beforehand. - Sean From mshefty at ichips.intel.com Wed Jan 18 09:46:03 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 18 Jan 2006 09:46:03 -0800 Subject: [openib-general] RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction In-Reply-To: <20060118020342.GB3740@esmail.cup.hp.com> References: <20060118020342.GB3740@esmail.cup.hp.com> Message-ID: <43CE7EDB.7030201@ichips.intel.com> Grant Grundler wrote: >>+static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask) >>+{ >>+ int i; >>+ >>+ for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++) >>+ dst[i] = src[i] & mask[i]; >>+} > > Is this code going to get invoked very often? In practice, it would be invoked when matching any listen requests originating from the CMA (RDMA connection abstraction). > If so, can the mask operation use a "native" size since > IB_CM_PRIVATE_DATA_COMPARE_SIZE is hard coded to 64 byte? > > e.g something like: > for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE/sizeof(unsigned long); > i++) > ((unsigned long *)dst)[i] = ((unsigned long *)src)[i] > & ((unsigned long *)mask)[i]; Yes - something like this should work. Thanks. - Sean From iod00d at hp.com Wed Jan 18 10:02:43 2006 From: iod00d at hp.com (Grant Grundler) Date: Wed, 18 Jan 2006 10:02:43 -0800 Subject: [openib-general] RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction In-Reply-To: <43CE7EDB.7030201@ichips.intel.com> References: <20060118020342.GB3740@esmail.cup.hp.com> <43CE7EDB.7030201@ichips.intel.com> Message-ID: <20060118180243.GD6818@esmail.cup.hp.com> On Wed, Jan 18, 2006 at 09:46:03AM -0800, Sean Hefty wrote: > Grant Grundler wrote: > >>+static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask) ... > >Is this code going to get invoked very often? > > In practice, it would be invoked when matching any listen requests > originating from the CMA (RDMA connection abstraction). hrm..I'm not sure how to translate your answer into a workload. e.g. which netperf or netpipe test would excercise this alot? Or would it take something like MPI or specweb/ttcp? > >If so, can the mask operation use a "native" size since > >IB_CM_PRIVATE_DATA_COMPARE_SIZE is hard coded to 64 byte? > > > >e.g something like: > > for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE/sizeof(unsigned > > long); > > i++) > > ((unsigned long *)dst)[i] = ((unsigned long *)src)[i] > > & ((unsigned long *)mask)[i]; > > Yes - something like this should work. Thanks. Do you need a patch? I can submit one but it will be untested. thanks, grant From mshefty at ichips.intel.com Wed Jan 18 10:13:34 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 18 Jan 2006 10:13:34 -0800 Subject: [openib-general] RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction In-Reply-To: <20060118180243.GD6818@esmail.cup.hp.com> References: <20060118020342.GB3740@esmail.cup.hp.com> <43CE7EDB.7030201@ichips.intel.com> <20060118180243.GD6818@esmail.cup.hp.com> Message-ID: <43CE854E.4060703@ichips.intel.com> Grant Grundler wrote: >>>Is this code going to get invoked very often? >> >>In practice, it would be invoked when matching any listen requests >>originating from the CMA (RDMA connection abstraction). > > hrm..I'm not sure how to translate your answer into a workload. > e.g. which netperf or netpipe test would excercise this alot? > Or would it take something like MPI or specweb/ttcp? The code will be invoked at least once for every connection that is established. >>>e.g something like: >>> for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE/sizeof(unsigned >>> long); >>> i++) >>> ((unsigned long *)dst)[i] = ((unsigned long *)src)[i] >>> & ((unsigned long *)mask)[i]; >> >>Yes - something like this should work. Thanks. > > > Do you need a patch? > I can submit one but it will be untested. I will incorporate the change with the next set of updates. Someone else pointed out that I'd need to make sure that there won't be any alignment issues. - Sean From mshefty at ichips.intel.com Wed Jan 18 10:19:01 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 18 Jan 2006 10:19:01 -0800 Subject: [openib-general] [PATCH 5/5] [RFC] Infiniband: connection abstraction In-Reply-To: References: Message-ID: <43CE8695.9080401@ichips.intel.com> Roland Dreier wrote: > > + UCMA_MAX_BACKLOG = 128 > > Is there any reason that we might want to make this a tunable? Maybe > as a module parameter that's writable in sysfs... There's no reason not to make this tunable. - Sean From iod00d at hp.com Wed Jan 18 10:49:45 2006 From: iod00d at hp.com (Grant Grundler) Date: Wed, 18 Jan 2006 10:49:45 -0800 Subject: [openib-general] [PATCH 5/5] [RFC] Infiniband: connection abstraction In-Reply-To: <43CE8695.9080401@ichips.intel.com> References: <43CE8695.9080401@ichips.intel.com> Message-ID: <20060118184945.GG6818@esmail.cup.hp.com> On Wed, Jan 18, 2006 at 10:19:01AM -0800, Sean Hefty wrote: > Roland Dreier wrote: > > > + UCMA_MAX_BACKLOG = 128 > > > >Is there any reason that we might want to make this a tunable? Maybe > >as a module parameter that's writable in sysfs... > > There's no reason not to make this tunable. Yes, there are reasons to NOT make something a tunable: o increases system complexity (admin) o increases the amount of documentation (learning curve) o increases test matrix/cost (devel/support cost) o generally hurts performance (var vs a constant of the same value) Any reason to make something a tunable has to compensate for the above drawbacks. An answer to Roland's question is a reasonable prerequisite if someone wants add a tunable. IB doesn't have the much in /sys/class/infiniband* or module parameters and I think that's a Good Thing. grant From bos at serpentine.com Wed Jan 18 12:27:32 2006 From: bos at serpentine.com (Bryan O'Sullivan) Date: Wed, 18 Jan 2006 12:27:32 -0800 Subject: [openib-general] Re: [PATCH 2/5] [RFC] Infiniband: connection abstraction In-Reply-To: <1137568107.3005.69.camel@laptopd505.fenrus.org> References: <20060117153838.3dc2cd2e@dxpl.pdx.osdl.net> <1137568107.3005.69.camel@laptopd505.fenrus.org> Message-ID: <1137616052.4757.85.camel@serpentine.pathscale.com> On Wed, 2006-01-18 at 08:08 +0100, Arjan van de Ven wrote: > the dual license text needs a bit of clarification I suspect to make > explicit that the "or BSD" part only applies when used entirely outside > the linux kernel. (that already is the case, just it's not explicit. > Making that explicit would be good). One appropriate way to do that would be to mark all IB-related exported symbols as EXPORT_SYMBOL_GPL. Message-ID: >+ if (neigh->parms->neigh_destructor) >+ (neigh->parms->neigh_destructor)(neigh); Is that safe without checking neigh->parms here? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Wed Jan 18 12:54:01 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 18 Jan 2006 15:54:01 -0500 (EST) Subject: [openib-general] uDAPL presentation Message-ID: Hi Bill, Would you like Arlin and I to create a uDAPL presentation for the workshop? james -- James Lentini | Network Appliance | 781-768-5359 | jlentini at netapp.com From mst at mellanox.co.il Wed Jan 18 13:19:26 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 23:19:26 +0200 Subject: [openib-general] Repost [PATCH 1 of 3] move destructor to struct neigh_parms Message-ID: <20060118211926.GE31280@mellanox.co.il> Sorry about reposting: the message didnt seem to make it to netdev. --- Hi! struct neigh_ops currently has a destructor field, unused by in-kernel drivers outside the infiniband subtree. infiniband ipoib in-tree driver currently uses this field, and we've run into problems: since the destructor is shared between neighbours that belong to different net devices, there's no way to set/clear it safely. It would be good to know what the design was behind putting the destructor method there in the first place. The following patch moves this field to neigh_parms where it can be safely set, together with its twin neigh_setup. Two additional patches in the patch series update ipoib to use this new interface. Please Cc me on replies, I'm not on the list. --- Move destructor from neigh_ops (which is shared between devices) to neigh_parms which is not, so that multiple drivers can set it safely. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/net/core/neighbour.c =================================================================== --- linux-2.6.15.orig/net/core/neighbour.c 2006-01-12 11:58:15.000000000 +0200 +++ linux-2.6.15/net/core/neighbour.c 2006-01-12 20:10:00.000000000 +0200 @@ -586,8 +586,8 @@ void neigh_destroy(struct neighbour *nei kfree(hh); } - if (neigh->ops && neigh->ops->destructor) - (neigh->ops->destructor)(neigh); + if (neigh->parms->neigh_destructor) + (neigh->parms->neigh_destructor)(neigh); skb_queue_purge(&neigh->arp_queue); Index: linux-2.6.15/include/net/neighbour.h =================================================================== --- linux-2.6.15.orig/include/net/neighbour.h 2006-01-03 05:21:10.000000000 +0200 +++ linux-2.6.15/include/net/neighbour.h 2006-01-12 20:09:27.000000000 +0200 @@ -68,6 +68,7 @@ struct neigh_parms struct net_device *dev; struct neigh_parms *next; int (*neigh_setup)(struct neighbour *); + void (*neigh_destructor)(struct neighbour *); struct neigh_table *tbl; void *sysctl_table; @@ -145,7 +146,6 @@ struct neighbour struct neigh_ops { int family; - void (*destructor)(struct neighbour *); void (*solicit)(struct neighbour *, struct sk_buff*); void (*error_report)(struct neighbour *, struct sk_buff*); int (*output)(struct sk_buff*); -- MST From mst at mellanox.co.il Wed Jan 18 13:19:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 23:19:53 +0200 Subject: [openib-general] Repost [PATCH 2 of 3] ipoib: move destructor to struct neigh_parms Message-ID: <20060118211953.GF31280@mellanox.co.il> Sorry about reposting: the message didnt seem to make it to netdev. --- Move destructor from neigh_ops (which is shared between devices) to neigh_parms which is not, so that multiple drivers can set it safely. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 20:30:52.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 20:31:26.000000000 +0200 @@ -247,7 +247,6 @@ static void path_free(struct net_device if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } @@ -530,7 +529,6 @@ static void neigh_add_path(struct sk_buf err: *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; kfree(neigh); ++priv->stats.tx_dropped; @@ -769,21 +767,9 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } -static int ipoib_neigh_setup(struct neighbour *neigh) -{ - /* - * Is this kosher? I can't find anybody in the kernel that - * sets neigh->destructor, so we should be able to set it here - * without trouble. - */ - neigh->ops->destructor = ipoib_neigh_destructor; - - return 0; -} - static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { - parms->neigh_setup = ipoib_neigh_setup; + parms->neigh_destructor = ipoib_neigh_destructor; return 0; } -- MST From mst at mellanox.co.il Wed Jan 18 13:20:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 23:20:18 +0200 Subject: [openib-general] Repost [PATCH 3 of 3] ipoib: fix error handling Message-ID: <20060118212018.GG31280@mellanox.co.il> Sorry about reposting: the message didnt seem to make it to netdev. --- The following patch is not directly related to the destructor issue, but I'm posting it here fore completeness since it needs to be applied on top of the previous pair of patches in the destructor series. --- Fix error handling in neigh_add_path. Reduce code duplication by implementing alloc/free functions for ipoib_neigh. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 20:48:06.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-01-12 20:48:43.000000000 +0200 @@ -246,8 +246,7 @@ static void path_free(struct net_device */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -475,7 +474,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -483,8 +482,6 @@ static void neigh_add_path(struct sk_buf } skb_queue_head_init(&neigh->queue); - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; /* * We can only be called from ipoib_start_xmit, so we're @@ -497,7 +494,7 @@ static void neigh_add_path(struct sk_buf path = path_rec_create(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); if (!path) - goto err; + goto err_path; __path_add(dev, path); } @@ -527,10 +524,9 @@ static void neigh_add_path(struct sk_buf return; err: - *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - kfree(neigh); - +err_path: + ipoib_neigh_free(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -757,8 +753,7 @@ static void ipoib_neigh_destructor(struc if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); - *to_ipoib_neigh(n) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -766,6 +761,26 @@ static void ipoib_neigh_destructor(struc if (ah) ipoib_put_ah(ah); } + +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +{ + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) + return NULL; + + neigh->neighbour = neighbour; + *to_ipoib_neigh(neighbour) = neigh; + + return neigh; +} + +void ipoib_neigh_free(struct ipoib_neigh *neigh) +{ + *to_ipoib_neigh(neigh->neighbour) = NULL; + kfree(neigh); +} static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-12 20:32:08.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-12 20:48:43.000000000 +0200 @@ -113,8 +113,7 @@ static void ipoib_mcast_free(struct ipoi */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -720,13 +719,11 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (neigh) { kref_get(&mcast->ah->ref); neigh->ah = mcast->ah; - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; list_add_tail(&neigh->list, &mcast->neigh_list); } } Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-12 20:27:47.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-12 20:48:43.000000000 +0200 @@ -222,6 +222,9 @@ static inline struct ipoib_neigh **to_ip (offsetof(struct neighbour, ha) & 4)); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +void ipoib_neigh_free(struct ipoib_neigh *neigh); + extern struct workqueue_struct *ipoib_workqueue; /* functions */ -- MST From mst at mellanox.co.il Wed Jan 18 13:27:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jan 2006 23:27:32 +0200 Subject: [openib-general] Re: Fwd: [PATCH 1 of 3] move destructor to struct neigh_parms In-Reply-To: References: Message-ID: <20060118212732.GA32283@mellanox.co.il> Quoting Shirley Ma : > Subject: Re: Fwd: [PATCH 1 of 3] move destructor to struct neigh_parms > > > >+ if (neigh->parms->neigh_destructor) > >+ (neigh->parms->neigh_destructor)(neigh); > > Is that safe without checking neigh->parms here? Yes, we have neigh_parms_put(neigh->parms); a couple of lines below. -- MST From halr at voltaire.com Wed Jan 18 14:11:05 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Jan 2006 17:11:05 -0500 Subject: [openib-general] Re: [patch] userspace/management/diags/src/sminfo.c - cmdline processing fix In-Reply-To: <20060117101000.GA8053@sashak.voltaire.com> References: <20060117101000.GA8053@sashak.voltaire.com> Message-ID: <1137507072.4337.41.camel@localhost.localdomain> Hi Sasha, On Tue, 2006-01-17 at 05:10, Sasha Khapyorsky wrote: > Hello Hal, > > There is small bug in sminfo's cmdline processing, this will segfault > when option argument is missing (like 'sminfo -a'). The "fast and dirty" > fix is inlined. Thanks. Applied. > The same problem exists with most diag tools, so I think we need to > rework AGRBEGIN { ... } ARGEND stuff (actually remove it from > libibcommon since it is used by diag tools only). I can do it if there > are no objections. I would welcome such a patch. Thanks. -- Hal > Regards, > Sasha. > > > This fast fix for invalid ARGF() usage in sminfo.c. > > Signed-off-by: Sasha Khapyorsky > > Index: diags/src/sminfo.c > =================================================================== > -- diags/src/sminfo.c (revision 5017) > +++ diags/src/sminfo.c (working copy) > @@ -49,6 +49,8 @@ > > #define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) > > +#define SAFE_ARGF() (*(argv+1) ? ARGF() : ( usage(), NULL ) ) > + > static void > iberror(const char *fn, char *msg, ...) > { > @@ -116,10 +118,10 @@ > > ARGBEGIN { > case 'C': > - ca = ARGF(); > + ca = SAFE_ARGF(); > break; > case 'P': > - ca_port = strtoul(ARGF(), 0, 0); > + ca_port = strtoul(SAFE_ARGF(), 0, 0); > break; > case 'd': > ibdebug++; > @@ -137,17 +139,17 @@ > dest_type = IB_DEST_GUID; > break; > case 't': > - timeout = strtoul(ARGF(), 0, 0); > + timeout = strtoul(SAFE_ARGF(), 0, 0); > madrpc_set_timeout(timeout); > break; > case 'a': > - act = strtoul(ARGF(), 0, 0); > + act = strtoul(SAFE_ARGF(), 0, 0); > break; > case 's': > - state = strtoul(ARGF(), 0, 0); > + state = strtoul(SAFE_ARGF(), 0, 0); > break; > case 'p': > - prio = strtoul(ARGF(), 0, 0); > + prio = strtoul(SAFE_ARGF(), 0, 0); > break; > case 'V': > fprintf(stderr, "%s %s\n", argv0, get_build_version() ); From robert.j.woodruff at intel.com Wed Jan 18 14:17:06 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 18 Jan 2006 14:17:06 -0800 Subject: [openib-general] RE: Offer to do a uDAPL presentation in Sonoma Message-ID: <1AC79F16F5C5284499BB9591B33D6F0006A31C4E@orsmsx408> I don't think that Arlin is planning on attending. woody ________________________________ From: Bill Boas [mailto:bill.boas at gmail.com] Sent: Wednesday, January 18, 2006 2:11 PM To: James Lentini Cc: openib-general; Arlin Davis; Matt L. Leininger; Hal Rosenstock; Woodruff, Robert J; spoole at lanl.gov; Asaf Somekh; Bob Pearson Subject: Offer to do a uDAPL presentation in Sonoma James, I think so - lets get agreement and suggestions from the people really working on the content like Hal, Woody, Matt, Steve, Asaf, Bob and others. Thank you. Bill. On 1/18/06, James Lentini wrote: Hi Bill, Would you like Arlin and I to create a uDAPL presentation for the workshop? james -- James Lentini | Network Appliance | 781-768-5359 | jlentini at netapp.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From arlin.r.davis at intel.com Wed Jan 18 14:23:00 2006 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Wed, 18 Jan 2006 14:23:00 -0800 Subject: [openib-general] RE: [RFC] DAT 2.0 extension proposal Message-ID: <59278FC0C48A994BABABD069571E45680DA66188@orsmsx401.amr.corp.intel.com> A new proposal, addressing the following questions, is attached for review. -arlin ________________________________ From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] Sent: Tuesday, January 17, 2006 7:48 AM To: Davis, Arlin R; Lentini, James Cc: dat-discussions at yahoogroups.com; openib-general at openib.org Subject: RE: [RFC] DAT 2.0 extension proposal Arlin, 1. Does it mean that existing DAT providers will have to be modified so they report DAT_NOT_IMPLEMENTED for each extension? 2. Why is there DAT_INVALID in DAT_DTOS? 3. Do you want to use DAT_EXTENSION_DATA or DAT_EXT_DATA? 4. The proposed operations are operation on EP and they are DTOs. Why not define DAT_DTO_EXT_OP instead of DAT_EXT_OP? MY concern is that if these are not DTO then we have a new event stream type for "extensions" and we need to define rules for this event stream including ordering rules and interactions with other event streams, provider attributes for stream mixing and so on... If we restrict extensions to DTO operation extension we avoid all these issues and simplify APIs. On the negative side these extension are restrictive. 5. Memory protection extension for atomic operations 6. error returns for extensions? Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: DAT_Extensions_Rev2.pdf Type: application/octet-stream Size: 84372 bytes Desc: DAT_Extensions_Rev2.pdf URL: From jlentini at netapp.com Wed Jan 18 14:24:04 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 18 Jan 2006 17:24:04 -0500 (EST) Subject: [openib-general] RE: Offer to do a uDAPL presentation in Sonoma In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0006A31C4E@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0006A31C4E@orsmsx408> Message-ID: Arlin and I spoke about that. He likely won't be able to attend the workshop, but he was interested in co-authoring the presentation. I think this is a great opportunity to present the new uDAPL interfaces he has been working on. james On Wed, 18 Jan 2006, Woodruff, Robert J wrote: > I don't think that Arlin is planning on attending. > > woody > > > ________________________________ > > From: Bill Boas [mailto:bill.boas at gmail.com] > Sent: Wednesday, January 18, 2006 2:11 PM > To: James Lentini > Cc: openib-general; Arlin Davis; Matt L. Leininger; Hal Rosenstock; > Woodruff, Robert J; spoole at lanl.gov; Asaf Somekh; Bob Pearson > Subject: Offer to do a uDAPL presentation in Sonoma > > > James, > > I think so - lets get agreement and suggestions from the people really > working on the content like Hal, Woody, Matt, Steve, Asaf, Bob and > others. > > Thank you. > > Bill. > > > On 1/18/06, James Lentini wrote: > > > Hi Bill, > > Would you like Arlin and I to create a uDAPL presentation for > the > workshop? > > james > > -- > James Lentini | Network Appliance | 781-768-5359 | > jlentini at netapp.com > > > > From arlin.r.davis at intel.com Wed Jan 18 16:37:11 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 18 Jan 2006 16:37:11 -0800 Subject: [openib-general] [RFC][PATCH][REV2] uDAPL atomic extensions and immediate data changes Message-ID: James, Attached is a patch that includes immediate data as a standard API and atomics via extensions. Changes have been made to the extensions based on discussions with Arkady (name changes and addition of extended dat_return and memory privileges). This matches the DAT 2.0 extension proposal that I sent out earlier. -arlin Signed-off by: Arlin Davis Index: test/dtest/dtest_ext.c =================================================================== --- test/dtest/dtest_ext.c (revision 0) +++ test/dtest/dtest_ext.c (revision 0) @@ -0,0 +1,952 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "dat/udat.h" + +/* + * Map DAT_RETURN values to readable strings, + * but don't assume the values are zero-based or contiguous. + */ +char errmsg[256] = {0}; +char majmsg[64] = {0}; +char minmsg[64] = {0}; +char extmsg[64] = {0}; +const char * +DT_RetToString (DAT_RETURN ret_value) +{ + const char *major_msg = majmsg; + const char *minor_msg = minmsg; + const char *ext_msg = extmsg; + int sz; + + /* DAT_NOT_IMPLEMENTED definition masked improperly in dat_error.h */ + if (ret_value == DAT_NOT_IMPLEMENTED) { + strcpy(errmsg, "DAT_NOT_IMPLEMENTED"); + return errmsg; + } + + dat_strerror(ret_value, &major_msg, &minor_msg); + dat_strerror_extension(ret_value, &ext_msg); + strcpy(errmsg, major_msg); + strcat(errmsg, " "); + strcpy(errmsg, ext_msg); + strcat(errmsg, " "); + strcat(errmsg, minor_msg); + + return errmsg; +} + +/* + * Map DAT_EVENT_CODE values to readable strings + */ +const char * +DT_EventToSTr (DAT_EVENT_NUMBER event_code) +{ + unsigned int i; + static struct { + const char *name; + DAT_RETURN value; + } + dat_events[] = + { + # define DATxx(x) { # x, x } + DATxx (DAT_DTO_COMPLETION_EVENT), + DATxx (DAT_RMR_BIND_COMPLETION_EVENT), + DATxx (DAT_CONNECTION_REQUEST_EVENT), + DATxx (DAT_CONNECTION_EVENT_ESTABLISHED), + DATxx (DAT_CONNECTION_EVENT_PEER_REJECTED), + DATxx (DAT_CONNECTION_EVENT_NON_PEER_REJECTED), + DATxx (DAT_CONNECTION_EVENT_ACCEPT_COMPLETION_ERROR), + DATxx (DAT_CONNECTION_EVENT_DISCONNECTED), + DATxx (DAT_CONNECTION_EVENT_BROKEN), + DATxx (DAT_CONNECTION_EVENT_TIMED_OUT), + DATxx (DAT_CONNECTION_EVENT_UNREACHABLE), + DATxx (DAT_ASYNC_ERROR_EVD_OVERFLOW), + DATxx (DAT_ASYNC_ERROR_IA_CATASTROPHIC), + DATxx (DAT_ASYNC_ERROR_EP_BROKEN), + DATxx (DAT_ASYNC_ERROR_TIMED_OUT), + DATxx (DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR), + DATxx (DAT_SOFTWARE_EVENT) + # undef DATxx + }; + # define NUM_EVENTS (sizeof(dat_events)/sizeof(dat_events[0])) + + for (i = 0; i < NUM_EVENTS; i++) { + if (dat_events[i].value == event_code) + { + return ( dat_events[i].name ); + } + } + return ( "Invalid_DAT_EVENT_NUMBER" ); +} + +/* + * Map DAT_EVENT_CODE values to readable strings + */ +const char * +DT_DtoStatusToSTr (DAT_DTO_COMPLETION_STATUS dto_status ) +{ + unsigned int i; + static struct { + const char *name; + DAT_RETURN value; + } + dat_dto[] = + { + # define DATxx(x) { # x, x } + DATxx (DAT_DTO_SUCCESS), + DATxx (DAT_DTO_ERR_FLUSHED), + DATxx (DAT_DTO_ERR_LOCAL_LENGTH), + DATxx (DAT_DTO_ERR_LOCAL_EP), + DATxx (DAT_DTO_ERR_LOCAL_PROTECTION), + DATxx (DAT_DTO_ERR_BAD_RESPONSE), + DATxx (DAT_DTO_ERR_REMOTE_ACCESS), + DATxx (DAT_DTO_ERR_REMOTE_RESPONDER), + DATxx (DAT_DTO_ERR_TRANSPORT), + DATxx (DAT_DTO_ERR_RECEIVER_NOT_READY), + DATxx (DAT_DTO_ERR_PARTIAL_PACKET), + DATxx (DAT_RMR_OPERATION_FAILED) + # undef DATxx + }; + # define NUM_DTO_ERRS (sizeof(dat_dto)/sizeof(dat_dto[0])) + + for (i = 0; i < NUM_DTO_ERRS; i++) { + if (dat_dto[i].value == dto_status) + { + return ( dat_dto[i].name ); + } + } + return ( "Invalid DAT_DTO_COMPLETION_STATUS" ); +} + +#define _OK( status, str ) \ +{ \ + if ( status != DAT_SUCCESS ) { \ + fprintf(stderr, str " returned %s\n", \ + DT_RetToString(status) ); \ + exit ( 1 ); \ + } \ +} + +#define _OK_EVENT( event, status, str ) \ +{ \ + if ( status != DAT_SUCCESS ) { \ + fprintf(stderr, str " event %s status %s\n", \ + DT_EventToSTr(event), DT_DtoStatusToSTr(status)); \ + exit ( 1 ); \ + } \ +} + +#define SECONDS( secs ) (1000*1000*secs) + +#define SERVER_CONN_QUAL 31111 +#define BUF_SIZE 256 +#define BUF_SIZE_ATOMIC 8 +#define REG_MEM_COUNT 10 + +#define SND_RDMA_BUF_INDEX 0 +#define RCV_RDMA_BUF_INDEX 1 +#define SEND_BUF_INDEX 2 +#define RECV_BUF_INDEX 3 + +u_int64_t *atomic_buf; +DAT_LMR_HANDLE lmr_atomic; +DAT_LMR_CONTEXT lmr_atomic_context; +DAT_RMR_CONTEXT rmr_atomic_context; +DAT_VLEN reg_atomic_size; +DAT_VADDR reg_atomic_addr; + +DAT_LMR_HANDLE lmr[ REG_MEM_COUNT ]; +DAT_LMR_CONTEXT lmr_context[ REG_MEM_COUNT ]; +DAT_RMR_TRIPLET rmr[ REG_MEM_COUNT ]; +DAT_RMR_CONTEXT rmr_context[ REG_MEM_COUNT ]; +DAT_VLEN reg_size[ REG_MEM_COUNT ]; +DAT_VADDR reg_addr[ REG_MEM_COUNT ]; +DAT_RMR_TRIPLET * buf[ REG_MEM_COUNT ]; +DAT_EP_HANDLE ep; +DAT_EVD_HANDLE async_evd = DAT_HANDLE_NULL; +DAT_IA_HANDLE ia = DAT_HANDLE_NULL; +DAT_PZ_HANDLE pz = DAT_HANDLE_NULL; +DAT_EVD_HANDLE cr_evd = DAT_HANDLE_NULL; +DAT_EVD_HANDLE con_evd = DAT_HANDLE_NULL; +DAT_EVD_HANDLE dto_evd = DAT_HANDLE_NULL; +DAT_PSP_HANDLE psp = DAT_HANDLE_NULL; +DAT_CR_HANDLE cr = DAT_HANDLE_NULL; +int server = 1; + +void +send_msg( + void *data, + DAT_COUNT size, + DAT_LMR_CONTEXT context, + DAT_DTO_COOKIE cookie, + DAT_COMPLETION_FLAGS flags ) +{ + DAT_LMR_TRIPLET iov; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_RETURN status; + + iov.lmr_context = context; + iov.pad = 0; + iov.virtual_address = (DAT_VADDR)(unsigned long)data; + iov.segment_length = (DAT_VLEN)size; + + status = dat_ep_post_send( ep, + 1, + &iov, + cookie, + flags ); + _OK( status, "dat_ep_post_send" ); + + if ( ! (flags & DAT_COMPLETION_SUPPRESS_FLAG) ) { + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait after dat_ep_post_send" ); + + if ( event.event_number != DAT_DTO_COMPLETION_EVENT ) { + printf("unexpected event waiting for post_send completion - 0x%x\n", event.event_number); + exit ( 1 ); + } + + _OK( event.event_data.dto_completion_event_data.status, "event status for post_send" ); + } +} + +int +connect_ep( char *hostname ) +{ + DAT_SOCK_ADDR remote_addr; + DAT_EP_ATTR ep_attr; + DAT_RETURN status; + DAT_REGION_DESCRIPTION region; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_LMR_TRIPLET iov; + DAT_RMR_TRIPLET r_iov; + DAT_DTO_COOKIE cookie; + DAT_PROVIDER_ATTR provider_attr; + DAT_NAMED_ATTR named_attrs; + int i,ext_cnt; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = &event.event_data.dto_completion_event_data; + + status = dat_ia_open( "OpenIB-ib0", 8, &async_evd, &ia ); + _OK( status, "dat_ia_open" ); + + /* query for immediate data and atomic operation extensions */ + status = dat_ia_query( ia, NULL, DAT_IA_FIELD_NONE, NULL, + DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR, + &provider_attr ); + _OK( status, "dat_ia_query" ); + + /* look for extension support, ALL or nothing */ + ext_cnt=0; + printf(" Extension Attributes:\n"); + for (i=0;iai_addr)->sin_addr.s_addr; + printf ("Server Name: %s \n", hostname); + printf ("Server Net Address: %d.%d.%d.%d\n", + (rval >> 0) & 0xff, + (rval >> 8) & 0xff, + (rval >> 16) & 0xff, + (rval >> 24) & 0xff); + + remote_addr = *((DAT_IA_ADDRESS_PTR)target->ai_addr); + + strcpy( (char*)buf[ SND_RDMA_BUF_INDEX ], "client written data" ); + + status = dat_ep_connect( ep, + &remote_addr, + SERVER_CONN_QUAL, + SECONDS( 20 ), + 0, + (DAT_PVOID)0, + 0, + DAT_CONNECT_DEFAULT_FLAG ); + _OK( status, "dat_psp_create" ); + + + } + + printf("Client waiting for connect response\n"); + status = dat_evd_wait( con_evd, SECONDS( 5 ), 1, &event, &nmore ); + _OK( status, "connect dat_evd_wait" ); + + if ( event.event_number != DAT_CONNECTION_EVENT_ESTABLISHED ) { + printf("unexpected event after dat_ep_connect: 0x%x\n", event.event_number); + exit ( 1 ); + } + + printf("Connected!\n"); + + /* + * Setup our remote memory and tell the other side about it + */ + printf("Sending RMR data to remote\n"); + r_iov.rmr_context = rmr_context[ RCV_RDMA_BUF_INDEX ]; + r_iov.pad = 0; + r_iov.target_address = (DAT_VADDR)((unsigned long)buf[ RCV_RDMA_BUF_INDEX ]); + r_iov.segment_length = BUF_SIZE; + + *buf[ SEND_BUF_INDEX ] = r_iov; + + send_msg(buf[ SEND_BUF_INDEX ], + sizeof( DAT_RMR_TRIPLET ), + lmr_context[ SEND_BUF_INDEX ], + cookie, + DAT_COMPLETION_SUPPRESS_FLAG ); + + /* + * Wait for their RMR + */ + printf("Waiting for remote to send RMR data\n"); + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait after dat_ep_post_send" ); + + if ( event.event_number != DAT_DTO_COMPLETION_EVENT ) { + printf("unexpected event waiting for RMR context - 0x%x\n", + event.event_number); + exit ( 1 ); + } + + _OK_EVENT( event.event_number, dto_event->status, " post_send" ); + if ( (dto_event->transfered_length != sizeof( DAT_RMR_TRIPLET )) || + (dto_event->user_cookie.as_64 != RECV_BUF_INDEX) ) { + printf("unexpected event data for receive: len=%d cookie=%d expected %d/%d\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64, + sizeof(DAT_RMR_TRIPLET), RECV_BUF_INDEX); + exit ( 1 ); + } + + r_iov = *buf[ RECV_BUF_INDEX ]; + + printf("Received RMR from remote: r_iov: ctx=%x,pad=%x,va=%p,len=%d\n", + r_iov.rmr_context, + r_iov.pad, + (void*)(unsigned long)r_iov.target_address, + r_iov.segment_length ); + + return ( 0 ); +} + +int +disconnect_ep( ) +{ + DAT_RETURN status; + int i; + + status = dat_ep_disconnect( ep, DAT_CLOSE_DEFAULT ); + _OK( status, "dat_ep_disconnect" ); + + printf("EP disconnected\n"); + + if ( server ) { + status = dat_psp_free( psp ); + _OK( status, "dat_ep_disconnect" ); + } + + for ( i = 0; i < REG_MEM_COUNT; i++ ) { + status = dat_lmr_free( lmr[ i ] ); + _OK( status, "dat_lmr_free" ); + } + + status = dat_lmr_free( lmr_atomic ); + _OK( status, "dat_lmr_free_atomic" ); + + status = dat_ep_free( ep ); + _OK( status, "dat_ep_free" ); + + status = dat_evd_free( dto_evd ); + _OK( status, "dat_evd_free DTO" ); + status = dat_evd_free( con_evd ); + _OK( status, "dat_evd_free CON" ); + status = dat_evd_free( cr_evd ); + _OK( status, "dat_evd_free CR" ); + + status = dat_pz_free( pz ); + _OK( status, "dat_pz_free" ); + + status = dat_ia_close( ia, DAT_CLOSE_DEFAULT ); + _OK( status, "dat_ia_close" ); + + return ( 0 ); +} + +int +do_immediate( ) +{ + DAT_REGION_DESCRIPTION region; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_LMR_TRIPLET iov; + DAT_RMR_TRIPLET r_iov; + DAT_DTO_COOKIE cookie; + DAT_RMR_CONTEXT their_context; + DAT_RETURN status; + DAT_UINT32 immed_data; + DAT_UINT32 immed_data_recv; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = + &event.event_data.dto_completion_event_data; + + printf("\nDoing RDMA WRITE IMMEDIATE DATA\n"); + + if ( server ) { + immed_data = 0x1111; + } else { + immed_data = 0x7777; + } + + cookie.as_64 = 0x5555; + + r_iov = *buf[ RECV_BUF_INDEX ]; + + iov.lmr_context = lmr_context[ SND_RDMA_BUF_INDEX ]; + iov.pad = 0; + iov.virtual_address = (DAT_VADDR)(unsigned long)buf[ SND_RDMA_BUF_INDEX ]; + iov.segment_length = BUF_SIZE; + + cookie.as_64 = 0x9999; + + status = dat_ep_post_rdma_write_immed( ep, // ep_handle + 1, // num_segments + &iov, // LMR + cookie, // user_cookie + &r_iov, // RMR + immed_data, + DAT_DTO_IMMED_FLAG, + DAT_COMPLETION_DEFAULT_FLAG ); + + _OK( status, "dat_ep_post_rdma_write_immed" ); + printf("dat_ep_post_rdma_write_immed posted\n"); + + /* + * Collect first event, write completion or the inbound rdma with immed + */ + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait after dat_ep_post_rdma_write" ); + if ( event.event_number != DAT_DTO_COMPLETION_EVENT ) + { + printf("unexpected event waiting for RMR context - 0x%x\n", + event.event_number ); + exit ( 1 ); + } + + _OK_EVENT( event.event_number, dto_event->status, " rdma_write_immed" ); + if (dto_event->operation == DAT_RDMA_WRITE_IMMED ) + { + if ((dto_event->transfered_length != BUF_SIZE) || + (dto_event->user_cookie.as_64 != 0x9999) ) + { + printf("unexpected event data for rdma_write_immed: len=%d cookie=0x%x\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64); + exit ( 1 ); + } + } + else if (dto_event->operation == DAT_RECEIVE_RDMA_IMMED ) + { + if ((dto_event->transfered_length != BUF_SIZE) || + (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) { + printf("unexpected event data of immediate write:" + "len=%d cookie=%d expected %d/%d\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64, + sizeof(int), RECV_BUF_INDEX+1); + exit ( 1 ); + } + /* get immediate data from DTO event */ + immed_data_recv = dto_event->immed_data; + } + else + { + printf("unexpected operation type - 0x%x, 0x%x\n", + event.event_number, dto_event->operation); + exit ( 1 ); + } + + /* + * Collect second event, write completion or the inbound rdma with immed + */ + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait after dat_ep_post_rdma_write" ); + if ( event.event_number != DAT_DTO_COMPLETION_EVENT ) + { + printf("unexpected event waiting for RMR context - 0x%x\n", + event.event_number ); + exit ( 1 ); + } + + _OK_EVENT( event.event_number, dto_event->status, " rdma_write_immed" ); + if (dto_event->operation == DAT_RDMA_WRITE_IMMED ) + { + if ((dto_event->transfered_length != BUF_SIZE) || + (dto_event->user_cookie.as_64 != 0x9999) ) + { + printf("unexpected event data for rdma_write_immed: len=%d cookie=0x%x\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64); + exit ( 1 ); + } + } + else if (dto_event->operation == DAT_RECEIVE_RDMA_IMMED ) + { + if ((dto_event->transfered_length != BUF_SIZE) || + (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) { + printf("unexpected event data of immediate write:" + "len=%d cookie=%d expected %d/%d\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64, + sizeof(int), RECV_BUF_INDEX+1); + exit ( 1 ); + } + /* get immediate data from DTO event */ + immed_data_recv = dto_event->immed_data; + } + else + { + printf("unexpected operation type - 0x%x, 0x%x\n", + event.event_number, dto_event->operation); + exit ( 1 ); + } + + if ((server) && (immed_data_recv != 0x7777)) + { + printf("Server got unexpected immed_data_recv 0x%x/0x%x\n", + 0x7777, immed_data_recv ); + exit ( 1 ); + } + else if ((!server) && (immed_data_recv != 0x1111)) + { + printf("Client got unexpected immed_data_recv 0x%x/0x%x\n", + 0x1111, immed_data_recv ); + exit ( 1 ); + } + + if (server) + printf("Server received immed_data=0x%x, expected 0x7777\n", immed_data_recv ); + else + printf("Client received immed_data=0x%x, expected 0x1111\n", immed_data_recv ); + + printf("RCV buffer %p contains: %s\n", + buf[ RCV_RDMA_BUF_INDEX ], buf[ RCV_RDMA_BUF_INDEX ]); + + return ( 0 ); +} + +int +do_cmp_swap() +{ + DAT_DTO_COOKIE cookie; + DAT_RETURN status; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_LMR_TRIPLET l_iov; + DAT_RMR_TRIPLET r_iov; + volatile DAT_UINT64 *target = (DAT_UINT64*)buf[ RCV_RDMA_BUF_INDEX ]; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = + &event.event_data.dto_completion_event_data; + + printf("\nDoing CMP and SWAP\n"); + + r_iov = *buf[ RECV_BUF_INDEX ]; + + l_iov.lmr_context = lmr_atomic_context; + l_iov.pad = 0; + l_iov.virtual_address = (DAT_VADDR)(unsigned long)atomic_buf; + l_iov.segment_length = BUF_SIZE_ATOMIC; + + cookie.as_64 = 3333; + if ( server ) { + *target = 0x12345; + sleep(1); + /* server does not compare and should not swap */ + status = dat_ep_post_cmp_and_swap( ep, + (DAT_UINT64)0x654321, + (DAT_UINT64)0x6789A, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } else { + *target = 0x54321; + sleep(1); + /* client does compare and should swap */ + status = dat_ep_post_cmp_and_swap( ep, + (DAT_UINT64)0x12345, + (DAT_UINT64)0x98765, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } + _OK( status, "dat_ep_post_cmp_and_swap" ); + printf("dat_ep_post_cmp_and_swap posted\n"); + + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait for compare and swap" ); + if ( event.event_number != DAT_DTO_COMPLETION_EVENT ) { + printf("unexpected event after post_cmp_and_swap: 0x%x\n", + event.event_number); + exit ( 1 ); + } + + _OK_EVENT( event.event_number, dto_event->status, " cmp_swap" ); + if ( dto_event->extension.type != DAT_DTO_EXTENSION_CMP_AND_SWAP ) { + printf("unexpected event data of cmp and swap : type=%d cookie=%d original 0x%llx\n", + (int)dto_event->extension.type, + (int)dto_event->user_cookie.as_64, + *atomic_buf); + exit ( 1 ); + } + sleep(1); + if ( server ) { + printf("Server got original data = 0x%llx, expected 0x54321\n", *atomic_buf); + printf("Client final result (on server) = 0x%llx, expected 0x98765\n", *target); + } else { + printf("Client got original data = 0x%llx, expected 0x12345\n",*atomic_buf); + printf("Server final result (on client) = 0x%llx, expected 0x54321\n", *target); + } + return(0); +} + +int +do_fetch_add() +{ + DAT_DTO_COOKIE cookie; + DAT_RETURN status; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_LMR_TRIPLET l_iov; + DAT_RMR_TRIPLET r_iov; + volatile DAT_UINT64 *target = (DAT_UINT64*)buf[ RCV_RDMA_BUF_INDEX ]; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = + &event.event_data.dto_completion_event_data; + + printf("\nDoing FETCH and ADD\n"); + + r_iov = *buf[ RECV_BUF_INDEX ]; + + l_iov.lmr_context = lmr_atomic_context; + l_iov.pad = 0; + l_iov.virtual_address = (DAT_VADDR)(unsigned long)atomic_buf; + l_iov.segment_length = BUF_SIZE_ATOMIC; + + cookie.as_64 = 0x7777; + if ( server ) { + *target = 0x10; + sleep( 1 ); + status = dat_ep_post_fetch_and_add( ep, + (DAT_UINT64)0x100, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } else { + *target = 0x100; + sleep( 1 ); + status = dat_ep_post_fetch_and_add( ep, + (DAT_UINT64)0x10, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } + _OK( status, "dat_ep_post_fetch_and_add" ); + printf("dat_ep_post_fetch_and_add posted\n"); + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait for fetch and add" ); + if ( event.event_number != DAT_DTO_COMPLETION_EVENT ) { + printf("unexpected event after post_fetch_and_add: 0x%x\n", event.event_number); + exit ( 1 ); + } + + _OK_EVENT( event.event_number, dto_event->status, " fetch_add" ); + if ( dto_event->extension.type != DAT_DTO_EXTENSION_FETCH_AND_ADD ) { + printf("unexpected event data of fetch and add : type=%d cookie=%d original%d\n", + (int)dto_event->extension.type, + (int)dto_event->user_cookie.as_64, + (int)*atomic_buf ); + exit ( 1 ); + } + + if ( server ) { + printf("Client original data (on server) = 0x%llx, expected 0x100\n", *atomic_buf ); + } else { + printf("Server original data (on client) = 0x%llx, expected 0x10\n", *atomic_buf ); + } + + sleep( 1 ); + + if ( server ) { + status = dat_ep_post_fetch_and_add( ep, + (DAT_UINT64)0x100, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } else { + status = dat_ep_post_fetch_and_add( ep, + (DAT_UINT64)0x10, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } + + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait for second fetch and add" ); + if ( event.event_number != DAT_DTO_COMPLETION_EVENT ) { + printf("unexpected event after second post_fetch_and_add: 0x%x\n", event.event_number); + exit ( 1 ); + } + + _OK_EVENT( event.event_number, dto_event->status, " fetch_add" ); + if ( dto_event->extension.type != DAT_DTO_EXTENSION_FETCH_AND_ADD ) { + printf("unexpected event data of second fetch and add : type=%d cookie=%d original%d\n", + (int)dto_event->extension.type, + (int)dto_event->user_cookie.as_64, + (long)atomic_buf); + exit ( 1 ); + } + + sleep( 1 ); + if ( server ) { + printf("Server got original data = 0x%llx, expected 0x200\n", *atomic_buf); + printf("Client final result (on server) = 0x%llx, expected 0x30\n", *target); + } else { + printf("Server side original data = 0x%llx, expected 0x20\n", *atomic_buf); + printf("Server final result (on client) = 0x%llx, expected 0x300\n", *target); + } + + return ( 0 ); +} + +void print_usage() +{ + printf("\n dtest_ext usage \n\n"); + printf("s: server\n"); + printf("h: hostname\n"); + printf("\n"); +} + +int +main(int argc, char **argv) +{ + int i,c; + char hostname[100]; + + /* parse arguments */ + while ((c = getopt(argc, argv, "sh:")) != -1) + { + switch(c) + { + case 's': + server = 1; + break; + case 'h': + server = 0; + strcpy(hostname, optarg); + break; + default: + print_usage(); + exit(1); + } + } + + if (server) + printf("Server: using provider OpenIB-ib0\n"); + else + printf("Client: using provider OpenIB-ib0, connect to %s \n", + hostname); + + /* connect and send rdma buffer information */ + if (connect_ep(hostname)) + exit(1); + + if (do_immediate()) + exit(1); + + if (do_cmp_swap()) + exit(1); + + if (do_fetch_add()) + exit(1); + + return (disconnect_ep()); +} Index: test/dtest/makefile =================================================================== --- test/dtest/makefile (revision 5065) +++ test/dtest/makefile (working copy) @@ -4,13 +4,18 @@ CFLAGS = -O2 -g DAT_INC = ../../dat/include DAT_LIB = /usr/local/lib -all: dtest +all: dtest dtest_ext clean: - rm -f *.o;touch *.c;rm -f dtest + rm -f *.o;touch *.c;rm -f dtest dtest_ext dtest: ./dtest.c $(CC) $(CFLAGS) ./dtest.c -o dtest \ -DDAPL_PROVIDER='"OpenIB-cma-ip"' \ -I $(DAT_INC) -L $(DAT_LIB) -ldat +dtest_ext: ./dtest_ext.c + $(CC) $(CFLAGS) ./dtest_ext.c -o dtest_ext \ + -DDAPL_PROVIDER='"OpenIB-cma-ip"' \ + -I $(DAT_INC) -L $(DAT_LIB) -ldat + Index: dapl/include/dapl.h =================================================================== --- dapl/include/dapl.h (revision 5065) +++ dapl/include/dapl.h (working copy) @@ -1,25 +1,28 @@ /* - * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * * This Software is licensed under one of the following licenses: - * + * * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * + * * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see + * copy of which is in the file LICENSE3.txt in the root directory. The + * license is also available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. - * + * * Licensee has the right to choose one of the above licenses. - * + * * Redistributions of source code must retain the above copyright * notice and one of the license notices. - * + * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. @@ -61,6 +64,8 @@ #include "dapl_dummy_util.h" #elif OPENIB #include "dapl_ib_util.h" +#elif DET +#include "dapl_det_util.h" #endif /********************************************************************* @@ -178,6 +183,11 @@ typedef enum dapl_qp_state #define DAT_ERROR(Type, SubType) ((DAT_RETURN)(DAT_CLASS_ERROR | Type | SubType)) +#ifdef DAT_EXTENSIONS +#define DAT_ERROR_EXTENSION(Type, ExtType, SubType) \ + ((DAT_RETURN)(DAT_CLASS_ERROR | Type | ExtType | SubType)) +#endif + /********************************************************************* * * * Typedefs * @@ -563,6 +573,15 @@ typedef enum dapl_dto_type DAPL_DTO_TYPE_RECV, DAPL_DTO_TYPE_RDMA_WRITE, DAPL_DTO_TYPE_RDMA_READ, +#ifdef DAT_IMMEDIATE_DATA + DAPL_DTO_TYPE_SEND_IMMED, + DAPL_DTO_TYPE_RECV_IMMED, + DAPL_DTO_TYPE_RDMA_WRITE_IMMED, +#endif +#ifdef DAT_EXTENSIONS + DAPL_DTO_TYPE_EXTENSION +#endif + } DAPL_DTO_TYPE; typedef enum dapl_cookie_type @@ -570,6 +589,7 @@ typedef enum dapl_cookie_type DAPL_COOKIE_TYPE_NULL, DAPL_COOKIE_TYPE_DTO, DAPL_COOKIE_TYPE_RMR, + } DAPL_COOKIE_TYPE; /* DAPL_DTO_COOKIE used as context for DTO WQEs */ @@ -578,6 +598,9 @@ struct dapl_dto_cookie DAPL_DTO_TYPE type; DAT_DTO_COOKIE cookie; DAT_COUNT size; /* used for SEND and RDMA write */ +#ifdef DAT_EXTENSIONS + DAT_PVOID extension; /* extended DTO ops */ +#endif }; /* DAPL_RMR_COOKIE used as context for bind WQEs */ @@ -1116,6 +1139,42 @@ extern DAT_RETURN dapl_srq_set_lw( IN DAT_SRQ_HANDLE, /* srq_handle */ IN DAT_COUNT); /* low_watermark */ +#ifdef DAT_IMMEDIATE_DATA +extern DAT_RETURN dapl_ep_post_send_immed ( + IN DAT_EP_HANDLE, /* ep_handle */ + IN DAT_COUNT, /* num_segments */ + IN DAT_LMR_TRIPLET *, /* local_iov */ + IN DAT_DTO_COOKIE, /* user_cookie */ + IN DAT_UINT32, /* immediate data */ + IN DAT_DTO_FLAGS, /* dto_flags */ + IN DAT_COMPLETION_FLAGS ); /* completion_flags */ + +extern DAT_RETURN dapl_ep_post_recv_immed ( + IN DAT_EP_HANDLE, /* ep_handle */ + IN DAT_COUNT, /* num_segments */ + IN DAT_LMR_TRIPLET *, /* local_iov */ + IN DAT_DTO_COOKIE, /* user_cookie */ + IN DAT_COMPLETION_FLAGS ); /* completion_flags */ + +extern DAT_RETURN dapl_ep_post_rdma_write_immed ( + IN DAT_EP_HANDLE, /* ep_handle */ + IN DAT_COUNT, /* num_segments */ + IN DAT_LMR_TRIPLET *, /* local_iov */ + IN DAT_DTO_COOKIE, /* user_cookie */ + IN const DAT_RMR_TRIPLET *,/* remote_iov */ + IN DAT_UINT32, /* immediate data */ + IN DAT_DTO_FLAGS, /* dto_flags */ + IN DAT_COMPLETION_FLAGS ); /* completion_flags */ +#endif + +#ifdef DAT_EXTENSIONS +extern DAT_RETURN dapl_extensions( + IN DAT_HANDLE, /* dat_handle */ + IN DAT_DTO_EXTENSION_OP, /* extension operation */ + IN va_list ); /* va_list args */ + +#endif + /* * DAPL internal utility function prototpyes */ Index: dapl/include/dapl_debug.h =================================================================== --- dapl/include/dapl_debug.h (revision 5065) +++ dapl/include/dapl_debug.h (working copy) @@ -112,7 +112,16 @@ extern void dapl_internal_dbg_log ( DAPL #define DCNT_EVD_DEQUEUE_NOT_FOUND 18 #define DCNT_TIMER_SET 19 #define DCNT_TIMER_CANCEL 20 + +#ifdef DAT_IMMEDIATE_DATA +#define DCNT_POST_SEND_IMMED 21 +#define DCNT_POST_RECV_IMMED 22 +#define DCNT_POST_RDMA_WRITE_IMMED 23 +#define DCNT_NUM_COUNTERS 24 +#else #define DCNT_NUM_COUNTERS 21 +#endif + #define DCNT_ALL_COUNTERS DCNT_NUM_COUNTERS #if defined(DAPL_COUNTERS) Index: dapl/udapl/Makefile =================================================================== --- dapl/udapl/Makefile (revision 5065) +++ dapl/udapl/Makefile (working copy) @@ -80,6 +80,13 @@ ifdef OS_VENDOR CFLAGS += -D$(OS_VENDOR) endif +# If an implementation supports immdiate data and extensions +CFLAGS += -DDAT_EXTENSIONS +CFLAGS += -DDAT_IMMEDIATE_DATA + +# If an implementation supports DAPL provider specific attributes +CFLAGS += -DDAPL_PROVIDER_SPECIFIC_ATTR + # # dummy provider # @@ -283,6 +290,8 @@ LDFLAGS += -libverbs -lrdmacm LDFLAGS += -rpath /usr/local/lib -L /usr/local/lib PROVIDER_SRCS = dapl_ib_util.c dapl_ib_cq.c dapl_ib_qp.c \ dapl_ib_cm.c dapl_ib_mem.c +# implementation supports extensions +PROVIDER_SRCS += dapl_ib_extensions.c endif UDAPL_SRCS = dapl_init.c \ @@ -320,6 +329,9 @@ COMMON_SRCS = dapl_cookie.c \ dapl_ep_post_rdma_write.c \ dapl_ep_post_recv.c \ dapl_ep_post_send.c \ + dapl_ep_post_recv_immed.c \ + dapl_ep_post_send_immed.c \ + dapl_ep_post_rdma_write_immed.c \ dapl_ep_query.c \ dapl_ep_util.c \ dapl_evd_dequeue.c \ Index: dapl/common/dapl_ep_post_recv_immed.c =================================================================== --- dapl/common/dapl_ep_post_recv_immed.c (revision 0) +++ dapl/common/dapl_ep_post_recv_immed.c (revision 0) @@ -0,0 +1,136 @@ +/* + * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/********************************************************************** + * + * MODULE: dapl_ep_post_recv.c + * + * PURPOSE: Endpoint management + * Description: Interfaces in this file are completely described in + * the DAPL 1.1 API, Chapter 6, section 5 + * + * $Id:$ + **********************************************************************/ + +#include "dapl.h" +#include "dapl_cookie.h" +#include "dapl_adapter_util.h" + +/* + * dapl_ep_post_recv_immed + * + * DAPL Requirements Version xxx, 6.5.11 + * + * Request to receive data over the connection of ep handle into + * local_iov. + * + * Provide additional message buffer space for 32 bit immediate data. + * + * Input: + * ep_handle + * num_segments + * local_iov + * user_cookie + * completion_flags + * + * Output: + * None. + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * DAT_INVALID_STATE + * DAT_PROTECTION_VIOLATION + * DAT_PROVILEGES_VIOLATION + */ +DAT_RETURN +dapl_ep_post_recv_immed ( + IN DAT_EP_HANDLE ep_handle, + IN DAT_COUNT num_segments, + IN DAT_LMR_TRIPLET *local_iov, + IN DAT_DTO_COOKIE user_cookie, + IN DAT_COMPLETION_FLAGS completion_flags ) +{ + DAPL_EP *ep_ptr; + DAPL_COOKIE *cookie; + DAT_RETURN dat_status; + + dapl_dbg_log (DAPL_DBG_TYPE_API, + "dapl_ep_post_recv_immed(%p, %d, %p, %P, %x)\n", + ep_handle, + num_segments, + local_iov, + user_cookie.as_64, + completion_flags); + DAPL_CNTR (DCNT_POST_RECV); + + if ( DAPL_BAD_HANDLE (ep_handle, DAPL_MAGIC_EP) ) + { + dat_status = DAT_ERROR (DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP); + goto bail; + } + + ep_ptr = (DAPL_EP *) ep_handle; + + /* + * Synchronization ok since this buffer is only used for receive + * requests, which aren't allowed to race with each other. + */ + dat_status = dapls_dto_cookie_alloc (&ep_ptr->recv_buffer, + DAPL_DTO_TYPE_RECV_IMMED, + user_cookie, + &cookie); + if ( DAT_SUCCESS != dat_status) + { + goto bail; + } + + /* + * Take reference before posting to avoid race conditions with + * completions + */ + dapl_os_atomic_inc (&ep_ptr->recv_count); + + /* + * Invoke provider specific routine to post DTO + */ + dat_status = dapls_ib_post_recv_immed (ep_ptr, cookie, num_segments, local_iov); + + if ( dat_status != DAT_SUCCESS ) + { + dapl_os_atomic_dec (&ep_ptr->recv_count); + dapls_cookie_dealloc (&ep_ptr->recv_buffer, cookie); + } + +bail: + dapl_dbg_log (DAPL_DBG_TYPE_RTN, + "dapl_ep_post_recv_immed () returns 0x%x\n", + dat_status); + + return dat_status; +} Index: dapl/common/dapl_ep_post_rdma_write.c =================================================================== --- dapl/common/dapl_ep_post_rdma_write.c (revision 5065) +++ dapl/common/dapl_ep_post_rdma_write.c (working copy) @@ -78,7 +78,7 @@ dapl_ep_post_rdma_write ( DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_rdma_write (%p, %d, %p, %p, %p, %x)\n", + "dapl_ep_post_rdma_write (%p, %d, %p, %P, %p, %x)\n", ep_handle, num_segments, local_iov, @@ -92,6 +92,9 @@ dapl_ep_post_rdma_write ( local_iov, user_cookie, remote_iov, +#if DAT_IMMEDIATE_DATA + 0, 0, +#endif completion_flags, DAPL_DTO_TYPE_RDMA_WRITE, OP_RDMA_WRITE); Index: dapl/common/dapl_ep_post_rdma_write_immed.c =================================================================== --- dapl/common/dapl_ep_post_rdma_write_immed.c (revision 0) +++ dapl/common/dapl_ep_post_rdma_write_immed.c (revision 0) @@ -0,0 +1,113 @@ +/* + * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/********************************************************************** + * + * MODULE: dapl_ep_post_rdma_write_immed.c + * + * PURPOSE: Endpoint management + * Description: Interfaces in this file are completely described in + * the DAPL 1.1 API, Chapter 6, section 5 + * + * $Id:$ + **********************************************************************/ + +#include "dapl_ep_util.h" + +/* + * dapl_ep_post_rdma_write_immed + * + * DAPL Requirements Version xxx, 6.5.13 + * + * Request the xfer of all data specified by the local_iov over the + * connection of ep handle Endpint into the remote_iov + * + * Input: + * ep_handle + * num_segments + * local_iov + * user_cookie + * remote_iov + * immed_data + * dto_flags + * compltion_flags + * + * Output: + * None. + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * DAT_INVALID_STATE + * DAT_LENGTH_ERROR + * DAT_PROTECTION_VIOLATION + * DAT_PRIVILEGES_VIOLATION + */ +DAT_RETURN +dapl_ep_post_rdma_write_immed ( + IN DAT_EP_HANDLE ep_handle, + IN DAT_COUNT num_segments, + IN DAT_LMR_TRIPLET *local_iov, + IN DAT_DTO_COOKIE user_cookie, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_UINT32 immed_data, + IN DAT_DTO_FLAGS dto_flags, + IN DAT_COMPLETION_FLAGS completion_flags ) +{ + DAT_RETURN dat_status; + + dapl_dbg_log (DAPL_DBG_TYPE_API, + "dapl_ep_post_rdma_write_immed (%p, %d, %p, %llx, %p, %x, %x, %x)\n", + ep_handle, + num_segments, + local_iov, + user_cookie.as_64, + remote_iov, + immed_data, + dto_flags, + completion_flags); + DAPL_CNTR(DCNT_POST_RDMA_WRITE_IMMED); + + dat_status = dapl_ep_post_send_req(ep_handle, + num_segments, + local_iov, + user_cookie, + remote_iov, + immed_data, + dto_flags, + completion_flags, + DAPL_DTO_TYPE_RDMA_WRITE_IMMED, + OP_RDMA_WRITE_IMMED); + + dapl_dbg_log (DAPL_DBG_TYPE_RTN, + "dapl_ep_post_rdma_write_immed () returns 0x%x", + dat_status); + + + return dat_status; +} Index: dapl/common/dapl_ia_query.c =================================================================== --- dapl/common/dapl_ia_query.c (revision 5065) +++ dapl/common/dapl_ia_query.c (working copy) @@ -167,6 +167,14 @@ dapl_ia_query ( #if !defined(__KDAPL__) provider_attr->pz_support = DAT_PZ_UNIQUE; #endif /* !KDAPL */ + + /* + * Have provider set their own. + */ +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR + dapls_set_provider_specific_attr(provider_attr); +#endif + /* * Set up evd_stream_merging_supported options. Note there is * one bit per allowable combination, using the ordinal Index: dapl/common/dapl_adapter_util.h =================================================================== --- dapl/common/dapl_adapter_util.h (revision 5065) +++ dapl/common/dapl_adapter_util.h (working copy) @@ -256,6 +256,21 @@ dapls_ib_wait_object_wait ( IN u_int32_t timeout); #endif +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR +void +dapls_set_provider_specific_attr( + IN DAT_PROVIDER_ATTR *provider_attr ); +#endif + +#ifdef DAT_EXTENSIONS +void +dapls_cqe_to_event_extension( + IN DAPL_EP *ep_ptr, + IN DAPL_COOKIE *cookie, + IN ib_work_completion_t *cqe_ptr, + OUT DAT_EVENT *event_ptr); +#endif + /* * Values for provider DAT_NAMED_ATTR */ @@ -272,6 +287,8 @@ dapls_ib_wait_object_wait ( #include "dapl_dummy_dto.h" #elif OPENIB #include "dapl_ib_dto.h" +#elif DET +#include "dapl_det_dto.h" #endif Index: dapl/common/dapl_ep_post_send.c =================================================================== --- dapl/common/dapl_ep_post_send.c (revision 5065) +++ dapl/common/dapl_ep_post_send.c (working copy) @@ -75,7 +75,7 @@ dapl_ep_post_send ( DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_send (%p, %d, %p, %p, %x)\n", + "dapl_ep_post_send (%p, %d, %p, %P, %x)\n", ep_handle, num_segments, local_iov, @@ -88,6 +88,9 @@ dapl_ep_post_send ( local_iov, user_cookie, &remote_iov, +#if DAT_IMMEDIATE_DATA + 0, 0, +#endif completion_flags, DAPL_DTO_TYPE_SEND, OP_SEND); Index: dapl/common/dapl_provider.c =================================================================== --- dapl/common/dapl_provider.c (revision 5065) +++ dapl/common/dapl_provider.c (working copy) @@ -221,7 +221,19 @@ DAT_PROVIDER g_dapl_provider_template = &dapl_srq_post_recv, &dapl_srq_query, &dapl_srq_resize, - &dapl_srq_set_lw + &dapl_srq_set_lw, + +#ifdef DAT_IMMEDIATE_DATA + /* dat-2.0 */ + &dapl_ep_post_send_immed, + &dapl_ep_post_recv_immed, + &dapl_ep_post_rdma_write_immed, +#endif + +#ifdef DAT_EXTENSIONS + /* dat-2.0 */ + &dapl_extensions +#endif }; #endif /* __KDAPL__ */ Index: dapl/common/dapl_ep_post_send_immed.c =================================================================== --- dapl/common/dapl_ep_post_send_immed.c (revision 0) +++ dapl/common/dapl_ep_post_send_immed.c (revision 0) @@ -0,0 +1,109 @@ +/* + * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/********************************************************************** + * + * MODULE: dapl_ep_post_send_immed.c + * + * PURPOSE: Endpoint management + * Description: Interfaces in this file are completely described in + * the DAPL 1.1 API, Chapter 6, section 5 + * + * $Id:$ + **********************************************************************/ + +#include "dapl_ep_util.h" + +/* + * dapl_ep_post_send_immed + * + * DAPL Requirements Version xxx, 6.5.10 + * + * Request a transfer of all the data from the local_iov over + * the connection of the ep handle Endpoint to the remote side. + * + * Input: + * ep_handle + * num_segments + * local_iov + * user_cookie + * immed_data + * dto_flags + * completion_flags + * + * Output: + * None + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * DAT_INVALID_STATE + * DAT_PROTECTION_VIOLATION + * DAT_PRIVILEGES_VIOLATION + */ +DAT_RETURN +dapl_ep_post_send_immed ( + IN DAT_EP_HANDLE ep_handle, + IN DAT_COUNT num_segments, + IN DAT_LMR_TRIPLET *local_iov, + IN DAT_DTO_COOKIE user_cookie, + IN DAT_UINT32 immed_data, + IN DAT_DTO_FLAGS dto_flags, + IN DAT_COMPLETION_FLAGS completion_flags ) +{ + DAT_RMR_TRIPLET remote_iov = {0,0,0,0}; + DAT_RETURN dat_status; + + dapl_dbg_log (DAPL_DBG_TYPE_API, + "dapl_ep_post_send_immed (%p, %d, %p, %llx, %x, %x, %x)\n", + ep_handle, + num_segments, + local_iov, + user_cookie.as_64, + immed_data, + dto_flags, + completion_flags); + DAPL_CNTR(DCNT_POST_SEND_IMMED); + + dat_status = dapl_ep_post_send_req(ep_handle, + num_segments, + local_iov, + user_cookie, + &remote_iov, + immed_data, + dto_flags, + completion_flags, + DAPL_DTO_TYPE_SEND_IMMED, + OP_SEND_IMMED); + + dapl_dbg_log (DAPL_DBG_TYPE_RTN, + "dapl_ep_post_send () returns 0x%x\n", + dat_status); + + + return dat_status; +} Index: dapl/common/dapl_ep_util.c =================================================================== --- dapl/common/dapl_ep_util.c (revision 5065) +++ dapl/common/dapl_ep_util.c (working copy) @@ -367,9 +367,13 @@ dapl_ep_post_send_req ( IN DAT_LMR_TRIPLET *local_iov, IN DAT_DTO_COOKIE user_cookie, IN const DAT_RMR_TRIPLET *remote_iov, +#ifdef DAT_IMMEDIATE_DATA + IN DAT_UINT32 immed_data, + IN DAT_DTO_FLAGS dto_flags, +#endif IN DAT_COMPLETION_FLAGS completion_flags, IN DAPL_DTO_TYPE dto_type, - IN int op_type) + IN int op_type ) { DAPL_EP *ep_ptr; DAPL_COOKIE *cookie; @@ -412,6 +416,10 @@ dapl_ep_post_send_req ( num_segments, local_iov, remote_iov, +#ifdef DAT_IMMEDIATE_DATA + immed_data, + dto_flags, +#endif completion_flags ); if ( dat_status != DAT_SUCCESS ) Index: dapl/common/dapl_ep_util.h =================================================================== --- dapl/common/dapl_ep_util.h (revision 5065) +++ dapl/common/dapl_ep_util.h (working copy) @@ -67,6 +67,10 @@ dapl_ep_post_send_req ( IN DAT_LMR_TRIPLET *local_iov, IN DAT_DTO_COOKIE user_cookie, IN const DAT_RMR_TRIPLET *remote_iov, +#ifdef DAT_IMMEDIATE_DATA + IN DAT_UINT32 immed_data, + IN DAT_DTO_FLAGS dto_flags, +#endif IN DAT_COMPLETION_FLAGS completion_flags, IN DAPL_DTO_TYPE dto_type, IN int op_type ); Index: dapl/common/dapl_ep_post_rdma_read.c =================================================================== --- dapl/common/dapl_ep_post_rdma_read.c (revision 5065) +++ dapl/common/dapl_ep_post_rdma_read.c (working copy) @@ -93,6 +93,9 @@ dapl_ep_post_rdma_read ( local_iov, user_cookie, remote_iov, +#if DAT_IMMEDIATE_DATA + 0, 0, +#endif completion_flags, DAPL_DTO_TYPE_RDMA_READ, OP_RDMA_READ); Index: dapl/common/dapl_evd_util.c =================================================================== --- dapl/common/dapl_evd_util.c (revision 5065) +++ dapl/common/dapl_evd_util.c (working copy) @@ -502,6 +502,21 @@ dapli_evd_eh_print_cqe ( #ifdef DAPL_DBG static char *optable[] = { +#ifdef OPENIB + /* different order for openib verbs */ + "OP_RDMA_WRITE", + "OP_RDMA_WRITE_IMMED", + "OP_SEND", + "OP_SEND_IMMED", + "OP_RDMA_READ", + "OP_COMP_AND_SWAP", + "OP_FETCH_AND_ADD", + "OP_RECEIVE", + "OP_RECEIVE_IMMED", + "OP_RECEIVE_RDMA_IMMED", + "OP_BIND_MW", + "OP_INVALID", +#else "OP_SEND", "OP_RDMA_READ", "OP_RDMA_WRITE", @@ -509,6 +524,7 @@ dapli_evd_eh_print_cqe ( "OP_FETCH_AND_ADD", "OP_RECEIVE", "OP_BIND_MW", +#endif 0 }; @@ -1030,7 +1046,20 @@ dapli_evd_cqe_to_event ( { DAPL_COOKIE_BUFFER *buffer; +#ifdef DAT_EXTENSIONS + if ( DAPL_DTO_TYPE_EXTENSION == cookie->val.dto.type ) + { + dapls_cqe_to_event_extension(ep_ptr, cookie, cqe_ptr, event_ptr); + break; + } +#endif + +#if DAT_IMMEDIATE_DATA + if ( DAPL_DTO_TYPE_RECV == cookie->val.dto.type || + DAPL_DTO_TYPE_RECV_IMMED == cookie->val.dto.type ) +#else if ( DAPL_DTO_TYPE_RECV == cookie->val.dto.type ) +#endif { dapl_os_atomic_dec (&ep_ptr->recv_count); buffer = &ep_ptr->recv_buffer; @@ -1048,6 +1077,16 @@ dapli_evd_cqe_to_event ( cookie->val.dto.cookie; event_ptr->event_data.dto_completion_event_data.status = dto_status; +#ifdef DAT_IMMEDIATE_DATA + event_ptr->event_data.dto_completion_event_data.operation = + DAPL_GET_CQE_DTOS_OPTYPE(cqe_ptr); + + if ( DAPL_GET_CQE_DTOS_OPTYPE(cqe_ptr) == DAT_RECEIVE_IMMED || + DAPL_GET_CQE_DTOS_OPTYPE(cqe_ptr) == DAT_RECEIVE_RDMA_IMMED) + event_ptr->event_data.dto_completion_event_data.immed_data = + DAPL_GET_CQE_IMMED_DATA(cqe_ptr); +#endif + #ifdef DAPL_DBG if (dto_status == DAT_DTO_SUCCESS) { @@ -1055,18 +1094,42 @@ dapli_evd_cqe_to_event ( ibtype = DAPL_GET_CQE_OPTYPE (cqe_ptr); - dapl_os_assert ((ibtype == OP_SEND && + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " dapli_evd_cqe_to_event: OP type ib=%d, cookie=%d)\n", + ibtype, cookie->val.dto.type); +#ifdef DAT_EXTENSIONS + if (cookie->val.dto.type != DAPL_DTO_TYPE_EXTENSION) +#endif + dapl_os_assert ((ibtype == OP_SEND && cookie->val.dto.type == DAPL_DTO_TYPE_SEND) - || (ibtype == OP_RECEIVE && - cookie->val.dto.type == DAPL_DTO_TYPE_RECV) - || (ibtype == OP_RDMA_WRITE && - cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_WRITE) - || (ibtype == OP_RDMA_READ && - cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_READ)); +#ifdef DAT_IMMEDIATE_DATA + || (ibtype == OP_RECEIVE && + (cookie->val.dto.type == DAPL_DTO_TYPE_RECV || + cookie->val.dto.type == DAPL_DTO_TYPE_RECV_IMMED)) + || (ibtype == OP_RECEIVE_IMMED && + cookie->val.dto.type == DAPL_DTO_TYPE_RECV_IMMED) + || (ibtype == OP_RECEIVE_RDMA_IMMED && + cookie->val.dto.type == DAPL_DTO_TYPE_RECV_IMMED) + || (ibtype == OP_SEND_IMMED && + cookie->val.dto.type == DAPL_DTO_TYPE_SEND_IMMED) + || (ibtype == OP_RDMA_WRITE_IMMED && + cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_WRITE_IMMED) +#else + || (ibtype == OP_RECEIVE && + cookie->val.dto.type == DAPL_DTO_TYPE_RECV) +#endif + || (ibtype == OP_RDMA_WRITE && + cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_WRITE) + || (ibtype == OP_RDMA_READ && + cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_READ)); } #endif /* DAPL_DBG */ if ( cookie->val.dto.type == DAPL_DTO_TYPE_SEND || +#ifdef DAT_IMMEDIATE_DATA + cookie->val.dto.type == DAPL_DTO_TYPE_SEND_IMMED || + cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_WRITE_IMMED || +#endif cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_WRITE ) { /* Get size from DTO; CQE value may be off. */ @@ -1113,6 +1176,7 @@ dapli_evd_cqe_to_event ( dapls_cookie_dealloc (&ep_ptr->req_buffer, cookie); break; } + default: { dapl_os_assert (!"Invalid Operation type"); Index: dapl/common/dapl_debug.c =================================================================== --- dapl/common/dapl_debug.c (revision 5065) +++ dapl/common/dapl_debug.c (working copy) @@ -86,6 +86,11 @@ char *dapl_dbg_counter_names[] = { "dapl_evd_not_found", "dapls_timer_set", "dapls_timer_cancel", +#ifdef DAT_IMMEDIATE_DATA + "dapls_ep_post_send_immed", + "dapls_ep_post_recv_immed", + "dapls_ep_post_rdma_write_immed", +#endif }; void dapl_dump_cntr( int cntr ) Index: dapl/openib_cma/dapl_ib_dto.h =================================================================== --- dapl/openib_cma/dapl_ib_dto.h (revision 5065) +++ dapl/openib_cma/dapl_ib_dto.h (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - DTO operations and CQE macros + * The OpenIB uCMA provider - DTO operations and CQE macros * **************************************************************************** * Source Control System Information @@ -119,7 +119,6 @@ dapls_ib_post_recv ( return DAT_SUCCESS; } - /* * dapls_ib_post_send * @@ -133,6 +132,10 @@ dapls_ib_post_send ( IN DAT_COUNT segments, IN DAT_LMR_TRIPLET *local_iov, IN const DAT_RMR_TRIPLET *remote_iov, +#ifdef DAT_IMMEDIATE_DATA + IN DAT_UINT32 immed_data, + IN DAT_DTO_FLAGS dto_flags, +#endif IN DAT_COMPLETION_FLAGS completion_flags) { dapl_dbg_log(DAPL_DBG_TYPE_EP, @@ -191,8 +194,12 @@ dapls_ib_post_send ( if (cookie != NULL) cookie->val.dto.size = total_len; - - if ((op_type == OP_RDMA_WRITE) || (op_type == OP_RDMA_READ)) { + + if ((op_type == OP_RDMA_WRITE) || +#ifdef DAT_IMMEDIATE_DATA + (op_type == OP_RDMA_WRITE_IMMED) || +#endif + (op_type == OP_RDMA_READ)) { wr.wr.rdma.remote_addr = remote_iov->target_address; wr.wr.rdma.rkey = remote_iov->rmr_context; dapl_dbg_log(DAPL_DBG_TYPE_EP, @@ -200,6 +207,14 @@ dapls_ib_post_send ( wr.wr.rdma.rkey, wr.wr.rdma.remote_addr); } +#ifdef DAT_IMMEDIATE_DATA + if ((op_type == OP_SEND_IMMED) || (op_type == OP_RDMA_WRITE_IMMED)) { + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: IMMED=0x%x\n", immed_data); + wr.imm_data = immed_data; + } +#endif + /* inline data for send or write ops */ if ((total_len <= ibt_ptr->max_inline_send) && ((op_type == OP_SEND) || (op_type == OP_RDMA_WRITE))) @@ -224,6 +239,182 @@ dapls_ib_post_send ( return DAT_SUCCESS; } +#ifdef DAT_IMMEDIATE_DATA + +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_recv_immed ( + IN DAPL_EP *ep_ptr, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov ) +{ + /* Nothing more to do, IB already provides space in descriptor */ + return (dapls_ib_post_recv( ep_ptr, cookie, segments, local_iov)); +} + +/* map Work Completions to DAPL WR operations */ +STATIC _INLINE_ DAT_DTOS dapls_cqe_dtos_opcode(ib_work_completion_t *cqe_p) +{ + switch (cqe_p->opcode) { + case IBV_WC_SEND: + if (cqe_p->wc_flags & IBV_WC_WITH_IMM) + return (DAT_SEND_IMMED); + else + return (DAT_SEND); + case IBV_WC_RDMA_WRITE: + if (cqe_p->wc_flags & IBV_WC_WITH_IMM) + return (DAT_RDMA_WRITE_IMMED); + else + return (DAT_RDMA_WRITE); + case IBV_WC_RDMA_READ: + return (DAT_RDMA_READ); + case IBV_WC_BIND_MW: + return (DAT_BIND_MW); + case IBV_WC_RECV: + if (cqe_p->wc_flags & IBV_WC_WITH_IMM) + return (DAT_RECEIVE_IMMED); + else + return (DAT_RECEIVE); +#ifdef DAT_EXTENSIONS + case IBV_WC_COMP_SWAP: + return (DAT_EXTENSION); + case IBV_WC_FETCH_ADD: + return (DAT_EXTENSION); +#endif + case IBV_WC_RECV_RDMA_WITH_IMM: + return (DAT_RECEIVE_RDMA_IMMED); + default: + return (DAT_INVALID); + } +} +#define DAPL_GET_CQE_DTOS_OPTYPE(cqe_p) dapls_cqe_dtos_opcode(cqe_p) + +#endif + +#ifdef DAT_EXTENSIONS +/* + * dapls_ib_post_ext_send + * + * Provider specific extended Post SEND function for atomics + * OP_COMP_AND_SWAP and OP_FETCH_AND_ADD + */ +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_ext_send ( + IN DAPL_EP *ep_ptr, + IN ib_send_op_type_t op_type, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_UINT64 compare_add, + IN DAT_UINT64 swap, + IN DAT_COMPLETION_FLAGS completion_flags) +{ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p op %d ck %p sgs", + "%d l_iov %p r_iov %p f %d\n", + ep_ptr, op_type, cookie, segments, local_iov, + remote_iov, completion_flags); + + ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; + ib_data_segment_t *ds_array_p; + struct ibv_send_wr wr; + struct ibv_send_wr *bad_wr; + DAT_COUNT i, total_len; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p cookie %p segs %d l_iov %p\n", + ep_ptr, cookie, segments, local_iov); + + if(segments <= DEFAULT_DS_ENTRIES) + ds_array_p = ds_array; + else + ds_array_p = + dapl_os_alloc(segments * sizeof(ib_data_segment_t)); + + if (NULL == ds_array_p) + return (DAT_INSUFFICIENT_RESOURCES); + + /* setup the work request */ + wr.next = 0; + wr.opcode = op_type; + wr.num_sge = 0; + wr.send_flags = 0; + wr.wr_id = (uint64_t)(uintptr_t)cookie; + wr.sg_list = ds_array_p; + total_len = 0; + + for (i = 0; i < segments; i++ ) { + if ( !local_iov[i].segment_length ) + continue; + + ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; + ds_array_p->length = local_iov[i].segment_length; + ds_array_p->lkey = local_iov[i].lmr_context; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: lkey 0x%x va %p len %d\n", + ds_array_p->lkey, ds_array_p->addr, + ds_array_p->length ); + + total_len += ds_array_p->length; + wr.num_sge++; + ds_array_p++; + } + + if (cookie != NULL) + cookie->val.dto.size = total_len; + + switch (op_type) { + case OP_COMP_AND_SWAP: + /* OP_COMP_AND_SWAP has direct IBAL wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: OP_COMP_AND_SWAP=%lx," + "%lx rkey 0x%x va %#016Lx\n", + compare_add, swap, remote_iov->rmr_context, + remote_iov->target_address); + + wr.wr.atomic.compare_add = compare_add; + wr.wr.atomic.swap = swap; + wr.wr.atomic.remote_addr = remote_iov->target_address; + wr.wr.atomic.rkey = remote_iov->rmr_context; + break; + case OP_FETCH_AND_ADD: + /* OP_FETCH_AND_ADD has direct IBAL wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: OP_FETCH_AND_ADD=%lx," + "%lx rkey 0x%x va %#016Lx\n", + compare_add, remote_iov->rmr_context, + remote_iov->target_address); + + wr.wr.atomic.compare_add = compare_add; + wr.wr.atomic.remote_addr = remote_iov->target_address; + wr.wr.atomic.rkey = remote_iov->rmr_context; + break; + default: + break; + } + + /* set completion flags in work request */ + wr.send_flags |= (DAT_COMPLETION_SUPPRESS_FLAG & + completion_flags) ? 0 : IBV_SEND_SIGNALED; + wr.send_flags |= (DAT_COMPLETION_BARRIER_FENCE_FLAG & + completion_flags) ? IBV_SEND_FENCE : 0; + wr.send_flags |= (DAT_COMPLETION_SOLICITED_WAIT_FLAG & + completion_flags) ? IBV_SEND_SOLICITED : 0; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: op 0x%x flags 0x%x sglist %p, %d\n", + wr.opcode, wr.send_flags, wr.sg_list, wr.num_sge); + + if (ibv_post_send(ep_ptr->qp_handle->cm_id->qp, &wr, &bad_wr)) + return( dapl_convert_errno(EFAULT,"ibv_recv") ); + + dapl_dbg_log(DAPL_DBG_TYPE_EP," post_snd: returned\n"); + return DAT_SUCCESS; +} +#endif + STATIC _INLINE_ DAT_RETURN dapls_ib_optional_prv_dat( IN DAPL_CR *cr_ptr, @@ -233,13 +424,17 @@ dapls_ib_optional_prv_dat( return DAT_SUCCESS; } +/* map Work Completions to DAPL WR operations */ STATIC _INLINE_ int dapls_cqe_opcode(ib_work_completion_t *cqe_p) { switch (cqe_p->opcode) { case IBV_WC_SEND: return (OP_SEND); case IBV_WC_RDMA_WRITE: - return (OP_RDMA_WRITE); + if (cqe_p->wc_flags & IBV_WC_WITH_IMM) + return (OP_RDMA_WRITE_IMMED); + else + return (OP_RDMA_WRITE); case IBV_WC_RDMA_READ: return (OP_RDMA_READ); case IBV_WC_COMP_SWAP: @@ -249,14 +444,18 @@ STATIC _INLINE_ int dapls_cqe_opcode(ib_ case IBV_WC_BIND_MW: return (OP_BIND_MW); case IBV_WC_RECV: - return (OP_RECEIVE); + if (cqe_p->wc_flags & IBV_WC_WITH_IMM) + return (OP_RECEIVE_IMMED); + else + return (OP_RECEIVE); case IBV_WC_RECV_RDMA_WITH_IMM: - return (OP_RECEIVE_IMM); + return (OP_RECEIVE_RDMA_IMMED); default: return (OP_INVALID); } } + #define DAPL_GET_CQE_OPTYPE(cqe_p) dapls_cqe_opcode(cqe_p) #define DAPL_GET_CQE_WRID(cqe_p) ((ib_work_completion_t*)cqe_p)->wr_id #define DAPL_GET_CQE_STATUS(cqe_p) ((ib_work_completion_t*)cqe_p)->status Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 5065) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - init, open, close, utilities, work thread + * The OpenIB uCMA provider - init, open, close, utilities, work thread * **************************************************************************** * Source Control System Information @@ -64,7 +64,6 @@ static const char rcsid[] = "$Id: $"; #include /* for struct ifreq */ #include /* for ARPHRD_INFINIBAND */ - int g_dapl_loopback_connection = 0; int g_ib_pipe[2]; ib_thread_state_t g_ib_thread_state = 0; @@ -342,6 +341,7 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HC struct timespec sleep, remain; sleep.tv_sec = 0; sleep.tv_nsec = 10000000; /* 10 ms */ + write(g_ib_pipe[1], "w", sizeof "w"); dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread_destroy: wait on hca %p destroy\n"); nanosleep (&sleep, &remain); @@ -727,7 +727,7 @@ void dapli_thread(void *arg) int ret,idx,fds; char rbuf[2]; - dapl_dbg_log (DAPL_DBG_TYPE_CM, + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " ib_thread(%d,0x%x): ENTER: pipe %d ucma %d\n", getpid(), g_ib_thread, g_ib_pipe[0], rdma_get_fd()); @@ -767,7 +767,7 @@ void dapli_thread(void *arg) ufds[idx].revents = 0; uhca[idx] = hca; - dapl_dbg_log(DAPL_DBG_TYPE_CM, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d) poll_fd: hca[%d]=%p, async=%d" " pipe=%d cm=%d cq=d\n", getpid(), hca, ufds[idx-1].fd, @@ -783,14 +783,14 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); ret = poll(ufds, fds, -1); if (ret <= 0) { - dapl_dbg_log(DAPL_DBG_TYPE_WARN, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d): ERR %s poll\n", getpid(),strerror(errno)); dapl_os_lock(&g_hca_lock); continue; } - dapl_dbg_log(DAPL_DBG_TYPE_CM, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d) poll_event: " " async=0x%x pipe=0x%x cm=0x%x cq=0x%x\n", getpid(), ufds[idx-1].revents, ufds[0].revents, @@ -834,3 +834,54 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); } +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR +/* + * dapls_set_provider_specific_attr + * + * Input: + * attr_ptr Pointer provider attributes + * + * Output: + * none + * + * Returns: + * void + */ +DAT_NAMED_ATTR ib_attrs[] = { + +#ifdef DAT_EXTENSIONS + { + DAT_EXTENSION_ATTR, + DAT_EXTENSION_ATTR_TRUE + }, + { + DAT_EXTENSION_ATTR_VERSION, + DAT_EXTENSION_ATTR_VERSION_VALUE + }, + { + DAT_EXTENSION_ATTR_FETCH_AND_ADD, + DAT_EXTENSION_ATTR_TRUE + }, + { + DAT_EXTENSION_ATTR_CMP_AND_SWAP, + DAT_EXTENSION_ATTR_TRUE + }, +#else + { + "DAT_EXTENSION_INTERFACE", + "FALSE" + }, +#endif +}; + +#define SPEC_ATTR_SIZE(x) ( sizeof(x)/sizeof(DAT_NAMED_ATTR) ) + +void dapls_set_provider_specific_attr( + IN DAT_PROVIDER_ATTR *attr_ptr ) +{ + attr_ptr->num_provider_specific_attr = SPEC_ATTR_SIZE(ib_attrs); + attr_ptr->provider_specific_attr = ib_attrs; +} + +#endif + Index: dapl/openib_cma/dapl_ib_mem.c =================================================================== --- dapl/openib_cma/dapl_ib_mem.c (revision 5065) +++ dapl/openib_cma/dapl_ib_mem.c (working copy) @@ -25,9 +25,9 @@ /********************************************************************** * - * MODULE: dapl_det_mem.c + * MODULE: dapl_ib_mem.c * - * PURPOSE: Intel DET APIs: Memory windows, registration, + * PURPOSE: OpenIB uCMA provider Memory windows, registration, * and protection domain * * $Id: $ @@ -65,6 +65,10 @@ dapls_convert_privileges(IN DAT_MEM_PRIV { int access = 0; + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " dapls_convert_privileges: 0x%x\n", + privileges ); + /* * if (DAT_MEM_PRIV_LOCAL_READ_FLAG & privileges) do nothing */ @@ -72,12 +76,13 @@ dapls_convert_privileges(IN DAT_MEM_PRIV access |= IBV_ACCESS_LOCAL_WRITE; if (DAT_MEM_PRIV_REMOTE_WRITE_FLAG & privileges) access |= IBV_ACCESS_REMOTE_WRITE; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) - access |= IBV_ACCESS_REMOTE_READ; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) - access |= IBV_ACCESS_REMOTE_READ; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) access |= IBV_ACCESS_REMOTE_READ; + +#ifdef DAT_EXTENSIONS + if (DAT_MEM_PRIV_EXT_REMOTE_ATOMIC & privileges) + access |= IBV_ACCESS_REMOTE_ATOMIC; +#endif return access; } Index: dapl/openib_cma/dapl_ib_extensions.c =================================================================== --- dapl/openib_cma/dapl_ib_extensions.c (revision 0) +++ dapl/openib_cma/dapl_ib_extensions.c (revision 0) @@ -0,0 +1,374 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +/********************************************************************** + * + * MODULE: dapl_ib_extensions.c + * + * PURPOSE: Extensions routines for OpenIB uCMA provider + * + * $Id: $ + **********************************************************************/ + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_evd_util.h" +#include "dapl_ib_util.h" +#include "dapl_ep_util.h" +#include "dapl_cookie.h" +#include + +DAT_RETURN +dapli_post_cmp_and_swap( + IN DAT_EP_HANDLE ep_handle, + IN DAT_UINT64 cmp_val, + IN DAT_UINT64 swap_val, + IN DAT_LMR_TRIPLET *local_iov, + IN DAT_DTO_COOKIE user_cookie, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_COMPLETION_FLAGS flags ); + +DAT_RETURN +dapli_post_fetch_and_add( + IN DAT_EP_HANDLE ep_handle, + IN DAT_UINT64 add_val, + IN DAT_LMR_TRIPLET *local_iov, + IN DAT_DTO_COOKIE user_cookie, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_COMPLETION_FLAGS flags ); + + +/* + * dapl_extensions + * + * Process extension requests + * + * Input: + * ext_type, + * ... + * + * Output: + * Depends.... + * + * Returns: + * DAT_SUCCESS + * DAT_NOT_IMPLEMENTED + * ..... + * + */ + +DAT_RETURN +dapl_extensions(IN DAT_HANDLE dat_handle, + IN DAT_DTO_EXTENSION_OP ext_op, + IN va_list args) +{ + DAT_EP_HANDLE ep; + DAT_LMR_TRIPLET *lmr_p; + DAT_DTO_COOKIE cookie; + const DAT_RMR_TRIPLET *rmr_p; + DAT_UINT64 dat_uint64a, dat_uint64b; + DAT_COMPLETION_FLAGS comp_flags; + + DAT_RETURN status = DAT_NOT_IMPLEMENTED; + + dapl_dbg_log(DAPL_DBG_TYPE_API, + "dapl_extensions(hdl %p operation %d, ...)\n", + dat_handle, ext_op); + + switch ((int)ext_op) + { + + case DAT_DTO_EXTENSION_CMP_AND_SWAP: + dapl_dbg_log(DAPL_DBG_TYPE_RTN, + " CMP_AND_SWAP extension call\n"); + + ep = dat_handle; /* ep_handle */ + dat_uint64a = va_arg( args, DAT_UINT64); /* cmp_value */ + dat_uint64b = va_arg( args, DAT_UINT64); /* swap_value */ + lmr_p = va_arg( args, DAT_LMR_TRIPLET*); + cookie = va_arg( args, DAT_DTO_COOKIE); + rmr_p = va_arg( args, const DAT_RMR_TRIPLET*); + comp_flags = va_arg( args, DAT_COMPLETION_FLAGS); + + status = dapli_post_cmp_and_swap(ep, + dat_uint64a, + dat_uint64b, + lmr_p, + cookie, + rmr_p, + comp_flags ); + break; + + case DAT_DTO_EXTENSION_FETCH_AND_ADD: + dapl_dbg_log(DAPL_DBG_TYPE_RTN, + " FETCH_AND_ADD extension call\n"); + + ep = dat_handle; /* ep_handle */ + dat_uint64a = va_arg( args, DAT_UINT64); /* add value */ + lmr_p = va_arg( args, DAT_LMR_TRIPLET*); + cookie = va_arg( args, DAT_DTO_COOKIE); + rmr_p = va_arg( args, const DAT_RMR_TRIPLET*); + comp_flags = va_arg( args, DAT_COMPLETION_FLAGS); + + status = dapli_post_fetch_and_add(ep, + dat_uint64a, + lmr_p, + cookie, + rmr_p, + comp_flags ); + break; + + default: + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "unsupported extension(%d)\n", (int)ext_op); + } + + return(status); +} + + +DAT_RETURN +dapli_post_cmp_and_swap(IN DAT_EP_HANDLE ep_handle, + IN DAT_UINT64 cmp_val, + IN DAT_UINT64 swap_val, + IN DAT_LMR_TRIPLET *local_iov, + IN DAT_DTO_COOKIE user_cookie, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_COMPLETION_FLAGS flags ) +{ + DAPL_EP *ep_ptr; + ib_qp_handle_t qp_ptr; + DAPL_COOKIE *cookie; + DAT_RETURN dat_status = DAT_SUCCESS; + + dapl_dbg_log(DAPL_DBG_TYPE_API, + " post_cmp_and_swap: ep %p cmp_val %d " + "swap_val %d cookie 0x%x, r_iov %p, flags 0x%x\n", + ep_handle, (unsigned)cmp_val, (unsigned)swap_val, + (unsigned)user_cookie.as_64, remote_iov, flags); + + if (DAPL_BAD_HANDLE(ep_handle, DAPL_MAGIC_EP)) + return(DAT_ERROR(DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP)); + + if ((NULL == remote_iov) || (NULL == local_iov)) + return DAT_INVALID_PARAMETER; + + ep_ptr = (DAPL_EP *) ep_handle; + qp_ptr = ep_ptr->qp_handle; + + /* + * Synchronization ok since this buffer is only used for send + * requests, which aren't allowed to race with each other. + * only if completion is expected + */ + if (!(DAT_COMPLETION_SUPPRESS_FLAG & flags)) { + + dat_status = dapls_dto_cookie_alloc( + &ep_ptr->req_buffer, + DAPL_DTO_TYPE_EXTENSION, + user_cookie, + &cookie ); + + if ( dat_status != DAT_SUCCESS ) + goto bail; + + /* + * Take reference before posting to avoid race conditions with + * completions + */ + dapl_os_atomic_inc(&ep_ptr->req_count); + } + + /* + * Invoke provider specific routine to post DTO + */ + dat_status = dapls_ib_post_ext_send(ep_ptr, + OP_COMP_AND_SWAP, + cookie, + 1, + local_iov, + remote_iov, + cmp_val, /* compare or add */ + swap_val, /* swap */ + flags); + + if (dat_status != DAT_SUCCESS) { + if ( cookie != NULL ) { + dapl_os_atomic_dec(&ep_ptr->req_count); + dapls_cookie_dealloc(&ep_ptr->req_buffer, cookie); + } + } + +bail: + return dat_status; + +} + +DAT_RETURN +dapli_post_fetch_and_add(IN DAT_EP_HANDLE ep_handle, + IN DAT_UINT64 add_val, + IN DAT_LMR_TRIPLET *local_iov, + IN DAT_DTO_COOKIE user_cookie, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_COMPLETION_FLAGS flags ) +{ + DAPL_EP *ep_ptr; + DAPL_COOKIE *cookie; + DAT_RETURN dat_status = DAT_SUCCESS; + + dapl_dbg_log (DAPL_DBG_TYPE_API, + " post_fetch_and_add: ep %p add_val %d local_iov" + "%p cookie 0x%x, r_iov %p, flags 0x%x\n", + ep_handle, (unsigned)add_val, local_iov, + (unsigned)user_cookie.as_64, remote_iov, flags); + + if (DAPL_BAD_HANDLE(ep_handle, DAPL_MAGIC_EP)) + return(DAT_ERROR(DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP)); + + if ((NULL == remote_iov) || (NULL == local_iov)) + return DAT_INVALID_PARAMETER; + + ep_ptr = (DAPL_EP *) ep_handle; + + /* + * Synchronization ok since this buffer is only used for send + * requests, which aren't allowed to race with each other. + * only if completion is expected + */ + if (!(DAT_COMPLETION_SUPPRESS_FLAG & flags)) { + + dat_status = dapls_dto_cookie_alloc( + &ep_ptr->req_buffer, + DAPL_DTO_TYPE_EXTENSION, + user_cookie, + &cookie); + + if (dat_status != DAT_SUCCESS) + goto bail; + + /* + * Take reference before posting to avoid race conditions with + * completions + */ + dapl_os_atomic_inc(&ep_ptr->req_count); + } + + /* + * Invoke provider specific routine to post DTO + */ + dat_status = dapls_ib_post_ext_send(ep_ptr, + OP_FETCH_AND_ADD, + cookie, + 1, + local_iov, + remote_iov, + add_val, /* compare or add */ + 0, /* swap */ + flags); + + if (dat_status != DAT_SUCCESS) { + if (cookie != NULL ) { + dapl_os_atomic_dec (&ep_ptr->req_count); + dapls_cookie_dealloc (&ep_ptr->req_buffer, cookie); + } + } + +bail: + return dat_status; +} + +/* + * New provider routine to process extended DTO events + */ +void +dapls_cqe_to_event_extension(IN DAPL_EP *ep_ptr, + IN DAPL_COOKIE *cookie, + IN ib_work_completion_t *cqe_ptr, + IN DAT_EVENT *event_ptr) +{ + uint32_t ibtype; + DAPL_COOKIE_BUFFER *buffer; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = + &event_ptr->event_data.dto_completion_event_data; + + dapl_dbg_log(DAPL_DBG_TYPE_EVD, + " cqe_to_event_ext: event_ptr %p dto_event %p\n", + event_ptr, dto_event); + + + if ( DAPL_DTO_TYPE_RECV == cookie->val.dto.type || + DAPL_DTO_TYPE_RECV_IMMED == cookie->val.dto.type ) { + dapl_os_atomic_dec (&ep_ptr->recv_count); + buffer = &ep_ptr->recv_buffer; + } + else { + dapl_os_atomic_dec (&ep_ptr->req_count); + buffer = &ep_ptr->req_buffer; + } + + /* update DTO event data and then the extension */ + event_ptr->event_number = DAT_DTO_COMPLETION_EVENT; + dto_event->operation = DAT_EXTENSION; + dto_event->ep_handle = cookie->ep; + dto_event->user_cookie = cookie->val.dto.cookie; + dto_event->status = dapls_ib_get_dto_status(cqe_ptr); + + if (dto_event->status != DAT_DTO_SUCCESS ) + return; + + /* get operation type from CQ work completion entry */ + ibtype = DAPL_GET_CQE_OPTYPE(cqe_ptr); + + switch (ibtype) { + case OP_COMP_AND_SWAP: + dapl_dbg_log (DAPL_DBG_TYPE_EVD, + " cqe_to_event_ext: COMP_AND_SWAP_RESP\n"); + /* original data is returned in LMR provided with post */ + dto_event->extension.type = DAT_DTO_EXTENSION_CMP_AND_SWAP; + break; + + case OP_FETCH_AND_ADD: + dapl_dbg_log (DAPL_DBG_TYPE_EVD, + " cqe_to_event_ext: FETCH_AND_ADD_RESP\n"); + /* original data is returned in LMR provided with post */ + dto_event->extension.type = DAT_DTO_EXTENSION_FETCH_AND_ADD; + break; + + default: + dapl_dbg_log(DAPL_DBG_TYPE_DTO_COMP_ERR, + "Extension completion ERROR: unknown op = 0x%x\n", + ibtype); + } + + if (cookie->val.dto.type == DAPL_DTO_TYPE_SEND || + cookie->val.dto.type == DAPL_DTO_TYPE_SEND_IMMED || + cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_WRITE_IMMED || + cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_WRITE ) + /* Get size from DTO; CQE value may be off. */ + dto_event->transfered_length = cookie->val.dto.size; + else + dto_event->transfered_length = DAPL_GET_CQE_BYTESNUM(cqe_ptr); + + dapls_cookie_dealloc(buffer, cookie); +} Index: dapl/openib_cma/dapl_ib_qp.c =================================================================== --- dapl/openib_cma/dapl_ib_qp.c (revision 5065) +++ dapl/openib_cma/dapl_ib_qp.c (working copy) @@ -25,9 +25,9 @@ /********************************************************************** * - * MODULE: dapl_det_qp.c + * MODULE: dapl_ib_qp.c * - * PURPOSE: QP routines for access to DET Verbs + * PURPOSE: OpenIB uCMA QP routines * * $Id: $ **********************************************************************/ Index: dapl/openib_cma/dapl_ib_util.h =================================================================== --- dapl/openib_cma/dapl_ib_util.h (revision 5065) +++ dapl/openib_cma/dapl_ib_util.h (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - definitions, prototypes, + * The OpenIB uCMA provider - definitions, prototypes, * **************************************************************************** * Source Control System Information @@ -123,15 +123,16 @@ typedef struct ibv_comp_channel *ib_wait /* DTO OPs, ordered for DAPL ENUM definitions */ #define OP_RDMA_WRITE IBV_WR_RDMA_WRITE -#define OP_RDMA_WRITE_IMM IBV_WR_RDMA_WRITE_WITH_IMM +#define OP_RDMA_WRITE_IMMED IBV_WR_RDMA_WRITE_WITH_IMM #define OP_SEND IBV_WR_SEND -#define OP_SEND_IMM IBV_WR_SEND_WITH_IMM +#define OP_SEND_IMMED IBV_WR_SEND_WITH_IMM #define OP_RDMA_READ IBV_WR_RDMA_READ #define OP_COMP_AND_SWAP IBV_WR_ATOMIC_CMP_AND_SWP #define OP_FETCH_AND_ADD IBV_WR_ATOMIC_FETCH_AND_ADD -#define OP_RECEIVE 7 /* internal op */ -#define OP_RECEIVE_IMM 8 /* internel op */ -#define OP_BIND_MW 9 /* internal op */ +#define OP_RECEIVE 0x7 /* internal op */ +#define OP_RECEIVE_IMMED 0x8 /* internel op */ +#define OP_RECEIVE_RDMA_IMMED 0x9 /* internal op */ +#define OP_BIND_MW 0xa /* internal op */ #define OP_INVALID 0xff /* Definitions to map QP state */ @@ -295,7 +296,8 @@ dapl_convert_errno( IN int err, IN const if (!err) return DAT_SUCCESS; #if DAPL_DBG - if ((err != EAGAIN) && (err != ETIME) && (err != ETIMEDOUT)) + if ((err != EAGAIN) && (err != ETIME) && + (err != ETIMEDOUT) && (err != EINTR)) dapl_dbg_log (DAPL_DBG_TYPE_ERR," %s %s\n", str, strerror(err)); #endif Index: dapl/openib_cma/dapl_ib_cq.c =================================================================== --- dapl/openib_cma/dapl_ib_cq.c (revision 5065) +++ dapl/openib_cma/dapl_ib_cq.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - completion queue + * The OpenIB uCMA provider - completion queue * **************************************************************************** * Source Control System Information @@ -498,7 +498,10 @@ dapls_ib_wait_object_wait(IN ib_wait_obj if (timeout != DAT_TIMEOUT_INFINITE) timeout_ms = timeout/1000; - status = poll(&cq_fd, 1, timeout_ms); + /* restart syscall */ + while ((status = poll(&cq_fd, 1, timeout_ms)) == -1 ) + if (errno == EINTR) + continue; /* returned event */ if (status > 0) { @@ -511,6 +514,8 @@ dapls_ib_wait_object_wait(IN ib_wait_obj /* timeout */ } else if (status == 0) status = ETIMEDOUT; + else + status = errno; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cq_object_wait: RET evd %p ibv_cq %p ibv_ctx %p %s\n", Index: dat/include/dat/udat.h =================================================================== --- dat/include/dat/udat.h (revision 5065) +++ dat/include/dat/udat.h (working copy) @@ -1,31 +1,51 @@ /* - * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 2002-2004, Network Appliance, Inc. All rights reserved. * - * This Software is licensed under one of the following licenses: - * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * in the file LICENSE.txt in the root directory. The license is also - * available from the Open Source Initiative, see + * This Software is licensed under both of the following two licenses: + * + * 1) under the terms of the "Common Public License 1.0". The license is also + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is in the file - * LICENSE2.txt in the root directory. The license is also available from - * the Open Source Initiative, see + * + * OR + * + * 2) under the terms of the "The BSD License". The license is also available + * from the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is in the file LICENSE3.txt in the root directory. The - * license is also available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. - * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are + * met: + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation + * notice, either one of the license notices in the documentation * and/or other materials provided with the distribution. + * + * Neither the name of Network Appliance, Inc. nor the names of other DAT + * Collaborative contributors may be used to endorse or promote + * products derived from this software without specific prior written + * permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND + * CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED + * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL + * THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, + * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. */ /**************************************************************** Index: dat/include/dat/dat_redirection.h =================================================================== --- dat/include/dat/dat_redirection.h (revision 5065) +++ dat/include/dat/dat_redirection.h (working copy) @@ -59,10 +59,10 @@ typedef struct dat_provider DAT_PROVIDER * This would allow a good compiler to avoid indirection overhead when * making function calls. */ - #define DAT_HANDLE_TO_PROVIDER(handle) (*(DAT_PROVIDER **)(handle)) #endif + #define DAT_IA_QUERY(ia, evd, ia_msk, ia_ptr, p_msk, p_ptr) \ (*DAT_HANDLE_TO_PROVIDER (ia)->ia_query_func) (\ (ia), \ @@ -395,6 +395,45 @@ typedef struct dat_provider DAT_PROVIDER (lbuf), \ (cookie)) +#ifdef DAT_IMMEDIATE_DATA +#define DAT_EP_POST_SEND_IMMED(ep, size, lbuf, cookie, immed, dflags, flags) \ + (*DAT_HANDLE_TO_PROVIDER (ep)->ep_post_send_immed_func) (\ + (ep), \ + (size), \ + (lbuf), \ + (cookie), \ + (immed), \ + (dflags), \ + (flags)) + +#define DAT_EP_POST_RECV_IMMED(ep, size, lbuf, cookie, flags) \ + (*DAT_HANDLE_TO_PROVIDER (ep)->ep_post_recv_immed_func) (\ + (ep), \ + (size), \ + (lbuf), \ + (cookie), \ + (flags)) + +#define DAT_EP_POST_RDMA_WRITE_IMMED(ep, size, lbuf, cookie, rbuf, immed, dflags, flags) \ + (*DAT_HANDLE_TO_PROVIDER (ep)->ep_post_rdma_write_immed_func) (\ + (ep), \ + (size), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (immed), \ + (dflags), \ + (flags)) +#endif + +#ifdef DAT_EXTENSIONS +#define DAT_EXTENSION(handle, op, args) \ + (*DAT_HANDLE_TO_PROVIDER (handle)->extension_func) (\ + (handle), \ + (op), \ + (args)) +#endif + /*************************************************************** * * FUNCTION PROTOTYPES @@ -720,4 +759,41 @@ typedef DAT_RETURN (*DAT_SRQ_POST_RECV_F IN DAT_LMR_TRIPLET *, /* local_iov */ IN DAT_DTO_COOKIE ); /* user_cookie */ +#ifdef DAT_IMMEDIATE_DATA +typedef DAT_RETURN (*DAT_EP_POST_SEND_IMMED_FUNC) ( + IN DAT_EP_HANDLE, /* ep_handle */ + IN DAT_COUNT, /* num_segments */ + IN DAT_LMR_TRIPLET *, /* local_iov */ + IN DAT_DTO_COOKIE, /* user_cookie */ + IN DAT_UINT32, /* immediate data */ + IN DAT_DTO_FLAGS, /* dto_flags */ + IN DAT_COMPLETION_FLAGS ); /* completion_flags */ + +typedef DAT_RETURN (*DAT_EP_POST_RECV_IMMED_FUNC) ( + IN DAT_EP_HANDLE, /* ep_handle */ + IN DAT_COUNT, /* num_segments */ + IN DAT_LMR_TRIPLET *, /* local_iov */ + IN DAT_DTO_COOKIE, /* user_cookie */ + IN DAT_COMPLETION_FLAGS ); /* completion_flags */ + +typedef DAT_RETURN (*DAT_EP_POST_RDMA_WRITE_IMMED_FUNC) ( + IN DAT_EP_HANDLE, /* ep_handle */ + IN DAT_COUNT, /* num_segments */ + IN DAT_LMR_TRIPLET *, /* local_iov */ + IN DAT_DTO_COOKIE, /* user_cookie */ + IN const DAT_RMR_TRIPLET *,/* remote_iov */ + IN DAT_UINT32, /* immediate data */ + IN DAT_DTO_FLAGS, /* dto_flags */ + IN DAT_COMPLETION_FLAGS ); /* completion_flags */ +#endif + +#ifdef DAT_EXTENSIONS +#include +typedef DAT_RETURN (*DAT_EXTENSION_FUNC) ( + IN DAT_HANDLE, /* dat handle */ + IN DAT_DTO_EXTENSION_OP, /* extension operation */ + IN va_list ); /* va_list */ +#endif + + #endif /* _DAT_REDIRECTION_H_ */ Index: dat/include/dat/dat.h =================================================================== --- dat/include/dat/dat.h (revision 5065) +++ dat/include/dat/dat.h (working copy) @@ -119,6 +119,27 @@ typedef DAT_HANDLE DAT_RMR_HANDLE; typedef DAT_HANDLE DAT_RSP_HANDLE; typedef DAT_HANDLE DAT_SRQ_HANDLE; +/* PROTOTYPE: immediate data and extensions */ +#ifdef DAT_IMMEDIATE_DATA +typedef enum dat_dtos +{ + DAT_SEND, + DAT_SEND_IMMED, + DAT_RDMA_WRITE, + DAT_RDMA_WRITE_IMMED, + DAT_RDMA_READ, + DAT_RECEIVE, + DAT_RECEIVE_WITH_INVALIDATE, + DAT_RECEIVE_IMMED, + DAT_RECEIVE_RDMA_IMMED, + DAT_BIND_MW, +#ifdef DAT_EXTENSIONS + DAT_EXTENSION, +#endif + DAT_INVALID +} DAT_DTOS; +#endif + /* dat NULL handles */ #define DAT_HANDLE_NULL ((DAT_HANDLE)NULL) @@ -176,6 +197,15 @@ typedef enum dat_psp_flags DAT_PSP_PROVIDER_FLAG = 0x01 /* Provider creates an Endpoint */ } DAT_PSP_FLAGS; +#ifdef DAT_IMMEDIATE_DATA +typedef enum dat_dto_flags +{ + DAT_DTO_IMMED_FLAG = 0x1, + DAT_DTO_IMMED_CONFIRM_FLAG = 0x2 + +} DAT_DTO_FLAGS; +#endif + /* * Memory Buffers * @@ -259,7 +289,6 @@ typedef struct dat_rmr_triplet */ /* Memory privileges */ - typedef enum dat_mem_priv_flags { DAT_MEM_PRIV_NONE_FLAG = 0x00, @@ -267,7 +296,11 @@ typedef enum dat_mem_priv_flags DAT_MEM_PRIV_REMOTE_READ_FLAG = 0x02, DAT_MEM_PRIV_LOCAL_WRITE_FLAG = 0x10, DAT_MEM_PRIV_REMOTE_WRITE_FLAG = 0x20, - DAT_MEM_PRIV_ALL_FLAG = 0x33 + DAT_MEM_PRIV_MW_BIND_FLAG = 0x40, + DAT_MEM_PRIV_ALL_FLAG = 0x73, +#ifdef DAT_EXTENSIONS + DAT_MEM_PRIV_EXTENSION = 0x10000 +#endif } DAT_MEM_PRIV_FLAGS; /* For backward compatibility with DAT-1.0, memory privileges values are @@ -712,6 +745,10 @@ typedef enum dat_dto_completion_status /* Completion group structs (six total) */ +#ifdef DAT_EXTENSIONS +#include +#endif + /* DTO completion event data */ /* transfered_length is not defined if status is not DAT_SUCCESS */ typedef struct dat_dto_completion_event_data @@ -719,7 +756,15 @@ typedef struct dat_dto_completion_event_ DAT_EP_HANDLE ep_handle; DAT_DTO_COOKIE user_cookie; DAT_DTO_COMPLETION_STATUS status; - DAT_VLEN transfered_length; + DAT_VLEN transfered_length; +#ifdef DAT_IMMEDIATE_DATA + DAT_DTOS operation; + DAT_RMR_CONTEXT rmr_context; + DAT_UINT32 immed_data; +#endif +#ifdef DAT_EXTENSIONS + DAT_DTO_EXTENSION_EVENT_DATA extension; +#endif } DAT_DTO_COMPLETION_EVENT_DATA; /* RMR bind completion event data */ @@ -854,11 +899,11 @@ typedef enum dat_event_number DAT_ASYNC_ERROR_EP_BROKEN = 0x08003, DAT_ASYNC_ERROR_TIMED_OUT = 0x08004, DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR = 0x08005, - DAT_SOFTWARE_EVENT = 0x10001 + DAT_SOFTWARE_EVENT = 0x10001, + } DAT_EVENT_NUMBER; /* Union for event Data */ - typedef union dat_event_data { DAT_DTO_COMPLETION_EVENT_DATA dto_completion_event_data; @@ -1222,6 +1267,41 @@ extern DAT_RETURN dat_srq_set_lw ( IN DAT_SRQ_HANDLE, /* srq_handle */ IN DAT_COUNT); /* low_watermark */ +#ifdef DAT_IMMEDIATE_DATA +extern DAT_RETURN dat_ep_post_send_immed ( + IN DAT_EP_HANDLE, /* ep_handle */ + IN DAT_COUNT, /* num_segments */ + IN DAT_LMR_TRIPLET *, /* local_iov */ + IN DAT_DTO_COOKIE, /* user_cookie */ + IN DAT_UINT32, /* immediate data */ + IN DAT_DTO_FLAGS, /* dto_flags */ + IN DAT_COMPLETION_FLAGS ); /* completion_flags */ + +extern DAT_RETURN dat_ep_post_recv_immed ( + IN DAT_EP_HANDLE, /* ep_handle */ + IN DAT_COUNT, /* num_segments */ + IN DAT_LMR_TRIPLET *, /* local_iov */ + IN DAT_DTO_COOKIE, /* user_cookie */ + IN DAT_COMPLETION_FLAGS ); /* completion_flags */ + +extern DAT_RETURN dat_ep_post_rdma_write_immed ( + IN DAT_EP_HANDLE, /* ep_handle */ + IN DAT_COUNT, /* num_segments */ + IN DAT_LMR_TRIPLET *, /* local_iov */ + IN DAT_DTO_COOKIE, /* user_cookie */ + IN const DAT_RMR_TRIPLET *,/* remote_iov */ + IN DAT_UINT32, /* immediate data */ + IN DAT_DTO_FLAGS, /* dto_flags */ + IN DAT_COMPLETION_FLAGS ); /* completion_flags */ +#endif + +#ifdef DAT_EXTENSIONS +extern DAT_RETURN dat_extension( + IN DAT_HANDLE, + IN DAT_DTO_EXTENSION_OP, + IN ... ); +#endif + /* * DAT registry functions. * Index: dat/include/dat/dat_error.h =================================================================== --- dat/include/dat/dat_error.h (revision 5065) +++ dat/include/dat/dat_error.h (working copy) @@ -1,31 +1,51 @@ /* - * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 2002-2004, Network Appliance, Inc. All rights reserved. * - * This Software is licensed under one of the following licenses: - * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * in the file LICENSE.txt in the root directory. The license is also - * available from the Open Source Initiative, see + * This Software is licensed under both of the following two licenses: + * + * 1) under the terms of the "Common Public License 1.0". The license is also + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is in the file - * LICENSE2.txt in the root directory. The license is also available from - * the Open Source Initiative, see + * + * OR + * + * 2) under the terms of the "The BSD License". The license is also available + * from the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is in the file LICENSE3.txt in the root directory. The - * license is also available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. - * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are + * met: + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation + * notice, either one of the license notices in the documentation * and/or other materials provided with the distribution. + * + * Neither the name of Network Appliance, Inc. nor the names of other DAT + * Collaborative contributors may be used to endorse or promote + * products derived from this software without specific prior written + * permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND + * CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED + * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL + * THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, + * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. */ /*********************************************************** @@ -47,17 +67,15 @@ /* * - * All return codes are actually a 3-way tuple: - * - * type: DAT_RETURN_CLASS DAT_RETURN_TYPE DAT_RETURN_SUBTYPE - * bits: 31-30 29-16 15-0 + * All return codes are actually a 4-way tuple: * - * 3 2 1 - * 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 - * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - * | C | DAT_RETURN_TYPE | DAT_RETURN_SUBTYPE | - * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + * type: CLASS RETURN_TYPE EXTENSION_SUBTYPE SUBTYPE + * bits: 31-30 29-16 15-8 7-0 * + * +-------------------------------------------------------------------------+ + * |3130 | 2928272625242322212019181716 | 15141312111009080 | 706054003020100| + * |CLAS | DAT_TYPE_STATUS | EXTENSION_SUBTYPE | SUBTYPE | + * +-------------------------------------------------------------------------+ */ /* @@ -70,8 +88,13 @@ * DAT Error bits */ #define DAT_TYPE_MASK 0x3fff0000 /* mask for DAT_TYPE_STATUS bits */ -#define DAT_SUBTYPE_MASK 0x0000FFFF /* mask for DAT_SUBTYPE_STATUS bits */ +#define DAT_SUBTYPE_MASK 0x000000FF /* mask for DAT_SUBTYPE_STATUS bits */ +#ifdef DAT_EXTENSIONS +/* Mask and macro for new extension subtype bits */ +#define DAT_EXTENSION_SUBTYPE_MASK 0x0000FF00 /* mask for DAT_EXTENSION_SUBTYPE_STATUS bits */ +#define DAT_GET_EXTENSION_SUBTYPE(status) ((DAT_UINT32)(status) & DAT_EXTENSION_SUBTYPE_MASK) +#endif /* * Determining the success of an operation is best done with a macro; * each of these returns a boolean value. Index: dat/include/dat/udat_redirection.h =================================================================== --- dat/include/dat/udat_redirection.h (revision 5065) +++ dat/include/dat/udat_redirection.h (working copy) @@ -199,13 +199,12 @@ typedef DAT_RETURN (*DAT_EVD_SET_UNWAITA typedef DAT_RETURN (*DAT_EVD_CLEAR_UNWAITABLE_FUNC) ( IN DAT_EVD_HANDLE); /* evd_handle */ - #include struct dat_provider { const char * device_name; - DAT_PVOID extension; + DAT_PVOID extension; DAT_IA_OPEN_FUNC ia_open_func; DAT_IA_QUERY_FUNC ia_query_func; @@ -294,6 +293,19 @@ struct dat_provider DAT_SRQ_QUERY_FUNC srq_query_func; DAT_SRQ_RESIZE_FUNC srq_resize_func; DAT_SRQ_SET_LW_FUNC srq_set_lw_func; + +#ifdef DAT_IMMEDIATE_DATA + /* udat-2.0 immediate data */ + DAT_EP_POST_SEND_IMMED_FUNC ep_post_send_immed_func; + DAT_EP_POST_RECV_IMMED_FUNC ep_post_recv_immed_func; + DAT_EP_POST_RDMA_WRITE_IMMED_FUNC ep_post_rdma_write_immed_func; +#endif + +#ifdef DAT_EXTENSIONS + /* udat-2.0 extensions */ + DAT_EXTENSION_FUNC extension_func; +#endif + }; #endif /* _UDAT_REDIRECTION_H_ */ Index: dat/include/dat/dat_extensions.h =================================================================== --- dat/include/dat/dat_extensions.h (revision 0) +++ dat/include/dat/dat_extensions.h (revision 0) @@ -0,0 +1,210 @@ +/* + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is in the file LICENSE3.txt in the root directory. The + * license is also available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +/********************************************************************** + * + * HEADER: dat_extensions.h + * + * PURPOSE: defines the extensions to the DAT API for uDAPL. + * + * Description: Header file for "uDAPL: User Direct Access Programming + * Library, Version: 1.2" + * + * Mapping rules: + * All global symbols are prepended with "DAT_" or "dat_" + * All DAT objects have an 'api' tag which, such as 'ep' or 'lmr' + * The method table is in the provider definition structure. + * + * + **********************************************************************/ +#ifndef _DAT_EXTENSIONS_H_ + +extern int dat_extensions; + +/* + * Provider specific attribute strings for extension support + * returned with dat_ia_query() and + * DAT_PROVIDER_ATTR_MASK == DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR + * + * DAT_NAMED_ATTR name == extended operations and version, + * value == TRUE if extended operation is supported + * version_value = version number of extension API + */ +#define DAT_EXTENSION_ATTR "DAT_EXTENSION_INTERFACE" +#define DAT_EXTENSION_ATTR_VERSION "DAT_EXTENSION_VERSION" +#define DAT_EXTENSION_ATTR_FETCH_AND_ADD "DAT_EXTENSION_FETCH_AND_ADD" +#define DAT_EXTENSION_ATTR_CMP_AND_SWAP "DAT_EXTENSION_CMP_AND_SWAP" +#define DAT_EXTENSION_ATTR_TRUE "TRUE" +#define DAT_EXTENSION_ATTR_FALSE "FALSE" +#define DAT_EXTENSION_ATTR_VERSION_VALUE "2.0.1" + + +/* + * DTO Extension OPERATIONS supported + */ +typedef enum dat_dto_extension_op +{ + DAT_DTO_EXTENSION_FETCH_AND_ADD, + DAT_DTO_EXTENSION_CMP_AND_SWAP + +} DAT_DTO_EXTENSION_OP; + + +/* + * Definitions for extension subtype RETURN codes + * + * All DAT return codes are now a 4-way tuple with an 8-bit + * EXTENSION_SUBTYPE reserved to cover specific extension subtypes: + * + * type: CLASS RETURN_TYPE EXTENSION_SUBTYPE SUBTYPE + * bits: 31-30 29-16 15-8 7-0 + * + * +-------------------------------------------------------------------------+ + * |3130 | 2928272625242322212019181716 | 15141312111009080 | 706054003020100| + * |CLAS | DAT_TYPE_STATUS | EXTENSION_SUBTYPE | SUBTYPE | + * +-------------------------------------------------------------------------+ + */ +typedef enum dat_return_extension_subtype +{ + /* NEW extension subtypes */ + DAT_EXTENSION_ERR_1 = DAT_SUBTYPE_MASK+1, + DAT_EXTENSION_ERR_2, + DAT_EXTENSION_ERR_3 + +} DAT_RETURN_EXTENSION_SUBTYPE; + +/* DAT_RETURN extension error to string */ +static __inline__ DAT_RETURN +dat_strerror_extension ( + IN DAT_RETURN value, + OUT const char **message ) +{ + switch( DAT_GET_EXTENSION_SUBTYPE (value) ) { + case 0: + *message = " "; + return DAT_SUCCESS; + case DAT_EXTENSION_ERR_1: + *message = "DAT_EXTENSION_ERR_1"; + return DAT_SUCCESS; + case DAT_EXTENSION_ERR_2: + *message = "DAT_EXTENSION_ERR_2"; + return DAT_SUCCESS; + case DAT_EXTENSION_ERR_3: + *message = "DAT_EXTENSION_ERR_3"; + return DAT_SUCCESS; + default: + *message = "unknown extension error"; + return DAT_INVALID_PARAMETER; + + } +} + +/* + * Definition for memory privilege extension flags. + * New priviledes required for new atomic DTO type extensions. + * New Bit definitions MUST start at DAT_MEM_PRIV_EXTENSION + */ +typedef enum dat_mem_priv_extension_flags +{ + DAT_MEM_PRIV_EXT_START = DAT_MEM_PRIV_EXTENSION, + DAT_MEM_PRIV_EXT_REMOTE_ATOMIC = (DAT_MEM_PRIV_EXTENSION << 1), + +} DAT_MEM_PRIV_EXTENSION_FLAGS; + + +/* + * DTO Extension event TYPES, DTO completion + */ +typedef enum dat_dto_extension_status +{ + DAT_DTO_EXTENSION_SUCCESS, + DAT_DTO_EXTENSION_ERR_GENERAL + +} DAT_DTO_EXTENSION_STATUS; + + +/* + * DTO Extension completion event DATA types + */ +typedef struct dat_extension_dto_data +{ + DAT_UINT64 as_64; + +} DAT_DTO_EXTENSION_DATA; + +typedef struct dat_dto_extension_event_data +{ + DAT_DTO_EXTENSION_OP type; + DAT_DTO_EXTENSION_STATUS status; + union { + DAT_DTO_EXTENSION_DATA data; + } val; + +} DAT_DTO_EXTENSION_EVENT_DATA; + + +/* + * Extended API with redirection via DAT extension function + */ + +/* + * This asynchronous call is modeled after the InfiniBand atomic + * Fetch and Add operation. The add_value is added to the 64 bit + * value stored at the remote memory location specified in remote_iov + * and the result is stored in the local_iov. + */ +#define dat_ep_post_fetch_and_add(ep, add_val, lbuf, cookie, rbuf, flgs) \ + dat_extension( ep, \ + DAT_DTO_EXTENSION_FETCH_AND_ADD, \ + (add_val), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (flgs)) + +/* + * This asynchronous call is modeled after the InfiniBand atomic + * Compare and Swap operation. The cmp_value is compared to the 64 bit + * value stored at the remote memory location specified in remote_iov. + * If the two values are equal, the 64 bit swap_value is stored in + * the remote memory location. In all cases, the original 64 bit + * value stored in the remote memory location is copied to the local_iov. + */ +#define dat_ep_post_cmp_and_swap(ep, cmp_val, swap_val, lbuf, cookie, rbuf, flgs) \ + dat_extension( ep, \ + DAT_DTO_EXTENSION_CMP_AND_SWAP, \ + (cmp_val), \ + (swap_val), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (flgs)) + +#endif /* _DAT_EXTENSIONS_H_ */ + Index: dat/common/dat_api.c =================================================================== --- dat/common/dat_api.c (revision 5065) +++ dat/common/dat_api.c (working copy) @@ -2,27 +2,27 @@ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * * This Software is licensed under one of the following licenses: - * + * * 1) under the terms of the "Common Public License 1.0" a copy of which is * in the file LICENSE.txt in the root directory. The license is also - * available from the Open Source Initiative, see + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * + * * 2) under the terms of the "The BSD License" a copy of which is in the file * LICENSE2.txt in the root directory. The license is also available from * the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * + * * 3) under the terms of the "GNU General Public License (GPL) Version 2" a * copy of which is in the file LICENSE3.txt in the root directory. The * license is also available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. - * + * * Licensee has the right to choose one of the above licenses. - * + * * Redistributions of source code must retain the above copyright * notice and one of the license notices. - * + * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. @@ -35,7 +35,7 @@ * PURPOSE: DAT Provider and Consumer registry functions. * Also provide small integers for IA_HANDLES * - * $Id: dat_api.c,v 1.10 2005/05/20 22:25:31 jlentini Exp $ + * $Id: dat_api.c,v 1.5 2005/02/17 19:36:23 jlentini Exp $ **********************************************************************/ #include "dat_osd.h" @@ -70,15 +70,16 @@ dats_handle_vector_init ( void ) { DAT_RETURN dat_status; int i; + int status; dat_status = DAT_SUCCESS; g_hv.handle_max = DAT_HANDLE_ENTRY_STEP; - dat_status = dat_os_lock_init (&g_hv.handle_lock); - if ( DAT_SUCCESS != dat_status ) + status = dat_os_lock_init (&g_hv.handle_lock); + if ( DAT_SUCCESS != status ) { - return dat_status; + return status; } g_hv.handle_array = dat_os_alloc (sizeof(void *) * DAT_HANDLE_ENTRY_STEP); @@ -88,7 +89,7 @@ dats_handle_vector_init ( void ) goto bail; } - for (i = 0; i < g_hv.handle_max; i++) + for (i = g_hv.handle_max; i < g_hv.handle_max; i++) { g_hv.handle_array[i] = NULL; } @@ -112,11 +113,7 @@ dats_set_ia_handle ( void **h; dat_os_lock (&g_hv.handle_lock); - - /* - * Don't give out handle zero since that is DAT_HANDLE_NULL! - */ - for (i = 1; i < g_hv.handle_max; i++) + for (i = 0; i < g_hv.handle_max; i++) { if (g_hv.handle_array[i] == NULL) { @@ -1142,6 +1139,105 @@ DAT_RETURN dat_srq_set_lw( low_watermark); } +#ifdef DAT_IMMEDIATE_DATA +DAT_RETURN dat_ep_post_rdma_write_immed ( + IN DAT_EP_HANDLE ep_handle, + IN DAT_COUNT num_segments, + IN DAT_LMR_TRIPLET *local_iov, + IN DAT_DTO_COOKIE user_cookie, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_UINT32 immed_data, + IN DAT_DTO_FLAGS dto_flags, + IN DAT_COMPLETION_FLAGS completion_flags) +{ + if (ep_handle == NULL) + { + return DAT_ERROR(DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP); + } + return DAT_EP_POST_RDMA_WRITE_IMMED(ep_handle, + num_segments, + local_iov, + user_cookie, + remote_iov, + immed_data, + dto_flags, + completion_flags); +} + +DAT_RETURN dat_ep_post_send_immed ( + IN DAT_EP_HANDLE ep_handle, + IN DAT_COUNT num_segments, + IN DAT_LMR_TRIPLET *local_iov, + IN DAT_DTO_COOKIE user_cookie, + IN DAT_UINT32 immed_data, + IN DAT_DTO_FLAGS dto_flags, + IN DAT_COMPLETION_FLAGS completion_flags) +{ + if (ep_handle == NULL) + { + return DAT_ERROR(DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP); + } + return DAT_EP_POST_SEND_IMMED (ep_handle, + num_segments, + local_iov, + user_cookie, + immed_data, + dto_flags, + completion_flags); +} + +DAT_RETURN dat_ep_post_recv_immed ( + IN DAT_EP_HANDLE ep_handle, + IN DAT_COUNT num_segments, + IN DAT_LMR_TRIPLET *local_iov, + IN DAT_DTO_COOKIE user_cookie, + IN DAT_COMPLETION_FLAGS completion_flags) +{ + if (ep_handle == NULL) + { + return DAT_ERROR(DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP); + } + return DAT_EP_POST_RECV_IMMED (ep_handle, + num_segments, + local_iov, + user_cookie, + completion_flags); +} +#endif + +#ifdef DAT_EXTENSIONS +DAT_RETURN dat_extension( + IN DAT_HANDLE handle, + IN DAT_DTO_EXTENSION_OP ext_op, + IN ... ) + +{ + DAT_RETURN status; + va_list args; + + if (handle == NULL) + { + return DAT_ERROR(DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP); + } + + /* verify provider extension support */ + if (!dat_extensions) + { + return DAT_ERROR(DAT_NOT_IMPLEMENTED, 0); + } + + va_start(args, ext_op); + + status = DAT_EXTENSION(handle, + ext_op, + args); + va_end(args); + + return status; +} +#endif + + /* * Local variables: * c-indent-level: 4 Index: dat/udat/Makefile =================================================================== --- dat/udat/Makefile (revision 5065) +++ dat/udat/Makefile (working copy) @@ -112,6 +112,13 @@ CFLAGS32 = -m32 endif # +# Prototype 2.0 DAT extensions and immediate data +# +CFLAGS += -DDAT_EXTENSIONS +CFLAGS += -DDAT_IMMEDIATE_DATA + + +# # LD definitions # Index: dat/udat/udat.c =================================================================== --- dat/udat/udat.c (revision 5065) +++ dat/udat/udat.c (working copy) @@ -2,27 +2,27 @@ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * * This Software is licensed under one of the following licenses: - * + * * 1) under the terms of the "Common Public License 1.0" a copy of which is * in the file LICENSE.txt in the root directory. The license is also - * available from the Open Source Initiative, see + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * + * * 2) under the terms of the "The BSD License" a copy of which is in the file * LICENSE2.txt in the root directory. The license is also available from * the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is in the file LICENSE3.txt in the root directory. The + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is in the file LICENSE3.txt in the root directory. The * license is also available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. - * + * * Licensee has the right to choose one of the above licenses. - * + * * Redistributions of source code must retain the above copyright * notice and one of the license notices. - * + * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. @@ -34,7 +34,7 @@ * * PURPOSE: DAT Provider and Consumer registry functions. * - * $Id: udat.c,v 1.22 2005/03/24 05:58:35 jlentini Exp $ + * $Id: udat.c,v 1.20 2005/02/11 20:17:05 jlentini Exp $ **********************************************************************/ #include @@ -66,6 +66,10 @@ udat_check_state ( void ); * * *********************************************************************/ +/* + * Use a global to get an unresolved when run with pre-extension library + */ +int dat_extensions = 0; /* * @@ -226,17 +230,48 @@ dat_ia_openv ( return dat_status; } - dat_status = (*ia_open_func) (name, - async_event_qlen, - async_event_handle, - ia_handle); + dat_status = (*ia_open_func) (name, + async_event_qlen, + async_event_handle, + ia_handle); + + /* + * See if provider supports extensions + */ if (dat_status == DAT_SUCCESS) { + DAT_PROVIDER_ATTR p_attr; + int i; + return_handle = dats_set_ia_handle (*ia_handle); if (return_handle >= 0) { *ia_handle = (DAT_IA_HANDLE)return_handle; - } + } + + if ( dat_ia_query( *ia_handle, + NULL, + 0, + NULL, + DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR, + &p_attr ) == DAT_SUCCESS ) + { + for ( i = 0; i < p_attr.num_provider_specific_attr; i++ ) + { + if ( (strcmp( p_attr.provider_specific_attr[i].name, + "DAT_EXTENSION_INTERFACE" ) == 0) && + (strcmp( p_attr.provider_specific_attr[i].value, + "TRUE" ) == 0) ) + { + dat_os_dbg_print(DAT_OS_DBG_TYPE_CONSUMER_API, + "DAT Registry: dat_ia_open () " + "DAPL Extension Interface supported!\n"); + + dat_extensions = 1; + break; + } + } + } } return dat_status; -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Jan 18 16:38:18 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 18 Jan 2006 16:38:18 -0800 Subject: [openib-general] Re: [PATCH] iWARP Include File Changes In-Reply-To: <1136487265.10878.17.camel@trinity.austin.ammasso.com> References: <1136487265.10878.17.camel@trinity.austin.ammasso.com> Message-ID: <43CEDF7A.8090708@ichips.intel.com> Tom Tucker wrote: > enum ib_device_cap_flags { > @@ -86,6 +87,14 @@ > IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), > IB_DEVICE_SRQ_RESIZE = (1<<13), > IB_DEVICE_N_NOTIFY_CQ = (1<<14), > + IB_DEVICE_IN_ORD_PLCMNT = (1<<15), > + IB_DEVICE_ZERO_STAG = (1<<16), > + IB_DEVICE_SEND_W_INV = (1<<17), > + IB_DEVICE_MW = (1<<18), > + IB_DEVICE_FMR = (1<<19), > + IB_DEVICE_SRQ = (1<<20), > + IB_DEVICE_ARP = (1<<21), > + IB_DEVICE_LLP = (1<<22), > }; Does this change imply that devices need to set these capabilities? I.e. should mthca and the other drivers be updated to set the device capabilities correctly? - Sean From bos at pathscale.com Wed Jan 18 16:43:31 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 18 Jan 2006 16:43:31 -0800 Subject: [openib-general] RFC: ipath ioctls and their replacements Message-ID: <1137631411.4757.218.camel@serpentine.pathscale.com> When I posted the last round of ipath driver code for review, people objected to the number of ioctls we had. I'd like to get feedback on what would be acceptable replacements. We have four kinds of ioctl right now: * Interfacing with userspace * Infiniband subnet management * Flash/EEPROM management * Diagnostics There are currently 36 ioctls in total. I think that I can reduce this number dramatically, but we're having some contentious internal debate about whether and how some of the ioctls should be replaced. I'd like to see what's most likely to get accepted. Obviously, we'd prefer the number to be zero, but I don't think we can do that without submitting a driver that isn't very useful. Unless I indicate otherwise, I cannot think of clean replacements for the ioctls listed below, and would appreciate suggestions. For user access: Opening the /dev/ipath special file assigns an appropriate free unit (chip) and port (context on a chip) to a user process. Think of it as similar to /dev/ptmx for ttys, except there isn't a devpts-like filesystem behind it. Once a process has opened /dev/ipath, it needs to find out which unit and port it has opened, so that it can access other attributes in /sys. To do this, we provide a GETPORT ioctl. USERINIT and BASEINFO work with mmap to set up direct access to the hardware for user processes. We intend to turn these into a single ioctl, USERINIT. This copies a substantial amount of information to and from userspace. RCVCTRL enables/disables receipt of packets. SET_PKEY sets a partition key, essentially telling hardware which packets are interesting to userspace. UPDM_TID and FREE_TID are used for RDMA context management. WAIT waits for incoming packets, and can clearly be replaced by file_ops->poll. GETCOUNTERS, GETUNITCOUNTERS and GETSTATS can all be replaced by files in sysfs. For subnet management: GETLID, SET_LID, SET_MTU, SET_GUID, SET_MLID, GET_MLID, GET_DEVSTATUS, GET_PORTINFO and GET_NODEINFO can all be replaced by files in sysfs. SET_LINKSTATE changes the link state. SEND_SMA_PKT and RCV_SMA_PKT send and receive subnet management packets. I *think* they could be replaced by read and write methods on a new special file, although the semantics aren't a super-clean match. For EEPROM/flash management: READ_EEPROM reads the flash. WRITE_EEPROM writes it. I don't see a standard way of doing this in the kernel; many drivers provide their own private ioctls, some on dedicated special files. I think that using read and write instead would be okay (with a small qualm about semantics), but this idea makes an influential coworker barf violently. I can't see how we could use the ethtool flash interface: the low-level driver doesn't look like a regular net device, and we support partial updates of the flash. For diagnostics: DIAGENTER and DIAGLEAVE put the driver into and out of diag mode. These could be replaced by open/close of a special file. DIAGREAD and DIAGWRITE perform direct accesses to the device's PCI memory space. I think these could be replaced by read and write, but they are again subject to the make-coworker-barf problem. HTREAD and HTREAD perform direct accesses to the device's PCI config space. Same disagreement problem as DIAGREAD and DIAGWRITE. SEND_DIAG_PKT can be replaced with whatever sends and receives subnet management packets, as above. DIAG_RD_I2C is synonymous with READ_EEPROM, and will go away. Depending on how you look at it, we can slim our list of ioctls down to somewhere between 6 and 10. This isn't zero, but it's not 36, either. What do people think? References: <309a667c0601120557h2bcec18fu6aa13d8930ccba4c@mail.gmail.com> Message-ID: <43CEE0D5.3000109@ichips.intel.com> Devesh Sharma wrote: > I have some queries regarding the significance of the function > ib_register_mad_agent() > > A) What this function dose? > B) What is the significance of this function in implementing HCA driver? I didn't see a response to this. This function permits a client to send and receive MADs on QP 0 and 1. For example, one of the things that mthca uses it for is to forward traps. - Sean From davem at davemloft.net Wed Jan 18 16:48:39 2006 From: davem at davemloft.net (David S. Miller) Date: Wed, 18 Jan 2006 16:48:39 -0800 (PST) Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <1137631411.4757.218.camel@serpentine.pathscale.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> Message-ID: <20060118.164839.74431051.davem@davemloft.net> From: Bryan O'Sullivan Date: Wed, 18 Jan 2006 16:43:31 -0800 > Obviously, we'd prefer the number to be zero, but I don't think we > can do that without submitting a driver that isn't very useful. You can use an interface such a netlink for device configuration. It can do better type checking, can be used by generic tools, and some day soon will be transferable over the wire so that one can perform remote configuration changes. Let's let ioctl()'s go the way of the cave man. It's one of the worst designed interfaces undef UNIX :) From greg at kroah.com Wed Jan 18 16:53:16 2006 From: greg at kroah.com (Greg KH) Date: Wed, 18 Jan 2006 16:53:16 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <1137631411.4757.218.camel@serpentine.pathscale.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> Message-ID: <20060119005316.GA26884@kroah.com> On Wed, Jan 18, 2006 at 04:43:31PM -0800, Bryan O'Sullivan wrote: > For EEPROM/flash management: > > READ_EEPROM reads the flash. WRITE_EEPROM writes it. I don't > see a standard way of doing this in the kernel; many drivers > provide their own private ioctls, some on dedicated special > files. I think that using read and write instead would be okay > (with a small qualm about semantics), but this idea makes an > influential coworker barf violently. I can't see how we could > use the ethtool flash interface: the low-level driver doesn't > look like a regular net device, and we support partial updates > of the flash. Use the firmware subsystem for this. It uses sysfs so ioctl needed at all. thanks, greg k-h From mshefty at ichips.intel.com Wed Jan 18 16:54:52 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 18 Jan 2006 16:54:52 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1136578777.14108.6.camel@trinity.austin.ammasso.com> References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> Message-ID: <43CEE35C.7050707@ichips.intel.com> Tom Tucker wrote: > Enclosed is a combined include file and core patch for iWARP support in CMA. This > patch includes changes per your last review. Unless anyone has an objection, I will commit these changes within the next day or so. - Sean From ralphc at pathscale.com Wed Jan 18 17:13:24 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 18 Jan 2006 17:13:24 -0800 Subject: [openib-general] [PATCH] Add -m option to ping pong programs to set path MTU Message-ID: <1137633204.4520.397.camel@brick.internal.keyresearch.com> This patch adds a new -m option to the ping pong programs which have a path MTU parameter to ib_modify_qp(). Signed-off-by: Ralph Campbell Index: libibverbs/examples/rc_pingpong.c =================================================================== --- libibverbs/examples/rc_pingpong.c (revision 5065) +++ libibverbs/examples/rc_pingpong.c (working copy) @@ -59,6 +59,7 @@ }; static int page_size; +static int path_mtu; struct pingpong_context { struct ibv_context *context; @@ -94,7 +95,7 @@ { struct ibv_qp_attr attr = { .qp_state = IBV_QPS_RTR, - .path_mtu = IBV_MTU_1024, + .path_mtu = path_mtu, .dest_qp_num = dest->qpn, .rq_psn = dest->psn, .max_dest_rd_atomic = 1, @@ -440,6 +441,7 @@ printf(" -d, --ib-dev= use IB device (default first device found)\n"); printf(" -i, --ib-port= use port of IB device (default 1)\n"); printf(" -s, --size= size of message to exchange (default 4096)\n"); + printf(" -m, --mtu= path MTU (default 1024)\n"); printf(" -r, --rx-depth= number of receives to post at a time (default 500)\n"); printf(" -n, --iters= number of exchanges (default 1000)\n"); printf(" -e, --events sleep on CQ events (default poll)\n"); @@ -458,6 +460,7 @@ int port = 18515; int ib_port = 1; int size = 4096; + int mtu = 1024; int rx_depth = 500; int iters = 1000; int use_event = 0; @@ -474,13 +477,14 @@ { .name = "ib-dev", .has_arg = 1, .val = 'd' }, { .name = "ib-port", .has_arg = 1, .val = 'i' }, { .name = "size", .has_arg = 1, .val = 's' }, + { .name = "mtu", .has_arg = 1, .val = 'm' }, { .name = "rx-depth", .has_arg = 1, .val = 'r' }, { .name = "iters", .has_arg = 1, .val = 'n' }, { .name = "events", .has_arg = 0, .val = 'e' }, { 0 } }; - c = getopt_long(argc, argv, "p:d:i:s:r:n:e", long_options, NULL); + c = getopt_long(argc, argv, "p:d:i:s:m:r:n:e", long_options, NULL); if (c == -1) break; @@ -509,6 +513,10 @@ size = strtol(optarg, NULL, 0); break; + case 'm': + mtu = strtol(optarg, NULL, 0); + break; + case 'r': rx_depth = strtol(optarg, NULL, 0); break; @@ -534,6 +542,32 @@ return 1; } + switch (mtu) { + case 256: + path_mtu = IBV_MTU_256; + break; + + case 512: + path_mtu = IBV_MTU_512; + break; + + case 1024: + path_mtu = IBV_MTU_1024; + break; + + case 2048: + path_mtu = IBV_MTU_2048; + break; + + case 4096: + path_mtu = IBV_MTU_4096; + break; + + default: + usage(argv[0]); + return 1; + } + page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_device_list(NULL); Index: libibverbs/examples/uc_pingpong.c =================================================================== --- libibverbs/examples/uc_pingpong.c (revision 5065) +++ libibverbs/examples/uc_pingpong.c (working copy) @@ -59,6 +59,7 @@ }; static int page_size; +static int path_mtu; struct pingpong_context { struct ibv_context *context; @@ -94,7 +95,7 @@ { struct ibv_qp_attr attr = { .qp_state = IBV_QPS_RTR, - .path_mtu = IBV_MTU_1024, + .path_mtu = path_mtu, .dest_qp_num = dest->qpn, .rq_psn = dest->psn, .ah_attr = { @@ -428,6 +429,7 @@ printf(" -d, --ib-dev= use IB device (default first device found)\n"); printf(" -i, --ib-port= use port of IB device (default 1)\n"); printf(" -s, --size= size of message to exchange (default 4096)\n"); + printf(" -m, --mtu= path MTU (default 1024)\n"); printf(" -r, --rx-depth= number of receives to post at a time (default 500)\n"); printf(" -n, --iters= number of exchanges (default 1000)\n"); printf(" -e, --events sleep on CQ events (default poll)\n"); @@ -446,6 +448,7 @@ int port = 18515; int ib_port = 1; int size = 4096; + int mtu = 1024; int rx_depth = 500; int iters = 1000; int use_event = 0; @@ -462,13 +465,14 @@ { .name = "ib-dev", .has_arg = 1, .val = 'd' }, { .name = "ib-port", .has_arg = 1, .val = 'i' }, { .name = "size", .has_arg = 1, .val = 's' }, + { .name = "mtu", .has_arg = 1, .val = 'm' }, { .name = "rx-depth", .has_arg = 1, .val = 'r' }, { .name = "iters", .has_arg = 1, .val = 'n' }, { .name = "events", .has_arg = 0, .val = 'e' }, { 0 } }; - c = getopt_long(argc, argv, "p:d:i:s:r:n:e", long_options, NULL); + c = getopt_long(argc, argv, "p:d:i:s:m:r:n:e", long_options, NULL); if (c == -1) break; @@ -497,6 +501,10 @@ size = strtol(optarg, NULL, 0); break; + case 'm': + mtu = strtol(optarg, NULL, 0); + break; + case 'r': rx_depth = strtol(optarg, NULL, 0); break; @@ -522,6 +530,32 @@ return 1; } + switch (mtu) { + case 256: + path_mtu = IBV_MTU_256; + break; + + case 512: + path_mtu = IBV_MTU_512; + break; + + case 1024: + path_mtu = IBV_MTU_1024; + break; + + case 2048: + path_mtu = IBV_MTU_2048; + break; + + case 4096: + path_mtu = IBV_MTU_4096; + break; + + default: + usage(argv[0]); + return 1; + } + page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_device_list(NULL); Index: libibverbs/examples/srq_pingpong.c =================================================================== --- libibverbs/examples/srq_pingpong.c (revision 5065) +++ libibverbs/examples/srq_pingpong.c (working copy) @@ -61,6 +61,7 @@ }; static int page_size; +static int path_mtu; struct pingpong_context { struct ibv_context *context; @@ -102,7 +103,7 @@ for (i = 0; i < ctx->num_qp; ++i) { struct ibv_qp_attr attr = { .qp_state = IBV_QPS_RTR, - .path_mtu = IBV_MTU_1024, + .path_mtu = path_mtu, .dest_qp_num = dest[i].qpn, .rq_psn = dest[i].psn, .max_dest_rd_atomic = 1, @@ -501,6 +502,7 @@ printf(" -d, --ib-dev= use IB device (default first device found)\n"); printf(" -i, --ib-port= use port of IB device (default 1)\n"); printf(" -s, --size= size of message to exchange (default 4096)\n"); + printf(" -m, --mtu= path MTU (default 1024)\n"); printf(" -q, --num-qp= number of QPs to use (default 16)\n"); printf(" -r, --rx-depth= number of receives to post at a time (default 500)\n"); printf(" -n, --iters= number of exchanges per QP(default 1000)\n"); @@ -521,6 +523,7 @@ int port = 18515; int ib_port = 1; int size = 4096; + int mtu = 1024; int num_qp = 16; int rx_depth = 500; int iters = 1000; @@ -540,6 +543,7 @@ { .name = "ib-dev", .has_arg = 1, .val = 'd' }, { .name = "ib-port", .has_arg = 1, .val = 'i' }, { .name = "size", .has_arg = 1, .val = 's' }, + { .name = "mtu", .has_arg = 1, .val = 'm' }, { .name = "num-qp", .has_arg = 1, .val = 'q' }, { .name = "rx-depth", .has_arg = 1, .val = 'r' }, { .name = "iters", .has_arg = 1, .val = 'n' }, @@ -547,7 +551,7 @@ { 0 } }; - c = getopt_long(argc, argv, "p:d:i:s:q:r:n:e", long_options, NULL); + c = getopt_long(argc, argv, "p:d:i:s:m:q:r:n:e", long_options, NULL); if (c == -1) break; @@ -576,6 +580,10 @@ size = strtol(optarg, NULL, 0); break; + case 'm': + mtu = strtol(optarg, NULL, 0); + break; + case 'q': num_qp = strtol(optarg, NULL, 0); break; @@ -605,6 +613,32 @@ return 1; } + switch (mtu) { + case 256: + path_mtu = IBV_MTU_256; + break; + + case 512: + path_mtu = IBV_MTU_512; + break; + + case 1024: + path_mtu = IBV_MTU_1024; + break; + + case 2048: + path_mtu = IBV_MTU_2048; + break; + + case 4096: + path_mtu = IBV_MTU_4096; + break; + + default: + usage(argv[0]); + return 1; + } + if (num_qp > rx_depth) { fprintf(stderr, "rx_depth %d is too small for %d QPs -- " "must have at least one receive per QP.\n", -- Ralph Campbell From bos at pathscale.com Wed Jan 18 17:14:16 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 18 Jan 2006 17:14:16 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <20060118.164839.74431051.davem@davemloft.net> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060118.164839.74431051.davem@davemloft.net> Message-ID: <1137633256.4757.225.camel@serpentine.pathscale.com> On Wed, 2006-01-18 at 16:48 -0800, David S. Miller wrote: > You can use an interface such a netlink for device configuration. > It can do better type checking, can be used by generic tools, and > some day soon will be transferable over the wire so that one can > perform remote configuration changes. That looks doable, but to my eyes, the netlink interface looks both more cumbersome and less reliable than ioctl. At least it apparently lets us do arbitrarily peculiar things :-) References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060118.164839.74431051.davem@davemloft.net> <1137633256.4757.225.camel@serpentine.pathscale.com> Message-ID: <20060118.171716.04998471.davem@davemloft.net> From: Bryan O'Sullivan Date: Wed, 18 Jan 2006 17:14:16 -0800 > That looks doable, but to my eyes, the netlink interface looks both > more cumbersome and less reliable than ioctl. At least it > apparently lets us do arbitrarily peculiar things :-) It's going to give you strict typing, and extensible attributes for the configuration attributes you define. So if you determine later "oh we need to add this knob for changing X" you can do that without breaking the existing interface. With ioctl() that is usually impossible or unreasonably hard to accomplish. Try not to get discouraged, give it a shot :) From bos at pathscale.com Wed Jan 18 17:17:20 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 18 Jan 2006 17:17:20 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <20060119005316.GA26884@kroah.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060119005316.GA26884@kroah.com> Message-ID: <1137633441.4757.228.camel@serpentine.pathscale.com> On Wed, 2006-01-18 at 16:53 -0800, Greg KH wrote: > Use the firmware subsystem for this. It uses sysfs so ioctl needed at > all. OK. Would I be correct in thinking that drivers/firmware/dcdbas.c is a reasonable model implementation to follow? References: <5CE025EE7D88BA4599A2C8FEFCF226F589AC09@taurus.voltaire.com> Message-ID: <7b2fa1820601181717ne0051admfb9f15a793102170@mail.gmail.com> On 1/18/06, Hal Rosenstock wrote: > > Multicast verbs would be used to send data from user space. Is that what > you are looking for ? Yes! I need some examples, according to which I could use the multicast in my own applications. Would you give some suggestions? And could not the multicast be used in kernel space? Thanks very much! -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg at kroah.com Wed Jan 18 18:57:41 2006 From: greg at kroah.com (Greg KH) Date: Wed, 18 Jan 2006 18:57:41 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <1137631411.4757.218.camel@serpentine.pathscale.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> Message-ID: <20060119025741.GC15706@kroah.com> On Wed, Jan 18, 2006 at 04:43:31PM -0800, Bryan O'Sullivan wrote: > Opening the /dev/ipath special file assigns an appropriate free > unit (chip) and port (context on a chip) to a user process. Shouldn't you just open the proper chip device and port device itself? That drops one ioctl. > Think of it as similar to /dev/ptmx for ttys, except there isn't > a devpts-like filesystem behind it. Once a process has > opened /dev/ipath, it needs to find out which unit and port it > has opened, so that it can access other attributes in /sys. To > do this, we provide a GETPORT ioctl. > USERINIT and BASEINFO work with mmap to set up direct access to > the hardware for user processes. We intend to turn these into a > single ioctl, USERINIT. This copies a substantial amount of > information to and from userspace. Why not just use mmap? What's the special needs? > RCVCTRL enables/disables receipt of packets. sysfs file. > SET_PKEY sets a partition key, essentially telling hardware > which packets are interesting to userspace. sysfs file. > UPDM_TID and FREE_TID are used for RDMA context management. sysfs files. > WAIT waits for incoming packets, and can clearly be replaced by > file_ops->poll. Use poll. > GETCOUNTERS, GETUNITCOUNTERS and GETSTATS can all be replaced by > files in sysfs. good. > For subnet management: > > GETLID, SET_LID, SET_MTU, SET_GUID, SET_MLID, GET_MLID, > GET_DEVSTATUS, GET_PORTINFO and GET_NODEINFO can all be replaced > by files in sysfs. > > SET_LINKSTATE changes the link state. > > SEND_SMA_PKT and RCV_SMA_PKT send and receive subnet management > packets. I *think* they could be replaced by read and write > methods on a new special file, although the semantics aren't a > super-clean match. Use netlink for subnet stuff. > For diagnostics: > > DIAGENTER and DIAGLEAVE put the driver into and out of diag > mode. These could be replaced by open/close of a special file. Use debugfs. > DIAGREAD and DIAGWRITE perform direct accesses to the device's > PCI memory space. I think these could be replaced by read and > write, but they are again subject to the make-coworker-barf > problem. Use debugfs. > HTREAD and HTREAD perform direct accesses to the device's PCI > config space. Same disagreement problem as DIAGREAD and > DIAGWRITE. Use the pci sysfs config files, don't duplicate existing functionality. > SEND_DIAG_PKT can be replaced with whatever sends and receives > subnet management packets, as above. netlink or debugfs. Hope this helps, greg k-h From greg at kroah.com Wed Jan 18 18:54:26 2006 From: greg at kroah.com (Greg KH) Date: Wed, 18 Jan 2006 18:54:26 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <1137633441.4757.228.camel@serpentine.pathscale.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060119005316.GA26884@kroah.com> <1137633441.4757.228.camel@serpentine.pathscale.com> Message-ID: <20060119025426.GB15706@kroah.com> On Wed, Jan 18, 2006 at 05:17:20PM -0800, Bryan O'Sullivan wrote: > On Wed, 2006-01-18 at 16:53 -0800, Greg KH wrote: > > > Use the firmware subsystem for this. It uses sysfs so ioctl needed at > > all. > > OK. Would I be correct in thinking that drivers/firmware/dcdbas.c is a > reasonable model implementation to follow? No. Pick a driver that has a backing device, like the wireless drivers that use it. That Dell bios driver has had more looney extensions than I can shake a stick at... thanks, greg k-h From akpm at osdl.org Wed Jan 18 19:49:11 2006 From: akpm at osdl.org (Andrew Morton) Date: Wed, 18 Jan 2006 19:49:11 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <20060119025741.GC15706@kroah.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060119025741.GC15706@kroah.com> Message-ID: <20060118194911.4da86c22.akpm@osdl.org> Greg KH wrote: > Sorry for sticking my head in a beehive, but. Stand back and look at it: > Shouldn't you just open the proper chip device and port device itself? > Why not just use mmap? What's the special needs? > sysfs file. > Use poll. > Use netlink for subnet stuff. > Use debugfs. > Use the pci sysfs config files, don't duplicate existing functionality. > netlink or debugfs. For a driver-bodging interface design, this is simply nutty. And it makes the driver developer learn a pile of extra stuff and it introduces lots of linkages everywhere and heaven knows what the driver's userspace interface description ends up looking like. ioctl() would have to be pretty darn bad to be worse than all this random stuff. Just saying... From greg at kroah.com Wed Jan 18 20:03:18 2006 From: greg at kroah.com (Greg KH) Date: Wed, 18 Jan 2006 20:03:18 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <20060118194911.4da86c22.akpm@osdl.org> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060119025741.GC15706@kroah.com> <20060118194911.4da86c22.akpm@osdl.org> Message-ID: <20060119040318.GA17121@kroah.com> On Wed, Jan 18, 2006 at 07:49:11PM -0800, Andrew Morton wrote: > Greg KH wrote: > > > > Sorry for sticking my head in a beehive, but. Stand back and look at it: > > > Shouldn't you just open the proper chip device and port device itself? > > Why not just use mmap? What's the special needs? > > sysfs file. > > Use poll. > > Use netlink for subnet stuff. > > Use debugfs. > > Use the pci sysfs config files, don't duplicate existing functionality. > > netlink or debugfs. > > For a driver-bodging interface design, this is simply nutty. One can rightfully argue that they are doing some huge messy things, and deserve the extra mess if they persist in trying to do it. > And it makes the driver developer learn a pile of extra stuff and it > introduces lots of linkages everywhere and heaven knows what the driver's > userspace interface description ends up looking like. > > ioctl() would have to be pretty darn bad to be worse than all this random > stuff. It is. It's giving any driver writer the ability to pretty much create as many different and new and incompatible system calls directly into the kernel, making their driver "just a little different" from every other type of driver. Do you really feel confident in allowing this? I sure do not. But if they use the interfaces that are present in the kernel (sysfs, debugfs, netlink, firmware interface), their driver will automatically work with the already-written userspace tools and their driver will usually not contain nasty bugs that show up on 64->32bit issues, and security problems where every user can mess with things they should not (like lots of ioctls have been known to have in the past.) We are trying very hard here to make it easier on both the users and the driver writers (that's why we wrote that infrastructure in the first place.) thanks, greg k-h From devesh28 at gmail.com Wed Jan 18 20:08:53 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Thu, 19 Jan 2006 09:38:53 +0530 Subject: [openib-general] Functioning of ib_register_mad_agent() In-Reply-To: <43CEE0D5.3000109@ichips.intel.com> References: <309a667c0601120557h2bcec18fu6aa13d8930ccba4c@mail.gmail.com> <43CEE0D5.3000109@ichips.intel.com> Message-ID: <309a667c0601182008q1a319f5aq47fa97a8e8e6e2bd@mail.gmail.com> Hi Sean, Thanks for replying, Hal and me had a good discussion on this and now the concept of this function is clear to me. Devesh On 1/19/06, Sean Hefty wrote: > Devesh Sharma wrote: > > I have some queries regarding the significance of the function > > ib_register_mad_agent() > > > > A) What this function dose? > > B) What is the significance of this function in implementing HCA driver? > > I didn't see a response to this. > > This function permits a client to send and receive MADs on QP 0 and 1. For > example, one of the things that mthca uses it for is to forward traps. > > - Sean > From info at fxoq.com Wed Jan 18 19:20:27 2006 From: info at fxoq.com (info at fxoq.com) Date: 19 Jan 2006 12:20:27 +0900 Subject: [openib-general] $B5U(B\$B8r:]@lMQL5NA>R2p=j(B Message-ID: <20060119032027.23171.qmail@mail.fxoq.com> $B"!!!:#2s$N!Z>R2pNA![!ZF~2qHqMQ![$OA4$FL5NA$G$9!#EPO?8eH/@8$9$k;v$J$I$b0l at ZM-$j$^$;$s!#(B $B"!!!DL>o!Z(B2,000$B1_J,![$NL5NA%]%$%s%H$r"(!Z(B10,000$B1_J,![$HCW$7$^$9!#(B $B"!!!5U1g=u4uK>=w at -$O:GDc(B3$BK|1_0J>e$,3NDj$5$l$F$$$kJ}$N$_$4>R2pCW$7$^$9!#(B $B"!!!0lH/$G at .N)$J$i$J$/$F$b!":G?7>pJs$r?o;~99?78e>R2p$5$;$FD:$-$^$9!#(B $B"!!!pJs0lMw$r4QMw$G$-$^$9!#(B $B$*;n$7$4F~2q$NJ}$O"M(B http://www.deai-style.net/?gyaku $B"(=EMW"((B $B!|0lK|1_L5NA(BP$B$G0l%v7n$[$IMxMQ2DG=$G$9!#!JM>M5$G$9!#!K(B $B!|>e5-!Z%Z!<%8![$,I=<($5$l$J$+$C$?>l9g$O!L8"Mx=*N;!M$H$J$C$F$*$j$^$9$N$G!"0lHLF~2q%Z!<%8!Z![$r$4MxMQ2<$5$$!#(B $B$=$NBe$o$j$K5.J}MM$N!L5U!oFCJL8"Mx!M52<$5$$!#(Bhttp://www.00-love5.com/serebu/s.html -------------------------------------------------------------------------------- $B!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g(B I don't veceive your mail safely5_net at yahoo.ca $B%a!<%k$N References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060119025741.GC15706@kroah.com> Message-ID: <1137646957.25584.17.camel@localhost.localdomain> On Wed, 2006-01-18 at 18:57 -0800, Greg KH wrote: > Shouldn't you just open the proper chip device and port device itself? > That drops one ioctl. There isn't usually a "right" chip device and port. On a NUMA system, you want to open the chip that is topologically closest to you, but failing that, you want to open something that will at least work. You may *also* want to be able to open a specific unit/port pair, but that would not be the normal mode of operation. The reason for doing this through a single open syscall, instead of making userland try each appropriate device in turn, is the same as why /dev/ptmx exists: it guarantees that userland can't do something stupid or racy. The driver checks all units and ports under a single mutex, so it doesn't have to retry to see if something got closed behind its back, for example. > Why not just use mmap? What's the special needs? mmap just maps the hardware MMIO area into user memory. The ioctl (or netlink message, or whatever it's going to be) does quite a lot more, such as tell the chip where user buffers are. > > RCVCTRL enables/disables receipt of packets. > > sysfs file. > > > SET_PKEY sets a partition key, essentially telling hardware > > which packets are interesting to userspace. > > sysfs file. > > > UPDM_TID and FREE_TID are used for RDMA context management. > > sysfs files. Really? Not netlink messages for these? It is rightly only the process that has a unit/port open that should be able to modify these; can I enforce that through sysfs without jumping through too many hoops? > Use netlink for subnet stuff. OK. > > For diagnostics: > Use debugfs. Ah, yes. > Use the pci sysfs config files, don't duplicate existing functionality. OK. > Hope this helps, Yes, it does. There's such a profusion of disconnected interfaces in 2.6 for driver authors to get their heads around, it is a big help to get some directions through the thicket. Thanks, From bos at pathscale.com Wed Jan 18 21:17:01 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 18 Jan 2006 21:17:01 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <20060118.171716.04998471.davem@davemloft.net> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060118.164839.74431051.davem@davemloft.net> <1137633256.4757.225.camel@serpentine.pathscale.com> <20060118.171716.04998471.davem@davemloft.net> Message-ID: <1137647821.25584.33.camel@localhost.localdomain> On Wed, 2006-01-18 at 17:17 -0800, David S. Miller wrote: > It's going to give you strict typing, and extensible attributes for > the configuration attributes you define. So if you determine later > "oh we need to add this knob for changing X" you can do that without > breaking the existing interface. Wow. OK, that is not immediately obvious from reading the code. The only modules in drivers/ that seem to use netlink are iscsi, connector, and w1. It's more extensive in net/, I see. > Try not to get discouraged, give it a shot :) It's not obvious what chunk of the the tree is a good example to follow. Just look what happened when I suggested to Greg that I use the Dell firmware loader as an example :-) The closest approximation I can find to documentation is something Neil Horman wrote over a year ago: http://people.redhat.com/nhorman/papers/netlink.pdf And a "this module does a particularly natty job that all coders would do well to emulate" pointer would be most welcome. I notice that libnetlink appears to have disappeared without a trace, along with Alexey. From nacc at us.ibm.com Wed Jan 18 21:25:53 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Wed, 18 Jan 2006 21:25:53 -0800 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <1135945593.4331.1109.camel@hal.voltaire.com> References: <20051230013128.GB8111@us.ibm.com> <1135916950.4331.618.camel@hal.voltaire.com> <20051230050421.GA6431@us.ibm.com> <1135945593.4331.1109.camel@hal.voltaire.com> Message-ID: <20060119052553.GT3257@us.ibm.com> On 30.12.2005 [07:26:34 -0500], Hal Rosenstock wrote: > Hi Nish, > > On Fri, 2005-12-30 at 00:04, Nishanth Aravamudan wrote: > > On 29.12.2005 [23:29:10 -0500], Hal Rosenstock wrote: > > > Hi Nish, > > > > > > On Thu, 2005-12-29 at 20:31, Nishanth Aravamudan wrote: > > > > Hi, > > > > > > > > Building 2.6.15-rc7-git3 on ppc64 with svn=4654 kernel components leads > > > > to: > > > > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_set_param': > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: (Each undeclared identifier is reported only once > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: for each function it appears in.) > > > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_get_param': > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1496: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > > > > drivers/infiniband/ulp/iser/iscsi_iser.c: At top level: > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: error: unknown field `af' specified in initializer > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: warning: initialization makes pointer from integer without a cast > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1635: error: unknown field `rdma' specified in initializer > > > > make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 > > > > make[2]: *** [drivers/infiniband/ulp/iser] Error 2 > > > > make[1]: *** [drivers/infiniband] Error 2 > > > > make: *** [drivers] Error 2 > > > > > > There is an iscsi patch required for this as iser requires an open-iscsi > > > version which is subsequent to what is in 2.6.15-rc7-git3. I'm not sure > > > the best way to handle this yet as the build is different for 2.6.14 > > > which does not contain open-iscsi. > > > > Where can I find this patch? I can temporarily add it to the build-path > > for the svn-based builds, until a better solution is found. > > I am attaching the patch for this. Note that this patch is for > 2.6.15-rc and not 2.6.14 variants. It has been tested with > 2.6.15-rc6. Please let me know if it works for you. Thanks. Trying to run the compilation tests against 2.6.16-rc1-git1 and am getting this: CC [M] drivers/infiniband/ulp/iser/iscsi_iser.o drivers/infiniband/ulp/iser/iscsi_iser.c:1573: warning: initialization from incompatible pointer type drivers/infiniband/ulp/iser/iscsi_iser.c:1574: warning: initialization from incompatible pointer type drivers/infiniband/ulp/iser/iscsi_iser.c:1575: warning: initialization from incompatible pointer type drivers/infiniband/ulp/iser/iscsi_iser.c:1577: warning: initialization from incompatible pointer type drivers/infiniband/ulp/iser/iscsi_iser.c:1579: error: unknown field `get_param' specified in initializer drivers/infiniband/ulp/iser/iscsi_iser.c:1579: warning: initialization from incompatible pointer type drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_init': drivers/infiniband/ulp/iser/iscsi_iser.c:1886: warning: assignment makes integer from pointer without a cast make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 make[2]: *** [drivers/infiniband/ulp/iser] Error 2 make[1]: *** [drivers/infiniband] Error 2 make: *** [drivers] Error 2 when using the patch you sent me. Should I not apply the patch to 2.6.16-rc1 and on? Or are these new issues. This is with svn 5065. Thanks, Nish From greg at kroah.com Wed Jan 18 21:39:40 2006 From: greg at kroah.com (Greg KH) Date: Wed, 18 Jan 2006 21:39:40 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <1137646957.25584.17.camel@localhost.localdomain> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060119025741.GC15706@kroah.com> <1137646957.25584.17.camel@localhost.localdomain> Message-ID: <20060119053940.GB21467@kroah.com> On Wed, Jan 18, 2006 at 09:02:37PM -0800, Bryan O'Sullivan wrote: > On Wed, 2006-01-18 at 18:57 -0800, Greg KH wrote: > > > Shouldn't you just open the proper chip device and port device itself? > > That drops one ioctl. > > There isn't usually a "right" chip device and port. On a NUMA system, > you want to open the chip that is topologically closest to you, but > failing that, you want to open something that will at least work. You > may *also* want to be able to open a specific unit/port pair, but that > would not be the normal mode of operation. > > The reason for doing this through a single open syscall, instead of > making userland try each appropriate device in turn, is the same as > why /dev/ptmx exists: it guarantees that userland can't do something > stupid or racy. The driver checks all units and ports under a single > mutex, so it doesn't have to retry to see if something got closed behind > its back, for example. Ok, that's fair enough. But if you want to do something like ptys, then why not just have your own filesystem for this driver? > > Why not just use mmap? What's the special needs? > > mmap just maps the hardware MMIO area into user memory. The ioctl (or > netlink message, or whatever it's going to be) does quite a lot more, > such as tell the chip where user buffers are. Ok. > > > UPDM_TID and FREE_TID are used for RDMA context management. > > > > sysfs files. > > Really? Not netlink messages for these? It is rightly only the process > that has a unit/port open that should be able to modify these; can I > enforce that through sysfs without jumping through too many hoops? I really don't know your application enough to be sure. If you want to use netlink, that's fine too. > Yes, it does. There's such a profusion of disconnected interfaces in > 2.6 for driver authors to get their heads around, it is a big help to > get some directions through the thicket. Well, for 99% of the drivers, there is no problem, as there is already a specified and documented way to interact (like network, tty, block, etc.) You are just making your own type of special interface up as you go, so the complexity is also there (this complexity would normally be in some core code, which I am hoping that your code will turn into for other devices of the same type, right?) thanks, greg k-h From greg at kroah.com Wed Jan 18 21:43:31 2006 From: greg at kroah.com (Greg KH) Date: Wed, 18 Jan 2006 21:43:31 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <1137647821.25584.33.camel@localhost.localdomain> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060118.164839.74431051.davem@davemloft.net> <1137633256.4757.225.camel@serpentine.pathscale.com> <20060118.171716.04998471.davem@davemloft.net> <1137647821.25584.33.camel@localhost.localdomain> Message-ID: <20060119054331.GC21467@kroah.com> On Wed, Jan 18, 2006 at 09:17:01PM -0800, Bryan O'Sullivan wrote: > On Wed, 2006-01-18 at 17:17 -0800, David S. Miller wrote: > > > It's going to give you strict typing, and extensible attributes for > > the configuration attributes you define. So if you determine later > > "oh we need to add this knob for changing X" you can do that without > > breaking the existing interface. > > Wow. OK, that is not immediately obvious from reading the code. The > only modules in drivers/ that seem to use netlink are iscsi, connector, > and w1. It's more extensive in net/, I see. The attribute stuff is pretty new, and I do not think any code in drivers/ uses it yet. But it is well documented in include/net/netlink.h, have you looked at that? > > Try not to get discouraged, give it a shot :) > > It's not obvious what chunk of the the tree is a good example to follow. > Just look what happened when I suggested to Greg that I use the Dell > firmware loader as an example :-) Well, it is good that you asked, far too many people do not. And others wonder why we are so insistant on everyone doing things properly in all parts of the kernel, it's because of this reason. Which reminds me to go back and look at that dell driver again... thanks, greg k-h From bos at pathscale.com Wed Jan 18 21:53:08 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 18 Jan 2006 21:53:08 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <20060119053940.GB21467@kroah.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060119025741.GC15706@kroah.com> <1137646957.25584.17.camel@localhost.localdomain> <20060119053940.GB21467@kroah.com> Message-ID: <1137649988.25584.67.camel@localhost.localdomain> On Wed, 2006-01-18 at 21:39 -0800, Greg KH wrote: > Ok, that's fair enough. But if you want to do something like ptys, then > why not just have your own filesystem for this driver? If you think it's appropriate to implement a new filesystem to replace a single ioctl that returns two integers, we can probably do that, but more realistically, the GETPORT ioctl can probably live a long and untroubled life as another netlink message. > You are just making your own type of special interface up as you > go, so the complexity is also there (this complexity would normally be > in some core code, which I am hoping that your code will turn into for > other devices of the same type, right?) The most important chunk of likely common code I can see at the moment is the stuff for bodging user page mappings that we got hammered over already. The drivers/infiniband/ tree already has code that does something like this, and a few other not-yet-in-tree network drivers that support RDMA have similar needs, too. (Bryan O'Sullivan's message of "Wed, 18 Jan 2006 16:43:31 -0800") References: <1137631411.4757.218.camel@serpentine.pathscale.com> Message-ID: "Bryan O'Sullivan" writes: > When I posted the last round of ipath driver code for review, people > objected to the number of ioctls we had. I'd like to get feedback on > what would be acceptable replacements. Roland you know the RDMA model best, are things so tied to the current crop of infiniband protocols that what the ipath code wants to do is not covered? They clearly need subsystem support and what they are trying to do either isn't covered or they don't see how to use what is there. Do the infiniband verbs not allow dealing with a unreliable datagram protocol? > We have four kinds of ioctl right now: > > * Interfacing with userspace > * Infiniband subnet management > * Flash/EEPROM management > * Diagnostics > > There are currently 36 ioctls in total. I think that I can reduce this > number dramatically, but we're having some contentious internal debate > about whether and how some of the ioctls should be replaced. I'd like > to see what's most likely to get accepted. Obviously, we'd prefer the > number to be zero, but I don't think we can do that without submitting a > driver that isn't very useful. > > Unless I indicate otherwise, I cannot think of clean replacements for > the ioctls listed below, and would appreciate suggestions. > > For user access: > > Opening the /dev/ipath special file assigns an appropriate free > unit (chip) and port (context on a chip) to a user process. > Think of it as similar to /dev/ptmx for ttys, except there isn't > a devpts-like filesystem behind it. Once a process has > opened /dev/ipath, it needs to find out which unit and port it > has opened, so that it can access other attributes in /sys. To > do this, we provide a GETPORT ioctl. We need some generic subsystem support to do this. If the kernel ib/rdma support is not enough to do this we need to build something. Dealing with NUMA affinity should not be something drivers need to invent. > USERINIT and BASEINFO work with mmap to set up direct access to > the hardware for user processes. We intend to turn these into a > single ioctl, USERINIT. This copies a substantial amount of > information to and from userspace. I'm not certain but the concept sounds generic even if the information is not. This sounds like a job for the ib/rdma/kernel-bypass networking subsystem. > RCVCTRL enables/disables receipt of packets. Again this is a generic problem, and the generic interfaces are broken if you can't do this. I know the linux network stack already provides this. > SET_PKEY sets a partition key, essentially telling hardware > which packets are interesting to userspace. I'm pretty certain this should be something that should be set at open time. > UPDM_TID and FREE_TID are used for RDMA context management. > > WAIT waits for incoming packets, and can clearly be replaced by > file_ops->poll. > > GETCOUNTERS, GETUNITCOUNTERS and GETSTATS can all be replaced by > files in sysfs. This whole section just cries out for a network/rdma/ib/kernel-by-pass layer that is that any interesting network driver can use. A device driver should not need to invent the interfaces for this kind of functionality. > For subnet management: > > GETLID, SET_LID, SET_MTU, SET_GUID, SET_MLID, GET_MLID, > GET_DEVSTATUS, GET_PORTINFO and GET_NODEINFO can all be replaced > by files in sysfs. > > SET_LINKSTATE changes the link state. > > SEND_SMA_PKT and RCV_SMA_PKT send and receive subnet management > packets. I *think* they could be replaced by read and write > methods on a new special file, although the semantics aren't a > super-clean match. Infiniband stack, it's there use it. If the Infiniband stack is too ugly to use or it is missing features then we need to fix it. So please complain about why you are have a hard time using the in-kernel infiniband stack, for this. > For EEPROM/flash management: > > READ_EEPROM reads the flash. WRITE_EEPROM writes it. I don't > see a standard way of doing this in the kernel; many drivers > provide their own private ioctls, some on dedicated special > files. I think that using read and write instead would be okay > (with a small qualm about semantics), but this idea makes an > influential coworker barf violently. I can't see how we could > use the ethtool flash interface: the low-level driver doesn't > look like a regular net device, and we support partial updates > of the flash. There are a couple of choices here. Off the top of my head. Have your driver support an i2c device, have your driver export an mtd device, and ethtool are the most standard. Partly it depends on what you are trying to do. Partial updates are not a problem. Just keep a cached copy and only write to those bytes that have changed. > For diagnostics: > > DIAGENTER and DIAGLEAVE put the driver into and out of diag > mode. These could be replaced by open/close of a special file. This one does sound global to a device and a trivial parameter. sysfs does sound like the proper interface here. That makes it script controllable etc. > DIAGREAD and DIAGWRITE perform direct accesses to the device's > PCI memory space. I think these could be replaced by read and > write, but they are again subject to the make-coworker-barf > problem. mmap(/dev/mem) There is also an interface in /proc or /sys I forget which that let's you select the individual bar for a pci device. You don't need to do anything, in your driver to support this. > HTREAD and HTREAD perform direct accesses to the device's PCI > config space. Same disagreement problem as DIAGREAD and > DIAGWRITE. Again. This is generic functionality already provided by the kernel, no need to implement anything. lspci/setpci already handle this quite well. > SEND_DIAG_PKT can be replaced with whatever sends and receives > subnet management packets, as above. > > DIAG_RD_I2C is synonymous with READ_EEPROM, and will go away. > > Depending on how you look at it, we can slim our list of ioctls down to > somewhere between 6 and 10. This isn't zero, but it's not 36, either. > What do people think? It's getting there. :) Eric From davem at davemloft.net Thu Jan 19 00:39:30 2006 From: davem at davemloft.net (David S. Miller) Date: Thu, 19 Jan 2006 00:39:30 -0800 (PST) Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: References: <1137631411.4757.218.camel@serpentine.pathscale.com> Message-ID: <20060119.003930.117351070.davem@davemloft.net> From: ebiederm at xmission.com (Eric W. Biederman) Date: Thu, 19 Jan 2006 01:25:39 -0700 > mmap(/dev/mem) > There is also an interface in /proc or /sys I forget which > that let's you select the individual bar for a pci device. > You don't need to do anything, in your driver to support this. Yes, please use /proc/bus/pci/* device file mmap()s or even better the PCI ones under /sys work too. I think libpci even has some help for this. From ogerlitz at voltaire.com Thu Jan 19 01:02:51 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 19 Jan 2006 11:02:51 +0200 (IST) Subject: [openib-general] [PATCH] iser: merged almost all .h files into iscsi_iser.h Message-ID: commited to r5070 Or. merged all the .h files except iser_socket.h into iscsi_iser.h changed iscsi_iser.h license to be the openib one (dual GPL/BSD) Signed-off-by: Or Gerlitz $ svndiff -r 5061:5070 ulp/iser | diffstat iscsi_iser.c | 1 iscsi_iser.h | 374 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- iser.h | 176 ------------------------- iser_conn.c | 10 - iser_conn.h | 64 --------- iser_dto.c | 9 - iser_dto.h | 64 --------- iser_initiator.c | 10 - iser_initiator.h | 46 ------ iser_memory.c | 8 - iser_memory.h | 83 ------------ iser_mod.c | 7 - iser_socket.c | 9 - iser_socket.h | 11 - iser_task.c | 9 - iser_task.h | 48 ------- iser_verbs.c | 10 - iser_verbs.h | 126 ------------------ 18 files changed, 385 insertions(+), 680 deletions(-) From ogerlitz at voltaire.com Thu Jan 19 01:02:37 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 19 Jan 2006 11:02:37 +0200 Subject: [openib-general] Re: [CMA][PATCH] port byte order fix In-Reply-To: <43CE7D4F.5030806@ichips.intel.com> References: <43CE7D4F.5030806@ichips.intel.com> Message-ID: <43CF55AD.4080404@voltaire.com> >> The CMA appears to assume that a struct sockaddr_in's sin_port value >> will be in host byte order. This is incorrect. Sean Hefty wrote: > Thanks - I'm fairly certain that the CMA will need some updates before > final acceptance into the kernel. I'll add this in beforehand. Sean, when you commit this change please change ulp/iser/iser_socket.c since this code indeed sets it for the CMA in host order (it is htons on something which is network order... confusing, anyway you will delete it). Or. Index: iser_socket.c =================================================================== --- iser_socket.c (revision 5070) +++ iser_socket.c (working copy) @@ -165,7 +165,6 @@ int iser_sock_connect(struct socket *soc dst_addr->sin_addr.s_addr, NIPQUAD(dst_addr->sin_addr), dst_addr->sin_port, dst_addr->sin_port); - dst_addr->sin_port = htons(dst_addr->sin_port); iser_err("ip = %d.%d.%d.%d, port = %d\n", NIPQUAD(dst_addr->sin_addr), dst_addr->sin_port); From yael at mellanox.co.il Thu Jan 19 04:08:46 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 19 Jan 2006 14:08:46 +0200 Subject: [openib-general] [PATCH] Opensm - duplicated guids handling Message-ID: <5z1wz42ze9.fsf@mtl066.yok.mtl.com> Hi Hal, We've noticed that currently if we have 2 hcas with duplicated guids connected back-2-back, opensm gets stuck. The reason for that is that in osm_vendor_set_sm() function - the second call trying to open the /dev/infiniband/issm%id is stuck, since this file is already open. The following patch fixes 2 things - 1. In osm_node_info_rcv.c - we've added a case that on cases of duplicated guids - exit (unless a flag is set otherwise). Add this exiting code also to the case where the nodes are connected back-2-back. 2. In osm_vendor_ibumad.c - add a static variable to avoid trying to open /dev/inifiniband/issm%d file twice during the run of opensm. Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 4951) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -1142,8 +1142,11 @@ osm_vendor_set_sm( osm_umad_bind_info_t *p_bind = (osm_umad_bind_info_t *)h_bind; osm_vendor_t *p_vend = p_bind->p_vend; char issmstring[24]; + static boolean_t osm_vendor_set_sm_indicator = FALSE; OSM_LOG_ENTER( p_vend->p_log, osm_vendor_set_sm ); + if (is_sm_val == FALSE || osm_vendor_set_sm_indicator == FALSE) + { sprintf(issmstring, "/dev/infiniband/issm%d", p_vend->umad_port_id); if (TRUE == is_sm_val) { p_vend->issmfd = open(issmstring, 0); @@ -1162,6 +1165,15 @@ osm_vendor_set_sm( " mask failed: errno %d\n", errno); p_vend->issmfd = -1; } + if ( osm_vendor_set_sm_indicator == FALSE ) + osm_vendor_set_sm_indicator = TRUE; + } + else + { + osm_log(p_vend->p_log, OSM_LOG_ERROR, + "osm_vendor_set_sm: ERR 5436: " + "Trying to set IS_SM capability mask again\n"); + } OSM_LOG_EXIT( p_vend->p_log ); } Index: opensm/osm_node_info_rcv.c =================================================================== --- opensm/osm_node_info_rcv.c (revision 4951) +++ opensm/osm_node_info_rcv.c (working copy) @@ -229,6 +229,14 @@ __osm_ni_rcv_set_links( osm_dump_dr_path(p_rcv->p_log, osm_physp_get_dr_path_ptr(p_physp), OSM_LOG_ERROR); + + osm_log( p_rcv->p_log, OSM_LOG_SYS, + "Errors on subnet. Duplicate GUID found " + "by link from a port to itself. " + "See osm log for more details\n"); + + if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE ) + exit( 1 ); } else { From halr at voltaire.com Thu Jan 19 04:35:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Jan 2006 07:35:04 -0500 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <20060119052553.GT3257@us.ibm.com> References: <20051230013128.GB8111@us.ibm.com> <1135916950.4331.618.camel@hal.voltaire.com> <20051230050421.GA6431@us.ibm.com> <1135945593.4331.1109.camel@hal.voltaire.com> <20060119052553.GT3257@us.ibm.com> Message-ID: <1137673940.4338.3575.camel@hal.voltaire.com> On Thu, 2006-01-19 at 00:25, Nishanth Aravamudan wrote: > Trying to run the compilation tests against 2.6.16-rc1-git1 and am > getting this: > > CC [M] drivers/infiniband/ulp/iser/iscsi_iser.o > drivers/infiniband/ulp/iser/iscsi_iser.c:1573: warning: initialization from incompatible pointer type > drivers/infiniband/ulp/iser/iscsi_iser.c:1574: warning: initialization from incompatible pointer type > drivers/infiniband/ulp/iser/iscsi_iser.c:1575: warning: initialization from incompatible pointer type > drivers/infiniband/ulp/iser/iscsi_iser.c:1577: warning: initialization from incompatible pointer type > drivers/infiniband/ulp/iser/iscsi_iser.c:1579: error: unknown field `get_param' specified in initializer > drivers/infiniband/ulp/iser/iscsi_iser.c:1579: warning: initialization from incompatible pointer type > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_init': > drivers/infiniband/ulp/iser/iscsi_iser.c:1886: warning: assignment makes integer from pointer without a cast > make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 > make[2]: *** [drivers/infiniband/ulp/iser] Error 2 > make[1]: *** [drivers/infiniband] Error 2 > make: *** [drivers] Error 2 > > when using the patch you sent me. Should I not apply the patch to > 2.6.16-rc1 and on? Or are these new issues. This is with svn 5065. These are new issues as struct iscsi_transport changed in 2.6.16-rc1 as follows: /** * struct iscsi_transport - iSCSI Transport template * @@ -48,23 +54,31 @@ struct iscsi_transport { char *name; unsigned int caps; struct scsi_host_template *host_template; + /* LLD session/scsi_host data size */ int hostdata_size; + /* LLD iscsi_host data size */ + int ihostdata_size; + /* LLD connection data size */ + int conndata_size; int max_lun; unsigned int max_conn; unsigned int max_cmd_len; - iscsi_sessionh_t (*create_session) (uint32_t initial_cmdsn, - struct Scsi_Host *shost); - void (*destroy_session) (iscsi_sessionh_t session); - iscsi_connh_t (*create_conn) (iscsi_sessionh_t session, uint32_t cid); + struct Scsi_Host *(*create_session) (struct scsi_transport_template *t, + uint32_t initial_cmdsn); + void (*destroy_session) (struct Scsi_Host *shost); + struct iscsi_cls_conn *(*create_conn) (struct Scsi_Host *shost, + uint32_t cid); int (*bind_conn) (iscsi_sessionh_t session, iscsi_connh_t conn, uint32_t transport_fd, int is_leading); int (*start_conn) (iscsi_connh_t conn); void (*stop_conn) (iscsi_connh_t conn, int flag); - void (*destroy_conn) (iscsi_connh_t conn); + void (*destroy_conn) (struct iscsi_cls_conn *conn); int (*set_param) (iscsi_connh_t conn, enum iscsi_param param, uint32_t value); - int (*get_param) (iscsi_connh_t conn, enum iscsi_param param, - uint32_t *value); + int (*get_conn_param) (void *conndata, enum iscsi_param param, + uint32_t *value); + int (*get_session_param) (struct Scsi_Host *shost, + enum iscsi_param param, uint32_t *value); int (*send_pdu) (iscsi_connh_t conn, struct iscsi_hdr *hdr, char *data, uint32_t data_size); void (*get_stats) (iscsi_connh_t conn, struct iscsi_stats *stats); @@ -73,7 +87,7 @@ struct iscsi_transport { /* * transport registration upcalls */ -extern int iscsi_register_transport(struct iscsi_transport *tt); +extern struct scsi_transport_template *iscsi_register_transport(struct iscsi_transport *tt); extern int iscsi_unregister_transport(struct iscsi_transport *tt); and the compliants are on create_session, destroy_session, create_conn, destroy_conn, get_param members of the struct and on the call to iscsi_register_transport. More on this later... -- Hal From ran8b_annah at kobej.zzn.com Thu Jan 19 05:04:53 2006 From: ran8b_annah at kobej.zzn.com (ran8b_annah at kobej.zzn.com) Date: Thu, 19 Jan 2006 05:04:53 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCMnczWiRLRS4kbCQ/JCQbKEI=?= =?iso-2022-jp?b?GyRCPXdALSQsSnM9NyRyTVEwVSQ3JEZCVCRDJEYkJCReGyhC?= =?iso-2022-jp?b?GyRCJDkbKEI=?= Message-ID: 20060119211746.22413mail@mail.lovelove-kameriasex552158754_lookserver772_womansystem01_woman-kameria-love.tv 【人気独占】 ☆男性緊急募集☆ 女性会員急増中! 全国42 万人登録! 人妻が出会いを求める理由は様々です。夫への不満、 単調な毎日から抜け出したい、刺激が欲しい、スリルを味わいたい、快楽に溺れたいなど・・・・。数え上げればきりがない程、人妻は満たされていないのです。そんな女性を貴男の手で満足して上げて下さい。 女性の満足度により貴男の報酬額が変わります。 ★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★ 女性を満たして上げたい・高額な報酬が欲しい等の男性は是非お入り下さい。 http://camellia.cx/h/ ★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★ ■必読 女性は男性へ支払う報酬額は貴女の満足度で決めて下さい。規則等は御座いません。 ‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥ 若い男性と知り合いたい、夫のいない時間に遊びたい そんな女性はコチラまで http://camellia.cx/h/ From ogerlitz at voltaire.com Thu Jan 19 06:18:47 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 19 Jan 2006 16:18:47 +0200 (IST) Subject: [openib-general] patch to ulp/iser such that it would work with latest open-iscsi Message-ID: Nish, For now, please use this patch for your compilation against latest kernels containing latest (post official 2.6.15) open-iscsi Or. Common subdirectories: iser-2.6.15-open-iscsi/.svn and iser-latest-open-iscsi/.svn diff -up iser-2.6.15-open-iscsi/iscsi_iser.c iser-latest-open-iscsi/iscsi_iser.c --- iser-2.6.15-open-iscsi/iscsi_iser.c 2006-01-19 16:55:05.000000000 +0200 +++ iser-latest-open-iscsi/iscsi_iser.c 2006-01-19 16:50:12.000000000 +0200 @@ -980,17 +980,15 @@ dout_alloc_fail: return -ENOMEM; } -static iscsi_sessionh_t iscsi_iser_session_create(uint32_t initial_cmdsn, - struct Scsi_Host *host) +static int +iscsi_iser_session_create(struct Scsi_Host *shost, uint32_t initial_cmdsn) { - struct iscsi_iser_session *session = NULL; + struct iscsi_iser_session *session = iscsi_hostdata(shost->hostdata); int cmd_i; - session = iscsi_hostdata(host->hostdata); memset(session, 0, sizeof(struct iscsi_iser_session)); - session->host = host; - session->id = host->host_no; + session->host = shost; session->state = ISCSI_STATE_LOGGED_IN; session->mgmtpool_max = ISCSI_ISER_MGMT_CMDS_MAX; session->cmds_max = ISCSI_ISER_XMIT_CMDS_MAX; @@ -1038,7 +1036,7 @@ static iscsi_sessionh_t iscsi_iser_sessi if (iscsi_iser_dout_pool_alloc(session)) goto dout_alloc_fail; - return iscsi_handle(session); + return 0; dout_alloc_fail: for (cmd_i = 0; cmd_i < session->mgmtpool_max; cmd_i++) @@ -1048,14 +1046,14 @@ immdata_alloc_fail: mgmtpool_alloc_fail: iscsi_iser_pool_free(&session->cmdpool, (void**)session->cmds); cmdpool_alloc_fail: - return iscsi_handle(NULL); + return -ENOMEM; } -static void iscsi_iser_session_destroy(iscsi_sessionh_t sessionh) +static void iscsi_iser_session_destroy(struct Scsi_Host *shost) { + struct iscsi_iser_session *session = iscsi_hostdata(shost->hostdata); int cmd_i; struct iscsi_iser_data_task *dtask, *n; - struct iscsi_iser_session *session = iscsi_ptr(sessionh); debug_iser("%s: enter\n", __FUNCTION__); @@ -1093,16 +1091,11 @@ static void iscsi_iser_xmitworker(void * up(&conn->xmitsema); } -static iscsi_connh_t iscsi_iser_conn_create(iscsi_sessionh_t sessionh, - uint32_t conn_idx) +static int +iscsi_iser_conn_create(struct Scsi_Host *shost, void *conndata, uint32_t conn_idx) { - struct iscsi_iser_session *session = iscsi_ptr(sessionh); - struct iscsi_iser_conn *conn = NULL; - - conn = kzalloc(sizeof *conn, GFP_KERNEL); - if (conn == NULL) { - goto conn_alloc_fail; - } + struct iscsi_iser_session *session = iscsi_hostdata(shost->hostdata); + struct iscsi_iser_conn *conn = conndata; /* Init the connection */ conn->c_stage = ISCSI_CONN_INITIAL_STAGE; @@ -1151,7 +1144,7 @@ static iscsi_connh_t iscsi_iser_conn_cre atomic_set(&conn->post_send_buf_count, 0); init_waitqueue_head(&conn->disconnect_wait_q); - return iscsi_handle(conn); + return 0; login_mtask_alloc_fail: kfifo_free(conn->mgmtqueue); @@ -1160,9 +1153,7 @@ mgmtqueue_alloc_fail: immqueue_alloc_fail: kfifo_free(conn->xmitqueue); xmitqueue_alloc_fail: - kfree(conn); -conn_alloc_fail: - return iscsi_handle(NULL); + return -ENOMEM; } static int iscsi_iser_conn_bind(iscsi_sessionh_t sessionh, iscsi_connh_t connh, @@ -1223,9 +1214,9 @@ static int iscsi_iser_conn_bind(iscsi_se return 0; } -static void iscsi_iser_conn_destroy(iscsi_connh_t connh) +static void iscsi_iser_conn_destroy(void *data) { - struct iscsi_iser_conn *conn = iscsi_ptr(connh); + struct iscsi_iser_conn *conn = data; struct iscsi_iser_session *session = conn->session; unsigned long flags; @@ -1292,8 +1283,6 @@ static void iscsi_iser_conn_destroy(iscs kfifo_free(conn->immqueue); kfifo_free(conn->mgmtqueue); - kfree(conn); - debug_iser("%s: exit\n", __FUNCTION__); } @@ -1379,23 +1368,13 @@ static int iscsi_iser_conn_set_param(isc return 0; } -static int iscsi_iser_conn_get_param(iscsi_connh_t connh, +static int iscsi_iser_session_get_param(struct Scsi_Host *shost, enum iscsi_param param, uint32_t *value) { - struct iscsi_iser_conn *conn = iscsi_ptr(connh); - struct iscsi_iser_session *session = conn->session; + struct iscsi_iser_session *session = iscsi_hostdata(shost->hostdata); switch (param) { - case ISCSI_PARAM_MAX_XMIT_DLENGTH: - *value = conn->max_xmit_dlength; - break; - case ISCSI_PARAM_HDRDGST_EN: - *value = 0; - break; - case ISCSI_PARAM_DATADGST_EN: - *value = 0; - break; case ISCSI_PARAM_INITIAL_R2T_EN: *value = session->initial_r2t_en; break; @@ -1429,12 +1408,6 @@ static int iscsi_iser_conn_get_param(isc case ISCSI_PARAM_RDMAEXTENSIONS: *value = 1; break; - /*case ISCSI_PARAM_TARGET_RECV_DLENGTH: - *value = conn->target_recv_dlength; - break; - case ISCSI_PARAM_INITIATOR_RECV_DLENGTH: - *value = conn->initiator_recv_dlength; - break;*/ default: return ISCSI_ERR_PARAM_NOT_FOUND; } @@ -1442,6 +1415,28 @@ static int iscsi_iser_conn_get_param(isc return 0; } +static int +iscsi_iser_conn_get_param(void *data, enum iscsi_param param, uint32_t *value) +{ + struct iscsi_iser_conn *conn = data; + + switch (param) { + case ISCSI_PARAM_MAX_XMIT_DLENGTH: + *value = conn->max_xmit_dlength; + break; + case ISCSI_PARAM_HDRDGST_EN: + *value = 0; + break; + case ISCSI_PARAM_DATADGST_EN: + *value = 0; + break; + default: + return ISCSI_ERR_PARAM_NOT_FOUND; + } + + return 0; +} + static int iscsi_iser_conn_start(iscsi_connh_t connh) { struct iscsi_iser_conn *conn = iscsi_ptr(connh); @@ -1568,6 +1563,7 @@ static struct iscsi_transport iscsi_iser .rdma = 1, .host_template = &iscsi_iser_sht, .hostdata_size = sizeof(struct iscsi_iser_session), + .conndata_size = sizeof(struct iscsi_iser_conn), .max_lun = ISCSI_ISER_MAX_LUN, .max_cmd_len = ISCSI_ISER_MAX_CMD_LEN, .create_session = iscsi_iser_session_create, @@ -1576,7 +1572,8 @@ static struct iscsi_transport iscsi_iser .bind_conn = iscsi_iser_conn_bind, .destroy_conn = iscsi_iser_conn_destroy, .set_param = iscsi_iser_conn_set_param, - .get_param = iscsi_iser_conn_get_param, + .get_conn_param = iscsi_iser_conn_get_param, + .get_session_param = iscsi_iser_session_get_param, .start_conn = iscsi_iser_conn_start, .stop_conn = iscsi_iser_conn_stop, .send_pdu = iscsi_iser_conn_send_pdu, @@ -1872,8 +1869,6 @@ iscsi_iser_hdr_recv(struct iscsi_iser_co int iscsi_iser_init(void) { - int error; - if (iscsi_max_lun < 1) { printk(KERN_ERR "Invalid max_lun value of %u\n", iscsi_max_lun); return -EINVAL; @@ -1883,11 +1878,9 @@ int iscsi_iser_init(void) if (iscsi_iser_slabs_create()) return -ENOMEM; - error = iscsi_register_transport(&iscsi_iser_transport); - if (error) { + if (!iscsi_register_transport(&iscsi_iser_transport)) { printk(KERN_ERR "iscsi_register_transport failed\n"); - iscsi_iser_slabs_destroy(); - return error; + iscsi_iser_slabs_destroy(); } return 0; } From halr at voltaire.com Thu Jan 19 06:07:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Jan 2006 09:07:13 -0500 Subject: [openib-general] Re: [PATCH] osm: lib vendor race cause OpenSM crashes In-Reply-To: <86vewh8r58.fsf@mtl066.yok.mtl.com> References: <86vewh8r58.fsf@mtl066.yok.mtl.com> Message-ID: <1137679632.4338.4191.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-01-18 at 10:58, Eitan Zahavi wrote: > Hi Hal > > We have found a race in OpenSM that can cause an active madw be > returned during the transaction. This is a fatal high priority > bug as it very likely cause a crash. Good catch. Thanks. Applied. Patch was rejected when applied so I did it by hand. Please reverify change. -- Hal > Eitan > > Signed-off-by: Eitan Zahavi > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 5009) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -138,14 +138,16 @@ get_madw(osm_vendor_t *p_vend, ib_net64_ > { > umad_match_t *m, *e; > ib_net64_t mtid = (*tid & 0xffffffff00000000llu); > + osm_madw_t *res; > > cl_spinlock_acquire( &p_vend->match_tbl_lock ); > for (m = p_vend->mtbl.tbl, e = m + p_vend->mtbl.max; m < e; m++) { > if (m->tid == mtid) { > m->tid = 0; > *tid = mtid; > + res = m->v; > cl_spinlock_release( &p_vend->match_tbl_lock ); > - return m->v; > + return res; > } > } > > > From halr at voltaire.com Thu Jan 19 06:51:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Jan 2006 09:51:13 -0500 Subject: [openib-general] Re: [PATCH] Opensm - duplicated guids handling In-Reply-To: <5z1wz42ze9.fsf@mtl066.yok.mtl.com> References: <5z1wz42ze9.fsf@mtl066.yok.mtl.com> Message-ID: <1137682270.4338.4454.camel@hal.voltaire.com> Hi Yael, On Thu, 2006-01-19 at 07:08, Yael Kalka wrote: > Hi Hal, > > We've noticed that currently if we have 2 hcas with duplicated guids I renew my comment about duplicated GUIDs. This is a pretty fundamental thing that MUST not be violated per the IBA spec. I understand there are processes in place that make the duplication more error prone than it should be. If we go down this path, there are other things that fall into this category and I believe this to be a slippery slope. I am still willing to go ahead with this patch or some variant of it. Some questions embedded in the patch. > connected back-2-back, opensm gets stuck. Not sure I quite understand the configuration. Are the two HCAs with the duplicated guids in the same machine and connected back to back ? Is that the case you are referring to ? > The reason for that is that > in osm_vendor_set_sm() function - the second call trying to open the > /dev/infiniband/issm%id is stuck, since this file is already open. > The following patch fixes 2 things - > 1. In osm_node_info_rcv.c - we've added a case that on cases of > duplicated guids - exit (unless a flag is set otherwise). Add this > exiting code also to the case where the nodes are connected back-2-back. > 2. In osm_vendor_ibumad.c - add a static variable to avoid trying to > open /dev/inifiniband/issm%d file twice during the run of opensm. The problem is that the second open hangs, right ? So rather than the changes to osm_vendor_ibumad.c below change the flags on the open from 0 to O_NONBLOCK ? Does that work for you ? If so, I will commit that approach with the change below to osm_node_info_rcv.c. Please let me know. Thanks. -- Hal > Signed-off-by: Yael Kalka > > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 4951) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -1142,8 +1142,11 @@ osm_vendor_set_sm( > osm_umad_bind_info_t *p_bind = (osm_umad_bind_info_t *)h_bind; > osm_vendor_t *p_vend = p_bind->p_vend; > char issmstring[24]; > + static boolean_t osm_vendor_set_sm_indicator = FALSE; > > OSM_LOG_ENTER( p_vend->p_log, osm_vendor_set_sm ); > + if (is_sm_val == FALSE || osm_vendor_set_sm_indicator == FALSE) I may have a comment on this based on the answer to the below. > + { > sprintf(issmstring, "/dev/infiniband/issm%d", p_vend->umad_port_id); > if (TRUE == is_sm_val) { > p_vend->issmfd = open(issmstring, 0); > @@ -1162,6 +1165,15 @@ osm_vendor_set_sm( > " mask failed: errno %d\n", errno); > p_vend->issmfd = -1; > } > + if ( osm_vendor_set_sm_indicator == FALSE ) > + osm_vendor_set_sm_indicator = TRUE; > + } > + else > + { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "osm_vendor_set_sm: ERR 5436: " > + "Trying to set IS_SM capability mask again\n"); > + } > OSM_LOG_EXIT( p_vend->p_log ); > } Does osm_vendor_set_sm_indicator ever needs to be reset to FALSE ? > Index: opensm/osm_node_info_rcv.c > =================================================================== > --- opensm/osm_node_info_rcv.c (revision 4951) > +++ opensm/osm_node_info_rcv.c (working copy) > @@ -229,6 +229,14 @@ __osm_ni_rcv_set_links( > osm_dump_dr_path(p_rcv->p_log, > osm_physp_get_dr_path_ptr(p_physp), > OSM_LOG_ERROR); > + > + osm_log( p_rcv->p_log, OSM_LOG_SYS, > + "Errors on subnet. Duplicate GUID found " > + "by link from a port to itself. " > + "See osm log for more details\n"); > + > + if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE ) > + exit( 1 ); > } > else > { > From mst at mellanox.co.il Thu Jan 19 07:51:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jan 2006 17:51:02 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicate outstanding MADtransactions with same TID. In-Reply-To: <20060118130536.GA24415@mellanox.co.il> References: <20060118130536.GA24415@mellanox.co.il> Message-ID: <20060119155102.GB22260@mellanox.co.il> Quoting r. Jack Morgenstein : > Subject: [PATCH] ib_mad: prevent duplicate outstanding MADtransactions with same TID. > > Prevent multiple outstanding MAD transactions with the same TID. > Could happen if duplicate requests are posted. > > Signed-off-by: Jack Morgenstein This patch was detecting an ack or nack as a duplicate of an rmpp request. We only need to do checks for rmpp data packets. Here's a fixed patch: --- Prevent issuing multiple MAD transactions with the same TID. Could happen if duplicate requests are posted. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/core/mad.c =================================================================== --- latest.orig/drivers/infiniband/core/mad.c 2006-01-16 18:19:55.000000000 +0200 +++ latest/drivers/infiniband/core/mad.c 2006-01-19 11:41:42.000000000 +0200 @@ -907,6 +907,20 @@ return ret; } +static inline int is_rmpp_data(struct ib_mad *mad) +{ + /* check if has rmpp header */ + if (mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_ADM && + (mad->mad_hdr.mgmt_class < IB_MGMT_CLASS_VENDOR_RANGE2_START || + mad->mad_hdr.mgmt_class > IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return 0; + + return ((ib_get_rmpp_flags(&((struct ib_rmpp_mad *)mad)->rmpp_hdr) & + IB_MGMT_RMPP_FLAG_ACTIVE) && + ((struct ib_rmpp_mad *)mad)->rmpp_hdr.rmpp_type == + IB_MGMT_RMPP_TYPE_DATA); +} + /* * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated * with the registered client @@ -964,6 +979,13 @@ /* Reference MAD agent until send completes */ atomic_inc(&mad_agent_priv->refcount); spin_lock_irqsave(&mad_agent_priv->lock, flags); + if (is_rmpp_data(send_buf->mad) && + ib_find_send_mad(mad_agent_priv, mad_send_wr->tid)) { + /* Duplicate send request */ + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + atomic_dec(&mad_agent_priv->refcount); + return -EBUSY; + } list_add_tail(&mad_send_wr->agent_list, &mad_agent_priv->send_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); -- MST From bos at pathscale.com Thu Jan 19 08:29:18 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 19 Jan 2006 08:29:18 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: References: <1137631411.4757.218.camel@serpentine.pathscale.com> Message-ID: <1137688158.3693.29.camel@serpentine.pathscale.com> On Thu, 2006-01-19 at 01:25 -0700, Eric W. Biederman wrote: > Do the infiniband verbs not allow dealing with a unreliable datagram > protocol? Eric, I think you are misunderstanding what we are actually trying to do. We already implement IB verbs and the various IB networking protocols in our drivers, at a layer that is not at all related to the one that is currently festooned with ioctls. The ioctl discussion pertains to lower-level direct user access to the hardware, for a protocol that bypasses the entire IB stack and just happens to send UD-compliant datagrams over the wire. I'm actually pretty satisfied with the feedback I've already gotten from Greg K-H and davem. > We need some generic subsystem support to do this. I am more than happy to put together generic support, provided I see other drivers that could take advantage of it being considered for submission. Right now, I do not - in general - see this happening. I know that some other drivers need to do user page pinning, and I'm happy to try to find a generic solution that is common to IB and drivers unrelated to IB. > > RCVCTRL enables/disables receipt of packets. > > Again this is a generic problem, and the generic interfaces are broken > if you can't do this. The SIOCSIFFLAGS ioctl, which I assume is the generic interface you refer to (it's the one used by iproute, at any rate), has poor overlap with what we need (it supports a pile of stuff that we don't care about, and we require a pile of stuff it doesn't support), and I don't feel inclined to try using it in any case. > > SET_PKEY sets a partition key, essentially telling hardware > > which packets are interesting to userspace. > > I'm pretty certain this should be something that should be set > at open time. It might be possible to make it fit into whatever replaces USERINIT, or else we can use a netlink message of its own. > > UPDM_TID and FREE_TID are used for RDMA context management. > > > > WAIT waits for incoming packets, and can clearly be replaced by > > file_ops->poll. > > > > GETCOUNTERS, GETUNITCOUNTERS and GETSTATS can all be replaced by > > files in sysfs. > > This whole section just cries out for a network/rdma/ib/kernel-by-pass > layer that is that any interesting network driver can use. No, it doesn't. Our chip's approach to remote memory access doesn't even slightly resemble that of other comparable chips. In addition, our counters are entirely device-specific, and I'm already planning to move them to sysfs. The sysfs move gets them out of ioctl-land, and there's no point in trying to do anything beyond that. > Infiniband stack, it's there use it. No. If you're running a full IB stack, we provide the usual IB subnet management facilities, and you can run OpenSM to manage your subnet. If you're *not*, which is the case I'm concerned with here, it makes no sense to replicate the byzantine IB management interfaces in order to do a handful of simple things that aren't even tied to the higher-level IB protocols. > There are a couple of choices here. Yes, we'll use the firmware interface, as Greg suggested. > There is also an interface in /proc or /sys I forget which > that let's you select the individual bar for a pci device. Yes, we'll use that. Thanks for your comments. References: <43CE7D4F.5030806@ichips.intel.com> <43CF55AD.4080404@voltaire.com> Message-ID: <43CFD3BE.30904@ichips.intel.com> Or Gerlitz wrote: > Sean, when you commit this change please change ulp/iser/iser_socket.c > since this code indeed sets it for the CMA in host order (it is htons > on something which is network order... confusing, anyway you will > delete it). Thanks - committed. - Sean From nacc at us.ibm.com Thu Jan 19 10:09:39 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 19 Jan 2006 10:09:39 -0800 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <1137673940.4338.3575.camel@hal.voltaire.com> References: <20051230013128.GB8111@us.ibm.com> <1135916950.4331.618.camel@hal.voltaire.com> <20051230050421.GA6431@us.ibm.com> <1135945593.4331.1109.camel@hal.voltaire.com> <20060119052553.GT3257@us.ibm.com> <1137673940.4338.3575.camel@hal.voltaire.com> Message-ID: <20060119180939.GU3257@us.ibm.com> On 19.01.2006 [07:35:04 -0500], Hal Rosenstock wrote: > On Thu, 2006-01-19 at 00:25, Nishanth Aravamudan wrote: > > Trying to run the compilation tests against 2.6.16-rc1-git1 and am > > getting this: > > > > CC [M] drivers/infiniband/ulp/iser/iscsi_iser.o > > drivers/infiniband/ulp/iser/iscsi_iser.c:1573: warning: initialization from incompatible pointer type > > drivers/infiniband/ulp/iser/iscsi_iser.c:1574: warning: initialization from incompatible pointer type > > drivers/infiniband/ulp/iser/iscsi_iser.c:1575: warning: initialization from incompatible pointer type > > drivers/infiniband/ulp/iser/iscsi_iser.c:1577: warning: initialization from incompatible pointer type > > drivers/infiniband/ulp/iser/iscsi_iser.c:1579: error: unknown field `get_param' specified in initializer > > drivers/infiniband/ulp/iser/iscsi_iser.c:1579: warning: initialization from incompatible pointer type > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_init': > > drivers/infiniband/ulp/iser/iscsi_iser.c:1886: warning: assignment makes integer from pointer without a cast > > make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 > > make[2]: *** [drivers/infiniband/ulp/iser] Error 2 > > make[1]: *** [drivers/infiniband] Error 2 > > make: *** [drivers] Error 2 > > > > when using the patch you sent me. Should I not apply the patch to > > 2.6.16-rc1 and on? Or are these new issues. This is with svn 5065. > > These are new issues as struct iscsi_transport changed in 2.6.16-rc1 as > follows: /me sighs... So, in the short run, am I best off disabling ISER (until someone has a patch that should fix the build ready?). With ISER enabled, as long as it fails to build, no kernels will be tested... Thanks, Nish From nacc at us.ibm.com Thu Jan 19 10:11:47 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 19 Jan 2006 10:11:47 -0800 Subject: [openib-general] Re: patch to ulp/iser such that it would work with latest open-iscsi In-Reply-To: References: Message-ID: <20060119181147.GW3257@us.ibm.com> On 19.01.2006 [16:18:47 +0200], Or Gerlitz wrote: > Nish, > > For now, please use this patch for your compilation against latest kernels > containing latest (post official 2.6.15) open-iscsi That's what I get for responding to e-mails before scanning my whole inbox :) I will try this patch out and report the results. Thanks, Nish From mshefty at ichips.intel.com Thu Jan 19 10:20:13 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 19 Jan 2006 10:20:13 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1136578777.14108.6.camel@trinity.austin.ammasso.com> References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> Message-ID: <43CFD85D.2010905@ichips.intel.com> Tom Tucker wrote: > Enclosed is a combined include file and core patch for iWARP support in CMA. This > patch includes changes per your last review. It was brought up that there's still not an iWarp driver in the OpenIB trunk. Is one ready to be merged? Thinking about this more, I'm not sure that we want to merge these changes into the trunk without one. (I can see arguments both ways.) Comments? Looking at the patch, it seems that the biggest issue that you would have with merging changes is how the cm_id is stored. I can at least fold that change into the CMA to help with maintenance if an iWarp driver isn't ready yet. - Sean From mst at mellanox.co.il Thu Jan 19 10:21:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jan 2006 20:21:10 +0200 Subject: [openib-general] [PATCH] use addr on vanilla 2.6.15 Message-ID: <20060119182109.GA30442@mellanox.co.il> I find the following useful: it lets me load ib_at on 2.6.14 without kernel patches. --- ip_dev_find isnt exported to modules starting with 2.6.14 Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/core/at.c =================================================================== --- linux-kernel/drivers/infiniband/core/at.c (revision 5066) +++ linux-kernel/drivers/infiniband/core/at.c (working copy) @@ -60,6 +60,30 @@ MODULE_AUTHOR("Shahar Frank"); MODULE_DESCRIPTION("InfiniBand address translation"); MODULE_LICENSE("Dual BSD/GPL"); +static int xxx_ip_dev_find(u32 addr) +{ + struct net_device *dev; + struct in_ifaddr **ifap; + struct in_ifaddr *ifa; + struct in_device *in_dev; + + read_lock(&dev_base_lock); + for (dev = dev_base; dev; dev = dev->next) { + for (ifap = &in_dev->ifa_list; (ifa = *ifap); + ifap = &ifa->ifa_next) { + if (addr == ifa->ifa_address) { + dev_hold(dev); + goto found; + } + } + } +found: + read_unlock(&dev_base_lock); + return dev; +} + +#define ip_dev_find xxx_ip_dev_find + static struct ib_at_dev ib_at_devs[IB_AT_MAX_DEV]; static void free_ats_req(void *async); -- MST From ebiederm at xmission.com Thu Jan 19 10:20:48 2006 From: ebiederm at xmission.com (Eric W. Biederman) Date: Thu, 19 Jan 2006 11:20:48 -0700 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <1137688158.3693.29.camel@serpentine.pathscale.com> (Bryan O'Sullivan's message of "Thu, 19 Jan 2006 08:29:18 -0800") References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1137688158.3693.29.camel@serpentine.pathscale.com> Message-ID: "Bryan O'Sullivan" writes: > On Thu, 2006-01-19 at 01:25 -0700, Eric W. Biederman wrote: > >> Do the infiniband verbs not allow dealing with a unreliable datagram >> protocol? > > Eric, I think you are misunderstanding what we are actually trying to > do. We already implement IB verbs and the various IB networking > protocols in our drivers, at a layer that is not at all related to the > one that is currently festooned with ioctls. > > The ioctl discussion pertains to lower-level direct user access to the > hardware, for a protocol that bypasses the entire IB stack and just > happens to send UD-compliant datagrams over the wire. I'm surprised. I didn't think your native datagrams were complaint above the link level with any of the IB protocols in the kernel. In any case that is not what I am saying. I am saying that I think that if the IB/rdma/networking layer does not do a good job of supporting you it is a failure there. Your driver looks ugly because there is not a sufficiently good helper layer. For high performance non-IP targeted networking cards you aren't doing anything terribly exotic. Could you please detail why you can't use the IB/rdma whatever helper layer, is insufficient to do what you need. If it is byzantine and heavy weight that concern needs to be addressed. I agree the normal software stack is pretty tall. > I'm actually pretty satisfied with the feedback I've already gotten from > Greg K-H and davem. > >> We need some generic subsystem support to do this. > > I am more than happy to put together generic support, provided I see > other drivers that could take advantage of it being considered for > submission. Right now, I do not - in general - see this happening. Right now it largely seems to be a chicken and the egg problem. There is a large portion of the HPC community that doesn't believe they are interesting to the rest of the world or that the rest of the world is interesting to them so they do they own thing leading to support problems. There are other drivers for linux right now, that the vendors are not too concerned about closed source that potentially code. I can think of at least 3 other networking fabrics out there. Heck the kernel already has a myrinet driver in it. Currently it only supports I also know there is another infiniband adapter that only provides raw packet access like yours does. I'm sick and tired of drivers having to invent all of the user space glue elements, for HPC. > I know that some other drivers need to do user page pinning, and I'm > happy to try to find a generic solution that is common to IB and drivers > unrelated to IB. Which is the RDMA thing. And looking at the code and I don't see how >> > RCVCTRL enables/disables receipt of packets. >> >> Again this is a generic problem, and the generic interfaces are broken >> if you can't do this. > > The SIOCSIFFLAGS ioctl, which I assume is the generic interface you > refer to (it's the one used by iproute, at any rate), has poor overlap > with what we need (it supports a pile of stuff that we don't care about, > and we require a pile of stuff it doesn't support), and I don't feel > inclined to try using it in any case. But SIOCSIFFLAGS is not implemented by a driver. It is implemented by the networking subsystem. It requires a network device to make sense in any case. >> > SET_PKEY sets a partition key, essentially telling hardware >> > which packets are interesting to userspace. >> >> I'm pretty certain this should be something that should be set >> at open time. > > It might be possible to make it fit into whatever replaces USERINIT, or > else we can use a netlink message of its own. > >> > UPDM_TID and FREE_TID are used for RDMA context management. >> > >> > WAIT waits for incoming packets, and can clearly be replaced by >> > file_ops->poll. >> > >> > GETCOUNTERS, GETUNITCOUNTERS and GETSTATS can all be replaced by >> > files in sysfs. >> >> This whole section just cries out for a network/rdma/ib/kernel-by-pass >> layer that is that any interesting network driver can use. > > No, it doesn't. Our chip's approach to remote memory access doesn't > even slightly resemble that of other comparable chips. In addition, our > counters are entirely device-specific, and I'm already planning to move > them to sysfs. The sysfs move gets them out of ioctl-land, and there's > no point in trying to do anything beyond that. Agreed, counters and sysfs are a good match. But the generic networking layer already has support for counters that are different for every device. That helper really needs to export those counters to sysfs as well as ethtool but the support already exists for more typical networking. The problem actually gets pretty simple when you need to design an interface to support generic kernel-by-pass over using arbitrary protocols. There are so few things in common those things that are in common stick out. >> Infiniband stack, it's there use it. > > No. If you're running a full IB stack, we provide the usual IB subnet > management facilities, and you can run OpenSM to manage your subnet. If > you're *not*, which is the case I'm concerned with here, it makes no > sense to replicate the byzantine IB management interfaces in order to do > a handful of simple things that aren't even tied to the higher-level IB > protocols. Is it the stack that is byzantine? Or the interface too it. What I thinking untimately is there should be something about as simple as af_packet in the kernel (but at the IB/rdma) layer that gives you the help you need. >> There are a couple of choices here. > > Yes, we'll use the firmware interface, as Greg suggested. I will have to look. That one doesn't sound familiar... Do we really have 4 wheels in the kernel? >> There is also an interface in /proc or /sys I forget which >> that let's you select the individual bar for a pci device. > > Yes, we'll use that. > > Thanks for your comments. Welcome, and thanks for your patience with this process. Eric From mst at mellanox.co.il Thu Jan 19 10:22:28 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jan 2006 20:22:28 +0200 Subject: [openib-general] [PATCH] ib_addr on vanilla 2.6.15 Message-ID: <20060119182228.GA30448@mellanox.co.il> I find the following useful: it let me load ib_addr on vanilla 2.6.15 --- ip_dev_find isnt exported to modules starting with 2.6.14 Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/core/addr.c =================================================================== --- linux-kernel/drivers/infiniband/core/addr.c (revision 5066) +++ linux-kernel/drivers/infiniband/core/addr.c (working copy) @@ -40,6 +40,30 @@ MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("IB Address Translation"); MODULE_LICENSE("Dual BSD/GPL"); +static int xxx_ip_dev_find(u32 addr) +{ + struct net_device *dev; + struct in_ifaddr **ifap; + struct in_ifaddr *ifa; + struct in_device *in_dev; + + read_lock(&dev_base_lock); + for (dev = dev_base; dev; dev = dev->next) { + for (ifap = &in_dev->ifa_list; (ifa = *ifap); + ifap = &ifa->ifa_next) { + if (addr == ifa->ifa_address) { + dev_hold(dev); + goto found; + } + } + } +found: + read_unlock(&dev_base_lock); + return dev; +} + +#define ip_dev_find xxx_ip_dev_find + struct addr_req { struct list_head list; struct sockaddr src_addr; -- MST From mst at mellanox.co.il Thu Jan 19 10:23:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jan 2006 20:23:05 +0200 Subject: [openib-general] [PATCH] sdp on vanilla 2.6.15 Message-ID: <20060119182305.GA30454@mellanox.co.il> I find the following useful: it lets me load sdp on vanilla 2.6.15 --- ip_dev_find isnt exported to modules starting with 2.6.14 Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_link.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_link.c (revision 4817) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_link.c (working copy) @@ -36,6 +36,30 @@ #include "ipoib.h" #include "sdp_main.h" +static int xxx_ip_dev_find(u32 addr) +{ + struct net_device *dev; + struct in_ifaddr **ifap; + struct in_ifaddr *ifa; + struct in_device *in_dev; + + read_lock(&dev_base_lock); + for (dev = dev_base; dev; dev = dev->next) { + for (ifap = &in_dev->ifa_list; (ifa = *ifap); + ifap = &ifa->ifa_next) { + if (addr == ifa->ifa_address) { + dev_hold(dev); + goto found; + } + } + } +found: + read_unlock(&dev_base_lock); + return dev; +} + +#define ip_dev_find xxx_ip_dev_find + #define SDP_LINK_F_VALID 0x01 /* valid path info record. */ #define SDP_LINK_F_ARP 0x02 /* arp request in progress. */ #define SDP_LINK_F_PATH 0x04 /* arp request in progress. */ -- MST From bos at pathscale.com Thu Jan 19 10:50:11 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 19 Jan 2006 10:50:11 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1137688158.3693.29.camel@serpentine.pathscale.com> Message-ID: <1137696611.3693.63.camel@serpentine.pathscale.com> On Thu, 2006-01-19 at 11:20 -0700, Eric W. Biederman wrote: > For high performance > non-IP targeted networking cards you aren't doing anything terribly > exotic. True. > Could you please detail why you can't use the IB/rdma > whatever helper layer, is insufficient to do what you need. There really isn't an RDMA helper layer. The fact that the IB headers live in include/rdma is, as best as I can tell, an artefact of Roland being accommodating to someone's suggestion when he was going through the same process with the IB tree as we are now with our driver. > Right now it largely seems to be a chicken and the egg problem. > There is a large portion of the HPC community that doesn't believe > they are interesting to the rest of the world or that the rest of > the world is interesting to them so they do they own thing leading > to support problems. I can't solve that problem. If other vendors don't want to pony up their driver source and take the same kinds of slings and arrows I'm doing, I'm not going to do the work to provide them with a generic set of abstractions to use in their out-of-tree or proprietary drivers. > Which is the RDMA thing. And looking at the code and I don't see how Your sentence ends in the middle. > >> Again this is a generic problem, and the generic interfaces are broken > >> if you can't do this. > But SIOCSIFFLAGS is not implemented by a driver. I can't square these two statements. Can you indicate what you might have been talking about, if not SIOCSIFFLAGS? > That helper really needs to export those counters > to sysfs as well as ethtool but the support already exists for more > typical networking. I know about the ethtool interfaces, but we implement only a tiny fraction of the stuff that is relevant to ethtool at this level of abstraction. > Is it the stack that is byzantine? Or the interface too it. Both. References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1137688158.3693.29.camel@serpentine.pathscale.com> Message-ID: <43CFDF5F.5060409@ichips.intel.com> Eric W. Biederman wrote: >>No. If you're running a full IB stack, we provide the usual IB subnet >>management facilities, and you can run OpenSM to manage your subnet. If >>you're *not*, which is the case I'm concerned with here, it makes no >>sense to replicate the byzantine IB management interfaces in order to do >>a handful of simple things that aren't even tied to the higher-level IB >>protocols. > > Is it the stack that is byzantine? Or the interface too it. > What I thinking untimately is there should be something about as > simple as af_packet in the kernel (but at the IB/rdma) layer that > gives you the help you need. I'm not familiar with the driver, but would the lower level verbs interfaces work for this? Could you just post whatever datagrams that you want directly to your management QPs? - Sean From bos at pathscale.com Thu Jan 19 10:55:01 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 19 Jan 2006 10:55:01 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <43CFDF5F.5060409@ichips.intel.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1137688158.3693.29.camel@serpentine.pathscale.com> <43CFDF5F.5060409@ichips.intel.com> Message-ID: <1137696901.3693.66.camel@serpentine.pathscale.com> On Thu, 2006-01-19 at 10:50 -0800, Sean Hefty wrote: > I'm not familiar with the driver, but would the lower level verbs interfaces > work for this? Could you just post whatever datagrams that you want directly to > your management QPs? Our lowest-level driver works in the absence of any IB support being compiled into the kernel, so in that situation, there are no QPs or any other management infrastructure present at all. All of that stuff lives in a higher layer, in which situation the cut-down subnet management agent doesn't get used, and something like OpenSM is more appropriate. References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> <43CFD85D.2010905@ichips.intel.com> Message-ID: <1137697323.2744.9.camel@stevo-desktop> On Thu, 2006-01-19 at 10:20 -0800, Sean Hefty wrote: > Tom Tucker wrote: > > Enclosed is a combined include file and core patch for iWARP support in CMA. This > > patch includes changes per your last review. > > It was brought up that there's still not an iWarp driver in the OpenIB trunk. > Is one ready to be merged? Thinking about this more, I'm not sure that we want > to merge these changes into the trunk without one. (I can see arguments both > ways.) Comments? The Ammasso rnic code could be merged in at this point. But the company Ammasso is no more, so I question whether we want it in the main trunk? > Looking at the patch, it seems that the biggest issue that you would have with > merging changes is how the cm_id is stored. I can at least fold that change > into the CMA to help with maintenance if an iWarp driver isn't ready yet. > Getting the core changes in now will help avoid having to keep merging trunk code back into the iwarp branch. It will also expose the iwarp changes to a larger audience for review and improvement. My 2 cents. Steve. From tom at opengridcomputing.com Thu Jan 19 11:02:54 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 19 Jan 2006 13:02:54 -0600 Subject: [openib-general] Re: [PATCH] iWARP Include File Changes In-Reply-To: <43CEDF7A.8090708@ichips.intel.com> References: <1136487265.10878.17.camel@trinity.austin.ammasso.com> <43CEDF7A.8090708@ichips.intel.com> Message-ID: <1137697374.3632.8.camel@trinity.ogc.int> Sean: I don't think mtcha needs to set these unless it supports FMR or SRQ. When the device does, it should set them. The bits aren't currently usedby any of the core code or ULP. However... This change is intended to allow an application (and the core for that matter) to determine if a device has an optional capability. This will be important going forward when attempting to optimize a ULP and making sure that optional features are not relied upon (e.g. SRQ), but are available if they can be used to good effect. I think the mtcha device would set FMR, and ultimately SRQ when that is ready. On Wed, 2006-01-18 at 16:38 -0800, Sean Hefty wrote: > Tom Tucker wrote: > > enum ib_device_cap_flags { > > @@ -86,6 +87,14 @@ > > IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), > > IB_DEVICE_SRQ_RESIZE = (1<<13), > > IB_DEVICE_N_NOTIFY_CQ = (1<<14), > > + IB_DEVICE_IN_ORD_PLCMNT = (1<<15), > > + IB_DEVICE_ZERO_STAG = (1<<16), > > + IB_DEVICE_SEND_W_INV = (1<<17), > > + IB_DEVICE_MW = (1<<18), > > + IB_DEVICE_FMR = (1<<19), > > + IB_DEVICE_SRQ = (1<<20), > > + IB_DEVICE_ARP = (1<<21), > > + IB_DEVICE_LLP = (1<<22), > > }; > > Does this change imply that devices need to set these capabilities? I.e. should > mthca and the other drivers be updated to set the device capabilities correctly? > > - Sean From dford at netapp.com Thu Jan 19 11:07:03 2006 From: dford at netapp.com (David Ford) Date: Thu, 19 Jan 2006 14:07:03 -0500 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1137697323.2744.9.camel@stevo-desktop> References: <43CFD85D.2010905@ichips.intel.com> <1136578777.14108.6.camel@trinity.austin.ammasso.com> <43CFD85D.2010905@ichips.intel.com> Message-ID: <5.1.1.5.2.20060119140446.01e6b1f8@exnane01.nane.netapp.com> At 02:02 PM 1/19/2006, Steve Wise wrote: >On Thu, 2006-01-19 at 10:20 -0800, Sean Hefty wrote: > > Tom Tucker wrote: > > > Enclosed is a combined include file and core patch for iWARP support > in CMA. This > > > patch includes changes per your last review. > > > > It was brought up that there's still not an iWarp driver in the OpenIB > trunk. > > Is one ready to be merged? Thinking about this more, I'm not sure that > we want > > to merge these changes into the trunk without one. (I can see > arguments both > > ways.) Comments? > >The Ammasso rnic code could be merged in at this point. But the company >Ammasso is no more, so I question whether we want it in the main trunk? I'd like to see it. A running implementation of OpenIB on iWARP is a very useful thing to have right now, assuming a reasonable supply of Ammasso cards. -- Dave > > Looking at the patch, it seems that the biggest issue that you would > have with > > merging changes is how the cm_id is stored. I can at least fold that > change > > into the CMA to help with maintenance if an iWarp driver isn't ready yet. > > > >Getting the core changes in now will help avoid having to keep merging >trunk code back into the iwarp branch. It will also expose the iwarp >changes to a larger audience for review and improvement. > >My 2 cents. > > >Steve. > > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general From tom at opengridcomputing.com Thu Jan 19 11:11:51 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 19 Jan 2006 13:11:51 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <43CFD85D.2010905@ichips.intel.com> References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> <43CFD85D.2010905@ichips.intel.com> Message-ID: <1137697911.3632.18.camel@trinity.ogc.int> On Thu, 2006-01-19 at 10:20 -0800, Sean Hefty wrote: > Tom Tucker wrote: > > Enclosed is a combined include file and core patch for iWARP support in CMA. This > > patch includes changes per your last review. > > It was brought up that there's still not an iWarp driver in the OpenIB trunk. > Is one ready to be merged? Thinking about this more, I'm not sure that we want > to merge these changes into the trunk without one. (I can see arguments both > ways.) Comments? I have a few comments: 1. We can place the AMSO1100 driver into the trunk if we want, but I can't imagine that this will ever hit the kernel because Ammasso (RIP) is dead. 2. There will almost certainly be multiple iWARP drivers by end of 1Q'06. The developers of these drivers would be better served if they were pulling from the trunk instead of a branch that gets old by the hour. 3. Getting the core support in early will give us time to tweak, tune, and test. I think this is a very good thing. > > Looking at the patch, it seems that the biggest issue that you would have with > merging changes is how the cm_id is stored. I can at least fold that change > into the CMA to help with maintenance if an iWarp driver isn't ready yet. > > - Sean I think we need the ib_verbs.h changes as well for item 2 above. From halr at voltaire.com Thu Jan 19 11:08:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Jan 2006 14:08:40 -0500 Subject: [openib-general] Re: [PATCH] iWARP Include File Changes In-Reply-To: <1137697374.3632.8.camel@trinity.ogc.int> References: <1136487265.10878.17.camel@trinity.austin.ammasso.com> <43CEDF7A.8090708@ichips.intel.com> <1137697374.3632.8.camel@trinity.ogc.int> Message-ID: <1137697481.4338.6003.camel@hal.voltaire.com> On Thu, 2006-01-19 at 14:02, Tom Tucker wrote: > Sean: > > I don't think mtcha needs to set these unless it supports FMR or SRQ. > When the device does, it should set them. The bits aren't currently > usedby any of the core code or ULP. However... > > This change is intended to allow an application (and the core for that > matter) to determine if a device has an optional capability. This will > be important going forward when attempting to optimize a ULP and making > sure that optional features are not relied upon (e.g. SRQ), but are > available if they can be used to good effect. > > I think the mtcha device would set FMR, and ultimately SRQ when that is > ready. Might there be 2 different FMR capabilities ? I may be wrong but I think the one mthca offers now is a pre (IBA 1.2) standard mode (and perhaps API). -- Hal > > On Wed, 2006-01-18 at 16:38 -0800, Sean Hefty wrote: > > Tom Tucker wrote: > > > enum ib_device_cap_flags { > > > @@ -86,6 +87,14 @@ > > > IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), > > > IB_DEVICE_SRQ_RESIZE = (1<<13), > > > IB_DEVICE_N_NOTIFY_CQ = (1<<14), > > > + IB_DEVICE_IN_ORD_PLCMNT = (1<<15), > > > + IB_DEVICE_ZERO_STAG = (1<<16), > > > + IB_DEVICE_SEND_W_INV = (1<<17), > > > + IB_DEVICE_MW = (1<<18), > > > + IB_DEVICE_FMR = (1<<19), > > > + IB_DEVICE_SRQ = (1<<20), > > > + IB_DEVICE_ARP = (1<<21), > > > + IB_DEVICE_LLP = (1<<22), > > > }; > > > > Does this change imply that devices need to set these capabilities? I.e. should > > mthca and the other drivers be updated to set the device capabilities correctly? > > > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Thu Jan 19 11:17:44 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 19 Jan 2006 11:17:44 -0800 Subject: [openib-general] Re: [PATCH 2/5] [RFC] Infiniband: connection abstraction In-Reply-To: <1137616052.4757.85.camel@serpentine.pathscale.com> References: <20060117153838.3dc2cd2e@dxpl.pdx.osdl.net> <1137568107.3005.69.camel@laptopd505.fenrus.org> <1137616052.4757.85.camel@serpentine.pathscale.com> Message-ID: <43CFE5D8.8000307@ichips.intel.com> Bryan O'Sullivan wrote: >>the dual license text needs a bit of clarification I suspect to make >>explicit that the "or BSD" part only applies when used entirely outside >>the linux kernel. (that already is the case, just it's not explicit. >>Making that explicit would be good). > > One appropriate way to do that would be to mark all IB-related exported > symbols as EXPORT_SYMBOL_GPL. It's been brought up that when the OpenIB code is accepted into the kernel, it has been accepted under the GPL license. The kernel developers have asked to make this explicit in the code, either by modifying the license or exporting the symbols as GPL only. Are there any issues with changing the EXPORT_SYMBOL statements to EXPORT_SYMBOL_GPL *for code submitted upstream*? Code maintained in the OpenIB tree would remain unchanged, so that only the code shipped with the kernel would add "_GLP" to exported symbols. - Sean From bos at pathscale.com Thu Jan 19 11:20:20 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 19 Jan 2006 11:20:20 -0800 Subject: [openib-general] Re: [PATCH 2/5] [RFC] Infiniband: connection abstraction In-Reply-To: <43CFE5D8.8000307@ichips.intel.com> References: <20060117153838.3dc2cd2e@dxpl.pdx.osdl.net> <1137568107.3005.69.camel@laptopd505.fenrus.org> <1137616052.4757.85.camel@serpentine.pathscale.com> <43CFE5D8.8000307@ichips.intel.com> Message-ID: <1137698420.3693.76.camel@serpentine.pathscale.com> On Thu, 2006-01-19 at 11:17 -0800, Sean Hefty wrote: > It's been brought up that when the OpenIB code is accepted into the kernel, it > has been accepted under the GPL license. At least one kernel developer has asked me privately to get the boilerplate language on source files clarified, to make it clear that the dual BSD/GPL licensing is GPL-only for the Linux kernel, and the user's choice of either in other cases. > The kernel developers have asked to > make this explicit in the code, either by modifying the license or exporting the > symbols as GPL only. I think we'd be well-served to do both. References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> <43CFD85D.2010905@ichips.intel.com> <1137697911.3632.18.camel@trinity.ogc.int> Message-ID: <43CFE8B9.2080307@ichips.intel.com> Tom Tucker wrote: > 3. Getting the core support in early will give us time to tweak, tune, > and test. I think this is a very good thing. I'm not disagreeing, but my concern is that if we start applying changes to the trunk to support branches we can end up maintaining a large amount of code that might not even be used. The maintainers end up needing to track branches to know if they're active, and if a branch ever falls off, then a fair amount of work needs to go back into cleaning up the code. I don't think that most open source projects go this route. - Sean From mshefty at ichips.intel.com Thu Jan 19 11:34:09 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 19 Jan 2006 11:34:09 -0800 Subject: [openib-general] Re: [PATCH 2/5] [RFC] Infiniband: connection abstraction In-Reply-To: <1137698420.3693.76.camel@serpentine.pathscale.com> References: <20060117153838.3dc2cd2e@dxpl.pdx.osdl.net> <1137568107.3005.69.camel@laptopd505.fenrus.org> <1137616052.4757.85.camel@serpentine.pathscale.com> <43CFE5D8.8000307@ichips.intel.com> <1137698420.3693.76.camel@serpentine.pathscale.com> Message-ID: <43CFE9B1.6080701@ichips.intel.com> Bryan O'Sullivan wrote: > At least one kernel developer has asked me privately to get the > boilerplate language on source files clarified, to make it clear that > the dual BSD/GPL licensing is GPL-only for the Linux kernel, and the > user's choice of either in other cases. Here was the suggestion that I received as a change to the license: licenses. You may choose to be licensed under the terms of the GNU General Public License (GPL) Version 2, available from the file COPYING in the main directory of this source tree, or, when using this code outside the context of the linux kernel or other GPL works, the OpenIB.org BSD license below The differences between this version and the one used by OpenIB is slight, but it does clarify that the version in Linux is GPL, which is the case. - Sean From tom at opengridcomputing.com Thu Jan 19 11:44:45 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 19 Jan 2006 13:44:45 -0600 Subject: [openib-general] Re: [PATCH] iWARP Include File Changes In-Reply-To: <1137697481.4338.6003.camel@hal.voltaire.com> References: <1136487265.10878.17.camel@trinity.austin.ammasso.com> <43CEDF7A.8090708@ichips.intel.com> <1137697374.3632.8.camel@trinity.ogc.int> <1137697481.4338.6003.camel@hal.voltaire.com> Message-ID: <1137699885.3632.24.camel@trinity.ogc.int> Hal: Sounds like we need another bit.... That's what they are there for. I don't know that this one would be exposed past the core though... On Thu, 2006-01-19 at 14:08 -0500, Hal Rosenstock wrote: > On Thu, 2006-01-19 at 14:02, Tom Tucker wrote: > > Sean: > > > > I don't think mtcha needs to set these unless it supports FMR or SRQ. > > When the device does, it should set them. The bits aren't currently > > usedby any of the core code or ULP. However... > > > > This change is intended to allow an application (and the core for that > > matter) to determine if a device has an optional capability. This will > > be important going forward when attempting to optimize a ULP and making > > sure that optional features are not relied upon (e.g. SRQ), but are > > available if they can be used to good effect. > > > > I think the mtcha device would set FMR, and ultimately SRQ when that is > > ready. > > Might there be 2 different FMR capabilities ? I may be wrong but I think > the one mthca offers now is a pre (IBA 1.2) standard mode (and perhaps > API). > > -- Hal > > > > > On Wed, 2006-01-18 at 16:38 -0800, Sean Hefty wrote: > > > Tom Tucker wrote: > > > > enum ib_device_cap_flags { > > > > @@ -86,6 +87,14 @@ > > > > IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), > > > > IB_DEVICE_SRQ_RESIZE = (1<<13), > > > > IB_DEVICE_N_NOTIFY_CQ = (1<<14), > > > > + IB_DEVICE_IN_ORD_PLCMNT = (1<<15), > > > > + IB_DEVICE_ZERO_STAG = (1<<16), > > > > + IB_DEVICE_SEND_W_INV = (1<<17), > > > > + IB_DEVICE_MW = (1<<18), > > > > + IB_DEVICE_FMR = (1<<19), > > > > + IB_DEVICE_SRQ = (1<<20), > > > > + IB_DEVICE_ARP = (1<<21), > > > > + IB_DEVICE_LLP = (1<<22), > > > > }; > > > > > > Does this change imply that devices need to set these capabilities? I.e. should > > > mthca and the other drivers be updated to set the device capabilities correctly? > > > > > > - Sean > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bos at pathscale.com Thu Jan 19 11:49:18 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 19 Jan 2006 11:49:18 -0800 Subject: [openib-general] Re: [PATCH 2/5] [RFC] Infiniband: connection abstraction In-Reply-To: <43CFE9B1.6080701@ichips.intel.com> References: <20060117153838.3dc2cd2e@dxpl.pdx.osdl.net> <1137568107.3005.69.camel@laptopd505.fenrus.org> <1137616052.4757.85.camel@serpentine.pathscale.com> <43CFE5D8.8000307@ichips.intel.com> <1137698420.3693.76.camel@serpentine.pathscale.com> <43CFE9B1.6080701@ichips.intel.com> Message-ID: <1137700158.3693.78.camel@serpentine.pathscale.com> On Thu, 2006-01-19 at 11:34 -0800, Sean Hefty wrote: > Here was the suggestion that I received as a change to the license: > > licenses. You may choose to be licensed under the terms of the GNU General > Public License (GPL) Version 2, available from the file COPYING in the main > directory of this source tree, or, when using this code outside the context of > the linux kernel or other GPL works, the OpenIB.org BSD license below > > The differences between this version and the one used by OpenIB is slight, but > it does clarify that the version in Linux is GPL, which is the case. That seems reasonable to me. Arjan? References: <20060117153838.3dc2cd2e@dxpl.pdx.osdl.net> <1137568107.3005.69.camel@laptopd505.fenrus.org> <1137616052.4757.85.camel@serpentine.pathscale.com> <43CFE5D8.8000307@ichips.intel.com> <1137698420.3693.76.camel@serpentine.pathscale.com> <43CFE9B1.6080701@ichips.intel.com> <1137700158.3693.78.camel@serpentine.pathscale.com> Message-ID: <1137700556.2993.38.camel@laptopd505.fenrus.org> On Thu, 2006-01-19 at 11:49 -0800, Bryan O'Sullivan wrote: > On Thu, 2006-01-19 at 11:34 -0800, Sean Hefty wrote: > > > Here was the suggestion that I received as a change to the license: > > > > licenses. You may choose to be licensed under the terms of the GNU General > > Public License (GPL) Version 2, available from the file COPYING in the main > > directory of this source tree, or, when using this code outside the context of > > the linux kernel or other GPL works, the OpenIB.org BSD license below > > > > The differences between this version and the one used by OpenIB is slight, but > > it does clarify that the version in Linux is GPL, which is the case. > > That seems reasonable to me. Arjan? To me it's a very good clarification of the situation and avoids a situation where people get the wrong idea. So I'm all for it. From nacc at us.ibm.com Thu Jan 19 11:56:20 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 19 Jan 2006 11:56:20 -0800 Subject: [openib-general] Re: patch to ulp/iser such that it would work with latest open-iscsi In-Reply-To: References: Message-ID: <20060119195620.GX3257@us.ibm.com> On 19.01.2006 [16:18:47 +0200], Or Gerlitz wrote: > Nish, > > For now, please use this patch for your compilation against latest > kernels containing latest (post official 2.6.15) open-iscsi Could you redo this patch so it applies to the kernel tree? Thanks, Nish From tom at opengridcomputing.com Thu Jan 19 12:29:16 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 19 Jan 2006 14:29:16 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <43CFE8B9.2080307@ichips.intel.com> References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> <43CFD85D.2010905@ichips.intel.com> <1137697911.3632.18.camel@trinity.ogc.int> <43CFE8B9.2080307@ichips.intel.com> Message-ID: <1137702556.3632.41.camel@trinity.ogc.int> On Thu, 2006-01-19 at 11:30 -0800, Sean Hefty wrote: > Tom Tucker wrote: > > 3. Getting the core support in early will give us time to tweak, tune, > > and test. I think this is a very good thing. > > I'm not disagreeing, but my concern is that if we start applying changes to the > trunk to support branches we can end up maintaining a large amount of code that > might not even be used. The maintainers end up needing to track branches to > know if they're active, and if a branch ever falls off, then a fair amount of > work needs to go back into cleaning up the code. I don't think that most open > source projects go this route. > > - Sean I think your points are very valid. However, I'd like to propose that a few things make these particular changes different: - I (hopefully "We") are certain that iWARP devices _will_ be present, i.e. the changes won't amount to dead code, - We are certain that there are developers today that rely on these changes, - Merging in the trunk now helps them get there sooner since they don't have to track changes to the trunk in a separate branch. Thanks, Tom From ebiederm at xmission.com Thu Jan 19 12:29:17 2006 From: ebiederm at xmission.com (Eric W. Biederman) Date: Thu, 19 Jan 2006 13:29:17 -0700 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <1137696611.3693.63.camel@serpentine.pathscale.com> (Bryan O'Sullivan's message of "Thu, 19 Jan 2006 10:50:11 -0800") References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1137688158.3693.29.camel@serpentine.pathscale.com> <1137696611.3693.63.camel@serpentine.pathscale.com> Message-ID: "Bryan O'Sullivan" writes: > On Thu, 2006-01-19 at 11:20 -0700, Eric W. Biederman wrote: > >> For high performance >> non-IP targeted networking cards you aren't doing anything terribly >> exotic. > > True. > >> Could you please detail why you can't use the IB/rdma >> whatever helper layer, is insufficient to do what you need. > > There really isn't an RDMA helper layer. The fact that the IB headers > live in include/rdma is, as best as I can tell, an artefact of Roland > being accommodating to someone's suggestion when he was going through > the same process with the IB tree as we are now with our driver. The fact that this didn't go farther is part of my complaint, and part of what needs to be refactored. >> Right now it largely seems to be a chicken and the egg problem. >> There is a large portion of the HPC community that doesn't believe >> they are interesting to the rest of the world or that the rest of >> the world is interesting to them so they do they own thing leading >> to support problems. > > I can't solve that problem. If other vendors don't want to pony up > their driver source and take the same kinds of slings and arrows I'm > doing, I'm not going to do the work to provide them with a generic set > of abstractions to use in their out-of-tree or proprietary drivers. Agreed. Part of the problem is the IB layer is insufficient, or at least you perceive it that way. At that level if you can express your problems we can get the IB layer fixed. As for other drivers I know I can get modifiable source, and I know I can get user pressure to hook it into a standard interface if there is one. Most of this should have been sorted out with getting a solid infiniband layer into the kernel. Since it didn't I'm at least want to get the interface right for next time. >> Which is the RDMA thing. And looking at the code and I don't see how > > Your sentence ends in the middle. Sorry. I was in the middle of noticing how incomplete the RDMA layer was. I guess that just makes my sentence >> >> Again this is a generic problem, and the generic interfaces are broken >> >> if you can't do this. > >> But SIOCSIFFLAGS is not implemented by a driver. > > I can't square these two statements. Can you indicate what you might > have been talking about, if not SIOCSIFFLAGS? I was saying that this functionality sounds like something that should be part of a generic layer. The IFF_UP from SIOCSIFFLAGS bit seems to behave exactly how you want. But this maps to the network driver methods open and close. No driver implements SIOCSIFFFLAGS. Basically my point was that the helper layers appear insufficient to your needs. >> Is it the stack that is byzantine? Or the interface too it. > > Both. This is my other point. Your driver puts packets on infiniband. Your hardware potentially supports more IB protocols than the driver for mellanox's hardware The IB stack does not serve you well. Except not being a member of the IB verbs camp there is nothing your hardware does that is exotic enough for the IB layer to fall down. All of the kernel-bypass is used for other protocols to IB. So right now it looks like 2 things going on. 1) The IB stack poorly supports your driver. - IB stack problem. If you could help point out what is wrong with the IB stack that would be great. 2) ipath doesn't seem to want to use the IB stack as a helper layer for it's fast path protocol. My sympathies are with you about the IB stack, it integrates rather badly with the networking layer, so likely has other issues. But you at least have to be willing to budge a little or these hard problems can't be fixed. For those who need the buzz words to understand what is going on the ipath hardware largely does stateless offload for IB while the mellanox hardware does whole protocol offload. Which would mean if this was a normal network driver ipath good mellanox bad. So something is broken if the our generic layers don't support the kind of hardware the linux kernel developers profess to prefer. Eric From ebiederm at xmission.com Thu Jan 19 12:31:34 2006 From: ebiederm at xmission.com (Eric W. Biederman) Date: Thu, 19 Jan 2006 13:31:34 -0700 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <1137696901.3693.66.camel@serpentine.pathscale.com> (Bryan O'Sullivan's message of "Thu, 19 Jan 2006 10:55:01 -0800") References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1137688158.3693.29.camel@serpentine.pathscale.com> <43CFDF5F.5060409@ichips.intel.com> <1137696901.3693.66.camel@serpentine.pathscale.com> Message-ID: "Bryan O'Sullivan" writes: > On Thu, 2006-01-19 at 10:50 -0800, Sean Hefty wrote: > >> I'm not familiar with the driver, but would the lower level verbs interfaces >> work for this? Could you just post whatever datagrams that you want directly > to >> your management QPs? > > Our lowest-level driver works in the absence of any IB support being > compiled into the kernel, so in that situation, there are no QPs or any > other management infrastructure present at all. All of that stuff lives > in a higher layer, in which situation the cut-down subnet management > agent doesn't get used, and something like OpenSM is more appropriate. Ok this is one piece of the puzzle. At your lowest level your hardware does not have QP's but it does have something similar to isolate a userspace process correct? Which sounds like one problem with the IB layer is that it assumes QPs instead of a slight abstraction of that concept. Eric From swise at opengridcomputing.com Thu Jan 19 12:47:57 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 19 Jan 2006 14:47:57 -0600 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1137688158.3693.29.camel@serpentine.pathscale.com> <1137696611.3693.63.camel@serpentine.pathscale.com> Message-ID: <1137703677.2744.22.camel@stevo-desktop> > For those who need the buzz words to understand what is going > on the ipath hardware largely does stateless offload for IB while > the mellanox hardware does whole protocol offload. Which would > mean if this was a normal network driver ipath good mellanox bad. > Are you sure about this? I would think if ipath does IB RC service in hardware, its no where near stateless offload. I don't think this is a fair comparison. Steve. From mshefty at ichips.intel.com Thu Jan 19 13:08:27 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 19 Jan 2006 13:08:27 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <1137696901.3693.66.camel@serpentine.pathscale.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1137688158.3693.29.camel@serpentine.pathscale.com> <43CFDF5F.5060409@ichips.intel.com> <1137696901.3693.66.camel@serpentine.pathscale.com> Message-ID: <43CFFFCB.7060002@ichips.intel.com> Bryan O'Sullivan wrote: > Our lowest-level driver works in the absence of any IB support being > compiled into the kernel, so in that situation, there are no QPs or any > other management infrastructure present at all. All of that stuff lives > in a higher layer, in which situation the cut-down subnet management > agent doesn't get used, and something like OpenSM is more appropriate. I'm struggling to understand what your card does then. From this, it sounds like a standard network card that just happens to use IB physicals. Do you just send raw packets? How is the LRH formatted by your card? I.e. what's setting up the dlid, slid, vl, etc.? Can your card interoperate with other IB devices on the network when running in this mode? - Sean From pw at osc.edu Thu Jan 19 13:50:25 2006 From: pw at osc.edu (Pete Wyckoff) Date: Thu, 19 Jan 2006 16:50:25 -0500 Subject: [openib-general] respect CFLAGS in OSM Message-ID: <20060119215025.GA2620@osc.edu> I do something like: CFLAGS=-g ./configure ... to build a debug tree from openib svn. Some places override this CFLAGS setting, though, applying optimization even though I explicitly do not want it. This patch fixes that. These apply to OSM below gen2/trunk/src/userspace/. Signed-off-by: Pete Wyckoff Index: management/osm/libvendor/Makefile.am =================================================================== --- management/osm/libvendor/Makefile.am (revision 5098) +++ management/osm/libvendor/Makefile.am (working copy) @@ -3,8 +3,6 @@ if DEBUG DBGFLAGS = -ggdb -D_DEBUG_ -else -DBGFLAGS = -g -O2 endif INCLUDES = $(OSMV_INCLUDES) Index: management/osm/complib/Makefile.am =================================================================== --- management/osm/complib/Makefile.am (revision 5098) +++ management/osm/complib/Makefile.am (working copy) @@ -5,8 +5,6 @@ if DEBUG DBGFLAGS = -ggdb -D_DEBUG_ -else -DBGFLAGS = -g -O2 endif libosmcomp_la_CFLAGS = -Wall $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 Index: management/osm/opensm/Makefile.am =================================================================== --- management/osm/opensm/Makefile.am (revision 5098) +++ management/osm/opensm/Makefile.am (working copy) @@ -5,8 +5,6 @@ if DEBUG DBGFLAGS = -ggdb -D_DEBUG_ -else -DBGFLAGS = -g -O2 endif libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 From bos at pathscale.com Thu Jan 19 13:52:05 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 19 Jan 2006 13:52:05 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <43CFFFCB.7060002@ichips.intel.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1137688158.3693.29.camel@serpentine.pathscale.com> <43CFDF5F.5060409@ichips.intel.com> <1137696901.3693.66.camel@serpentine.pathscale.com> <43CFFFCB.7060002@ichips.intel.com> Message-ID: <1137707525.3693.95.camel@serpentine.pathscale.com> On Thu, 2006-01-19 at 13:08 -0800, Sean Hefty wrote: > I'm struggling to understand what your card does then. From this, it sounds > like a standard network card that just happens to use IB physicals. It has typical features of a standard network card, while also supporting direct user access to the hardware. We eschew the offload-as-much-as-possible approach that other vendors take. > Do you just send raw packets? We certainly can do that. The hardware doesn't need to do much more, in fact. > How is the LRH formatted by your card? I.e. what's setting > up the dlid, slid, vl, etc.? This is all done in software. The low-level driver and hardware fill out enough of the IB UD protocol headers to put packets on the wire that an IB switch will route. The higher-level layer is responsible for the full IB protocol suite and the driver-side interfaces to the various OpenIB userspace APIs. > Can your card interoperate with other IB devices > on the network when running in this mode? Yes. It can do both the low-level wonkery and regular IB at the same time. References: <20060119215025.GA2620@osc.edu> Message-ID: <20060119215321.GB2620@osc.edu> Avoid overriding CFLAGS in ibis and ibdm. These apply to ibis and ibdm below gen2/utils/src/linux-user/. The third chunk below avoids a configure warning for a file "osm_build_id.h" that appears nowhere in my source or build tree. Signed-off-by: Pete Wyckoff Index: ibis/src/Makefile.am =================================================================== --- ibis/src/Makefile.am (revision 5098) +++ ibis/src/Makefile.am (working copy) @@ -38,7 +38,7 @@ if DEBUG DBG = -O0 -g -Wall -Werror else -DBG = -O2 -Wall +DBG = -Wall endif AM_CFLAGS = $(TCL_CPPFLAGS) $(OSM_CFLAGS) $(DBG) -fno-strict-aliasing -fPIC Index: ibdm/datamodel/Makefile.am =================================================================== --- ibdm/datamodel/Makefile.am (revision 5098) +++ ibdm/datamodel/Makefile.am (working copy) @@ -60,8 +60,6 @@ # Support debug mode through config variable if DEBUG DBG = -O0 -g -else -DBG = -O2 endif # We have a special mode where we know the package will be eventually moved Index: ibis/config/osm.m4 =================================================================== --- ibis/config/osm.m4 (revision 5098) +++ ibis/config/osm.m4 (working copy) @@ -156,6 +156,8 @@ AM_CONDITIONAL(OSM_VENDOR_SIM, test $OSM_VENDOR = sim) AM_CONDITIONAL(OSM_BUILD_OPENIB, test $OSM_BUILD = openib) +if test -f $osm_include_dir/opensm/osm_build_id.h; then + dnl validate the defined path - so the build id header is there AC_CHECK_FILE($osm_include_dir/opensm/osm_build_id.h,, AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_build_id.h])) @@ -168,6 +170,9 @@ else osm_debug_flags= fi +else + osm_debug_flags= +fi OSM_CFLAGS="-I$osm_include_dir $osm_extra_includes $osm_debug_flags $osm_vendor_sel -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1" From bos at pathscale.com Thu Jan 19 13:53:43 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 19 Jan 2006 13:53:43 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1137688158.3693.29.camel@serpentine.pathscale.com> <43CFDF5F.5060409@ichips.intel.com> <1137696901.3693.66.camel@serpentine.pathscale.com> Message-ID: <1137707623.3693.97.camel@serpentine.pathscale.com> On Thu, 2006-01-19 at 13:31 -0700, Eric W. Biederman wrote: > Ok this is one piece of the puzzle. At your lowest level your hardware > does not have QP's but it does have something similar to isolate a userspace > process correct? Right. We implement almost none of the IB protocols in hardware. References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1137688158.3693.29.camel@serpentine.pathscale.com> <1137696611.3693.63.camel@serpentine.pathscale.com> Message-ID: <1137708791.3693.111.camel@serpentine.pathscale.com> On Thu, 2006-01-19 at 13:29 -0700, Eric W. Biederman wrote: > Agreed. Part of the problem is the IB layer is insufficient, or > at least you perceive it that way. At that level if you can express > your problems we can get the IB layer fixed. Our low-level driver is not IB, doesn't implement IB, and doesn't care about IB. Our upper-level driver implements IB, and interfaces to the existing IB tree. > Except not being a member of the IB verbs camp there is nothing > your hardware does that is exotic enough for the IB layer to > fall down. We implement IB verbs just fine, both in the kernel and userspace. > 1) The IB stack poorly supports your driver. > - IB stack problem. If you could help point out what > is wrong with the IB stack that would be great. I have no issue with it. We already act as a provider to it, in our higher-layer driver code. We have some user page pinning code that is clearly similar in purpose, and that I want to refactor in a helpful way. We have UD and RC protocol engines that could profitably be moved out of our driver and into the IB layer at some future point in time, should some other device ever come along that could use them. > For those who need the buzz words to understand what is going > on the ipath hardware largely does stateless offload for IB while > the mellanox hardware does whole protocol offload. Our hardware actually does no offload whatsoever. That's why we are (a) fast (b) flexible and (c) somewhat big and unusual compared to other IB drivers. References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060119025741.GC15706@kroah.com> <1137646957.25584.17.camel@localhost.localdomain> <20060119053940.GB21467@kroah.com> <1137649988.25584.67.camel@localhost.localdomain> Message-ID: <20060119225716.GB27689@kroah.com> On Wed, Jan 18, 2006 at 09:53:08PM -0800, Bryan O'Sullivan wrote: > On Wed, 2006-01-18 at 21:39 -0800, Greg KH wrote: > > > Ok, that's fair enough. But if you want to do something like ptys, then > > why not just have your own filesystem for this driver? > > If you think it's appropriate to implement a new filesystem to replace a > single ioctl that returns two integers, we can probably do that, but > more realistically, the GETPORT ioctl can probably live a long and > untroubled life as another netlink message. Well it only takes about 250 lines to make a new fs these days, but a single netlink message would probably be smaller :) > > You are just making your own type of special interface up as you > > go, so the complexity is also there (this complexity would normally be > > in some core code, which I am hoping that your code will turn into for > > other devices of the same type, right?) > > The most important chunk of likely common code I can see at the moment > is the stuff for bodging user page mappings that we got hammered over > already. The drivers/infiniband/ tree already has code that does > something like this, and a few other not-yet-in-tree network drivers > that support RDMA have similar needs, too. The RDMA-loving people need to get together and hammer out a proposal that the network people can laugh at and shoot down all at once :) Ok, maybe not shoot down, but they do need to get together and come up with some kind of solution, add-hok implementations in a bunch of different drivers, in a bunch of different ways is not the proper thing to do, no matter _how_ different the hardware works at the lower levels. thanks, greg k-h From bos at pathscale.com Thu Jan 19 15:44:19 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 19 Jan 2006 15:44:19 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <20060119225716.GB27689@kroah.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060119025741.GC15706@kroah.com> <1137646957.25584.17.camel@localhost.localdomain> <20060119053940.GB21467@kroah.com> <1137649988.25584.67.camel@localhost.localdomain> <20060119225716.GB27689@kroah.com> Message-ID: <1137714259.22241.12.camel@localhost.localdomain> On Thu, 2006-01-19 at 14:57 -0800, Greg KH wrote: > The RDMA-loving people need to get together and hammer out a proposal > that the network people can laugh at and shoot down all at once :) We are not really in the RDMA camp. Our facility looks more like "when this kind of message comes in, be sure that it shows up at this point in my address space", which does not match RDMA semantics. Also, RDMA's mother smells of elderberries, in my personal opinion. References: <1137631411.4757.218.camel@serpentine.pathscale.com> <20060119025741.GC15706@kroah.com> <1137646957.25584.17.camel@localhost.localdomain> <20060119053940.GB21467@kroah.com> <1137649988.25584.67.camel@localhost.localdomain> <20060119225716.GB27689@kroah.com> <1137714259.22241.12.camel@localhost.localdomain> Message-ID: <43D02888.30605@ichips.intel.com> Bryan O'Sullivan wrote: > We are not really in the RDMA camp. Our facility looks more like "when > this kind of message comes in, be sure that it shows up at this point in > my address space", which does not match RDMA semantics. A lot of people mean QP-like semantics when they talk about "RDMA", rather than the RDMA operation itself. I.e. pre-posted receive buffers associated with a particular user-space process. That aside, conceptually, I see little difference between RDMA semantics versus the facility that you describe. The main difference is the complexity of the header and the checks done against it. - Sean From friedman at ucla.edu Thu Jan 19 18:47:03 2006 From: friedman at ucla.edu (Scott A. Friedman) Date: Thu, 19 Jan 2006 18:47:03 -0800 Subject: [openib-general] Advice Message-ID: <43D04F27.6010302@ucla.edu> Hi I am pretty new to IB and would like clear up a couple of questions I have 1. What usermode tools should I use if I want to use the modules that are part of the current kernels (e.g. Fedora). The OpenIB subversion code, the IBGD stuff or something else. 2. Is there any reason to *not* use the kernel modules that are now included with the kernel.org or Fedora kernels? If so, which to use? 3. How stable is the subversion code? Is it just for testers? Thanks for your comments. Scott From sean.hefty at intel.com Thu Jan 19 20:38:01 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 19 Jan 2006 20:38:01 -0800 Subject: [openib-general] Advice In-Reply-To: <43D04F27.6010302@ucla.edu> Message-ID: >1. What usermode tools should I use if I want to use the modules that >are part of the current kernels (e.g. Fedora). The OpenIB subversion >code, the IBGD stuff or something else. Not sure what IBGD is. The OpenIB svn code matches with that shipping in the current kernels. >2. Is there any reason to *not* use the kernel modules that are now >included with the kernel.org or Fedora kernels? If so, which to use? Several modules in svn are not yet available in the kernel, so it depends on what you are trying to do. >3. How stable is the subversion code? Is it just for testers? In general, the tip of svn gen2/trunk is stable. Bugs are usually fixed within a relatively short time (couple of hours to a day). The drawback is that it targets the latest kernel release; however, backport patches are available. - Sean From friedman at ucla.edu Thu Jan 19 20:55:39 2006 From: friedman at ucla.edu (Scott A. Friedman) Date: Thu, 19 Jan 2006 20:55:39 -0800 Subject: [openib-general] Advice In-Reply-To: References: Message-ID: <43D06D4B.6020803@ucla.edu> Sean Hefty wrote: >> 1. What usermode tools should I use if I want to use the modules that >> are part of the current kernels (e.g. Fedora). The OpenIB subversion >> code, the IBGD stuff or something else. > > Not sure what IBGD is. The OpenIB svn code matches with that shipping in the > current kernels. > Sorry, I mean the Mellanox Gold distribution thing. Using this is kinda a problem for me since I need to use a recent kernel - to support some non Infiniband stuff. Just so I understand. When you say 'svn code matches' it means a) the kernel modules are the same (or close enough). b) the usermode stuff will/should work without a problem. Could you be a little more specific? Just curious how this works until things stabilize more. >> 2. Is there any reason to *not* use the kernel modules that are now >> included with the kernel.org or Fedora kernels? If so, which to use? > > Several modules in svn are not yet available in the kernel, so it depends on > what you are trying to do. > Mainly, I am interested in the verb layer for know so my needs are not that great. However, is there a reason to compile and use the svn kernel modules in place of the supplied ones? Or, as you mentioned above - are fixes/patches picked up by the distributions in between stable kernel releases. >> 3. How stable is the subversion code? Is it just for testers? > > In general, the tip of svn gen2/trunk is stable. Bugs are usually fixed within > a relatively short time (couple of hours to a day). The drawback is that it > targets the latest kernel release; however, backport patches are available. > Hmm, this is what I am looking for so that is actually a good thing for me. Except for having to rebuild the kernel to use the svn driver code. I suppose that I could just stick with building the usermode stuff and use the supplied kernel modules since I am not using any of the missing stuff. I also noticed that a person from redhat has provided some rpm modules of the usermode libraries. Is anyone using these? Is this a reasonable way to go? I suppose it is just a convenience. Thanks again, Scott From sean.hefty at intel.com Thu Jan 19 21:11:33 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 19 Jan 2006 21:11:33 -0800 Subject: [openib-general] Advice In-Reply-To: <43D06D4B.6020803@ucla.edu> Message-ID: >Just so I understand. When you say 'svn code matches' it means a) the >kernel modules are the same (or close enough). b) the usermode stuff >will/should work without a problem. Could you be a little more specific? >Just curious how this works until things stabilize more. The IB code included in kernel.org is from the svn gen2/trunk. The code is developed in OpenIB, then pushed upstream to kernel.org. I.e. linux-2.6.15/drivers/infiniband is a subset of gen2/trunk/src/linux-kernel/infiniband. The code in svn is a little more up to date. The usermode portion in gen2/trunk/src/userspace should work with the code in the kernel. >Mainly, I am interested in the verb layer for know so my needs are not >that great. However, is there a reason to compile and use the svn kernel >modules in place of the supplied ones? Or, as you mentioned above - are >fixes/patches picked up by the distributions in between stable kernel >releases. >Hmm, this is what I am looking for so that is actually a good thing for >me. Except for having to rebuild the kernel to use the svn driver code. >I suppose that I could just stick with building the usermode stuff and >use the supplied kernel modules since I am not using any of the missing >stuff. Note that it's fairly easy to drop in the svn code into the kernel if you want to go that route. You just need to link linux/drivers/infiniband to the svn infiniband directory, and remove linux/include/rdma. It doesn't sound like you'll need to do this though. >I also noticed that a person from redhat has provided some rpm modules >of the usermode libraries. Is anyone using these? Is this a reasonable >way to go? I suppose it is just a convenience. That I don't know the answer to. - Sean From halr at voltaire.com Thu Jan 19 21:16:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Jan 2006 00:16:33 -0500 Subject: [openib-general] Advice In-Reply-To: <43D06D4B.6020803@ucla.edu> References: <43D06D4B.6020803@ucla.edu> Message-ID: <1137734191.4338.8953.camel@hal.voltaire.com> On Thu, 2006-01-19 at 23:55, Scott A. Friedman wrote: > Sean Hefty wrote: > >> 1. What usermode tools should I use if I want to use the modules that > >> are part of the current kernels (e.g. Fedora). The OpenIB subversion > >> code, the IBGD stuff or something else. > > > > Not sure what IBGD is. The OpenIB svn code matches with that shipping in the > > current kernels. > > > > Sorry, I mean the Mellanox Gold distribution thing. Using this is kinda > a problem for me since I need to use a recent kernel - to support some > non Infiniband stuff. You don't necessarily need to use IBGD. There is overlap. It depends on what you need/use out of this. Mellanox is working on an IB Gold 2 based on OpenIB too. > Just so I understand. When you say 'svn code matches' it means a) the > kernel modules are the same (or close enough). b) the usermode stuff > will/should work without a problem. Could you be a little more specific? > Just curious how this works until things stabilize more. I think Sean meant (correct me if I'm wrong here) svn code that matches the kernel in use (2.6.x) at a certain version (e.g. say 4507 for 2.6.9). This includes both kernel and userspace changes to the same OpenIB svn version. These are periodically updated but lag the head of the tree for obvious reasons. > >> 2. Is there any reason to *not* use the kernel modules that are now > >> included with the kernel.org or Fedora kernels? If so, which to use? > > > > Several modules in svn are not yet available in the kernel, so it depends on > > what you are trying to do. > > > > Mainly, I am interested in the verb layer for know so my needs are not > that great. However, is there a reason to compile and use the svn kernel > modules in place of the supplied ones? Or, as you mentioned above - are > fixes/patches picked up by the distributions in between stable kernel > releases. Just the verbs ? I think the answer depends on what you are specifically using in the verbs. It has not been released as 1.0 yet but close. Roland is best to comment on this. > >> 3. How stable is the subversion code? Is it just for testers? > > > > In general, the tip of svn gen2/trunk is stable. Bugs are usually fixed within > > a relatively short time (couple of hours to a day). The drawback is that it > > targets the latest kernel release; however, backport patches are available. > > > > Hmm, this is what I am looking for so that is actually a good thing for > me. Except for having to rebuild the kernel to use the svn driver code. > I suppose that I could just stick with building the usermode stuff and > use the supplied kernel modules since I am not using any of the missing > stuff. Not always.. Sometimes there are related kernel changes. > I also noticed that a person from redhat has provided some rpm modules > of the usermode libraries. Is anyone using these? Is this a reasonable > way to go? I suppose it is just a convenience. Some people are using and testing these packages. -- Hal > Thanks again, > Scott > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From friedman at ucla.edu Thu Jan 19 21:35:54 2006 From: friedman at ucla.edu (Scott A. Friedman) Date: Thu, 19 Jan 2006 21:35:54 -0800 Subject: [openib-general] Advice In-Reply-To: <1137734191.4338.8953.camel@hal.voltaire.com> References: <43D06D4B.6020803@ucla.edu> <1137734191.4338.8953.camel@hal.voltaire.com> Message-ID: <43D076BA.8010109@ucla.edu> Hal Rosenstock wrote: > On Thu, 2006-01-19 at 23:55, Scott A. Friedman wrote: >> Sean Hefty wrote: >>>> 1. What usermode tools should I use if I want to use the modules that >>>> are part of the current kernels (e.g. Fedora). The OpenIB subversion >>>> code, the IBGD stuff or something else. >>> Not sure what IBGD is. The OpenIB svn code matches with that shipping in the >>> current kernels. >>> >> Sorry, I mean the Mellanox Gold distribution thing. Using this is kinda >> a problem for me since I need to use a recent kernel - to support some >> non Infiniband stuff. > > You don't necessarily need to use IBGD. There is overlap. It depends on > what you need/use out of this. Mellanox is working on an IB Gold 2 based > on OpenIB too. > I will keep an eye out for it. However, they tend to only support particular kernel versions which doesn't work for us. >> Just so I understand. When you say 'svn code matches' it means a) the >> kernel modules are the same (or close enough). b) the usermode stuff >> will/should work without a problem. Could you be a little more specific? >> Just curious how this works until things stabilize more. > > I think Sean meant (correct me if I'm wrong here) svn code that matches > the kernel in use (2.6.x) at a certain version (e.g. say 4507 for > 2.6.9). This includes both kernel and userspace changes to the same > OpenIB svn version. These are periodically updated but lag the head of > the tree for obvious reasons. > Got it - thanks. >>>> 2. Is there any reason to *not* use the kernel modules that are now >>>> included with the kernel.org or Fedora kernels? If so, which to use? >>> Several modules in svn are not yet available in the kernel, so it depends on >>> what you are trying to do. >>> >> Mainly, I am interested in the verb layer for know so my needs are not >> that great. However, is there a reason to compile and use the svn kernel >> modules in place of the supplied ones? Or, as you mentioned above - are >> fixes/patches picked up by the distributions in between stable kernel >> releases. > > Just the verbs ? I think the answer depends on what you are specifically > using in the verbs. It has not been released as 1.0 yet but close. > Roland is best to comment on this. > Well, we are porting some code written for VI and the verbs seem to be the most direct way of getting things going. I suppose later moving to a higher level would make sense (uDAPL?) >>>> 3. How stable is the subversion code? Is it just for testers? >>> In general, the tip of svn gen2/trunk is stable. Bugs are usually fixed within >>> a relatively short time (couple of hours to a day). The drawback is that it >>> targets the latest kernel release; however, backport patches are available. >>> >> Hmm, this is what I am looking for so that is actually a good thing for >> me. Except for having to rebuild the kernel to use the svn driver code. >> I suppose that I could just stick with building the usermode stuff and >> use the supplied kernel modules since I am not using any of the missing >> stuff. > > Not always.. Sometimes there are related kernel changes. > >> I also noticed that a person from redhat has provided some rpm modules >> of the usermode libraries. Is anyone using these? Is this a reasonable >> way to go? I suppose it is just a convenience. > > Some people are using and testing these packages. > > -- Hal > >> Thanks again, >> Scott >> Sounds like just building the svn code is the easiest. Thanks for all your comments. Scott From bill.boas at gmail.com Thu Jan 19 21:45:47 2006 From: bill.boas at gmail.com (Bill Boas) Date: Thu, 19 Jan 2006 21:45:47 -0800 Subject: [openib-general] Sonoma Registrations as of Jan19 - please register NOW if you are planning to join us. In-Reply-To: <19a929370601192142s329fd18aq78deb655d81c54e3@mail.gmail.com> References: <19a929370601192142s329fd18aq78deb655d81c54e3@mail.gmail.com> Message-ID: <19a929370601192145g22a74e8en77342a1a40468a73@mail.gmail.com> Dear Members, I am forwarding today's list of the people who have registered for the Sonoma Workshop. 86 people is a very good showing so far but there are many members of our community who we hope will join that have not yet registered. If you are one of them please register as soon as you can. Karla Nutt is helping get the word out about this workshop and you may have received an email from her. She is sending out a reminder to register to thos who not done so yet. Here is what the reminder says - please forward it on to anyone yo know who you think should join us in Sonoma but has not registered yet. Thank you. " Please forgive us if we have emailed before you but we're just trying to make sure everyone we know has the opportunity to attend if they can. For those who are using, or going to use OpenIB Release 1.0 code in the next weeks or months you learn up-to-date capabilities, when RedHat and SUSE will ship it in their distributions, and when it will be available for Windows Cluster Server, how we are going about interoperability verification and the provision of support to customers. For those who plan to consider Infiniband or RDMA over Ethernet and wish to use open source with Linux or Windows the workshop is where the developers will listen to your needs and respond with what they think can be in the next integrated release 2.0 containing both and the improvements requested at this workshop. You'll learn about extension over thousands of kilometers, virtualization with Xen and how to use the software for direct access to filesystems, databases and storage. Please join us by following the links below: For information about the Lodge but not reservations, see http://www.thelodgeatsonoma.com/sonoma-valley.html . Register for $395 at http://www.acteva.com/go/rdma. All participants, including speakers, must register to help with the cost of the Workshop and Alliance work. Registration fees will not be refunded for cancellations after January 20, 2006. Hosted by OpenIB, the full RDMA community is invited, including customers and members of the iWARP Consortium, OpenRDMA, RDMA Consortium, Interconnect Software Consortium, RNIC-PI Working Group and other members of Open Group Forums as well as customers interested in enabling RDMA networks to operate not only in the datacenter or on campus but also over long distance terrestrial or satellite IP networks. There will also be the Annual General Meeting of the members of the Alliance during which the members will confirm the integration of iWARP and OpenIB software stack with a new name for the stack, the Alliance and accompanying revisions to our by-laws. The Workshop is open to all who are interested in the development and use of this open source software. The Workshop agenda is available at the wiki, at https://openib.org/tiki/tiki-index.php?page=Sonoma2006Agenda. Room reservations at the Lodge: - General group room rate, priority code OPAOPAA - http://marriott.com/property/propertypage/sfols?groupCode=opaopaa&app=resvlink - For US Government badge holders, priority code OPAOPAG. - http://marriott.com/property/propertypage/sfols?groupCode=opaopag&app=resvlink -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: registration as of Jan 19 text.csv Type: application/octet-stream Size: 7664 bytes Desc: not available URL: From WandadmBravo at shawcable.net Fri Jan 20 01:07:23 2006 From: WandadmBravo at shawcable.net (Wanda Bravo) Date: Fri, 20 Jan 2006 06:07:23 -0300 Subject: [openib-general] Hey. Message-ID: We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://craudulence.net/dreams/ Have a good day. Sincerely, Wanda Bravo Customer Service Rep eUHT Inc. From MiriamdkMaldonado at advtel.net Fri Jan 20 06:23:29 2006 From: MiriamdkMaldonado at advtel.net (Miriam Maldonado) Date: Fri, 20 Jan 2006 12:23:29 -0200 Subject: [openib-general] Mortgage News Update. Message-ID: We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://craudulence.net/dreams/ Have a good day. Sincerely, Miriam Maldonado Customer Service Rep eMJE Inc. From jennim at ardalantrade.com Fri Jan 20 05:03:08 2006 From: jennim at ardalantrade.com (Tristan Smith) Date: Fri, 20 Jan 2006 08:03:08 -0500 Subject: [openib-general] Don't be left behing- the enlargement revolution! Message-ID: <000001c61deb$d97ce980$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! http://www.lofogus.com/pt/?46&qjpvfs -------------- next part -------------- An HTML attachment was scrubbed... URL: From devesh28 at gmail.com Fri Jan 20 07:41:16 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Fri, 20 Jan 2006 21:11:16 +0530 Subject: [openib-general] Question On mad.c Message-ID: <309a667c0601200741q75fcfbf4g63d2972561439201@mail.gmail.com> Hi all, In mad.c while calling ib_post_receive() operation spin_lock_irqsave(&recv_queue->lock, flags); post = (++recv_queue->count < recv_queue->max_active); list_add_tail(&mad_priv->header.mad_list.list, &recv_queue->list); spin_unlock_irqrestore(&recv_queue->lock, flags); ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); This is in while loop till "post" variable remains true, value of max_active is 512 So loop will go 512 times. If the qp on which this posting is going on dose not supports 512 recevie descriptors posting then what will happen? Although during qp creation max_recv supported will be returned but loop is independent of this. Is it the requirement that for QP0 and QP1 this is (512) the at least supported receive descriptors? please any body clerify my confusion. Devesh From ogerlitz at voltaire.com Fri Jan 20 08:08:09 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Fri, 20 Jan 2006 18:08:09 +0200 Subject: [openib-general] Re: patch to ulp/iser such that it would work withlatest open-iscsi Message-ID: The patch was not applied to the svn, so no need to redo Or ________________________________ From: openib-general-bounces at openib.org on behalf of Nishanth Aravamudan Sent: Thu 1/19/2006 9:56 PM To: Or Gerlitz Cc: openib-general at openib.org Subject: [openib-general] Re: patch to ulp/iser such that it would work withlatest open-iscsi On 19.01.2006 [16:18:47 +0200], Or Gerlitz wrote: > Nish, > > For now, please use this patch for your compilation against latest > kernels containing latest (post official 2.6.15) open-iscsi Could you redo this patch so it applies to the kernel tree? Thanks, Nish _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Fri Jan 20 08:09:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Jan 2006 11:09:58 -0500 Subject: [openib-general] Re: [patch] userspace/management: ARGBEGIN() -> getopt() conversion for diags In-Reply-To: <20060118142853.GA19642@sashak.voltaire.com> References: <20060118142853.GA19642@sashak.voltaire.com> Message-ID: <1137773397.4338.12483.camel@hal.voltaire.com> Hi Sasha, On Wed, 2006-01-18 at 09:28, Sasha Khapyorsky wrote: > Hi Hal, > > Diag utils are converted to getopt(). It is just basically tested, > so please report bugs (if any). > > Sasha. > > > This converts diag utils to more standard getopt() using instead of > AGRBEGIN() buggy macros. Unused now ARGBEGIN() related code is > removed from libibcommon. Thanks! Applied with some minor cosmetic changes. -- Hal > Signed-off-by: Sasha Khapyorsky From mshefty at ichips.intel.com Fri Jan 20 09:28:54 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 20 Jan 2006 09:28:54 -0800 Subject: [openib-general] Question On mad.c In-Reply-To: <309a667c0601200741q75fcfbf4g63d2972561439201@mail.gmail.com> References: <309a667c0601200741q75fcfbf4g63d2972561439201@mail.gmail.com> Message-ID: <43D11DD6.7040709@ichips.intel.com> Devesh Sharma wrote: > In mad.c while calling ib_post_receive() operation > > spin_lock_irqsave(&recv_queue->lock, flags); > > post = (++recv_queue->count < recv_queue->max_active); > > list_add_tail(&mad_priv->header.mad_list.list, &recv_queue->list); > spin_unlock_irqrestore(&recv_queue->lock, flags); > ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); > > This is in while loop till "post" variable remains true, value of > max_active is 512 So loop will go 512 times. > > If the qp on which this posting is going on dose not supports 512 > recevie descriptors posting then what will happen? > Although during qp creation max_recv supported will be returned but > loop is independent of this. The QP is created with a size of IB_MAD_QP_RECV_SIZE (512). If the hardware cannot support this size of a QP, then the create QP call will fail. I.e. the hardware can provide a QP that is larger, but not smaller. The code cannot adjust to using a larger size without resizing the corresponding CQ, which is not yet supported. - Sean From mshefty at ichips.intel.com Fri Jan 20 12:05:05 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 20 Jan 2006 12:05:05 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1136578777.14108.6.camel@trinity.austin.ammasso.com> References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> Message-ID: <43D14271.6050507@ichips.intel.com> Tom Tucker wrote: > Enclosed is a combined include file and core patch for iWARP support in CMA. This > patch includes changes per your last review. I applied the portion of your patch that modifies the CMA to use a union to store the underlying cm_id's, with a couple of changes - see below. I haven't seen anyone object to merging the other changes. Roland, Hal - any opinion? - Sean > @@ -624,9 +676,20 @@ > state = cma_exch(id_priv, CMA_DESTROYING); > cma_cancel_operation(id_priv, state); > > - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) > - ib_destroy_cm_id(id_priv->cm_id); > + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) { > > + switch (id->device->node_type) { > + case IB_NODE_RNIC: > + iw_destroy_cm_id(id_priv->cm_id.iw); > + break; > + default: > + ib_destroy_cm_id(id_priv->cm_id.ib); > + break; > + } > + > + id_priv->cm_id.ib = NULL; > + } > + > if (id_priv->cma_dev) { > down(&mutex); > cma_detach_from_dev(id_priv); I rearranged this to check for a cma_dev first, then switch on the node type. We'll never have a cm_id if a device hasn't been assigned. This allows us to check the correct union member. From eitan at mellanox.co.il Fri Jan 20 13:15:37 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 20 Jan 2006 23:15:37 +0200 Subject: [openib-general] respect CFLAGS in ibis/ibdm, fix missing file warning Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B58D@mtlexch01.mtl.com> Hi Pete, Looks fine so far but I will look at this further early next week. Thanks Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Pete Wyckoff > Sent: Thursday, January 19, 2006 11:53 PM > To: openib-general at openib.org > Subject: [openib-general] respect CFLAGS in ibis/ibdm,fix missing file warning > > Avoid overriding CFLAGS in ibis and ibdm. These apply to ibis and > ibdm below gen2/utils/src/linux-user/. > > The third chunk below avoids a configure warning for a file > "osm_build_id.h" that appears nowhere in my source or build tree. > > Signed-off-by: Pete Wyckoff > > Index: ibis/src/Makefile.am > =================================================================== > --- ibis/src/Makefile.am (revision 5098) > +++ ibis/src/Makefile.am (working copy) > @@ -38,7 +38,7 @@ > if DEBUG > DBG = -O0 -g -Wall -Werror > else > -DBG = -O2 -Wall > +DBG = -Wall > endif > > AM_CFLAGS = $(TCL_CPPFLAGS) $(OSM_CFLAGS) $(DBG) -fno-strict-aliasing - > fPIC > Index: ibdm/datamodel/Makefile.am > =================================================================== > --- ibdm/datamodel/Makefile.am (revision 5098) > +++ ibdm/datamodel/Makefile.am (working copy) > @@ -60,8 +60,6 @@ > # Support debug mode through config variable > if DEBUG > DBG = -O0 -g > -else > -DBG = -O2 > endif > > # We have a special mode where we know the package will be eventually moved > Index: ibis/config/osm.m4 > =================================================================== > --- ibis/config/osm.m4 (revision 5098) > +++ ibis/config/osm.m4 (working copy) > @@ -156,6 +156,8 @@ > AM_CONDITIONAL(OSM_VENDOR_SIM, test $OSM_VENDOR = sim) > AM_CONDITIONAL(OSM_BUILD_OPENIB, test $OSM_BUILD = openib) > > +if test -f $osm_include_dir/opensm/osm_build_id.h; then > + > dnl validate the defined path - so the build id header is there > AC_CHECK_FILE($osm_include_dir/opensm/osm_build_id.h,, > AC_MSG_ERROR([OSM: could not find > $with_osm/include/opensm/osm_build_id.h])) > @@ -168,6 +170,9 @@ > else > osm_debug_flags= > fi > +else > + osm_debug_flags= > +fi > > OSM_CFLAGS="-I$osm_include_dir $osm_extra_includes $osm_debug_flags > $osm_vendor_sel -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1" > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Fri Jan 20 13:19:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Jan 2006 16:19:03 -0500 Subject: [openib-general] respect CFLAGS in ibis/ibdm, fix missing file warning In-Reply-To: <20060119215321.GB2620@osc.edu> References: <20060119215025.GA2620@osc.edu> <20060119215321.GB2620@osc.edu> Message-ID: <1137791899.4338.14618.camel@hal.voltaire.com> On Thu, 2006-01-19 at 16:53, Pete Wyckoff wrote: > Avoid overriding CFLAGS in ibis and ibdm. These apply to ibis and > ibdm below gen2/utils/src/linux-user/. > > The third chunk below avoids a configure warning for a file > "osm_build_id.h" that appears nowhere in my source or build tree. osm_build_id.h is a biproduct of the OpenSM build. -- Hal From pw at osc.edu Fri Jan 20 14:56:05 2006 From: pw at osc.edu (Pete Wyckoff) Date: Fri, 20 Jan 2006 17:56:05 -0500 Subject: [openib-general] respect CFLAGS in ibis/ibdm, fix missing file warning In-Reply-To: <1137791899.4338.14618.camel@hal.voltaire.com> References: <20060119215025.GA2620@osc.edu> <20060119215321.GB2620@osc.edu> <1137791899.4338.14618.camel@hal.voltaire.com> Message-ID: <20060120225605.GA4332@osc.edu> halr at voltaire.com wrote on Fri, 20 Jan 2006 16:19 -0500: > On Thu, 2006-01-19 at 16:53, Pete Wyckoff wrote: > > Avoid overriding CFLAGS in ibis and ibdm. These apply to ibis and > > ibdm below gen2/utils/src/linux-user/. > > > > The third chunk below avoids a configure warning for a file > > "osm_build_id.h" that appears nowhere in my source or build tree. > > osm_build_id.h is a biproduct of the OpenSM build. Oh, I see now. I was building in osm/complib, osm/libvendor and osm/opensm directly in my build script. Missed the fact that the top level osm/Makefile.am created a file too. Thanks for pointing that out. In this case, it seems a bit odd that compiling ibis properly requires knowing if osm was built with DEBUG, but it's not a big deal. pkgconfig (http://pkgconfig.freedesktop.org/wiki/) is one way some distributions manage these interdependencies, if you're interested. -- Pete From halr at voltaire.com Fri Jan 20 15:15:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Jan 2006 18:15:58 -0500 Subject: [openib-general] respect CFLAGS in ibis/ibdm, fix missing file warning In-Reply-To: <20060120225605.GA4332@osc.edu> References: <20060119215025.GA2620@osc.edu> <20060119215321.GB2620@osc.edu> <1137791899.4338.14618.camel@hal.voltaire.com> <20060120225605.GA4332@osc.edu> Message-ID: <1137798689.4338.15307.camel@hal.voltaire.com> On Fri, 2006-01-20 at 17:56, Pete Wyckoff wrote: > halr at voltaire.com wrote on Fri, 20 Jan 2006 16:19 -0500: > > On Thu, 2006-01-19 at 16:53, Pete Wyckoff wrote: > > > Avoid overriding CFLAGS in ibis and ibdm. These apply to ibis and > > > ibdm below gen2/utils/src/linux-user/. > > > > > > The third chunk below avoids a configure warning for a file > > > "osm_build_id.h" that appears nowhere in my source or build tree. > > > > osm_build_id.h is a biproduct of the OpenSM build. > > Oh, I see now. I was building in osm/complib, osm/libvendor and > osm/opensm directly in my build script. Missed the fact that the > top level osm/Makefile.am created a file too. Thanks for pointing > that out. > > In this case, it seems a bit odd that compiling ibis properly > requires knowing if osm was built with DEBUG, but it's not a big > deal. pkgconfig (http://pkgconfig.freedesktop.org/wiki/) is one > way some distributions manage these interdependencies, if you're > interested. Yes, that's a question for Eitan. -- Hal > -- Pete From greg at kroah.com Fri Jan 20 16:08:19 2006 From: greg at kroah.com (Greg KH) Date: Fri, 20 Jan 2006 16:08:19 -0800 Subject: [openib-general] [PATCH] fix IB with latest versions of udev Message-ID: <20060121000819.GA26967@kroah.com> Here's a patch that will remove a few lines of code from the IB core, and let it work properly with userspace programs that are only watching the netlink socket for events, instead of mucking around in sysfs (like the latest versions of udev do.) I've only compile tested it as I have no IB hardware here. If you want, I can forward this on to Linus in my driver tree, or you can send it yourselves. thanks, greg k-h --------------------- From: Greg Kroah-Hartman Subject: IB: fix up major/minor sysfs interface for IB core Current IB code doesn't work with userspace programs that listen only to the kernel event netlink socket as it is trying to create its own dev interface. This small patch fixes this problem, and removes some unneeded code as the driver core handles this logic for you automatically. Signed-off-by: Greg Kroah-Hartman --- drivers/infiniband/core/ucm.c | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-) --- gregkh-2.6.orig/drivers/infiniband/core/ucm.c +++ gregkh-2.6/drivers/infiniband/core/ucm.c @@ -1319,15 +1319,6 @@ static struct class ucm_class = { .release = ib_ucm_release_class_dev }; -static ssize_t show_dev(struct class_device *class_dev, char *buf) -{ - struct ib_ucm_device *dev; - - dev = container_of(class_dev, struct ib_ucm_device, class_dev); - return print_dev_t(buf, dev->dev.dev); -} -static CLASS_DEVICE_ATTR(dev, S_IRUGO, show_dev, NULL); - static ssize_t show_ibdev(struct class_device *class_dev, char *buf) { struct ib_ucm_device *dev; @@ -1364,15 +1355,13 @@ static void ib_ucm_add_one(struct ib_dev ucm_dev->class_dev.class = &ucm_class; ucm_dev->class_dev.dev = device->dma_device; + ucm_dev->class_dev.devt = ucm_dev->dev.dev; snprintf(ucm_dev->class_dev.class_id, BUS_ID_SIZE, "ucm%d", ucm_dev->devnum); if (class_device_register(&ucm_dev->class_dev)) goto err_cdev; if (class_device_create_file(&ucm_dev->class_dev, - &class_device_attr_dev)) - goto err_class; - if (class_device_create_file(&ucm_dev->class_dev, &class_device_attr_ibdev)) goto err_class; From mshefty at ichips.intel.com Fri Jan 20 16:28:39 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 20 Jan 2006 16:28:39 -0800 Subject: [openib-general] Re: [PATCH] fix IB with latest versions of udev In-Reply-To: <20060121000819.GA26967@kroah.com> References: <20060121000819.GA26967@kroah.com> Message-ID: <43D18037.9040503@ichips.intel.com> Greg KH wrote: > If you want, I can forward this on to Linus in my driver tree, or you > can send it yourselves. Thanks, Greg. Please go ahead and forward this patch. - Sean From greg at kroah.com Fri Jan 20 16:45:24 2006 From: greg at kroah.com (Greg KH) Date: Fri, 20 Jan 2006 16:45:24 -0800 Subject: [openib-general] Re: [PATCH] fix IB with latest versions of udev In-Reply-To: <43D18037.9040503@ichips.intel.com> References: <20060121000819.GA26967@kroah.com> <43D18037.9040503@ichips.intel.com> Message-ID: <20060121004524.GA26233@kroah.com> On Fri, Jan 20, 2006 at 04:28:39PM -0800, Sean Hefty wrote: > Greg KH wrote: > >If you want, I can forward this on to Linus in my driver tree, or you > >can send it yourselves. > > Thanks, Greg. Please go ahead and forward this patch. Great, will do, thanks. Care to give me an "Acked-by:" line to add to it? greg k-h From mshefty at ichips.intel.com Fri Jan 20 16:46:42 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 20 Jan 2006 16:46:42 -0800 Subject: [openib-general] Re: [PATCH] fix IB with latest versions of udev In-Reply-To: <20060121004524.GA26233@kroah.com> References: <20060121000819.GA26967@kroah.com> <43D18037.9040503@ichips.intel.com> <20060121004524.GA26233@kroah.com> Message-ID: <43D18472.1080906@ichips.intel.com> Greg KH wrote: > Care to give me an "Acked-by:" line to add to it? Please apply. Acked-by: Sean Hefty From rdreier at cisco.com Fri Jan 20 16:48:38 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jan 2006 16:48:38 -0800 Subject: [openib-general] [PATCH] Add -m option to ping pong programs to set path MTU In-Reply-To: <1137633204.4520.397.camel@brick.internal.keyresearch.com> (Ralph Campbell's message of "Wed, 18 Jan 2006 17:13:24 -0800") References: <1137633204.4520.397.camel@brick.internal.keyresearch.com> Message-ID: Thanks, I applied my own version of this. Please make sure that the svn tree still works for you. (This patch was the one that pushed me over the edge with code duplication in the pingpong examples, so I put the MTU switch statement into a new pingpong.c file...) - R. From rdreier at cisco.com Fri Jan 20 15:12:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jan 2006 15:12:58 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <43D14271.6050507@ichips.intel.com> (Sean Hefty's message of "Fri, 20 Jan 2006 12:05:05 -0800") References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> <43D14271.6050507@ichips.intel.com> Message-ID: Sean> I haven't seen anyone object to merging the other changes. Sean> Roland, Hal - any opinion? I don't see much urgency in merging it now. When svn diverges from what's upstream in the kernel, it makes my life harder because I have to figure out which patches belong upstream and sometimes merge things by hand (when they hit the divergent regions). Also I can't say I'm thrilled by adding > + struct iw_cm_verbs *iwcm; to struct ib_device -- we still really haven't answered the issue of how iWARP connections interact with the host network stack, we've just pushed it off into low-level driver code where we can't see it. Finally (a minor point), there's a lot of stuff like > + const void* pdata, u8 pdata_len) It's more idiomatic in the kernel to say "void *pdata" -- in other words, the space comes before the *, not after it. - R. From rdreier at cisco.com Fri Jan 20 15:14:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jan 2006 15:14:04 -0800 Subject: [openib-general] Re: [PATCH] srptools on FC4 In-Reply-To: <20060118152605.GE22260@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 18 Jan 2006 17:26:05 +0200") References: <20060118152605.GE22260@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Fri Jan 20 20:40:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jan 2006 20:40:40 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: (Eric W. Biederman's message of "Thu, 19 Jan 2006 01:25:39 -0700") References: <1137631411.4757.218.camel@serpentine.pathscale.com> Message-ID: Eric> Roland you know the RDMA model best, are things so tied to Eric> the current crop of infiniband protocols that what the ipath Eric> code wants to do is not covered? Eric> They clearly need subsystem support and what they are trying Eric> to do either isn't covered or they don't see how to use what Eric> is there. Do the infiniband verbs not allow dealing with a Eric> unreliable datagram protocol? I think this has been answered already but the issue is really that the PathScale hardware does not implement RDMA or even any of the other connection-oriented abstractions that the RDMA layer is designed for. The hardware has only much lower level capabilities, which basically can send and receive packets on an IB link. With those capabilites it is possible to implement IB transports in software -- so for example RDMA read operations are simulated by having the CPU on the receiver copy data to send the response. However that implementation is not going to make good use of the IB midlayer, which really operates at the abstraction level above the IB transport. It's also possible to use the PathScale hardware to directly implement MPI on top of a protocol optimized specifically for MPI, without using IB verbs semantics or an IB transport on the wire. But clearly the userspace interface needed for doing this is not going to match up very well with a userspace interface for IB verbs (which is at a different abstraction level). - R. From panda at cse.ohio-state.edu Fri Jan 20 20:55:46 2006 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri, 20 Jan 2006 23:55:46 -0500 (EST) Subject: [openib-general] Announcing the creation of `mvapich-discuss' mailing list Message-ID: <200601210455.k0L4tlW5017729@xi.cse.ohio-state.edu> Based on many requests, the MVAPICH team is pleased to announce the creation of a public `mvapich-discuss' mailing list. This mailing list is aimed for the users, vendors and developers of MVAPICH and MVAPICH2 projects to discuss all issues (user installation/build problems, performance problems, features, patches and general questions) related to all different versions (VAPI, Gen2 and uDAPL) of MVAPICH and MVAPICH2. All interested users, vendors and developers of MVAPICH and MVAPICH2 are invited to join this discussion mailing list. More details are available from the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ Thanks, MVAPICH Team at OSU/NBCL From nacc at us.ibm.com Fri Jan 20 22:35:58 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Fri, 20 Jan 2006 22:35:58 -0800 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060109063300.GA19748@mellanox.co.il> References: <20060109045948.GH2064@us.ibm.com> <20060109063300.GA19748@mellanox.co.il> Message-ID: <20060121063558.GB13458@us.ibm.com> On 09.01.2006 [08:33:01 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Just like with the results I posted earlier, all the perftest results > > are seriously wrong for 32-bit clients (with both 32-bit and 64-bit > > servers). I am not sure who else to notify beyond the general list (is > > there a corresponding MAINTAINERS files like in the kernel proper for > > the OpenIB code?) > > That would be me - sorry about the delay, I'll take a look at this. Any luck figuring this out? Thanks, Nish From mitia13 at biggermarkets.com Fri Jan 20 22:37:52 2006 From: mitia13 at biggermarkets.com (Isaiah Young) Date: Sat, 21 Jan 2006 00:37:52 -0600 Subject: [openib-general] Don't be inadequate anymore! Message-ID: <000001c61e7f$1e9e2b80$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! http://www.lofogus.com/pt/?46&vaavhr -------------- next part -------------- An HTML attachment was scrubbed... URL: From nacc at us.ibm.com Sat Jan 21 00:19:20 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Sat, 21 Jan 2006 00:19:20 -0800 Subject: [openib-general] Re: patch to ulp/iser such that it would work withlatest open-iscsi In-Reply-To: References: Message-ID: <20060121081920.GC8402@us.ibm.com> On 20.01.2006 [18:08:09 +0200], Or Gerlitz wrote: > The patch was not applied to the svn, so no need to redo Ok, for now, just so I can get numbers, I'm going to have to disable ISER. If you have a patch you'd like me to test (against 2.6.16-rc1-git3 as of right now), just send it my way. Thanks, Nish From tom at opengridcomputing.com Sat Jan 21 06:42:07 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Sat, 21 Jan 2006 08:42:07 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> <43D14271.6050507@ichips.intel.com> Message-ID: <1137854528.7683.13.camel@strider.opengridcomputing.com> On Fri, 2006-01-20 at 15:12 -0800, Roland Dreier wrote: > Sean> I haven't seen anyone object to merging the other changes. > Sean> Roland, Hal - any opinion? > > I don't see much urgency in merging it now. When svn diverges from > what's upstream in the kernel, it makes my life harder because I have > to figure out which patches belong upstream and sometimes merge things > by hand (when they hit the divergent regions). The easy solution here is not to diverge. Unless the iWARP support regresses IB functionality, it does no harm and creates a single software core for both iWARP and IB developers to bring new drivers to market. > > Also I can't say I'm thrilled by adding > > > + struct iw_cm_verbs *iwcm; I agree there are more elegant approaches, however, the design criteria was to minimize changes to ib_verbs and the risk of IB functional regression. I think this approach accomplishes that goal. > > to struct ib_device -- we still really haven't answered the issue of > how iWARP connections interact with the host network stack, we've just > pushed it off into low-level driver code where we can't see it. The implementation not withstanding, we have answered the integration question: - No transport level connection state sharing - No migration of host established connections to RDMA mode. RDMA connection management is integrated with the host stack to the same degree that IB CM is integrated. > > Finally (a minor point), there's a lot of stuff like > > > + const void* pdata, u8 pdata_len) > > It's more idiomatic in the kernel to say "void *pdata" -- in other > words, the space comes before the *, not after it. > > - R. From g1_junjund at kobej.zzn.com Sat Jan 21 08:18:17 2006 From: g1_junjund at kobej.zzn.com (g1_junjund at kobej.zzn.com) Date: Sat, 21 Jan 2006 08:18:17 -0800 (PST) Subject: [openib-general] =?utf-8?b?woF5wot0wonCh8KPwpXCj8OuwpXDscKBesKC?= =?utf-8?b?wqjCi8OgwoLDicKNwqLCgsOBwoLDhMKCw6nCkmrCkMKrwoLDiQ==?= Message-ID: 20060121231237.86352mail@mail.hothot-top7789548_5524_superwebserver09_hothot-top99.cc ������ɂ�����̒j���̊F�l�B���������������܂�Ȃ��č����Ă���j�����N�B �@���邢�́A�~���s���̏����̕��X�͕K���ł��I�� �@http://concon12454.dynu.net/~y2net/ ���t�����Ƃ́H�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�� ������������𕥂��Đ��𖞂������ƁB�j���̉�����ۂ̋t�B�@�@�@�@�@�@�@�� ����ɏo��n�T�C�g�Ȃǂł����Ȃ��Ă���͗l�B�@�@�@�@�@�@�@�@�@�@�@�� ���\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�\�� �܂��̓i�[�t�������}�����Ă���̂��H�Ȋw�I�Ɍ��؁I���̐}��������I �`�Ɛg�������̏㏸�` ���{�̌����� 1970�N�@�������������������������������@1,029,405�� 2000�N�@�����������������������������@�@�@798,138�� 2001�N�@�����������������������������@�@�@799,999�� 2002�N�@�����������������������������@�@�@757,331�� 2004�N�@�����������������������������@�@ 740,220�� ���{�̗����� 1980�N�@�����������������������������@18.3�� 1990�N�@�����������������������������@21.8�� 1997�N�@�����������������������������@28.7�� 2001�N�@�����������������������������@33.1�� 2004�N�@�����������������������������@38.4�� �i�����J���Ȓ��ׁj ���炩�ɏ����̌�����]�����ނ��Ă܂��ˁE�E�E�w��l�̕����y�x�E�w�j�͂��� �ɗ��؂邩�猙�x�E�w�q�������x�E�w�ʓ|�L���x�Ȃǂ̗��R�œƂ��I�񂶂Ⴄ �������}�����Ƃ̂��ƁB �ł�E�E�E�E�E�E�w�₵���c�x�w�G�b�`���c�������c�x�Ǝv���̂��l�̐�(����)�B �j�������ɍs���悤�ɁA�������y�ȏo��n�T�C�g�ɂ‚��A�������Ⴄ�̂ł��B �^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^ �P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P �����ߏ��v�`�Z���u�̗���� �� http://concon12454.dynu.net/~y2net/ ���ׂ�̂���l��I�ѕ���I �� http://concon12454.dynu.net/~y2net/ �Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q �^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^ �����o���΁A����؂��B�ł���̂���́H�ƁA�v���������܂��񂪁A�ӊO ��ӊO�I���b�`�ȏ����͍J�Ɉ�ꂩ�����Ă���̂ł��I������Ƃ��Ƃ�������I �`�����̋����`�i�r�o�Z���ׁj ---- �Ж� ---- -- �E�� -- -- �N�� -- --- ���� --- --- �N�� --- ���Z�V���Ё@�@�@ �L�ҁ@�@�@�@34�΁@�@�@�@75���~�@�@�@ 1,050���~ �\�Z�[�@�@�@�@�@�@ �L��@�@�@�@29�΁@�@�@�@55���~�@�@�@�@ 850���~ ���{�Z�a�l�@�@�@�@ �r�d�@�@�@�@32�΁@�@�@�@39���~�@�@�@�@ 700���~ �t�e�Z��s�@�@�@�@ �c�Ɓ@�@�@�@31�΁@�@�@�@45���~�@�@�@�@ 630���~ �Z�u���Z���u���E�W���p���@�r�u 26�΁@�@�@�@33���~�@�@�@�@ 590���~ �����c�t���̋����@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@ 500�`600���~ ����w�̂��������@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@ 100�`360���~ �̂ƈႢ�A������j���ƕς��Ȃ��ʂɉ҂��ł����ł��ˁA���ۂ́B �O���n��Ƃ̃L�����A�E�[�}���Ƃ�Ȃ��2,000���~�𒴂��鏗�������Ƃ�!? �A�p�����ƊE�ł�A�N��800���~�͂���I����؂藿��1�A2���Ȃǔޏ��B�ɂƂ��� �̓^�o�R����x���钬�̎�w�̃w�\�N�����ς͂Ȃ��230���������Ƃ�!�g������ �Ȃ�̂�E�i�Y�P����z�ł��ˁB �^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^ �P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P ����������������œ��ւ�薧�� �� http://concon12454.dynu.net/~y2net/ �����ߏ��}�_���Ǝ�y�Ƀ��u�z�I �� http://concon12454.dynu.net/~y2net/ �Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q �^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^�^ �����������������������@�ً}���������I���ʔ��\�I�@���������������������� ���񗘗p�����T�C�g�́A���݌����݂ň��L���ȃT�C�g�B�T�C�g�g�b�v�̑u�₩ �Ȉ�ۂƂ͗����ɁA�v�`�Z���u�B������o�v���邱�ƂŒm��l���m�錊�ꒆ�̌� ��I���܂葼�����Ȃ��łˁi���j http://concon12454.dynu.net/~y2net/ ���Ȃ݂Ɏ��̓��m�S�S���‚��������珗���Ƀ��e���o���̑S�������A���N����� �����C�ɂȂ�n�߂��R�T�΂̓Ɛg�I���W�ł��E�E�E�O�̂��߂ɂˁB �������������������������������@�������ʁ@������������������������������ ��������--------- 20���ԁ@�@�@�@ ���M�l��--------- 60�l�@�@�@�@ �ԐM�l���@------ 29�l�i48���j �A�|����--------- 16�l�@�@�@�@ ����l��------ 13�l�i81���j �g�����@--------- 7�l�i53���j�@�t�������@------�@2�l�i28���j ���p�T�C�g------ �� http://concon12454.dynu.net/~y2net/ �f������������-------��25,000�i�z�e���㏜���j ��w�I�ɂ͏����̐��~�z�������̏㏸�A�����I�ɂ͏�����㵒p�S�̒ቺ�A�s�u�� �g�Ȃǂ̉ߌ��x�t�o���A�����̋t������}���������v���͂܂��܂���R����Ǝv ���܂��B ��y�Ŕ閧�ێ����”\�ȏo��n�T�C�g�ō����ޏ���͂���𕥂��y����ł� ��̂ł��傤���B�M����G�����_�B�̂�������āA�g��S�����z����܂��� �݂Ă͂������H �����͂Ńs�J�C�`�ȃT�C�g�I ���������������������������������������������������������������������� �ŋ߁A���������������Ă�H������߂�̂͂܂������I�ł͂���Ɍ��������� �@ ���l�̉�����Ȃ爤��Ƃ��������A�����Q�b�g�o�����Ⴄ��! (o^-')b ���������������������������������������������������������������������� http://concon12454.dynu.net/~y2net/ ���y�s�����l�Ȃ��񑝉����I�z������������������������������������������ ����Ȑl�Ȃ���Ƀ��e��錍���J�I �錍�@�@�Ƃɂ������l���̃v���C�x�[�g�����l�����e��I �錍�A�@������ړI���Ǝv�킹�Ȃ����ƁI�l�Ȃ��񂾂��ăo�J����Ȃ��I �@�@�@�@�����܂ł��Ȃ��Ƃ̈�����؂��Ƌ����ł���l�����e��I �錍�B�@���܁`���䎌��A�����悤�I���l���͂��܏������߂Ă�I �@�@�@�@���ׂ��Ȑl�͂Ƃ肠�����u�����Ă�v��A�����I ���������������������������������������������������������������������� http://concon12454.dynu.net/~y2net/ �Ɋւ��铝�v�f�[�^���\�I ���y�s�v�c�Ȑ����W�z�������������������������������������������������� �@�@�@�@�@�����������@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@ �@�@�@�@�@�����������@�@�@�l�@�@�@���[������΂Q�l�̓�����I�@�@�@�@�@ �@�@�@�@�@�����������@�@�@�@�@�@�@�`�`�`�`�`�`�`�`�`�`�`�`�`�@�@�@�@�@ �@�@�@�@�@�����������@�@�@�l���@�@�R�l�͂�������I�@�@�@�@�@�@�@�@�@�@ �@�@�@�@�@�����������@�@�@�@�@�@�@�`�`�`�`�`�`�`�`�@�@�@�@�@�@�@�@ �@�@�@�@�@�����������@�@�@�l�@�@�@�͂P�����Ńn�����i�����ҁj�I �@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�`�`�`�`�`�`�`�`�`�@�@�@�@�@�@�@�@�@ �����y�����̂�����Ƃ����ꌾ�z���������������������������������������� ���Ȃ��ɗ]�T������΁A�A�|�@�ˁ@�H���@�˃z�e���Ńf�[�g�C�����߂悤�I ���Ȃ��ɂ�����Ȃ��̂ł���΁A�f���Ƀz�e���֒��s���悤�I ���������������������������������������������������������������������� From sean.hefty at intel.com Sat Jan 21 09:13:55 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 21 Jan 2006 09:13:55 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1137854528.7683.13.camel@strider.opengridcomputing.com> Message-ID: >> I don't see much urgency in merging it now. When svn diverges from >> what's upstream in the kernel, it makes my life harder because I have >> to figure out which patches belong upstream and sometimes merge things >> by hand (when they hit the divergent regions). > >The easy solution here is not to diverge. Unless the iWARP support >regresses IB functionality, it does no harm and creates a single >software core for both iWARP and IB developers to bring new drivers to >market. Until iWarp is integrated with the kernel, the code will diverge however. And I agree with Roland, merging diverged code upstream is a pain. I'm definitely willing to re-organize the code to make it easier to maintain the code out of the tree. Also, if we can isolate the IB/iWarp code into separate files, then it's not a big issue pushing changes upstream. - Sean From halr at voltaire.com Sat Jan 21 09:28:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Jan 2006 12:28:23 -0500 Subject: [openib-general] [multicast]examples using multicast In-Reply-To: <7b2fa1820601181717ne0051admfb9f15a793102170@mail.gmail.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AC09@taurus.voltaire.com> <7b2fa1820601181717ne0051admfb9f15a793102170@mail.gmail.com> Message-ID: <1137864501.4338.21750.camel@hal.voltaire.com> Hi Ian, On Wed, 2006-01-18 at 20:17, Ian Jiang wrote: > On 1/18/06, Hal Rosenstock wrote: > Multicast verbs would be used to send data from user space. Is > that what you are looking for ? > Yes! Sorry for the slow response. I've been swamped this week and am still digging out. > I need some examples, according to which I could use the multicast in > my own applications. > Would you give some suggestions? You would need to create the UD QP, create/join the group, and then attach the QP to the group via ibv_attach_mcast. I don't think there is example code for this. You can put it together from some other pieces available though. The biggest whole is lack of a user space SA client with MCMemberRecord support. However, there are several choices here: build your own requests (ala srptools for PathRecords does this), use osm_vendor_ibumad_sa.c (API in management/osm/include/vendor/osm_vendor_sa_api.h and some example code), or wait for the real SA client. > And could not the multicast be used in kernel space? Multicast can be used in either kernel (e.g. IPoIB is already doing this) or userspace. Not sure exactly what you mean. -- Hal > Thanks very much! > > -- > Ian Jiang > ianjiang.ict at gmail.com > > Laboratory of Spatial Information Technology > Division of System Architecture > Institute of Computing Technology > Chinese Academy of Sciences > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Sat Jan 21 11:53:19 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 21 Jan 2006 21:53:19 +0200 Subject: [openib-general] respect CFLAGS in ibis/ibdm, fix missing filewarning Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B593@mtlexch01.mtl.com> Hi Pete, Thanks for the pkconfig pointer. I will look it up. The reason for the dependency is that libosmcom which is a set of utilities used by OpenSM and ibis was designed such that the interface for debug and non debug is not identical while the library name is the same. I think that it makes more sense to either make the APIs the same or use a different library name. But maybe pkconfig will solve that elegantly. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Hal Rosenstock > Sent: Saturday, January 21, 2006 1:16 AM > To: Pete Wyckoff > Cc: openib-general at openib.org > Subject: Re: [openib-general] respect CFLAGS in ibis/ibdm, fix missing filewarning > > On Fri, 2006-01-20 at 17:56, Pete Wyckoff wrote: > > halr at voltaire.com wrote on Fri, 20 Jan 2006 16:19 -0500: > > > On Thu, 2006-01-19 at 16:53, Pete Wyckoff wrote: > > > > Avoid overriding CFLAGS in ibis and ibdm. These apply to ibis and > > > > ibdm below gen2/utils/src/linux-user/. > > > > > > > > The third chunk below avoids a configure warning for a file > > > > "osm_build_id.h" that appears nowhere in my source or build tree. > > > > > > osm_build_id.h is a biproduct of the OpenSM build. > > > > Oh, I see now. I was building in osm/complib, osm/libvendor and > > osm/opensm directly in my build script. Missed the fact that the > > top level osm/Makefile.am created a file too. Thanks for pointing > > that out. > > > > In this case, it seems a bit odd that compiling ibis properly > > requires knowing if osm was built with DEBUG, but it's not a big > > deal. pkgconfig (http://pkgconfig.freedesktop.org/wiki/) is one > > way some distributions manage these interdependencies, if you're > > interested. > > Yes, that's a question for Eitan. > > -- Hal > > > -- Pete > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From meliklucky at ns.nic.mx Sat Jan 21 12:08:18 2006 From: meliklucky at ns.nic.mx (Melika Lucky) Date: Sat, 21 Jan 2006 15:08:18 -0500 Subject: [openib-general] Re: draft Ph aramacy Message-ID: <000001c61ec6$6b5b9600$0111a8c0@swish> to Carlos, whose fealty to his lord had perhaps run its suffocating course. It was this instinctive projection that made Bourne include in V V C A I I L A A I G L U R I M A S from from from $ $ $ 1 3 3 , , , 2 3 7 1 3 5 These and Many other http://www.otouth.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at opengridcomputing.com Sat Jan 21 12:47:15 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Sat, 21 Jan 2006 14:47:15 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: References: Message-ID: <1137876435.7119.17.camel@strider.opengridcomputing.com> On Sat, 2006-01-21 at 09:13 -0800, Sean Hefty wrote: > >> I don't see much urgency in merging it now. When svn diverges from > >> what's upstream in the kernel, it makes my life harder because I have > >> to figure out which patches belong upstream and sometimes merge things > >> by hand (when they hit the divergent regions). > > > >The easy solution here is not to diverge. Unless the iWARP support > >regresses IB functionality, it does no harm and creates a single > >software core for both iWARP and IB developers to bring new drivers to > >market. > > Until iWarp is integrated with the kernel, It thought the approach was branch --> trunk --> kernel. What am I missing here? > the code will diverge however. And I > agree with Roland, merging diverged code upstream is a pain. No argument here. Merging code downstream is a pain too ;-) > I'm definitely > willing to re-organize the code to make it easier to maintain the code out of > the tree. Also, if we can isolate the IB/iWarp code into separate files, then > it's not a big issue pushing changes upstream. Making the code more modular is a good idea anyway. The provider and CM are already in separate files. At some point, though there is a single API and these files will have code for both transports (e.g. ib_verbs.h). One way to modularize the CMA is to have transport CM's register with the CMA and force all calls through function pointers ala verbs. > - Sean From rdreier at cisco.com Sat Jan 21 14:00:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 21 Jan 2006 14:00:56 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix sgid for port 2 mad In-Reply-To: <20060118091313.GY22260@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 18 Jan 2006 11:13:13 +0200") References: <20060118091313.GY22260@mellanox.co.il> Message-ID: Thanks, applied. From rolandd at cisco.com Sat Jan 21 14:03:10 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 21 Jan 2006 22:03:10 +0000 Subject: [openib-general] [git patch review 1/5] IPoIB: Make sure path is fully initialized before using it Message-ID: <1137880990999-28a2de7670074e8b@cisco.com> The SA path record query completion can initialize path->pathrec.dlid before IPoIB's callback runs and initializes path->ah, so we must test ah rather than dlid. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) 47f7a0714b67b904a3a36e2f2d85904e8064219b diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index fd3f5c8..c3b5f79 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -505,7 +505,7 @@ static void neigh_add_path(struct sk_buf list_add_tail(&neigh->list, &path->neigh_list); - if (path->pathrec.dlid) { + if (path->ah) { kref_get(&path->ah->ref); neigh->ah = path->ah; @@ -591,7 +591,7 @@ static void unicast_arp_send(struct sk_b return; } - if (path->pathrec.dlid) { + if (path->ah) { ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); -- 1.1.3 From rolandd at cisco.com Sat Jan 21 14:03:10 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 21 Jan 2006 22:03:10 +0000 Subject: [openib-general] [git patch review 5/5] IB/mthca: Use correct GID in MADs sent on port 2 In-Reply-To: <1137880990999-7f911bca79a83d08@cisco.com> Message-ID: <1137880990999-4d027d1b419c13b2@cisco.com> mthca_create_ah() includes the port number in the GID index. The reverse needs to be done in mthca_read_ah(). Noted by Hal Rosenstock. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_av.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) f9e61929e5e1dacc2afefbde6abc3e6571ca2887 diff --git a/drivers/infiniband/hw/mthca/mthca_av.c b/drivers/infiniband/hw/mthca/mthca_av.c index a14eed0..a19e0ed 100644 --- a/drivers/infiniband/hw/mthca/mthca_av.c +++ b/drivers/infiniband/hw/mthca/mthca_av.c @@ -184,7 +184,7 @@ int mthca_read_ah(struct mthca_dev *dev, ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); ib_get_cached_gid(&dev->ib_dev, be32_to_cpu(ah->av->port_pd) >> 24, - ah->av->gid_index, + ah->av->gid_index % dev->limits.gid_table_len, &header->grh.source_gid); memcpy(header->grh.destination_gid.raw, ah->av->dgid, 16); -- 1.1.3 From rolandd at cisco.com Sat Jan 21 14:03:10 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 21 Jan 2006 22:03:10 +0000 Subject: [openib-general] [git patch review 2/5] IB/uverbs: Flush scheduled work before unloading module In-Reply-To: <1137880990999-28a2de7670074e8b@cisco.com> Message-ID: <1137880990999-7ca1217bcd8a8383@cisco.com> uverbs might schedule work to clean up when a file is closed. Make sure that this work runs before allowing module text to go away. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/core/uverbs_main.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) cc76e33ec98ee2acab2d10828d31588d1b10f274 diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index 96ea79b..903f85a 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -902,6 +902,7 @@ static void __exit ib_uverbs_cleanup(voi unregister_filesystem(&uverbs_event_fs); class_destroy(uverbs_class); unregister_chrdev_region(IB_UVERBS_BASE_DEV, IB_UVERBS_MAX_DEVICES); + flush_scheduled_work(); idr_destroy(&ib_uverbs_pd_idr); idr_destroy(&ib_uverbs_mr_idr); idr_destroy(&ib_uverbs_mw_idr); -- 1.1.3 From rolandd at cisco.com Sat Jan 21 14:03:10 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 21 Jan 2006 22:03:10 +0000 Subject: [openib-general] [git patch review 4/5] IPoIB: Lock accesses to multicast packet queues In-Reply-To: <1137880990999-449ff8b55b88bcaa@cisco.com> Message-ID: <1137880990999-7f911bca79a83d08@cisco.com> Avoid corrupting mcast->pkt_queue by serializing access with priv->tx_lock. Also, update dropped packet statistics to count multicast packets removed from pkt_queue as dropped. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 25 +++++++++++++++++++++--- 1 files changed, 22 insertions(+), 3 deletions(-) b36f170b617a7cd147b694dabf504e856a50ee9d diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 98039da..ccaa0c3 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -97,6 +97,7 @@ static void ipoib_mcast_free(struct ipoi struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tmp; unsigned long flags; + int tx_dropped = 0; ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group " IPOIB_GID_FMT "\n", @@ -123,8 +124,14 @@ static void ipoib_mcast_free(struct ipoi if (mcast->ah) ipoib_put_ah(mcast->ah); - while (!skb_queue_empty(&mcast->pkt_queue)) + while (!skb_queue_empty(&mcast->pkt_queue)) { + ++tx_dropped; dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } + + spin_lock_irqsave(&priv->tx_lock, flags); + priv->stats.tx_dropped += tx_dropped; + spin_unlock_irqrestore(&priv->tx_lock, flags); kfree(mcast); } @@ -276,8 +283,10 @@ static int ipoib_mcast_join_finish(struc } /* actually send any queued packets */ + spin_lock_irq(&priv->tx_lock); while (!skb_queue_empty(&mcast->pkt_queue)) { struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + spin_unlock_irq(&priv->tx_lock); skb->dev = dev; @@ -288,7 +297,9 @@ static int ipoib_mcast_join_finish(struc if (dev_queue_xmit(skb)) ipoib_warn(priv, "dev_queue_xmit failed to requeue packet\n"); + spin_lock_irq(&priv->tx_lock); } + spin_unlock_irq(&priv->tx_lock); return 0; } @@ -300,6 +311,7 @@ ipoib_mcast_sendonly_join_complete(int s { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); if (!status) ipoib_mcast_join_finish(mcast, mcmember); @@ -310,8 +322,12 @@ ipoib_mcast_sendonly_join_complete(int s IPOIB_GID_ARG(mcast->mcmember.mgid), status); /* Flush out any queued packets */ - while (!skb_queue_empty(&mcast->pkt_queue)) + spin_lock_irq(&priv->tx_lock); + while (!skb_queue_empty(&mcast->pkt_queue)) { + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } + spin_unlock_irq(&priv->tx_lock); /* Clear the busy flag so we try again */ clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); @@ -687,6 +703,7 @@ void ipoib_mcast_send(struct net_device if (!mcast) { ipoib_warn(priv, "unable to allocate memory for " "multicast structure\n"); + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); goto out; } @@ -700,8 +717,10 @@ void ipoib_mcast_send(struct net_device if (!mcast->ah) { if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) skb_queue_tail(&mcast->pkt_queue, skb); - else + else { + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); + } if (mcast->query) ipoib_dbg_mcast(priv, "no address vector, " -- 1.1.3 From rolandd at cisco.com Sat Jan 21 14:03:10 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 21 Jan 2006 22:03:10 +0000 Subject: [openib-general] [git patch review 3/5] IB/sa_query: Flush scheduled work before unloading module In-Reply-To: <1137880990999-7ca1217bcd8a8383@cisco.com> Message-ID: <1137880990999-449ff8b55b88bcaa@cisco.com> sa_query schedules work on IB asynchronous events. After unregistering the async event handler, make sure that this work has completed before releasing the IB device (and possibly allowing the sa_query module text to go away). Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/core/sa_query.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) 0f47ae0b3ec35dc5f4723f2e0ad0f6f3f55e9bcd diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index acda7d6..501cc05 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -956,6 +956,8 @@ static void ib_sa_remove_one(struct ib_d ib_unregister_event_handler(&sa_dev->event_handler); + flush_scheduled_work(); + for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { ib_unregister_mad_agent(sa_dev->port[i].agent); kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); -- 1.1.3 From swise at opengridcomputing.com Sat Jan 21 15:14:44 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 21 Jan 2006 17:14:44 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP References: <1137876435.7119.17.camel@strider.opengridcomputing.com> Message-ID: <001001c61ee0$786943f0$020010ac@haggard> Is it possible to just go ahead and push the core iwarp stuff into kernel.org? ----- Original Message ----- From: "Tom Tucker" To: "Sean Hefty" Cc: "Roland Dreier" ; Sent: Saturday, January 21, 2006 2:47 PM Subject: RE: [openib-general] Re: [PATCH] CMA and iWARP > On Sat, 2006-01-21 at 09:13 -0800, Sean Hefty wrote: >> >> I don't see much urgency in merging it now. When svn diverges >> >> from >> >> what's upstream in the kernel, it makes my life harder because I >> >> have >> >> to figure out which patches belong upstream and sometimes merge >> >> things >> >> by hand (when they hit the divergent regions). >> > >> >The easy solution here is not to diverge. Unless the iWARP support >> >regresses IB functionality, it does no harm and creates a single >> >software core for both iWARP and IB developers to bring new drivers >> >to >> >market. >> >> Until iWarp is integrated with the kernel, > > It thought the approach was branch --> trunk --> kernel. What am I > missing here? > >> the code will diverge however. And I >> agree with Roland, merging diverged code upstream is a pain. > > No argument here. Merging code downstream is a pain too ;-) > >> I'm definitely >> willing to re-organize the code to make it easier to maintain the >> code out of >> the tree. Also, if we can isolate the IB/iWarp code into separate >> files, then >> it's not a big issue pushing changes upstream. > > Making the code more modular is a good idea anyway. The provider and > CM > are already in separate files. At some point, though there is a single > API and these files will have code for both transports (e.g. > ib_verbs.h). One way to modularize the CMA is to have transport CM's > register with the CMA and force all calls through function pointers > ala > verbs. > >> - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Sat Jan 21 15:37:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 21 Jan 2006 15:37:12 -0800 Subject: [openib-general] [PATCH 3/5] [RFC] Infiniband: connection abstraction In-Reply-To: (Sean Hefty's message of "Tue, 17 Jan 2006 15:28:17 -0800") References: Message-ID: BTW, it's probably worth highlighting these parts of this patch: First off, ip_dev_find() is exported again: > --- linux-2.6.git/net/ipv4/fib_frontend.c 2006-01-16 10:28:29.000000000 -0800 > +++ linux-2.6.ib/net/ipv4/fib_frontend.c 2006-01-16 16:14:24.000000000 -0800 > @@ -666,4 +666,5 @@ void __init ip_fib_init(void) > } > > EXPORT_SYMBOL(inet_addr_type); > +EXPORT_SYMBOL(ip_dev_find); > EXPORT_SYMBOL(ip_rt_ioctl); And then it's used here: > +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) > +{ > + struct net_device *dev; > + u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr; > + int ret; > + > + dev = ip_dev_find(ip); > + if (!dev) > + return -EADDRNOTAVAIL; > + > + ret = copy_addr(dev_addr, dev, NULL); > + dev_put(dev); > + return ret; > +} And also here to find the local device to use when connecting to a loopback address: > +static int addr_resolve_local(struct sockaddr_in *src_in, > + struct sockaddr_in *dst_in, > + struct rdma_dev_addr *addr) > +{ > + struct net_device *dev; > + u32 src_ip = src_in->sin_addr.s_addr; > + u32 dst_ip = dst_in->sin_addr.s_addr; > + int ret; > + > + dev = ip_dev_find(dst_ip); > + if (!dev) > + return -EADDRNOTAVAIL; > + > + if (!src_ip) { > + src_in->sin_family = dst_in->sin_family; > + src_in->sin_addr.s_addr = dst_ip; > + ret = copy_addr(addr, dev, dev->dev_addr); > + } else { > + ret = rdma_translate_ip((struct sockaddr *)src_in, addr); > + if (!ret) > + memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN); > + } > + > + dev_put(dev); > + return ret; > +} I don't really have an opinion one way or another about this usage, but I think it's a good idea to make sure that this stuff doesn't get lost in all the other code, since it is (re)exporting a function that is currently private to networking. - R. From dotanb at mellanox.co.il Sat Jan 21 22:31:28 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 22 Jan 2006 08:31:28 +0200 Subject: [openib-general] [multicast]examples using multicast Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3012D6992@mtlexch01.mtl.com> Hi all. > On Wed, 2006-01-18 at 20:17, Ian Jiang wrote: > > On 1/18/06, Hal Rosenstock wrote: > > Multicast verbs would be used to send data from > user space. Is > > that what you are looking for ? > > Yes! a small example of a how to use multicast in IB can be found in: /trunk/contrib/mellanox/ibtp/gen2/userspace/useraccess/multicast_test just email me if you have any question. Dotan From yael at mellanox.co.il Sun Jan 22 00:29:51 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Sun, 22 Jan 2006 10:29:51 +0200 Subject: [openib-general] RE: [PATCH] Opensm - duplicated guids handling Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD46@mtlexch01.mtl.com> Hi Hal, The configuration are 2 HCAs with duplicated GUIDs on 2 different machines, which are connected back-2-back. Regarding the duplicated guids issue itself - as you said, it is a fundamental violation. The problem is that we've had cases where there was such a violation, and since OpenSM didn't give a clear enough error message - there was a waste of time in trying to debug why OpenSM doesn't configure the subnet correctly. I have done testing to make sure that the different cases of duplication of guids are handled, both on subnets with switches, and on back-2-back machines. This was the problem left. Using O_NONBLOCK works fine for me. I will send a patch seperately with this fix instead of the original one. Thanks, Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, January 19, 2006 4:51 PM To: Yael Kalka Cc: openib-general at openib.org; Eitan Zahavi Subject: Re: [PATCH] Opensm - duplicated guids handling Hi Yael, On Thu, 2006-01-19 at 07:08, Yael Kalka wrote: > Hi Hal, > > We've noticed that currently if we have 2 hcas with duplicated guids I renew my comment about duplicated GUIDs. This is a pretty fundamental thing that MUST not be violated per the IBA spec. I understand there are processes in place that make the duplication more error prone than it should be. If we go down this path, there are other things that fall into this category and I believe this to be a slippery slope. I am still willing to go ahead with this patch or some variant of it. Some questions embedded in the patch. > connected back-2-back, opensm gets stuck. Not sure I quite understand the configuration. Are the two HCAs with the duplicated guids in the same machine and connected back to back ? Is that the case you are referring to ? > The reason for that is that > in osm_vendor_set_sm() function - the second call trying to open the > /dev/infiniband/issm%id is stuck, since this file is already open. > The following patch fixes 2 things - > 1. In osm_node_info_rcv.c - we've added a case that on cases of > duplicated guids - exit (unless a flag is set otherwise). Add this > exiting code also to the case where the nodes are connected back-2-back. > 2. In osm_vendor_ibumad.c - add a static variable to avoid trying to > open /dev/inifiniband/issm%d file twice during the run of opensm. The problem is that the second open hangs, right ? So rather than the changes to osm_vendor_ibumad.c below change the flags on the open from 0 to O_NONBLOCK ? Does that work for you ? If so, I will commit that approach with the change below to osm_node_info_rcv.c. Please let me know. Thanks. -- Hal > Signed-off-by: Yael Kalka > > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 4951) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -1142,8 +1142,11 @@ osm_vendor_set_sm( > osm_umad_bind_info_t *p_bind = (osm_umad_bind_info_t *)h_bind; > osm_vendor_t *p_vend = p_bind->p_vend; > char issmstring[24]; > + static boolean_t osm_vendor_set_sm_indicator = FALSE; > > OSM_LOG_ENTER( p_vend->p_log, osm_vendor_set_sm ); > + if (is_sm_val == FALSE || osm_vendor_set_sm_indicator == FALSE) I may have a comment on this based on the answer to the below. > + { > sprintf(issmstring, "/dev/infiniband/issm%d", p_vend->umad_port_id); > if (TRUE == is_sm_val) { > p_vend->issmfd = open(issmstring, 0); > @@ -1162,6 +1165,15 @@ osm_vendor_set_sm( > " mask failed: errno %d\n", errno); > p_vend->issmfd = -1; > } > + if ( osm_vendor_set_sm_indicator == FALSE ) > + osm_vendor_set_sm_indicator = TRUE; > + } > + else > + { > + osm_log(p_vend->p_log, OSM_LOG_ERROR, > + "osm_vendor_set_sm: ERR 5436: " > + "Trying to set IS_SM capability mask again\n"); > + } > OSM_LOG_EXIT( p_vend->p_log ); > } Does osm_vendor_set_sm_indicator ever needs to be reset to FALSE ? > Index: opensm/osm_node_info_rcv.c > =================================================================== > --- opensm/osm_node_info_rcv.c (revision 4951) > +++ opensm/osm_node_info_rcv.c (working copy) > @@ -229,6 +229,14 @@ __osm_ni_rcv_set_links( > osm_dump_dr_path(p_rcv->p_log, > osm_physp_get_dr_path_ptr(p_physp), > OSM_LOG_ERROR); > + > + osm_log( p_rcv->p_log, OSM_LOG_SYS, > + "Errors on subnet. Duplicate GUID found " > + "by link from a port to itself. " > + "See osm log for more details\n"); > + > + if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE ) > + exit( 1 ); > } > else > { > From yael at mellanox.co.il Sun Jan 22 00:26:12 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 22 Jan 2006 10:26:12 +0200 Subject: [openib-general] [PATCH] Opensm - duplicated guids handling - new Message-ID: <5zzmlo1xej.fsf@mtl066.yok.mtl.com> Hi Hal, Here is a patch using O_NONBLOCK instead of the static variable. Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 4951) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -1146,7 +1146,7 @@ osm_vendor_set_sm( OSM_LOG_ENTER( p_vend->p_log, osm_vendor_set_sm ); sprintf(issmstring, "/dev/infiniband/issm%d", p_vend->umad_port_id); if (TRUE == is_sm_val) { - p_vend->issmfd = open(issmstring, 0); + p_vend->issmfd = open(issmstring, O_NONBLOCK); if (p_vend->issmfd < 0) { osm_log(p_vend->p_log, OSM_LOG_ERROR, "osm_vendor_set_sm: ERR 5431: " Index: opensm/osm_node_info_rcv.c =================================================================== --- opensm/osm_node_info_rcv.c (revision 4951) +++ opensm/osm_node_info_rcv.c (working copy) @@ -229,6 +229,14 @@ __osm_ni_rcv_set_links( osm_dump_dr_path(p_rcv->p_log, osm_physp_get_dr_path_ptr(p_physp), OSM_LOG_ERROR); + + osm_log( p_rcv->p_log, OSM_LOG_SYS, + "Errors on subnet. Duplicate GUID found " + "by link from a port to itself. " + "See osm log for more details\n"); + + if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE ) + exit( 1 ); } else { From yael at mellanox.co.il Sun Jan 22 04:29:18 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 22 Jan 2006 14:29:18 +0200 Subject: [openib-general] [PATCH] Opensm - running with console option Message-ID: <5zy8181m5d.fsf@mtl066.yok.mtl.com> Hi Hal, I've noticed that when running opensm with --console, it exits with a message that option `-console' requires an argument. But I saw in the main.c that such argument isn't used. The following patch removes this dependency. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/main.c =================================================================== --- opensm/main.c (revision 4951) +++ opensm/main.c (working copy) @@ -489,7 +489,7 @@ main( { "log_file", 1, NULL, 'f'}, { "erase_log_file",0, NULL, 'e'}, { "maxsmps", 1, NULL, 'n'}, - { "console", 1, NULL, 'q'}, + { "console", 0, NULL, 'q'}, { "V", 0, NULL, 'V'}, { "help", 0, NULL, 'h'}, { "once", 0, NULL, 'o'}, From r.badrinath at gmail.com Sun Jan 22 23:46:03 2006 From: r.badrinath at gmail.com (R. Badrinath) Date: Mon, 23 Jan 2006 13:16:03 +0530 Subject: [openib-general] Looking for a Mellanox VAPI example C/C++ code Message-ID: <619061180601222346y315c7fc5y9023052519ee77f3@mail.gmail.com> Hi all, -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.badrinath at gmail.com Sun Jan 22 23:48:25 2006 From: r.badrinath at gmail.com (R. Badrinath) Date: Mon, 23 Jan 2006 13:18:25 +0530 Subject: [openib-general] Looking for a Mellanox VAPI example C/C++ code Message-ID: <619061180601222348x44cf2ea3g3694c03181ce22af@mail.gmail.com> Hi all, I am a new member on this list. I am wondering if anyone has pointers on this, following the earlier posting I saw on the archives (Feb last year). Regards and thanks, -Badri -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Sun Jan 22 23:54:08 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 23 Jan 2006 09:54:08 +0200 Subject: [openib-general] Looking for a Mellanox VAPI example C/C++ code Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3012D6CD2@mtlexch01.mtl.com> Hi Badri. Hi all, I am a new member on this list. I am wondering if anyone has pointers on this, following the earlier posting I saw on the archives (Feb last year). Regards and thanks, -Badri In the vapi driver there are some examples that uses this driver. Dotan -------------- next part -------------- An HTML attachment was scrubbed... URL: From nirvanadamma at saltycat.com Mon Jan 23 03:59:10 2006 From: nirvanadamma at saltycat.com (Nirvana Dammann) Date: Mon, 23 Jan 2006 06:59:10 -0500 Subject: [openib-general] Re: tog Phar amacy Message-ID: <000001c62014$6be348d0$6c99a8c0@nacelle> http://www.lovinsiter.com S A C V X V o m I A a I m b A L n A a i L I a G e I U x R n S M A $ $ $ $ $ 1 2 3 $ 1 3 , , , 1 , , 1 8 7 , 4 3 2 9 5 2 2 3 -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Jan 23 04:18:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Jan 2006 07:18:41 -0500 Subject: [openib-general] respect CFLAGS in OSM In-Reply-To: <20060119215025.GA2620@osc.edu> References: <20060119215025.GA2620@osc.edu> Message-ID: <1138018719.4338.34398.camel@hal.voltaire.com> On Thu, 2006-01-19 at 16:50, Pete Wyckoff wrote: > I do something like: > > CFLAGS=-g ./configure ... > > to build a debug tree from openib svn. > > Some places override this CFLAGS setting, though, applying > optimization even though I explicitly do not want it. This patch > fixes that. These apply to OSM below gen2/trunk/src/userspace/. Thanks. Applied. > Signed-off-by: Pete Wyckoff From mst at mellanox.co.il Mon Jan 23 05:19:59 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 15:19:59 +0200 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060109065012.GJ2064@us.ibm.com> References: <20060109065012.GJ2064@us.ibm.com> Message-ID: <20060123131959.GC24474@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Subject: Re: Userspace testing results (many kernels,many svn trees) > > On 09.01.2006 [08:33:01 +0200], Michael S. Tsirkin wrote: > > Quoting r. Nishanth Aravamudan : > > > Just like with the results I posted earlier, all the perftest results > > > are seriously wrong for 32-bit clients (with both 32-bit and 64-bit > > > servers). I am not sure who else to notify beyond the general list (is > > > there a corresponding MAINTAINERS files like in the kernel proper for > > > the OpenIB code?) > > > > That would be me - sorry about the delay, I'll take a look at this. > > Thanks a lot, Nishanth! > > This work is very much appreciated. > > No worries, hope the problem is not too hard to fix :) OK, I'm going to concentrate on rdma_lat/rdma_bw for now. # file ./rdma_bw ./rdma_bw: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), not stripped # ./rdma_bw swlab155 local address: LID 0xd9, QPN 0x24040a, PSN 0x98e7f4 RKey 0xe0003f VAddr 0x000000556db000 remote address: LID 0xd9, QPN 0x240406, PSN 0x952ea0, RKey 0xe00033 VAddr 0x000000556db000 Bandwidth peak (#0 to #999): 879.956 MB/sec Bandwidth average: 879.954 MB/sec Service Demand peak (#0 to #999): 3773 cycles/KB Service Demand Avg : 37 cycles/KB Seems like I cant reproduce the problem. Which distribution and CPU architecture is this, again? -- MST From eitan at mellanox.co.il Mon Jan 23 05:27:13 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 23 Jan 2006 15:27:13 +0200 Subject: [openib-general] respect CFLAGS in OSM Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B5BC@mtlexch01.mtl.com> Hi Pete, I have looked again at this patch and what it is changing. My understanding is that you found the -g -O2 CLFAGS (provided through the specific target CFLAGS) unneeded. You also think they will interfere with settings you might want to provide from the command line. I have just double checked what I new to be the rule for autoconf: If the user provides CFLAGS or LDFLAGS from the command like - they are appended to the compile or link flags. The impact on gcc is that the later settings - i.e. those provided by the user take precedence over the flags provided at the beginning of the command line. So the patch below is actually not needed. Just to convince you I attach some gcc traces showing that -O0 -O2 acts like -O2 and -O2 -O0 acts like -O0. Bottom line I would like to keep the code as it is without any change such that default installation will use the -O2 mode. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O0 -O2 -version -o /tmp/ccet3OkS.s GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) compiled by GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8). GGC heuristics: --param ggc-min-expand=99 --param ggc-min-heapsize=129317 options passed: -auxbase -O0 -O2 options enabled: -falign-loops -fargument-alias -fbranch-count-reg -fcaller-saves -fcommon -fcprop-registers -fcrossjumping -fcse-follow-jumps -fcse-skip-blocks -fdefer-pop -fdelete-null-pointer-checks -feliminate-unused-debug-types -fexpensive-optimizations -fforce-mem -ffunction-cse -fgcse -fgcse-lm -fguess-branch-probability -fident -fif-conversion -fif-conversion2 -fivopts -fkeep-static-consts -fleading-underscore -floop-optimize -floop-optimize2 -fmath-errno -fmerge-constants -foptimize-register-move -foptimize-sibling-calls -fpcc-struct-return -fpeephole -fpeephole2 -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop -frerun-loop-opt -fsched-interblock -fsched-spec -fsched-stalled-insns-dep -fsplit-ivs-in-unroller -fstrength-reduce -fstrict-aliasing -fthread-jumps -ftrapping-math -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-lrs -ftree-pre -ftree-sra -ftree-ter -funit-at-a-time -fvar-tracking -fzero-initialized-in-bss -m80387 -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 -mno-red-zone -mtls-direct-seg-refs -mtune=i386 -march=i386 swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O2 -O0 -version -o /tmp/ccet3OkS.s GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) compiled by GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8). GGC heuristics: --param ggc-min-expand=99 --param ggc-min-heapsize=129317 options passed: -auxbase -O2 -O0 options enabled: -falign-loops -fargument-alias -fbranch-count-reg -fcommon -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident -fivopts -fkeep-static-consts -fleading-underscore -floop-optimize2 -fmath-errno -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec -fsched-stalled-insns-dep -fsplit-ivs-in-unroller -ftrapping-math -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -funit-at-a-time -fvar-tracking -fzero-initialized-in-bss -m80387 -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 -mno-red-zone -mtls-direct-seg-refs -mtune=i386 -march=i386 > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Pete Wyckoff > Sent: Thursday, January 19, 2006 11:50 PM > To: openib-general at openib.org > Subject: [openib-general] respect CFLAGS in OSM > > I do something like: > > CFLAGS=-g ./configure ... > > to build a debug tree from openib svn. > > Some places override this CFLAGS setting, though, applying > optimization even though I explicitly do not want it. This patch > fixes that. These apply to OSM below gen2/trunk/src/userspace/. > > Signed-off-by: Pete Wyckoff > > Index: management/osm/libvendor/Makefile.am > =================================================================== > --- management/osm/libvendor/Makefile.am (revision 5098) > +++ management/osm/libvendor/Makefile.am (working copy) > @@ -3,8 +3,6 @@ > > if DEBUG > DBGFLAGS = -ggdb -D_DEBUG_ > -else > -DBGFLAGS = -g -O2 > endif > > INCLUDES = $(OSMV_INCLUDES) > Index: management/osm/complib/Makefile.am > =================================================================== > --- management/osm/complib/Makefile.am (revision 5098) > +++ management/osm/complib/Makefile.am (working copy) > @@ -5,8 +5,6 @@ > > if DEBUG > DBGFLAGS = -ggdb -D_DEBUG_ > -else > -DBGFLAGS = -g -O2 > endif > > libosmcomp_la_CFLAGS = -Wall $(DBGFLAGS) -D_XOPEN_SOURCE=600 - > D_BSD_SOURCE=1 > Index: management/osm/opensm/Makefile.am > =================================================================== > --- management/osm/opensm/Makefile.am (revision 5098) > +++ management/osm/opensm/Makefile.am (working copy) > @@ -5,8 +5,6 @@ > > if DEBUG > DBGFLAGS = -ggdb -D_DEBUG_ > -else > -DBGFLAGS = -g -O2 > endif > > libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT > $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Mon Jan 23 05:29:01 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 23 Jan 2006 15:29:01 +0200 Subject: [openib-general] respect CFLAGS in ibis/ibdm, fix missing file warning Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B5BD@mtlexch01.mtl.com> Hi Pete, Please see my response to the similar patch for OpenSM. I think you can still apply your optimization flags simply by using the CFLAGS at the command line. Please double check and let me know. Thanks Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Pete Wyckoff > Sent: Thursday, January 19, 2006 11:53 PM > To: openib-general at openib.org > Subject: [openib-general] respect CFLAGS in ibis/ibdm,fix missing file warning > > Avoid overriding CFLAGS in ibis and ibdm. These apply to ibis and > ibdm below gen2/utils/src/linux-user/. > > The third chunk below avoids a configure warning for a file > "osm_build_id.h" that appears nowhere in my source or build tree. > > Signed-off-by: Pete Wyckoff > > Index: ibis/src/Makefile.am > =================================================================== > --- ibis/src/Makefile.am (revision 5098) > +++ ibis/src/Makefile.am (working copy) > @@ -38,7 +38,7 @@ > if DEBUG > DBG = -O0 -g -Wall -Werror > else > -DBG = -O2 -Wall > +DBG = -Wall > endif > > AM_CFLAGS = $(TCL_CPPFLAGS) $(OSM_CFLAGS) $(DBG) -fno-strict-aliasing - > fPIC > Index: ibdm/datamodel/Makefile.am > =================================================================== > --- ibdm/datamodel/Makefile.am (revision 5098) > +++ ibdm/datamodel/Makefile.am (working copy) > @@ -60,8 +60,6 @@ > # Support debug mode through config variable > if DEBUG > DBG = -O0 -g > -else > -DBG = -O2 > endif > > # We have a special mode where we know the package will be eventually moved > Index: ibis/config/osm.m4 > =================================================================== > --- ibis/config/osm.m4 (revision 5098) > +++ ibis/config/osm.m4 (working copy) > @@ -156,6 +156,8 @@ > AM_CONDITIONAL(OSM_VENDOR_SIM, test $OSM_VENDOR = sim) > AM_CONDITIONAL(OSM_BUILD_OPENIB, test $OSM_BUILD = openib) > > +if test -f $osm_include_dir/opensm/osm_build_id.h; then > + > dnl validate the defined path - so the build id header is there > AC_CHECK_FILE($osm_include_dir/opensm/osm_build_id.h,, > AC_MSG_ERROR([OSM: could not find > $with_osm/include/opensm/osm_build_id.h])) > @@ -168,6 +170,9 @@ > else > osm_debug_flags= > fi > +else > + osm_debug_flags= > +fi > > OSM_CFLAGS="-I$osm_include_dir $osm_extra_includes $osm_debug_flags > $osm_vendor_sel -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1" > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Mon Jan 23 05:39:56 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Jan 2006 08:39:56 -0500 Subject: [openib-general] respect CFLAGS in OSM In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B5BC@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B5BC@mtlexch01.mtl.com> Message-ID: <1138023594.4338.34806.camel@hal.voltaire.com> On Mon, 2006-01-23 at 08:27, Eitan Zahavi wrote: > Hi Pete, > > I have looked again at this patch and what it is changing. > My understanding is that you found the -g -O2 CLFAGS (provided through > the specific target CFLAGS) unneeded. You also think they will interfere > with settings you might want to provide from the command line. > > I have just double checked what I new to be the rule for autoconf: > If the user provides CFLAGS or LDFLAGS from the command like - they are > appended to the compile or link flags. The impact on gcc is that the > later settings - i.e. those provided by the user take precedence over > the flags provided at the beginning of the command line. So the patch > below is actually not needed. > > Just to convince you I attach some gcc traces showing that -O0 -O2 acts > like -O2 and > -O2 -O0 acts like -O0. Yes, it does override. What about the -g setting ? Should that stay or go ? -- Hal > Bottom line I would like to keep the code as it is without any change > such that default installation will use the -O2 mode. > > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ > 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O0 -O2 > -version -o /tmp/ccet3OkS.s > GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) > compiled by GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8). > GGC heuristics: --param ggc-min-expand=99 --param > ggc-min-heapsize=129317 > options passed: -auxbase -O0 -O2 > options enabled: -falign-loops -fargument-alias -fbranch-count-reg > -fcaller-saves -fcommon -fcprop-registers -fcrossjumping > -fcse-follow-jumps -fcse-skip-blocks -fdefer-pop > -fdelete-null-pointer-checks -feliminate-unused-debug-types > -fexpensive-optimizations -fforce-mem -ffunction-cse -fgcse -fgcse-lm > -fguess-branch-probability -fident -fif-conversion -fif-conversion2 > -fivopts -fkeep-static-consts -fleading-underscore -floop-optimize > -floop-optimize2 -fmath-errno -fmerge-constants > -foptimize-register-move > -foptimize-sibling-calls -fpcc-struct-return -fpeephole -fpeephole2 > -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop > -frerun-loop-opt -fsched-interblock -fsched-spec > -fsched-stalled-insns-dep > -fsplit-ivs-in-unroller -fstrength-reduce -fstrict-aliasing > -fthread-jumps > -ftrapping-math -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce > -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-loop-im > -ftree-loop-ivcanon -ftree-loop-optimize -ftree-lrs -ftree-pre > -ftree-sra > -ftree-ter -funit-at-a-time -fvar-tracking -fzero-initialized-in-bss > -m80387 -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 > -mno-red-zone -mtls-direct-seg-refs -mtune=i386 -march=i386 > > swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ > 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O2 -O0 > -version -o /tmp/ccet3OkS.s > GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) > compiled by GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8). > GGC heuristics: --param ggc-min-expand=99 --param > ggc-min-heapsize=129317 > options passed: -auxbase -O2 -O0 > options enabled: -falign-loops -fargument-alias -fbranch-count-reg > -fcommon -feliminate-unused-debug-types -ffunction-cse -fgcse-lm > -fident > -fivopts -fkeep-static-consts -fleading-underscore -floop-optimize2 > -fmath-errno -fpcc-struct-return -fpeephole -fsched-interblock > -fsched-spec -fsched-stalled-insns-dep -fsplit-ivs-in-unroller > -ftrapping-math -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize > -funit-at-a-time -fvar-tracking -fzero-initialized-in-bss -m80387 > -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 -mno-red-zone > -mtls-direct-seg-refs -mtune=i386 -march=i386 > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of Pete Wyckoff > > Sent: Thursday, January 19, 2006 11:50 PM > > To: openib-general at openib.org > > Subject: [openib-general] respect CFLAGS in OSM > > > > I do something like: > > > > CFLAGS=-g ./configure ... > > > > to build a debug tree from openib svn. > > > > Some places override this CFLAGS setting, though, applying > > optimization even though I explicitly do not want it. This patch > > fixes that. These apply to OSM below gen2/trunk/src/userspace/. > > > > Signed-off-by: Pete Wyckoff > > > > Index: management/osm/libvendor/Makefile.am > > =================================================================== > > --- management/osm/libvendor/Makefile.am (revision 5098) > > +++ management/osm/libvendor/Makefile.am (working copy) > > @@ -3,8 +3,6 @@ > > > > if DEBUG > > DBGFLAGS = -ggdb -D_DEBUG_ > > -else > > -DBGFLAGS = -g -O2 > > endif > > > > INCLUDES = $(OSMV_INCLUDES) > > Index: management/osm/complib/Makefile.am > > =================================================================== > > --- management/osm/complib/Makefile.am (revision 5098) > > +++ management/osm/complib/Makefile.am (working copy) > > @@ -5,8 +5,6 @@ > > > > if DEBUG > > DBGFLAGS = -ggdb -D_DEBUG_ > > -else > > -DBGFLAGS = -g -O2 > > endif > > > > libosmcomp_la_CFLAGS = -Wall $(DBGFLAGS) -D_XOPEN_SOURCE=600 - > > D_BSD_SOURCE=1 > > Index: management/osm/opensm/Makefile.am > > =================================================================== > > --- management/osm/opensm/Makefile.am (revision 5098) > > +++ management/osm/opensm/Makefile.am (working copy) > > @@ -5,8 +5,6 @@ > > > > if DEBUG > > DBGFLAGS = -ggdb -D_DEBUG_ > > -else > > -DBGFLAGS = -g -O2 > > endif > > > > libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT > > $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Mon Jan 23 06:01:23 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 23 Jan 2006 16:01:23 +0200 Subject: [openib-general] respect CFLAGS in OSM Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B5BE@mtlexch01.mtl.com> > > > > Just to convince you I attach some gcc traces showing that -O0 -O2 acts > > like -O2 and > > -O2 -O0 acts like -O0. > > Yes, it does override. What about the -g setting ? Should that stay or > go ? [EZ] If -g is the problem we could easily remove the -g . However I would recommend having OpenSM compile -O2 by default and not rely on the user to provide that. > > -- Hal > > > Bottom line I would like to keep the code as it is without any change > > such that default installation will use the -O2 mode. > > > > > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ > > 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O0 -O2 > > -version -o /tmp/ccet3OkS.s > > GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) > > compiled by GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8). > > GGC heuristics: --param ggc-min-expand=99 --param > > ggc-min-heapsize=129317 > > options passed: -auxbase -O0 -O2 > > options enabled: -falign-loops -fargument-alias -fbranch-count-reg > > -fcaller-saves -fcommon -fcprop-registers -fcrossjumping > > -fcse-follow-jumps -fcse-skip-blocks -fdefer-pop > > -fdelete-null-pointer-checks -feliminate-unused-debug-types > > -fexpensive-optimizations -fforce-mem -ffunction-cse -fgcse -fgcse-lm > > -fguess-branch-probability -fident -fif-conversion -fif-conversion2 > > -fivopts -fkeep-static-consts -fleading-underscore -floop-optimize > > -floop-optimize2 -fmath-errno -fmerge-constants > > -foptimize-register-move > > -foptimize-sibling-calls -fpcc-struct-return -fpeephole -fpeephole2 > > -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop > > -frerun-loop-opt -fsched-interblock -fsched-spec > > -fsched-stalled-insns-dep > > -fsplit-ivs-in-unroller -fstrength-reduce -fstrict-aliasing > > -fthread-jumps > > -ftrapping-math -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce > > -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-loop-im > > -ftree-loop-ivcanon -ftree-loop-optimize -ftree-lrs -ftree-pre > > -ftree-sra > > -ftree-ter -funit-at-a-time -fvar-tracking -fzero-initialized-in-bss > > -m80387 -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 > > -mno-red-zone -mtls-direct-seg-refs -mtune=i386 -march=i386 > > > > swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ > > 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O2 -O0 > > -version -o /tmp/ccet3OkS.s > > GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) > > compiled by GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8). > > GGC heuristics: --param ggc-min-expand=99 --param > > ggc-min-heapsize=129317 > > options passed: -auxbase -O2 -O0 > > options enabled: -falign-loops -fargument-alias -fbranch-count-reg > > -fcommon -feliminate-unused-debug-types -ffunction-cse -fgcse-lm > > -fident > > -fivopts -fkeep-static-consts -fleading-underscore -floop-optimize2 > > -fmath-errno -fpcc-struct-return -fpeephole -fsched-interblock > > -fsched-spec -fsched-stalled-insns-dep -fsplit-ivs-in-unroller > > -ftrapping-math -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize > > -funit-at-a-time -fvar-tracking -fzero-initialized-in-bss -m80387 > > -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 -mno-red-zone > > -mtls-direct-seg-refs -mtune=i386 -march=i386 > > > > > -----Original Message----- > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > bounces at openib.org] On Behalf Of Pete Wyckoff > > > Sent: Thursday, January 19, 2006 11:50 PM > > > To: openib-general at openib.org > > > Subject: [openib-general] respect CFLAGS in OSM > > > > > > I do something like: > > > > > > CFLAGS=-g ./configure ... > > > > > > to build a debug tree from openib svn. > > > > > > Some places override this CFLAGS setting, though, applying > > > optimization even though I explicitly do not want it. This patch > > > fixes that. These apply to OSM below gen2/trunk/src/userspace/. > > > > > > Signed-off-by: Pete Wyckoff > > > > > > Index: management/osm/libvendor/Makefile.am > > > =================================================================== > > > --- management/osm/libvendor/Makefile.am (revision 5098) > > > +++ management/osm/libvendor/Makefile.am (working copy) > > > @@ -3,8 +3,6 @@ > > > > > > if DEBUG > > > DBGFLAGS = -ggdb -D_DEBUG_ > > > -else > > > -DBGFLAGS = -g -O2 > > > endif > > > > > > INCLUDES = $(OSMV_INCLUDES) > > > Index: management/osm/complib/Makefile.am > > > =================================================================== > > > --- management/osm/complib/Makefile.am (revision 5098) > > > +++ management/osm/complib/Makefile.am (working copy) > > > @@ -5,8 +5,6 @@ > > > > > > if DEBUG > > > DBGFLAGS = -ggdb -D_DEBUG_ > > > -else > > > -DBGFLAGS = -g -O2 > > > endif > > > > > > libosmcomp_la_CFLAGS = -Wall $(DBGFLAGS) -D_XOPEN_SOURCE=600 - > > > D_BSD_SOURCE=1 > > > Index: management/osm/opensm/Makefile.am > > > =================================================================== > > > --- management/osm/opensm/Makefile.am (revision 5098) > > > +++ management/osm/opensm/Makefile.am (working copy) > > > @@ -5,8 +5,6 @@ > > > > > > if DEBUG > > > DBGFLAGS = -ggdb -D_DEBUG_ > > > -else > > > -DBGFLAGS = -g -O2 > > > endif > > > > > > libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) - > DVENDOR_RMPP_SUPPORT > > > $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Mon Jan 23 05:52:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Jan 2006 08:52:08 -0500 Subject: [openib-general] respect CFLAGS in OSM In-Reply-To: <1138023594.4338.34806.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B5BC@mtlexch01.mtl.com> <1138023594.4338.34806.camel@hal.voltaire.com> Message-ID: <1138024326.4338.34873.camel@hal.voltaire.com> On Mon, 2006-01-23 at 08:39, Hal Rosenstock wrote: > On Mon, 2006-01-23 at 08:27, Eitan Zahavi wrote: > > Hi Pete, > > > > I have looked again at this patch and what it is changing. > > My understanding is that you found the -g -O2 CLFAGS (provided through > > the specific target CFLAGS) unneeded. You also think they will interfere > > with settings you might want to provide from the command line. > > > > I have just double checked what I new to be the rule for autoconf: > > If the user provides CFLAGS or LDFLAGS from the command like - they are > > appended to the compile or link flags. The impact on gcc is that the > > later settings - i.e. those provided by the user take precedence over > > the flags provided at the beginning of the command line. So the patch > > below is actually not needed. > > > > Just to convince you I attach some gcc traces showing that -O0 -O2 acts > > like -O2 and > > -O2 -O0 acts like -O0. > > Yes, it does override. Actually I'm not sure about this. I too see both but can't tell which was honored and didn't see the specific text in gcc to indicate the precedence here. Is it first option or last option of the same or something else ? -- Hal > What about the -g setting ? Should that stay or > go ? > > -- Hal > > > Bottom line I would like to keep the code as it is without any change > > such that default installation will use the -O2 mode. > > > > > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ > > 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O0 -O2 > > -version -o /tmp/ccet3OkS.s > > GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) > > compiled by GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8). > > GGC heuristics: --param ggc-min-expand=99 --param > > ggc-min-heapsize=129317 > > options passed: -auxbase -O0 -O2 > > options enabled: -falign-loops -fargument-alias -fbranch-count-reg > > -fcaller-saves -fcommon -fcprop-registers -fcrossjumping > > -fcse-follow-jumps -fcse-skip-blocks -fdefer-pop > > -fdelete-null-pointer-checks -feliminate-unused-debug-types > > -fexpensive-optimizations -fforce-mem -ffunction-cse -fgcse -fgcse-lm > > -fguess-branch-probability -fident -fif-conversion -fif-conversion2 > > -fivopts -fkeep-static-consts -fleading-underscore -floop-optimize > > -floop-optimize2 -fmath-errno -fmerge-constants > > -foptimize-register-move > > -foptimize-sibling-calls -fpcc-struct-return -fpeephole -fpeephole2 > > -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop > > -frerun-loop-opt -fsched-interblock -fsched-spec > > -fsched-stalled-insns-dep > > -fsplit-ivs-in-unroller -fstrength-reduce -fstrict-aliasing > > -fthread-jumps > > -ftrapping-math -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce > > -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-loop-im > > -ftree-loop-ivcanon -ftree-loop-optimize -ftree-lrs -ftree-pre > > -ftree-sra > > -ftree-ter -funit-at-a-time -fvar-tracking -fzero-initialized-in-bss > > -m80387 -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 > > -mno-red-zone -mtls-direct-seg-refs -mtune=i386 -march=i386 > > > > swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ > > 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O2 -O0 > > -version -o /tmp/ccet3OkS.s > > GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) > > compiled by GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8). > > GGC heuristics: --param ggc-min-expand=99 --param > > ggc-min-heapsize=129317 > > options passed: -auxbase -O2 -O0 > > options enabled: -falign-loops -fargument-alias -fbranch-count-reg > > -fcommon -feliminate-unused-debug-types -ffunction-cse -fgcse-lm > > -fident > > -fivopts -fkeep-static-consts -fleading-underscore -floop-optimize2 > > -fmath-errno -fpcc-struct-return -fpeephole -fsched-interblock > > -fsched-spec -fsched-stalled-insns-dep -fsplit-ivs-in-unroller > > -ftrapping-math -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize > > -funit-at-a-time -fvar-tracking -fzero-initialized-in-bss -m80387 > > -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 -mno-red-zone > > -mtls-direct-seg-refs -mtune=i386 -march=i386 > > > > > -----Original Message----- > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > bounces at openib.org] On Behalf Of Pete Wyckoff > > > Sent: Thursday, January 19, 2006 11:50 PM > > > To: openib-general at openib.org > > > Subject: [openib-general] respect CFLAGS in OSM > > > > > > I do something like: > > > > > > CFLAGS=-g ./configure ... > > > > > > to build a debug tree from openib svn. > > > > > > Some places override this CFLAGS setting, though, applying > > > optimization even though I explicitly do not want it. This patch > > > fixes that. These apply to OSM below gen2/trunk/src/userspace/. > > > > > > Signed-off-by: Pete Wyckoff > > > > > > Index: management/osm/libvendor/Makefile.am > > > =================================================================== > > > --- management/osm/libvendor/Makefile.am (revision 5098) > > > +++ management/osm/libvendor/Makefile.am (working copy) > > > @@ -3,8 +3,6 @@ > > > > > > if DEBUG > > > DBGFLAGS = -ggdb -D_DEBUG_ > > > -else > > > -DBGFLAGS = -g -O2 > > > endif > > > > > > INCLUDES = $(OSMV_INCLUDES) > > > Index: management/osm/complib/Makefile.am > > > =================================================================== > > > --- management/osm/complib/Makefile.am (revision 5098) > > > +++ management/osm/complib/Makefile.am (working copy) > > > @@ -5,8 +5,6 @@ > > > > > > if DEBUG > > > DBGFLAGS = -ggdb -D_DEBUG_ > > > -else > > > -DBGFLAGS = -g -O2 > > > endif > > > > > > libosmcomp_la_CFLAGS = -Wall $(DBGFLAGS) -D_XOPEN_SOURCE=600 - > > > D_BSD_SOURCE=1 > > > Index: management/osm/opensm/Makefile.am > > > =================================================================== > > > --- management/osm/opensm/Makefile.am (revision 5098) > > > +++ management/osm/opensm/Makefile.am (working copy) > > > @@ -5,8 +5,6 @@ > > > > > > if DEBUG > > > DBGFLAGS = -ggdb -D_DEBUG_ > > > -else > > > -DBGFLAGS = -g -O2 > > > endif > > > > > > libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT > > > $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Mon Jan 23 06:10:50 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 23 Jan 2006 16:10:50 +0200 Subject: [openib-general] respect CFLAGS in OSM Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B5C0@mtlexch01.mtl.com> > > Actually I'm not sure about this. I too see both but can't tell which > was honored and didn't see the specific text in gcc to indicate the > precedence here. Is it first option or last option of the same or > something else ? [EZ] This is exactly why I have added the traces from running gcc with the two options. If you look just below you will see the verbose report from cc1 showing that the last option wins by showing the exact list of optimization options used by the compiler. I have tested that on multiple gcc versions. (The command line itself is reported when you run gcc -v. Then you need to remove the -quite flag to get the correct level of verbosity. ) > > > > > > swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ > > > 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O0 -O2 > > > -version -o /tmp/ccet3OkS.s > > > GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) > > > compiled by GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8). > > > GGC heuristics: --param ggc-min-expand=99 --param > > > ggc-min-heapsize=129317 > > > options passed: -auxbase -O0 -O2 > > > options enabled: -falign-loops -fargument-alias -fbranch-count-reg > > > -fcaller-saves -fcommon -fcprop-registers -fcrossjumping > > > -fcse-follow-jumps -fcse-skip-blocks -fdefer-pop > > > -fdelete-null-pointer-checks -feliminate-unused-debug-types > > > -fexpensive-optimizations -fforce-mem -ffunction-cse -fgcse -fgcse-lm > > > -fguess-branch-probability -fident -fif-conversion -fif-conversion2 > > > -fivopts -fkeep-static-consts -fleading-underscore -floop-optimize > > > -floop-optimize2 -fmath-errno -fmerge-constants > > > -foptimize-register-move > > > -foptimize-sibling-calls -fpcc-struct-return -fpeephole -fpeephole2 > > > -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop > > > -frerun-loop-opt -fsched-interblock -fsched-spec > > > -fsched-stalled-insns-dep > > > -fsplit-ivs-in-unroller -fstrength-reduce -fstrict-aliasing > > > -fthread-jumps > > > -ftrapping-math -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce > > > -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-loop-im > > > -ftree-loop-ivcanon -ftree-loop-optimize -ftree-lrs -ftree-pre > > > -ftree-sra > > > -ftree-ter -funit-at-a-time -fvar-tracking -fzero-initialized-in-bss > > > -m80387 -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 > > > -mno-red-zone -mtls-direct-seg-refs -mtune=i386 -march=i386 > > > > > > swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ > > > 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O2 -O0 > > > -version -o /tmp/ccet3OkS.s > > > GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) > > > compiled by GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8). > > > GGC heuristics: --param ggc-min-expand=99 --param > > > ggc-min-heapsize=129317 > > > options passed: -auxbase -O2 -O0 > > > options enabled: -falign-loops -fargument-alias -fbranch-count-reg > > > -fcommon -feliminate-unused-debug-types -ffunction-cse -fgcse-lm > > > -fident > > > -fivopts -fkeep-static-consts -fleading-underscore -floop-optimize2 > > > -fmath-errno -fpcc-struct-return -fpeephole -fsched-interblock > > > -fsched-spec -fsched-stalled-insns-dep -fsplit-ivs-in-unroller > > > -ftrapping-math -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize > > > -funit-at-a-time -fvar-tracking -fzero-initialized-in-bss -m80387 > > > -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 -mno-red-zone > > > -mtls-direct-seg-refs -mtune=i386 -march=i386 > > > From halr at voltaire.com Mon Jan 23 06:03:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Jan 2006 09:03:13 -0500 Subject: [openib-general] Re: [PATCH] Opensm - running with console option In-Reply-To: <5zy8181m5d.fsf@mtl066.yok.mtl.com> References: <5zy8181m5d.fsf@mtl066.yok.mtl.com> Message-ID: <1138024992.4338.34942.camel@hal.voltaire.com> On Sun, 2006-01-22 at 07:29, Yael Kalka wrote: > Hi Hal, > > I've noticed that when running opensm with --console, it exits with a > message that option `-console' requires an argument. But I saw in the > main.c that such argument isn't used. > The following patch removes this dependency. > > Thanks, > Yael Thanks. Applied. > Signed-off-by: Yael Kalka From halr at voltaire.com Mon Jan 23 06:06:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Jan 2006 09:06:41 -0500 Subject: [openib-general] respect CFLAGS in OSM In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B5C0@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B5C0@mtlexch01.mtl.com> Message-ID: <1138025200.4338.34962.camel@hal.voltaire.com> On Mon, 2006-01-23 at 09:10, Eitan Zahavi wrote: > > > > Actually I'm not sure about this. I too see both but can't tell which > > was honored and didn't see the specific text in gcc to indicate the > > precedence here. Is it first option or last option of the same or > > something else ? > [EZ] This is exactly why I have added the traces from running gcc with > the two options. > If you look just below you will see the verbose report from cc1 showing > that the last option wins by showing the exact list of optimization > options used by the compiler. I have tested that on multiple gcc > versions. (The command line itself is reported when you run gcc -v. Then > you need to remove the -quite flag to get the correct level of > verbosity. ) Got it. It's all the extra -f options that are enabled by -O2. I will revert the patch to the OSM makefiles. -- Hal > > > > > > > > > swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ > > > > 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O0 -O2 > > > > -version -o /tmp/ccet3OkS.s > > > > GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) > > > > compiled by GNU C version 4.0.0 20050519 (Red Hat > 4.0.0-8). > > > > GGC heuristics: --param ggc-min-expand=99 --param > > > > ggc-min-heapsize=129317 > > > > options passed: -auxbase -O0 -O2 > > > > options enabled: -falign-loops -fargument-alias > -fbranch-count-reg > > > > -fcaller-saves -fcommon -fcprop-registers -fcrossjumping > > > > -fcse-follow-jumps -fcse-skip-blocks -fdefer-pop > > > > -fdelete-null-pointer-checks -feliminate-unused-debug-types > > > > -fexpensive-optimizations -fforce-mem -ffunction-cse -fgcse > -fgcse-lm > > > > -fguess-branch-probability -fident -fif-conversion > -fif-conversion2 > > > > -fivopts -fkeep-static-consts -fleading-underscore > -floop-optimize > > > > -floop-optimize2 -fmath-errno -fmerge-constants > > > > -foptimize-register-move > > > > -foptimize-sibling-calls -fpcc-struct-return -fpeephole > -fpeephole2 > > > > -fregmove -freorder-blocks -freorder-functions > -frerun-cse-after-loop > > > > -frerun-loop-opt -fsched-interblock -fsched-spec > > > > -fsched-stalled-insns-dep > > > > -fsplit-ivs-in-unroller -fstrength-reduce -fstrict-aliasing > > > > -fthread-jumps > > > > -ftrapping-math -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce > > > > -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-loop-im > > > > -ftree-loop-ivcanon -ftree-loop-optimize -ftree-lrs -ftree-pre > > > > -ftree-sra > > > > -ftree-ter -funit-at-a-time -fvar-tracking > -fzero-initialized-in-bss > > > > -m80387 -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 > > > > -mno-red-zone -mtls-direct-seg-refs -mtune=i386 -march=i386 > > > > > > > > > swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ > > > > 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O2 -O0 > > > > -version -o /tmp/ccet3OkS.s > > > > GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) > > > > compiled by GNU C version 4.0.0 20050519 (Red Hat > 4.0.0-8). > > > > GGC heuristics: --param ggc-min-expand=99 --param > > > > ggc-min-heapsize=129317 > > > > options passed: -auxbase -O2 -O0 > > > > options enabled: -falign-loops -fargument-alias > -fbranch-count-reg > > > > -fcommon -feliminate-unused-debug-types -ffunction-cse -fgcse-lm > > > > -fident > > > > -fivopts -fkeep-static-consts -fleading-underscore > -floop-optimize2 > > > > -fmath-errno -fpcc-struct-return -fpeephole -fsched-interblock > > > > -fsched-spec -fsched-stalled-insns-dep -fsplit-ivs-in-unroller > > > > -ftrapping-math -ftree-loop-im -ftree-loop-ivcanon > -ftree-loop-optimize > > > > -funit-at-a-time -fvar-tracking -fzero-initialized-in-bss -m80387 > > > > -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 > -mno-red-zone > > > > -mtls-direct-seg-refs -mtune=i386 -march=i386 > > > > From eitan at mellanox.co.il Mon Jan 23 06:21:24 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 23 Jan 2006 16:21:24 +0200 Subject: [openib-general] respect CFLAGS in OSM Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B5C2@mtlexch01.mtl.com> > > On Mon, 2006-01-23 at 09:10, Eitan Zahavi wrote: > > > > > > Actually I'm not sure about this. I too see both but can't tell which > > > was honored and didn't see the specific text in gcc to indicate the > > > precedence here. Is it first option or last option of the same or > > > something else ? > > [EZ] This is exactly why I have added the traces from running gcc with > > the two options. > > If you look just below you will see the verbose report from cc1 showing > > that the last option wins by showing the exact list of optimization > > options used by the compiler. I have tested that on multiple gcc > > versions. (The command line itself is reported when you run gcc -v. Then > > you need to remove the -quite flag to get the correct level of > > verbosity. ) > > Got it. It's all the extra -f options that are enabled by -O2. > > I will revert the patch to the OSM makefiles. [EZ] Let's wait for Pete's response. > > -- Hal > > > > > > > > > > > > > swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ > > > > > 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O0 -O2 > > > > > -version -o /tmp/ccet3OkS.s > > > > > GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) > > > > > compiled by GNU C version 4.0.0 20050519 (Red Hat > > 4.0.0-8). > > > > > GGC heuristics: --param ggc-min-expand=99 --param > > > > > ggc-min-heapsize=129317 > > > > > options passed: -auxbase -O0 -O2 > > > > > options enabled: -falign-loops -fargument-alias > > -fbranch-count-reg > > > > > -fcaller-saves -fcommon -fcprop-registers -fcrossjumping > > > > > -fcse-follow-jumps -fcse-skip-blocks -fdefer-pop > > > > > -fdelete-null-pointer-checks -feliminate-unused-debug-types > > > > > -fexpensive-optimizations -fforce-mem -ffunction-cse -fgcse > > -fgcse-lm > > > > > -fguess-branch-probability -fident -fif-conversion > > -fif-conversion2 > > > > > -fivopts -fkeep-static-consts -fleading-underscore > > -floop-optimize > > > > > -floop-optimize2 -fmath-errno -fmerge-constants > > > > > -foptimize-register-move > > > > > -foptimize-sibling-calls -fpcc-struct-return -fpeephole > > -fpeephole2 > > > > > -fregmove -freorder-blocks -freorder-functions > > -frerun-cse-after-loop > > > > > -frerun-loop-opt -fsched-interblock -fsched-spec > > > > > -fsched-stalled-insns-dep > > > > > -fsplit-ivs-in-unroller -fstrength-reduce -fstrict-aliasing > > > > > -fthread-jumps > > > > > -ftrapping-math -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce > > > > > -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-loop-im > > > > > -ftree-loop-ivcanon -ftree-loop-optimize -ftree-lrs -ftree-pre > > > > > -ftree-sra > > > > > -ftree-ter -funit-at-a-time -fvar-tracking > > -fzero-initialized-in-bss > > > > > -m80387 -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 > > > > > -mno-red-zone -mtls-direct-seg-refs -mtune=i386 -march=i386 > > > > > > > > > > > > swlab25:/home/eitan/SW/work/examples>/usr/libexec/gcc/i386-redhat-linux/ > > > > > 4.0.0/cc1 getHostName.c getHostName.c -auxbase getHostName -O2 -O0 > > > > > -version -o /tmp/ccet3OkS.s > > > > > GNU C version 4.0.0 20050519 (Red Hat 4.0.0-8) (i386-redhat-linux) > > > > > compiled by GNU C version 4.0.0 20050519 (Red Hat > > 4.0.0-8). > > > > > GGC heuristics: --param ggc-min-expand=99 --param > > > > > ggc-min-heapsize=129317 > > > > > options passed: -auxbase -O2 -O0 > > > > > options enabled: -falign-loops -fargument-alias > > -fbranch-count-reg > > > > > -fcommon -feliminate-unused-debug-types -ffunction-cse -fgcse-lm > > > > > -fident > > > > > -fivopts -fkeep-static-consts -fleading-underscore > > -floop-optimize2 > > > > > -fmath-errno -fpcc-struct-return -fpeephole -fsched-interblock > > > > > -fsched-spec -fsched-stalled-insns-dep -fsplit-ivs-in-unroller > > > > > -ftrapping-math -ftree-loop-im -ftree-loop-ivcanon > > -ftree-loop-optimize > > > > > -funit-at-a-time -fvar-tracking -fzero-initialized-in-bss -m80387 > > > > > -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 > > -mno-red-zone > > > > > -mtls-direct-seg-refs -mtune=i386 -march=i386 > > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Mon Jan 23 06:46:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Jan 2006 09:46:49 -0500 Subject: [openib-general] Re: [PATCH] Opensm - duplicated guids handling - new In-Reply-To: <5zzmlo1xej.fsf@mtl066.yok.mtl.com> References: <5zzmlo1xej.fsf@mtl066.yok.mtl.com> Message-ID: <1138027608.4338.35211.camel@hal.voltaire.com> On Sun, 2006-01-22 at 03:26, Yael Kalka wrote: > Hi Hal, > > Here is a patch using O_NONBLOCK instead of the static variable. Thanks. Applied. > Thanks, > Yael > > > Signed-off-by: Yael Kalka From mst at mellanox.co.il Mon Jan 23 07:23:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 17:23:52 +0200 Subject: [openib-general] [PATCH] devinfo multiple device support Message-ID: <20060123152352.GA28827@mellanox.co.il> For devinfo, it makes sense to report all available devices if the user did not specify a specific one. --- Make ibv_devinfo list all IB devices, rather than the first device, by default. Signed-off-by: Dotan Barak Signed-off-by: Michael S. Tsirkin Index: last_stable/src/userspace/libibverbs/examples/devinfo.c =================================================================== --- last_stable.orig/src/userspace/libibverbs/examples/devinfo.c 2006-01-22 13:19:22.000000000 +0200 +++ last_stable/src/userspace/libibverbs/examples/devinfo.c 2006-01-22 13:18:23.000000000 +0200 @@ -403,7 +403,10 @@ fprintf(stderr, "No IB devices found\n"); return -1; } - ret |= print_hca_cap(*dev_list, ib_port); + while (*dev_list) { + ret |= print_hca_cap(*dev_list, ib_port); + ++dev_list; + } } if (ib_devname) -- MST From rdreier at cisco.com Mon Jan 23 07:45:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 07:45:03 -0800 Subject: [openib-general] Re: [PATCH] devinfo multiple device support In-Reply-To: <20060123152352.GA28827@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 23 Jan 2006 17:23:52 +0200") References: <20060123152352.GA28827@mellanox.co.il> Message-ID: thanks, applied. From mst at mellanox.co.il Mon Jan 23 07:54:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 17:54:36 +0200 Subject: [openib-general] [PATCH] RFC: mthca handling of signals Message-ID: <20060123155436.GB26147@mellanox.co.il> We have run into the following problem: if a task receives a signal while in the process of e.g. destroying a resource (which could be because the relevant file was closed) mthca could bail out from trying to take a command interface semaphore without performing the appropriate command to tell hardware that the resource is being destroyed. As a result we see messages like ib_mthca 0000:04:00.0: HW2SW_CQ failed (-4) In this case, hardware could access the resource after the memory has been freed, possibly causing memory corruption. A simple solution is to replace down_interruptible by down in command interface activation. A more elegant, but much bigger, change would involve making resource allocation command interruptible, while keeping resource cleanup commands uninterruptible. --- Its not safe to cancel a command upon a signal. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- openib/drivers/infiniband/hw/mthca/mthca_cmd.c (revision 5121) +++ openib/drivers/infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -199,8 +199,7 @@ static int mthca_cmd_post(struct mthca_d { int err = 0; - if (down_interruptible(&dev->cmd.hcr_sem)) - return -EINTR; + down(&dev->cmd.hcr_sem); if (event) { unsigned long end = jiffies + GO_BIT_TIMEOUT; @@ -255,8 +254,7 @@ static int mthca_cmd_poll(struct mthca_d int err = 0; unsigned long end; - if (down_interruptible(&dev->cmd.poll_sem)) - return -EINTR; + down(&dev->cmd.poll_sem); err = mthca_cmd_post(dev, in_param, out_param ? *out_param : 0, @@ -333,8 +331,7 @@ static int mthca_cmd_wait(struct mthca_d int err = 0; struct mthca_cmd_context *context; - if (down_interruptible(&dev->cmd.event_sem)) - return -EINTR; + down(&dev->cmd.event_sem); spin_lock(&dev->cmd.context_lock); BUG_ON(dev->cmd.free_head < 0); -- MST From rdreier at cisco.com Mon Jan 23 08:03:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 08:03:03 -0800 Subject: [openib-general] [git pull] InfiniBand fixes for 2.6.16 Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus The pull will get the following changes: Michael S. Tsirkin: IPoIB: Make sure path is fully initialized before using it IB/uverbs: Flush scheduled work before unloading module IB/sa_query: Flush scheduled work before unloading module IPoIB: Lock accesses to multicast packet queues IB/mthca: Use correct GID in MADs sent on port 2 drivers/infiniband/core/sa_query.c | 2 ++ drivers/infiniband/core/uverbs_main.c | 1 + drivers/infiniband/hw/mthca/mthca_av.c | 2 +- drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 ++-- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 25 +++++++++++++++++++++--- 5 files changed, 28 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index acda7d6..501cc05 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -956,6 +956,8 @@ static void ib_sa_remove_one(struct ib_d ib_unregister_event_handler(&sa_dev->event_handler); + flush_scheduled_work(); + for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { ib_unregister_mad_agent(sa_dev->port[i].agent); kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index 96ea79b..903f85a 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -902,6 +902,7 @@ static void __exit ib_uverbs_cleanup(voi unregister_filesystem(&uverbs_event_fs); class_destroy(uverbs_class); unregister_chrdev_region(IB_UVERBS_BASE_DEV, IB_UVERBS_MAX_DEVICES); + flush_scheduled_work(); idr_destroy(&ib_uverbs_pd_idr); idr_destroy(&ib_uverbs_mr_idr); idr_destroy(&ib_uverbs_mw_idr); diff --git a/drivers/infiniband/hw/mthca/mthca_av.c b/drivers/infiniband/hw/mthca/mthca_av.c index a14eed0..a19e0ed 100644 --- a/drivers/infiniband/hw/mthca/mthca_av.c +++ b/drivers/infiniband/hw/mthca/mthca_av.c @@ -184,7 +184,7 @@ int mthca_read_ah(struct mthca_dev *dev, ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); ib_get_cached_gid(&dev->ib_dev, be32_to_cpu(ah->av->port_pd) >> 24, - ah->av->gid_index, + ah->av->gid_index % dev->limits.gid_table_len, &header->grh.source_gid); memcpy(header->grh.destination_gid.raw, ah->av->dgid, 16); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index fd3f5c8..c3b5f79 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -505,7 +505,7 @@ static void neigh_add_path(struct sk_buf list_add_tail(&neigh->list, &path->neigh_list); - if (path->pathrec.dlid) { + if (path->ah) { kref_get(&path->ah->ref); neigh->ah = path->ah; @@ -591,7 +591,7 @@ static void unicast_arp_send(struct sk_b return; } - if (path->pathrec.dlid) { + if (path->ah) { ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 98039da..ccaa0c3 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -97,6 +97,7 @@ static void ipoib_mcast_free(struct ipoi struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tmp; unsigned long flags; + int tx_dropped = 0; ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group " IPOIB_GID_FMT "\n", @@ -123,8 +124,14 @@ static void ipoib_mcast_free(struct ipoi if (mcast->ah) ipoib_put_ah(mcast->ah); - while (!skb_queue_empty(&mcast->pkt_queue)) + while (!skb_queue_empty(&mcast->pkt_queue)) { + ++tx_dropped; dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } + + spin_lock_irqsave(&priv->tx_lock, flags); + priv->stats.tx_dropped += tx_dropped; + spin_unlock_irqrestore(&priv->tx_lock, flags); kfree(mcast); } @@ -276,8 +283,10 @@ static int ipoib_mcast_join_finish(struc } /* actually send any queued packets */ + spin_lock_irq(&priv->tx_lock); while (!skb_queue_empty(&mcast->pkt_queue)) { struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); + spin_unlock_irq(&priv->tx_lock); skb->dev = dev; @@ -288,7 +297,9 @@ static int ipoib_mcast_join_finish(struc if (dev_queue_xmit(skb)) ipoib_warn(priv, "dev_queue_xmit failed to requeue packet\n"); + spin_lock_irq(&priv->tx_lock); } + spin_unlock_irq(&priv->tx_lock); return 0; } @@ -300,6 +311,7 @@ ipoib_mcast_sendonly_join_complete(int s { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); if (!status) ipoib_mcast_join_finish(mcast, mcmember); @@ -310,8 +322,12 @@ ipoib_mcast_sendonly_join_complete(int s IPOIB_GID_ARG(mcast->mcmember.mgid), status); /* Flush out any queued packets */ - while (!skb_queue_empty(&mcast->pkt_queue)) + spin_lock_irq(&priv->tx_lock); + while (!skb_queue_empty(&mcast->pkt_queue)) { + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); + } + spin_unlock_irq(&priv->tx_lock); /* Clear the busy flag so we try again */ clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); @@ -687,6 +703,7 @@ void ipoib_mcast_send(struct net_device if (!mcast) { ipoib_warn(priv, "unable to allocate memory for " "multicast structure\n"); + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); goto out; } @@ -700,8 +717,10 @@ void ipoib_mcast_send(struct net_device if (!mcast->ah) { if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) skb_queue_tail(&mcast->pkt_queue, skb); - else + else { + ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); + } if (mcast->query) ipoib_dbg_mcast(priv, "no address vector, " From pw at osc.edu Mon Jan 23 08:07:19 2006 From: pw at osc.edu (Pete Wyckoff) Date: Mon, 23 Jan 2006 11:07:19 -0500 Subject: [openib-general] respect CFLAGS in OSM In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B5BC@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B5BC@mtlexch01.mtl.com> Message-ID: <20060123160719.GB11313@osc.edu> eitan at mellanox.co.il wrote on Mon, 23 Jan 2006 15:27 +0200: > I have looked again at this patch and what it is changing. > My understanding is that you found the -g -O2 CLFAGS (provided through > the specific target CFLAGS) unneeded. You also think they will interfere > with settings you might want to provide from the command line. > > I have just double checked what I new to be the rule for autoconf: > If the user provides CFLAGS or LDFLAGS from the command like - they are > appended to the compile or link flags. The impact on gcc is that the > later settings - i.e. those provided by the user take precedence over > the flags provided at the beginning of the command line. So the patch > below is actually not needed. > > Just to convince you I attach some gcc traces showing that -O0 -O2 acts > like -O2 and > -O2 -O0 acts like -O0. > > Bottom line I would like to keep the code as it is without any change > such that default installation will use the -O2 mode. Yes, later "-O" options do override previous ones. I didn't think of explicitly disabling optimization with -O0 in my build script. The implication is that you want me to do the following to compile a debugging version of osm: CFLAGS="-g -O0" ./configure ... rather than what I expected to work: CFLAGS="-g" ./configure ... I can deal with that, and it's not a big enough concern to me now that you've pointed out this work-around. Feel free to ignore my suggestion, then: it's your code. My surprise comes from having become accustomed to autoconf-based programs that always use the user-specified CFLAGS exactly if given. AC_PROG_CC sets it to "-g -O2" by default for gcc only if no CFLAGS was set by the user. If I don't want -g or don't want -O2, usually it is easy to make that happen without editing files. -- Pete From mst at mellanox.co.il Mon Jan 23 08:34:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 18:34:15 +0200 Subject: [openib-general] [PATCH] uar size != 8M Message-ID: <20060123163415.GC26147@mellanox.co.il> There are some cards around that have UAR size different from 8M. As a sanity check, compare it against the device reported limit instead. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/hw/mthca/mthca_main.c 2006-01-12 12:05:35.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/hw/mthca/mthca_main.c 2006-01-23 21:06:47.000000000 +0200 @@ -155,6 +155,13 @@ static int __devinit mthca_dev_lim(struc return -ENODEV; } + if (dev_lim->uar_size > pci_resource_len(mdev->pdev, 2)) { + mthca_err(mdev, "HCA reported UAR size of 0x%x bigger than " + "PCI resource 2 size of 0x%lx, aborting.\n", + dev_lim->uar_size, pci_resource_len(mdev->pdev, 2)); + return -ENODEV; + } + mdev->limits.num_ports = dev_lim->num_ports; mdev->limits.vl_cap = dev_lim->max_vl; mdev->limits.mtu_cap = dev_lim->max_mtu; @@ -976,8 +983,7 @@ static int __devinit mthca_init_one(stru err = -ENODEV; goto err_disable_pdev; } - if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) || - pci_resource_len(pdev, 2) != 1 << 23) { + if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM)) { dev_err(&pdev->dev, "Missing UAR, aborting.\n"); err = -ENODEV; goto err_disable_pdev; -- MST From mst at mellanox.co.il Mon Jan 23 08:37:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 18:37:37 +0200 Subject: [openib-general] Re: Repost [PATCH 1 of 3] move destructor to struct neigh_parms In-Reply-To: <20060118211926.GE31280@mellanox.co.il> References: <20060118211926.GE31280@mellanox.co.il> Message-ID: <20060123163737.GD26147@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Move destructor from neigh_ops (which is shared between devices) > to neigh_parms which is not, so that multiple drivers can set > it safely. I posted this on netdev and lkml on Wednesday, but there was no response. I guess this patch is non-controverisal since no one else in kernel uses the destructor. Do you think we can push this into mainline? -- MST From nacc at us.ibm.com Mon Jan 23 08:40:21 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 23 Jan 2006 08:40:21 -0800 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123131959.GC24474@mellanox.co.il> References: <20060109065012.GJ2064@us.ibm.com> <20060123131959.GC24474@mellanox.co.il> Message-ID: <20060123164021.GB5074@us.ibm.com> On 23.01.2006 [15:19:59 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Subject: Re: Userspace testing results (many kernels,many svn trees) > > > > On 09.01.2006 [08:33:01 +0200], Michael S. Tsirkin wrote: > > > Quoting r. Nishanth Aravamudan : > > > > Just like with the results I posted earlier, all the perftest results > > > > are seriously wrong for 32-bit clients (with both 32-bit and 64-bit > > > > servers). I am not sure who else to notify beyond the general list (is > > > > there a corresponding MAINTAINERS files like in the kernel proper for > > > > the OpenIB code?) > > > > > > That would be me - sorry about the delay, I'll take a look at this. > > > Thanks a lot, Nishanth! > > > This work is very much appreciated. > > > > No worries, hope the problem is not too hard to fix :) > > OK, I'm going to concentrate on rdma_lat/rdma_bw for now. > > # file ./rdma_bw > ./rdma_bw: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), not stripped > > # ./rdma_bw swlab155 > local address: LID 0xd9, QPN 0x24040a, PSN 0x98e7f4 RKey 0xe0003f VAddr 0x000000556db000 > remote address: LID 0xd9, QPN 0x240406, PSN 0x952ea0, RKey 0xe00033 VAddr 0x000000556db000 > Bandwidth peak (#0 to #999): 879.956 MB/sec > Bandwidth average: 879.954 MB/sec > Service Demand peak (#0 to #999): 3773 cycles/KB > Service Demand Avg : 37 cycles/KB > > Seems like I cant reproduce the problem. Which distribution > and CPU architecture is this, again? The machines are running ppc64 kernels. They are both SLES9-SP2 userspace setups, with the infiniband components being updated to the current svn. Bah, I just noticed that perftests doesn't even build right now (svn 5130). on ppc32 machines: /usr/local/autobench/var/tmp/gen2-trunk/userspace/perftest patching file Makefile gcc -m32 -m32 -I/usr/local/autobench/var/tmp/out/ppc32/include -Wall -O2 -g -D_GNU_SOURCE -m32 -L/lib -L/usr/local/autobench/var/tmp/out/ppc32/lib rdma_lat.c get_clock.c -libverbs -o rdma_lat In file included from rdma_lat.c:57: get_clock.h:50:27: operator '||' has no right operand get_clock.h:73:2: warning: #warning get_cycles not implemented for this architecture: attempt asm/timex.h rdma_lat.c:517: error: parse error before "get_median" rdma_lat.c:517: error: parse error before "cycles_t" rdma_lat.c:518: warning: return type defaults to `int' rdma_lat.c: In function `get_median': rdma_lat.c:519: error: `n' undeclared (first use in this function) rdma_lat.c:519: error: (Each undeclared identifier is reported only once rdma_lat.c:519: error: for each function it appears in.) rdma_lat.c:520: error: `delta' undeclared (first use in this function) rdma_lat.c: In function `cycles_compare': rdma_lat.c:527: error: syntax error before '*' token rdma_lat.c:528: error: syntax error before '*' token rdma_lat.c:529: error: `a' undeclared (first use in this function) rdma_lat.c:529: error: `b' undeclared (first use in this function) rdma_lat.c: At top level: rdma_lat.c:535: error: parse error before "cycles_t" rdma_lat.c: In function `print_report': rdma_lat.c:538: error: `cycles_t' undeclared (first use in this function) rdma_lat.c:538: error: parse error before "median" rdma_lat.c:541: error: `delta' undeclared (first use in this function) rdma_lat.c:541: error: `iters' undeclared (first use in this function) rdma_lat.c:549: error: `tstamp' undeclared (first use in this function) rdma_lat.c:552: error: `options' undeclared (first use in this function) rdma_lat.c:574: error: `median' undeclared (first use in this function) rdma_lat.c: In function `main': rdma_lat.c:605: error: `cycles_t' undeclared (first use in this function) rdma_lat.c:605: error: `tstamp' undeclared (first use in this function) rdma_lat.c:747: warning: implicit declaration of function `get_cycles' make: *** [rdma_lat] Error 1 on ppc64 machines: gcc -m64 -m64 -I/usr/local/autobench/var/tmp/out/ppc64/include -Wall -O2 -g -D_GNU_SOURCE -m64 -L/lib64 -L/usr/local/autobench/var/tmp/out/ppc64/lib rdma_lat.c get_clock.c -libverbs -o rdma_lat In file included from rdma_lat.c:57: get_clock.h:50:27: operator '||' has no right operand get_clock.h:73:2: warning: #warning get_cycles not implemented for this architecture: attempt asm/timex.h make: *** [rdma_lat] Error 1 Once we get that sorted out, I can re-verify that perftest is/isn't broken still. Thanks, Nish From mst at mellanox.co.il Mon Jan 23 08:44:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 18:44:10 +0200 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: References: Message-ID: <20060123164410.GA26724@mellanox.co.il> Quoting r. Roland Dreier : > Also should we count dropped packets here? Right. And since it seems that we cant get by with just one bit, here's the original patch again, with dropped packet counter fixed. --- Fix the following race scenario: Device is up. Port event or set mcast list triggers ipoib_mcast_stop_thread, this cancels the query and waits on mcast "done" completion. Completion is called and "done" is set. Meanwhile, ipoib_mcast_send arrives and starts a new query, re-initializing "done". Further, there's an additional issue that I saw in testing: ipoib_mcast_send may get called when priv->broadcast is NULL (e.g. if the device was downed and then upped internally because of a port event). If this happends and the sendonly join request gets completed before priv->broadcast is set, we get an oops ---- Do not send multicasts if mcast thread is stopped or if priv->broadcast is not set. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-23 21:24:10.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-23 21:25:19.000000000 +0200 @@ -600,6 +600,10 @@ int ipoib_mcast_start_thread(struct net_ queue_work(ipoib_workqueue, &priv->mcast_task); mutex_unlock(&mcast_mutex); + spin_lock_irq(&priv->lock); + set_bit(IPOIB_MCAST_STARTED, &priv->flags); + spin_unlock_irq(&priv->lock); + return 0; } @@ -610,6 +614,10 @@ int ipoib_mcast_stop_thread(struct net_d ipoib_dbg_mcast(priv, "stopping multicast thread\n"); + spin_lock_irq(&priv->lock); + clear_bit(IPOIB_MCAST_STARTED, &priv->flags); + spin_unlock_irq(&priv->lock); + mutex_lock(&mcast_mutex); clear_bit(IPOIB_MCAST_RUN, &priv->flags); cancel_delayed_work(&priv->mcast_task); @@ -692,6 +700,12 @@ void ipoib_mcast_send(struct net_device */ spin_lock(&priv->lock); + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || !priv->broadcast) { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + goto unlock; + } + mcast = __ipoib_mcast_find(dev, mgid); if (!mcast) { /* Let's create a new send only group now */ @@ -753,6 +767,7 @@ out: ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); } +unlock: spin_unlock(&priv->lock); } Index: linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-23 21:24:10.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib.h 2006-01-23 21:24:46.000000000 +0200 @@ -85,6 +85,7 @@ enum { IPOIB_FLAG_SUBINTERFACE = 4, IPOIB_MCAST_RUN = 5, IPOIB_STOP_REAPER = 6, + IPOIB_MCAST_STARTED = 7, IPOIB_MAX_BACKOFF_SECONDS = 16, -- MST From mst at mellanox.co.il Mon Jan 23 08:46:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 18:46:42 +0200 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123164021.GB5074@us.ibm.com> References: <20060123164021.GB5074@us.ibm.com> Message-ID: <20060123164642.GB26724@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Bah, I just noticed that perftests doesn't even build right now (svn > 5130). Fixed, thanks. -- MST From nacc at us.ibm.com Mon Jan 23 08:49:33 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 23 Jan 2006 08:49:33 -0800 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123164642.GB26724@mellanox.co.il> References: <20060123164021.GB5074@us.ibm.com> <20060123164642.GB26724@mellanox.co.il> Message-ID: <20060123164933.GC5074@us.ibm.com> On 23.01.2006 [18:46:42 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Bah, I just noticed that perftests doesn't even build right now (svn > > 5130). > > Fixed, thanks. Great! What svn revision is it fixed in? 5159 is running the tests now, but I can cancel and resubmit if it was fixed after that. Thanks, Nish From mst at mellanox.co.il Mon Jan 23 08:53:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 18:53:19 +0200 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123164021.GB5074@us.ibm.com> References: <20060123164021.GB5074@us.ibm.com> Message-ID: <20060123165319.GD26724@mellanox.co.il> Quoting Nishanth Aravamudan : > The machines are running ppc64 kernels. BTW, does this configuration (ppc32 on ppc64) support the mftb instruction? -- MST From mst at mellanox.co.il Mon Jan 23 08:53:58 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 18:53:58 +0200 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123164933.GC5074@us.ibm.com> References: <20060123164933.GC5074@us.ibm.com> Message-ID: <20060123165358.GE26724@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Subject: Re: Re: Userspace testing results (many kernels,many svn trees) > > On 23.01.2006 [18:46:42 +0200], Michael S. Tsirkin wrote: > > Quoting r. Nishanth Aravamudan : > > > Bah, I just noticed that perftests doesn't even build right now (svn > > > 5130). > > > > Fixed, thanks. > > Great! What svn revision is it fixed in? 5159 is running the tests now, > but I can cancel and resubmit if it was fixed after that. > > Thanks, > Nish > This one: ------------------------------------------------------------------------ r5162 | mst | 2006-01-23 18:51:34 +0200 (Mon, 23 Jan 2006) | 3 lines typo fix -- MST From nacc at us.ibm.com Mon Jan 23 08:55:51 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 23 Jan 2006 08:55:51 -0800 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123165319.GD26724@mellanox.co.il> References: <20060123164021.GB5074@us.ibm.com> <20060123165319.GD26724@mellanox.co.il> Message-ID: <20060123165551.GE5074@us.ibm.com> On 23.01.2006 [18:53:19 +0200], Michael S. Tsirkin wrote: > Quoting Nishanth Aravamudan : > > The machines are running ppc64 kernels. > > BTW, does this configuration (ppc32 on ppc64) support the mftb instruction? Good question :) How would I go about checking? Thanks, Nish From nacc at us.ibm.com Mon Jan 23 09:00:47 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 23 Jan 2006 09:00:47 -0800 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123165358.GE26724@mellanox.co.il> References: <20060123164933.GC5074@us.ibm.com> <20060123165358.GE26724@mellanox.co.il> Message-ID: <20060123170047.GF5074@us.ibm.com> On 23.01.2006 [18:53:58 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Subject: Re: Re: Userspace testing results (many kernels,many svn trees) > > > > On 23.01.2006 [18:46:42 +0200], Michael S. Tsirkin wrote: > > > Quoting r. Nishanth Aravamudan : > > > > Bah, I just noticed that perftests doesn't even build right now (svn > > > > 5130). > > > > > > Fixed, thanks. > > > > Great! What svn revision is it fixed in? 5159 is running the tests now, > > but I can cancel and resubmit if it was fixed after that. > > > > Thanks, > > Nish > > > > This one: > > ------------------------------------------------------------------------ > r5162 | mst | 2006-01-23 18:51:34 +0200 (Mon, 23 Jan 2006) | 3 lines > > typo fix Ok, running the tests again with 5162 -- will take some time to get to a set of sizes (32/64) that we are interested in, but will post again once they finish. Thanks, Nish From rdreier at cisco.com Mon Jan 23 09:05:25 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 09:05:25 -0800 Subject: [openib-general] Re: Repost [PATCH 1 of 3] move destructor to struct neigh_parms In-Reply-To: <20060123163737.GD26147@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 23 Jan 2006 18:37:37 +0200") References: <20060118211926.GE31280@mellanox.co.il> <20060123163737.GD26147@mellanox.co.il> Message-ID: Michael> I posted this on netdev and lkml on Wednesday, but there Michael> was no response. I guess this patch is non-controverisal Michael> since no one else in kernel uses the destructor. Michael> Do you think we can push this into mainline? Let me resend to Dave Miller directly and then it should be OK. - R. From rdreier at cisco.com Mon Jan 23 09:06:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 09:06:39 -0800 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123165319.GD26724@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 23 Jan 2006 18:53:19 +0200") References: <20060123164021.GB5074@us.ibm.com> <20060123165319.GD26724@mellanox.co.il> Message-ID: Michael> BTW, does this configuration (ppc32 on ppc64) support the Michael> mftb instruction? I have some ppc64 machines around -- let me do some perftest checking. - R. From mst at mellanox.co.il Mon Jan 23 09:11:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 19:11:01 +0200 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123165551.GE5074@us.ibm.com> References: <20060123165551.GE5074@us.ibm.com> Message-ID: <20060123171101.GF26724@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Subject: Re: Userspace testing results (many kernels,many svn trees) > > On 23.01.2006 [18:53:19 +0200], Michael S. Tsirkin wrote: > > Quoting Nishanth Aravamudan : > > > The machines are running ppc64 kernels. > > > > BTW, does this configuration (ppc32 on ppc64) support the mftb instruction? > > Good question :) > > How would I go about checking? > > Thanks, > Nish This seems to imply we are ok http://www-128.ibm.com/developerworks/eserver/articles/archguide.html Apple used to have some good ppc documentation but I couldnt locate it at the moment. -- MST From rdreier at cisco.com Mon Jan 23 09:12:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 09:12:10 -0800 Subject: [openib-general] [ANNOUNCE] libibverbs 1.0-rc5 released Message-ID: I just tagged a 1.0-rc5 release of libibverbs and pushed it out to the relevant channels, which means that it should appear on http://openib.org/downloads/ shortly. Once the Fedora project build system is fixed, I will also kick off builds for Fedora Extras, so binary packages for Fedora 4 and 5 will appear on standard Fedora mirrors in a few days Changes since 1.0-rc4 include: - Change from dlist-based ibv_get_devices() API to simpler array-based ibv_get_device_list() API. This breaks source compatibility with 1.0-rc4. - Add Sean Hefty's user/kernel marshalling functions. - Lots of fixes for pingpong examples. See the ChangeLog in the package for full details. I think we're getting close to a full 1.0 release with a frozen API and ABI. My plans for getting there are the following: - As of now, the only changes to the API that I will accept are pure additions that do not break source compatibility with existing consumer or provider code. Any exceptions to this will have to be extremely well justified. - In about 3-4 weeks, I'll release 1.0-rc6. This release will add support for resizing CQs and (if the code appears in time) query QP and query SRQ support from Dotan Barak at Mellanox. - Once 1.0-rc6 is out, I'll consider the API and ABI absolutely frozen. In other words, any binary applications or device provider libraries built against 1.0-rc6 will continue to work aganst any later release in the 1.0 series. This means that any changes that affect source or binary compatibility with existing consumer or provider code will have to have end-of-the-world type consequences for me to accept them. - Bug fixes are of course always welcome at any point in the development cycle. - Depending on how testing feedback looks, 3-4 weeks after 1.0-rc6, I'll release either a full 1.0, or if we're unlucky, 1.0-rc7. - Once 1.0 is out, I'll continue to accept changes that don't affect compatibility, and continue to do 1.0.1, 1.0.2, etc. releases as necessary. - When API or ABI breaking changes become necessary, I'll kick off a 1.1 release series and we can go nuts again. From mst at mellanox.co.il Mon Jan 23 09:13:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 19:13:15 +0200 Subject: [openib-general] Re: Repost [PATCH 1 of 3] move destructor tostruct neigh_parms In-Reply-To: References: Message-ID: <20060123171315.GG26724@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Repost [PATCH 1 of 3] move destructor tostruct neigh_parms > > Michael> I posted this on netdev and lkml on Wednesday, but there > Michael> was no response. I guess this patch is non-controverisal > Michael> since no one else in kernel uses the destructor. > > Michael> Do you think we can push this into mainline? > > Let me resend to Dave Miller directly and then it should be OK. > > - R. Ok. I guess we'll want a trunk patch with an ifdef, to use until 2.6.16 is released? -- MST From mst at mellanox.co.il Mon Jan 23 09:16:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 19:16:32 +0200 Subject: [openib-general] ANNOUNCE: mstflint update Message-ID: <20060123171632.GH26724@mellanox.co.il> Hi! I have updated mstflint tool with code from mellanox MFT 1.0.1 package. mstflint is a stand-alone firmware burning tool for Mellanox manufactured HCA cards. Some success has been reported with cards from Topspin/Cisco. See the README file under src/userspace/mstflint for more info. Changes: * bug fixes * more flash types supported * new flag -skip_is to allow safe firmware updates even if new firmware includes initial sector update (by updating all firmware except the IS) * portability cleanups: use stdio instead of low level file descriptor operations This code has been tested on x86 and x86_64, on PCI-X and PCI-Express cards. I'd appreciate feedback and testing on other platforms. You can help by testing this tool even if you dont have IB cards, or if you dont want to burn firmware, as described below. Thanks, MST ----------------------------------------------------------------------- To build: cd src/userspace/mstflint make ----------------------------------------------------------------------- Ways to test the tool (without accessing device): # ./mstflint -i ~/fw-25218-5_1_0-rc22-lion-cub.bin q Image type: Failsafe I.S. Version: 1 Chip Revision: A0 GUID Des: Node Port1 Port2 Sys image GUIDs: 0002c9000100d050 0002c9000100d051 0002c9000100d052 0002c9000100d050 Board ID: (MT_0140000001) VSD: PSID: MT_0140000001 # ./mstflint -i ~/fw-25218-5_1_0-rc22-lion-cub.bin v Failsafe image: Invariant /0x00000028-0x0000095f (0x000938)/ (BOOT2) - OK Primary Image /0x00010000-0x00010107 (0x000108)/ (Pointer Sector)- OK /0x00030028-0x000308af (0x000888)/ (BOOT2) - OK /0x000308b0-0x00035ae7 (0x005238)/ (BOOT2) - OK /0x00035ae8-0x000380cf (0x0025e8)/ (Configuration) - OK /0x000380d0-0x00038103 (0x000034)/ (GUID) - OK /0x00038104-0x00046b87 (0x00ea84)/ (DDR) - OK /0x00046b88-0x0004fb1b (0x008f94)/ (DDR) - OK /0x0004fb1c-0x00067ca3 (0x018188)/ (DDR) - OK /0x00067ca4-0x0007fc53 (0x017fb0)/ (DDR) - OK /0x0007fc54-0x0008245b (0x002808)/ (DDR) - OK /0x0008245c-0x000839c3 (0x001568)/ (DDR) - OK /0x000839c4-0x000839d7 (0x000014)/ (Configuration) - OK /0x000839d8-0x00083a1b (0x000044)/ (Jump addresses) - OK /0x00083a1c-0x0008e1cb (0x00a7b0)/ (EMT Service) - OK Secondary Image /0x00020000-0x00020107 (0x000108)/ (Pointer Sector)- OK /0x00090028-0x000908af (0x000888)/ (BOOT2) - OK /0x000908b0-0x00095ae7 (0x005238)/ (BOOT2) - OK /0x00095ae8-0x000980cf (0x0025e8)/ (Configuration) - OK /0x000980d0-0x00098103 (0x000034)/ (GUID) - OK /0x00098104-0x000a6b87 (0x00ea84)/ (DDR) - OK /0x000a6b88-0x000afb1b (0x008f94)/ (DDR) - OK /0x000afb1c-0x000c7ca3 (0x018188)/ (DDR) - OK /0x000c7ca4-0x000dfc53 (0x017fb0)/ (DDR) - OK /0x000dfc54-0x000e245b (0x002808)/ (DDR) - OK /0x000e245c-0x000e39c3 (0x001568)/ (DDR) - OK /0x000e39c4-0x000e39d7 (0x000014)/ (Configuration) - OK /0x000e39d8-0x000e3a1b (0x000044)/ (Jump addresses) - OK /0x000e3a1c-0x000ee1cb (0x00a7b0)/ (EMT Service) - OK FW Image verification succeeded. Image is OK. ------------------------------------------------------------------------ More ways to test the tool (read flash but do not burn anything to the device): # ./mstflint -d /sys/class/infiniband/mthca0/device/resource0 q Image type: Failsafe I.S. Version: 1 Chip Revision: A0 GUID Des: Node Port1 Port2 Sys image GUIDs: 0a66b4900002c901 0a66b4910002c901 0a66b4920002c901 0a66b4930002c901 Board ID: (MT_00A0000001) VSD: PSID: MT_00A0000001 # ./mstflint -d /sys/class/infiniband/mthca0/device/resource0 v Failsafe image: Invariant /0x00000028-0x0000095f (0x000938)/ (BOOT2) - OK Primary Image /0x00010000-0x00010107 (0x000108)/ (Pointer Sector)- OK /0x00030028-0x000308af (0x000888)/ (BOOT2) - OK /0x000308b0-0x00034baf (0x004300)/ (BOOT2) - OK /0x00034bb0-0x00035a93 (0x000ee4)/ (Configuration) - OK /0x00035a94-0x00035ac7 (0x000034)/ (GUID) - OK /0x00035ac8-0x0003e2bb (0x0087f4)/ (DDR) - OK /0x0003e2bc-0x0004c2e3 (0x00e028)/ (DDR) - OK /0x0004c2e4-0x0004f213 (0x002f30)/ (DDR) - OK /0x0004f214-0x00050907 (0x0016f4)/ (DDR) - OK /0x00050908-0x000693e7 (0x018ae0)/ (DDR) - OK /0x000693e8-0x0007d307 (0x013f20)/ (DDR) - OK /0x0007d308-0x0007d31b (0x000014)/ (Configuration) - OK /0x0007d31c-0x0007d35f (0x000044)/ (Jump addresses) - OK /0x0007d360-0x0007d3eb (0x00008c)/ (FW Configuration) - OK Secondary Pointer Sector /0x00020000/ - invalid signature (00000000) FW Image verification succeeded. Image is OK. -- MST From xma at us.ibm.com Mon Jan 23 09:30:58 2006 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 23 Jan 2006 09:30:58 -0800 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123165551.GE5074@us.ibm.com> Message-ID: PPC Architecture defines a 64-bit Time Base register (2-32bit regs in 32 bit mode). Below link is the documentation. http://www-128.ibm.com/developerworks/eserver/articles/archguide.html Book II: PowerPC Virtual Environment Architecture Page 30 Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Jan 23 09:44:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 19:44:25 +0200 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: References: Message-ID: <20060123174425.GL26724@mellanox.co.il> Quoting r. Shirley Ma : > Subject: Re: [openib-general] Re: Userspace testing results (many kernels,?many svn trees) > > > PPC Architecture defines a 64-bit Time Base register (2-32bit regs in 32 > bit mode). Below link is the documentation. > > http://www-128.ibm.com/developerworks/eserver/articles/archguide.html > Book II: PowerPC Virtual Environment Architecture Page 30 > > > Thanks > Shirley Ma > IBM Linux Technology Center So am I right when I say that mftb returns a 32 bit register on ppc32 and a 64 bit register on ppc64? If true, it seems that this line typedef unsigned long long cycles_t; should be replaced by typedef unsigned long cycles_t; -- MST From xma at us.ibm.com Mon Jan 23 09:49:15 2006 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 23 Jan 2006 09:49:15 -0800 Subject: [openib-general] Re: Repost [PATCH 1 of 3] move destructor to struct neigh_parms In-Reply-To: <20060123163737.GD26147@mellanox.co.il> Message-ID: You can resend the patch to Dave Miller (network maintainer) and ask him for inputs directly. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Mon Jan 23 09:50:11 2006 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 23 Jan 2006 09:50:11 -0800 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123174425.GL26724@mellanox.co.il> Message-ID: >If true, it seems that this line >typedef unsigned long long cycles_t; >should be replaced by >typedef unsigned long cycles_t; Yes. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Jan 23 09:55:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 19:55:05 +0200 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: References: Message-ID: <20060123175505.GN26724@mellanox.co.il> Quoting r. Shirley Ma : > Subject: Re: Re: Userspace testing results (many kernels,?many svn trees) > > > >If true, it seems that this line > >typedef unsigned long long cycles_t; > >should be replaced by > >typedef unsigned long cycles_t; > > Yes. OK, I did it this way. # svn ci get_clock.h Sending get_clock.h Transmitting file data . Committed revision 5163. You might want to try this rev out. -- MST From xma at us.ibm.com Mon Jan 23 10:09:01 2006 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 23 Jan 2006 10:09:01 -0800 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: Message-ID: I read the documentation again, I think I was wrong. It returns 64 bit always. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Jan 23 10:13:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 20:13:50 +0200 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: References: Message-ID: <20060123181350.GO26724@mellanox.co.il> Quoting r. Shirley Ma : > Subject: Re: [openib-general] Re: Userspace testing results (many kernels,?many svn trees) > > > I read the documentation again, I think I was wrong. It returns 64 bit always. > > Thanks > Shirley Ma > IBM Linux Technology Center > 15300 SW Koll Parkway > Beaverton, OR 97006-6063 > Phone(Fax): (503) 578-7638 > I reverted it, if this change is needed you know what to do. -- MST From caitlinb at broadcom.com Mon Jan 23 10:15:55 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 23 Jan 2006 10:15:55 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C32DF@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Sean> I haven't seen anyone object to merging the other changes. > Sean> Roland, Hal - any opinion? > > I don't see much urgency in merging it now. When svn > diverges from what's upstream in the kernel, it makes my life > harder because I have to figure out which patches belong > upstream and sometimes merge things by hand (when they hit > the divergent regions). > > Also I can't say I'm thrilled by adding > > > + struct iw_cm_verbs *iwcm; > > to struct ib_device -- we still really haven't answered the > issue of how iWARP connections interact with the host network > stack, we've just pushed it off into low-level driver code > where we can't see it. > My understanding is that this is a partial answer, not a deferred one. What is fully addressed is how the iWARP device accepts or initiates a connection that will start the transition to MPA mode immediately. In that mode, there is no integration with the host network TCP layer required -- only co-ordination with the host network IP layer. Linking to the net device does that. Deferred entry into MPA mode is indeed a complex issue, which is why Tom did not propose an immediate solution. The initial model is IB compatible, deals with the needs of virtually all RDMA applications, and is very simple. Locking down this first step, to enable transport neutral application development, is important. It should not have to wait for the entire stack integration problem to be solved. The latter is a complex issue that includes things like compliance with netfilter rules that need to apply to *all* connections with IP semantics, including SDP. That complexity will take some time to work out properly. Meanwhile we can establish a very workable baseline. From nacc at us.ibm.com Mon Jan 23 10:24:12 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 23 Jan 2006 10:24:12 -0800 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123175505.GN26724@mellanox.co.il> References: <20060123175505.GN26724@mellanox.co.il> Message-ID: <20060123182412.GH5074@us.ibm.com> On 23.01.2006 [19:55:05 +0200], Michael S. Tsirkin wrote: > Quoting r. Shirley Ma : > > Subject: Re: Re: Userspace testing results (many kernels,?many svn trees) > > > > > > >If true, it seems that this line > > >typedef unsigned long long cycles_t; > > >should be replaced by > > >typedef unsigned long cycles_t; > > > > Yes. > > OK, I did it this way. > # svn ci get_clock.h > Sending get_clock.h > Transmitting file data . > Committed revision 5163. > > You might want to try this rev out. heh, ok. I'm going to let the 5162 version of a 32/32 setup finish, then run 5163. Thanks, Nish From ralphc at pathscale.com Mon Jan 23 10:41:09 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 23 Jan 2006 10:41:09 -0800 Subject: [openib-general] [PATCH] Add -m option to ping pong programs to set path MTU In-Reply-To: References: <1137633204.4520.397.camel@brick.internal.keyresearch.com> Message-ID: <1138041669.4520.439.camel@brick.internal.keyresearch.com> Looks fine to me. On Fri, 2006-01-20 at 16:48 -0800, Roland Dreier wrote: > Thanks, I applied my own version of this. Please make sure that the > svn tree still works for you. > > (This patch was the one that pushed me over the edge with code > duplication in the pingpong examples, so I put the MTU switch > statement into a new pingpong.c file...) > > - R. -- Ralph Campbell From mshefty at ichips.intel.com Mon Jan 23 10:43:40 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 23 Jan 2006 10:43:40 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <001001c61ee0$786943f0$020010ac@haggard> References: <1137876435.7119.17.camel@strider.opengridcomputing.com> <001001c61ee0$786943f0$020010ac@haggard> Message-ID: <43D523DC.6030208@ichips.intel.com> Steve Wise wrote: > Is it possible to just go ahead and push the core iwarp stuff into > kernel.org? I don't think that makes sense or would be accepted. This would be asking for changes to the kernel code that nothing else in the kernel would use. - Sean From caitlinb at broadcom.com Mon Jan 23 10:50:44 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 23 Jan 2006 10:50:44 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C32E9@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Steve Wise wrote: >> Is it possible to just go ahead and push the core iwarp stuff into >> kernel.org? > > I don't think that makes sense or would be accepted. This > would be asking for changes to the kernel code that nothing else in > the kernel would use. > But there are solid precedents for creating an extensible/flexible interface even if there is only a single user of that interface currently. For example the scsi_transport_iscsi defines an very general interface to an iscsi transport layer even though iscsi_tcp is the only instantiation of that transport. From swise at opengridcomputing.com Mon Jan 23 10:58:06 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 23 Jan 2006 12:58:06 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11C32E9@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F11C32E9@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1138042686.21945.17.camel@stevo-desktop> On Mon, 2006-01-23 at 10:50 -0800, Caitlin Bestler wrote: > openib-general-bounces at openib.org wrote: > > Steve Wise wrote: > >> Is it possible to just go ahead and push the core iwarp stuff into > >> kernel.org? > > > > I don't think that makes sense or would be accepted. This > > would be asking for changes to the kernel code that nothing else in > > the kernel would use. > > > > But there are solid precedents for creating an extensible/flexible > interface even if there is only a single user of that interface > currently. For example the scsi_transport_iscsi defines an very > general interface to an iscsi transport layer even though iscsi_tcp > is the only instantiation of that transport. And the name change for the include directory from include/infiniband to include/rmda was also a precursor to full RDMA support for more than just IB. But I'm no expert on what the criteria is for pushing openib change sets into kernel.org, so I can only voice my opinion (and ask lots of questions :)... Roland, can you help clarify this? Also, what _is_ the criteria in your mind for pulling iwarp support into: 1) openib trunk, and 2) kernel.org? Thanks, Steve. From rdreier at cisco.com Mon Jan 23 11:04:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 11:04:12 -0800 Subject: [openib-general] Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: (Shirley Ma's message of "Mon, 23 Jan 2006 10:09:01 -0800") References: Message-ID: Shirley> I read the documentation again, I think I was wrong. It Shirley> returns 64 bit always. I don't think that's correct. How could mftb return a 64-bit value on a CPU with 32-bit registers? The best thing to look at is the definition of get_tb() in in the kernel tree. I think that should give something that always works. - R. From rdreier at cisco.com Mon Jan 23 11:07:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 11:07:45 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1138042686.21945.17.camel@stevo-desktop> (Steve Wise's message of "Mon, 23 Jan 2006 12:58:06 -0600") References: <54AD0F12E08D1541B826BE97C98F99F11C32E9@NT-SJCA-0751.brcm.ad.broadcom.com> <1138042686.21945.17.camel@stevo-desktop> Message-ID: Steve> Roland, can you help clarify this? Steve> Also, what _is_ the criteria in your mind for pulling iwarp Steve> support into: 1) openib trunk, and 2) kernel.org? It's a judgement call that involves everyone. One thing that would really help me feel more comfortable would be seeing a driver for some other piece of hardware (NetEffect?). I don't want to push a design based on one piece of hardware and then find out that all the new hardware wants something else. Maybe hearing Chelsio's opinion would be helpful as well, as they have been dealing with similar issues from a different direction. - R. From rdreier at cisco.com Mon Jan 23 11:08:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 11:08:32 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11C32E9@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Mon, 23 Jan 2006 10:50:44 -0800") References: <54AD0F12E08D1541B826BE97C98F99F11C32E9@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: Caitlin> But there are solid precedents for creating an Caitlin> extensible/flexible interface even if there is only a Caitlin> single user of that interface currently. For example the Caitlin> scsi_transport_iscsi defines an very general interface to Caitlin> an iscsi transport layer even though iscsi_tcp is the Caitlin> only instantiation of that transport. Sure, but we're talking about creating an interface with zero users. And even when there is one user, we want to avoid premature and incorrect abstraction. - R. From rdreier at cisco.com Mon Jan 23 11:10:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 11:10:15 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11C32DF@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Mon, 23 Jan 2006 10:15:55 -0800") References: <54AD0F12E08D1541B826BE97C98F99F11C32DF@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: Caitlin> What is fully addressed is how the iWARP device accepts Caitlin> or initiates a connection that will start the transition Caitlin> to MPA mode immediately. Caitlin> In that mode, there is no integration with the host Caitlin> network TCP layer required -- only co-ordination with the Caitlin> host network IP layer. Linking to the net device does Caitlin> that. I disagree. The proposed interface says that accepting a connection is completely the responsibility of the low-level driver, with no hooks for central policy, bypassing packet filtering, etc, etc. That's possibly a legitimate decision but it is a pretty major decision as well. - R. From rdreier at cisco.com Mon Jan 23 11:11:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 11:11:45 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1137854528.7683.13.camel@strider.opengridcomputing.com> (Tom Tucker's message of "Sat, 21 Jan 2006 08:42:07 -0600") References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> <43D14271.6050507@ichips.intel.com> <1137854528.7683.13.camel@strider.opengridcomputing.com> Message-ID: Tom> I agree there are more elegant approaches, however, the Tom> design criteria was to minimize changes to ib_verbs and the Tom> risk of IB functional regression. I think this approach Tom> accomplishes that goal. What would the more elegant approach be? I don't think minimizing changes is really the dimension to optimize on. The luxury of Linux development is that we can choose the best solution, even if it means breaking the world (although of course the costs of churn in terms of risk and effort do need to be weighed). - R. From tom at opengridcomputing.com Mon Jan 23 11:24:55 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 23 Jan 2006 13:24:55 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: References: <54AD0F12E08D1541B826BE97C98F99F11C32E9@NT-SJCA-0751.brcm.ad.broadcom.com> <1138042686.21945.17.camel@stevo-desktop> Message-ID: <1138044295.18972.21.camel@trinity.ogc.int> On Mon, 2006-01-23 at 11:07 -0800, Roland Dreier wrote: > Steve> Roland, can you help clarify this? > > Steve> Also, what _is_ the criteria in your mind for pulling iwarp > Steve> support into: 1) openib trunk, and 2) kernel.org? > > It's a judgement call that involves everyone. One thing that would > really help me feel more comfortable would be seeing a driver for some > other piece of hardware (NetEffect?). I don't want to push a design > based on one piece of hardware and then find out that all the new > hardware wants something else. Maybe hearing Chelsio's opinion would > be helpful as well, as they have been dealing with similar issues from > a different direction. I think these are very good points. I'll see if I can't get some other folks to weigh in on this. > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at opengridcomputing.com Mon Jan 23 12:17:43 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 23 Jan 2006 14:17:43 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> <43D14271.6050507@ichips.intel.com> <1137854528.7683.13.camel@strider.opengridcomputing.com> Message-ID: <1138047463.19878.40.camel@trinity.ogc.int> On Mon, 2006-01-23 at 11:11 -0800, Roland Dreier wrote: > Tom> I agree there are more elegant approaches, however, the > Tom> design criteria was to minimize changes to ib_verbs and the > Tom> risk of IB functional regression. I think this approach > Tom> accomplishes that goal. > > What would the more elegant approach be? Phase III > > I don't think minimizing changes is really the dimension to optimize > on. The luxury of Linux development is that we can choose the best > solution, even if it means breaking the world (although of course the > costs of churn in terms of risk and effort do need to be weighed). The discussions from back at IDF advocated a phased approach. From my recollection: Phase I - iWARP device driver that mapped RNIC events and DTO to IB events and DTOs. Very small change required to core in the form of a new node type. [done] Phase II - Transport independent connection management. This milestone was to begin merge with trunk since it required more significant core changes. [done] Phase III - Transport neutral naming, pluggable transports, etc... Sonoma is a great place to dig into these discussions. Phases I and II are complete in the branch, were demonstrated at SC'05, and have now been submitted as a patch to the trunk. > > - R. From mshefty at ichips.intel.com Mon Jan 23 12:31:41 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 23 Jan 2006 12:31:41 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1138047463.19878.40.camel@trinity.ogc.int> References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> <43D14271.6050507@ichips.intel.com> <1137854528.7683.13.camel@strider.opengridcomputing.com> <1138047463.19878.40.camel@trinity.ogc.int> Message-ID: <43D53D2D.1060309@ichips.intel.com> Tom Tucker wrote: > Phases I and II are complete in the branch, were demonstrated at SC'05, > and have now been submitted as a patch to the trunk. To add to this, the patch makes fairly minor changes (a couple to only a few lines) to ib_verbs, ib_mad, ib_addr, and ib_cm. Some work was done to ib_addr and rdma_cm to simplify iWarp support. The largest changes in the patch were adding iWarp specific files, and modifications to the rdma_cm. I wanted to defer merging the iWarp changes into the rdma_cm until we could get feedback from the kernel developers first, which we've now had the chance to receive. - Sean From mst at mellanox.co.il Mon Jan 23 12:34:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 22:34:48 +0200 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123182412.GH5074@us.ibm.com> References: <20060123182412.GH5074@us.ibm.com> Message-ID: <20060123203448.GA28891@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Subject: Re: Re: Userspace testing results (many kernels,?many svn trees) > > On 23.01.2006 [19:55:05 +0200], Michael S. Tsirkin wrote: > > Quoting r. Shirley Ma : > > > Subject: Re: Re: Userspace testing results (many kernels,?many svn trees) > > > > > > > > > >If true, it seems that this line > > > >typedef unsigned long long cycles_t; > > > >should be replaced by > > > >typedef unsigned long cycles_t; > > > > > > Yes. > > > > OK, I did it this way. > > # svn ci get_clock.h > > Sending get_clock.h > > Transmitting file data . > > Committed revision 5163. > > > > You might want to try this rev out. > > heh, ok. I'm going to let the 5162 version of a 32/32 setup finish, then > run 5163. Well, Shirley Ma here said its a mistake, so I reverted the change - no need to re-run it. -- MST From nacc at us.ibm.com Mon Jan 23 13:14:54 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 23 Jan 2006 13:14:54 -0800 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123182412.GH5074@us.ibm.com> References: <20060123175505.GN26724@mellanox.co.il> <20060123182412.GH5074@us.ibm.com> Message-ID: <20060123211454.GM5074@us.ibm.com> On 23.01.2006 [10:24:12 -0800], Nishanth Aravamudan wrote: > On 23.01.2006 [19:55:05 +0200], Michael S. Tsirkin wrote: > > Quoting r. Shirley Ma : > > > Subject: Re: Re: Userspace testing results (many kernels,?many svn trees) > > > > > > > > > >If true, it seems that this line > > > >typedef unsigned long long cycles_t; > > > >should be replaced by > > > >typedef unsigned long cycles_t; > > > > > > Yes. > > > > OK, I did it this way. > > # svn ci get_clock.h > > Sending get_clock.h > > Transmitting file data . > > Committed revision 5163. > > > > You might want to try this rev out. > > heh, ok. I'm going to let the 5162 version of a 32/32 setup finish, > then run 5163. Looks like 5162/5163 is fine building wise. Here is what I got for rdma_lat for a 32-bit server and 32-bit client: loading libehca local address: LID 0x0d QPN 0x140406 PSN 0x253f3e RKey 0x2340032 VAddr 0x00000010019001 remote address: LID 0x08 QPN 0x140406 PSN 0xa79d77 RKey 0x2340032 VAddr 0x00000010019001 Latency typical: 3.25746e+09 usec Latency best : 3.19975e+09 usec Latency worst : 4.21767e+10 usec and for rdma_bw: loading libehca local address: LID 0x0d, QPN 0x150406, PSN 0x1b3ee5 RKey 0x23a0032 VAddr 0x000000f7fce000 remote address: LID 0x08, QPN 0x150406, PSN 0x2fa0a9, RKey 0x23a0032 VAddr 0x000000f7fb8000 Bandwidth peak (#0 to #999): 4.3446e-07 MB/sec Bandwidth average: 4.3446e-07 MB/sec Service Demand peak (#0 to #999): 17301 cycles/KB Service Demand Avg : 0 cycles/KB So it's still present... Thanks, Nish From mst at mellanox.co.il Mon Jan 23 13:22:45 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 23:22:45 +0200 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123211454.GM5074@us.ibm.com> References: <20060123211454.GM5074@us.ibm.com> Message-ID: <20060123212245.GI28891@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Subject: Re: Re: Userspace testing results (many kernels,many svn trees) > > On 23.01.2006 [10:24:12 -0800], Nishanth Aravamudan wrote: > > On 23.01.2006 [19:55:05 +0200], Michael S. Tsirkin wrote: > > > Quoting r. Shirley Ma : > > > > Subject: Re: Re: Userspace testing results (many kernels,?many svn trees) > > > > > > > > > > > > >If true, it seems that this line > > > > >typedef unsigned long long cycles_t; > > > > >should be replaced by > > > > >typedef unsigned long cycles_t; > > > > > > > > Yes. > > > > > > OK, I did it this way. > > > # svn ci get_clock.h > > > Sending get_clock.h > > > Transmitting file data . > > > Committed revision 5163. > > > > > > You might want to try this rev out. > > > > heh, ok. I'm going to let the 5162 version of a 32/32 setup finish, > > then run 5163. > > Looks like 5162/5163 is fine building wise. Here is what I got for > rdma_lat for a 32-bit server and 32-bit client: > > loading libehca local address: LID 0x0d QPN 0x140406 PSN 0x253f3e RKey 0x2340032 VAddr 0x00000010019001 > remote address: LID 0x08 QPN 0x140406 PSN 0xa79d77 RKey 0x2340032 VAddr 0x00000010019001 > Latency typical: 3.25746e+09 usec > Latency best : 3.19975e+09 usec > Latency worst : 4.21767e+10 usec > > and for rdma_bw: > > loading libehca local address: LID 0x0d, QPN 0x150406, PSN 0x1b3ee5 RKey 0x23a0032 VAddr 0x000000f7fce000 > remote address: LID 0x08, QPN 0x150406, PSN 0x2fa0a9, RKey 0x23a0032 VAddr 0x000000f7fb8000 > Bandwidth peak (#0 to #999): 4.3446e-07 MB/sec > Bandwidth average: 4.3446e-07 MB/sec > Service Demand peak (#0 to #999): 17301 cycles/KB > Service Demand Avg : 0 cycles/KB > > So it's still present... > > Thanks, > Nish Hmm. Could you please try running e.g. rdma_lat with -H to get all the results? -- MST From nacc at us.ibm.com Mon Jan 23 13:26:57 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 23 Jan 2006 13:26:57 -0800 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123212245.GI28891@mellanox.co.il> References: <20060123211454.GM5074@us.ibm.com> <20060123212245.GI28891@mellanox.co.il> Message-ID: <20060123212657.GN5074@us.ibm.com> On 23.01.2006 [23:22:45 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Subject: Re: Re: Userspace testing results (many kernels,many svn trees) > > > > On 23.01.2006 [10:24:12 -0800], Nishanth Aravamudan wrote: > > > On 23.01.2006 [19:55:05 +0200], Michael S. Tsirkin wrote: > > > > Quoting r. Shirley Ma : > > > > > Subject: Re: Re: Userspace testing results (many kernels,?many svn trees) > > > > > > > > > > > > > > > >If true, it seems that this line > > > > > >typedef unsigned long long cycles_t; > > > > > >should be replaced by > > > > > >typedef unsigned long cycles_t; > > > > > > > > > > Yes. > > > > > > > > OK, I did it this way. > > > > # svn ci get_clock.h > > > > Sending get_clock.h > > > > Transmitting file data . > > > > Committed revision 5163. > > > > > > > > You might want to try this rev out. > > > > > > heh, ok. I'm going to let the 5162 version of a 32/32 setup finish, > > > then run 5163. > > > > Looks like 5162/5163 is fine building wise. Here is what I got for > > rdma_lat for a 32-bit server and 32-bit client: > > > > loading libehca local address: LID 0x0d QPN 0x140406 PSN 0x253f3e RKey 0x2340032 VAddr 0x00000010019001 > > remote address: LID 0x08 QPN 0x140406 PSN 0xa79d77 RKey 0x2340032 VAddr 0x00000010019001 > > Latency typical: 3.25746e+09 usec > > Latency best : 3.19975e+09 usec > > Latency worst : 4.21767e+10 usec > > > > and for rdma_bw: > > > > loading libehca local address: LID 0x0d, QPN 0x150406, PSN 0x1b3ee5 RKey 0x23a0032 VAddr 0x000000f7fce000 > > remote address: LID 0x08, QPN 0x150406, PSN 0x2fa0a9, RKey 0x23a0032 VAddr 0x000000f7fb8000 > > Bandwidth peak (#0 to #999): 4.3446e-07 MB/sec > > Bandwidth average: 4.3446e-07 MB/sec > > Service Demand peak (#0 to #999): 17301 cycles/KB > > Service Demand Avg : 0 cycles/KB > > > > So it's still present... > > > > Thanks, > > Nish > > Hmm. Could you please try running e.g. rdma_lat with -H to get all the results? Sure, let me modify the script I use to run the jobs and resubmit the job. Thanks, Nish From rdreier at cisco.com Mon Jan 23 13:27:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 13:27:32 -0800 Subject: [openib-general] [PATCH] net: Move destructor from neigh->ops to neigh_params Message-ID: This is a resend of a patch written by Michael S. Tsirkin . I'd like to get an ACK or NAK of it from Dave and other networking people, so that we can either merge it upstream or try a different approach. There definitely is a problem with neighbour destructors that IP-over-IB is running into. It would be good to know what the design was behind putting the destructor method in neigh->ops in the first place. Dave, if you want to merge this directly, that's fine. Or I'm fine with merging this through the IB tree if you'd prefer (if you want me to do that, let me know if you think it's 2.6.16 material). Thanks, Roland struct neigh_ops currently has a destructor field, which no in-kernel drivers outside of infiniband use. The infiniband/ulp/ipoib in-tree driver stashes some info in the neighbour structure (the results of the second-stage lookup from ARP results to real link-level path), and it uses neigh->ops->destructor to get a callback so it can clean up this extra info when a neighbour is freed. We've run into problems with this: since the destructor is in an ops field that is shared between neighbours that may belong to different net devices, there's no way to set/clear it safely. The following patch moves this field to neigh_parms where it can be safely set, together with its twin neigh_setup. Two additional patches in the patch series update ipoib to use this new interface. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- diff --git a/include/net/neighbour.h b/include/net/neighbour.h index 6fa9ae1..b0666d6 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -68,6 +68,7 @@ struct neigh_parms struct net_device *dev; struct neigh_parms *next; int (*neigh_setup)(struct neighbour *); + void (*neigh_destructor)(struct neighbour *); struct neigh_table *tbl; void *sysctl_table; @@ -145,7 +146,6 @@ struct neighbour struct neigh_ops { int family; - void (*destructor)(struct neighbour *); void (*solicit)(struct neighbour *, struct sk_buff*); void (*error_report)(struct neighbour *, struct sk_buff*); int (*output)(struct sk_buff*); diff --git a/net/core/neighbour.c b/net/core/neighbour.c index e68700f..3489e23 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -586,8 +586,8 @@ void neigh_destroy(struct neighbour *nei kfree(hh); } - if (neigh->ops && neigh->ops->destructor) - (neigh->ops->destructor)(neigh); + if (neigh->parms->neigh_destructor) + (neigh->parms->neigh_destructor)(neigh); skb_queue_purge(&neigh->arp_queue); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index fd3f5c8..9588124 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -247,7 +247,6 @@ static void path_free(struct net_device if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } @@ -530,7 +529,6 @@ static void neigh_add_path(struct sk_buf err: *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; kfree(neigh); ++priv->stats.tx_dropped; @@ -769,21 +767,9 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } -static int ipoib_neigh_setup(struct neighbour *neigh) -{ - /* - * Is this kosher? I can't find anybody in the kernel that - * sets neigh->destructor, so we should be able to set it here - * without trouble. - */ - neigh->ops->destructor = ipoib_neigh_destructor; - - return 0; -} - static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { - parms->neigh_setup = ipoib_neigh_setup; + parms->neigh_destructor = ipoib_neigh_destructor; return 0; } From rdreier at cisco.com Mon Jan 23 13:43:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 13:43:28 -0800 Subject: [openib-general] Re: [PATCH] RFC: mthca handling of signals In-Reply-To: <20060123155436.GB26147@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 23 Jan 2006 17:54:36 +0200") References: <20060123155436.GB26147@mellanox.co.il> Message-ID: This is a good catch, and I think this is the right solution. One question: does the mutex use in mthca_multicast_detach() suffer from the same problem? It seems that we might call to detach from a multicast group during process exit with a signal pending, and end up failing there when we shouldn't. - R. From iod00d at hp.com Mon Jan 23 13:55:16 2006 From: iod00d at hp.com (Grant Grundler) Date: Mon, 23 Jan 2006 13:55:16 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: References: <54AD0F12E08D1541B826BE97C98F99F11C32DF@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20060123215516.GG29214@esmail.cup.hp.com> On Mon, Jan 23, 2006 at 11:10:15AM -0800, Roland Dreier wrote: > I disagree. The proposed interface says that accepting a connection > is completely the responsibility of the low-level driver, with no > hooks for central policy, bypassing packet filtering, etc, etc. > That's possibly a legitimate decision but it is a pretty major > decision as well. Yes, but we need to start somewhere. Until someone submits a driver that does all the things you mention, it makes sense to move forward with what has been proposed to date. Isn't there a chance the proposed interface is the right approach anyway? Or does RDMA infrastructure need to directly meddle with NETFILTER? grant From mst at mellanox.co.il Mon Jan 23 13:57:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jan 2006 23:57:11 +0200 Subject: [openib-general] Re: [PATCH] RFC: mthca handling of signals In-Reply-To: References: Message-ID: <20060123215711.GN28891@mellanox.co.il> Quoting r. Roland Dreier : > Subject: [openib-general] Re: [PATCH] RFC: mthca handling of signals > > This is a good catch, and I think this is the right solution. One > question: does the mutex use in mthca_multicast_detach() suffer from > the same problem? It seems that we might call to detach from a > multicast group during process exit with a signal pending, and end up > failing there when we shouldn't. Looks like it's the same problem, although I didnt see it failing here. -- MST From davem at davemloft.net Mon Jan 23 13:54:43 2006 From: davem at davemloft.net (David S. Miller) Date: Mon, 23 Jan 2006 13:54:43 -0800 (PST) Subject: [openib-general] Re: [PATCH] net: Move destructor from neigh->ops to neigh_params In-Reply-To: References: Message-ID: <20060123.135443.117589637.davem@davemloft.net> From: Roland Dreier Date: Mon, 23 Jan 2006 13:27:32 -0800 > I'd like to get an ACK or NAK of it from Dave Dave is in New Zealand at linux.conf.au, don't expect him to be too active for at least a week... From tom at opengridcomputing.com Mon Jan 23 14:32:12 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 23 Jan 2006 16:32:12 -0600 Subject: [openib-general] [PATCH] RFC: net: Neighbour, Route and Redirect Notifiers Message-ID: <1138055532.19878.63.camel@trinity.ogc.int> Enclosed is a patch for notifying registered modules of neighbour updates, route changes, and redirect events. I'm submitting this first to this list for commentary since I am presuming that OpenIB needs this functionality. Something like this will be needed for iWARP support. The issue is that for an IP network, the next hop to the remote peer can change dynamically. This patch provides a mechanism for the provider/core to be notified so that it can change the next hop hardware mac in the RNIC. There are various ways it can change, thus the different events. It could be argued that route change events are not needed. I included it here so that the argument can be had. Also PathMTU change needs to be added. The code that does this, however, is nasty and scattered about. I'll submit it after this has been looked at if people agree that this is a good general approach. Thanks, Signed-off-by: Tom Tucker diff -u -r -x '.*' --new-file linux-2.6.14.5/include/net/netevent.h linux-2.6.14.5.tom/include/net/netevent.h --- linux-2.6.14.5/include/net/netevent.h 1969-12-31 18:00:00.000000000 -0600 +++ linux-2.6.14.5.tom/include/net/netevent.h 2006-01-23 15:41:21.000000000 -0600 @@ -0,0 +1,40 @@ +#ifndef _NET_EVENT_H +#define _NET_EVENT_H + +/* + * Generic netevent notifiers + * + * Authors: + * Tom Tucker + * + * Changes: + */ + +#ifdef __KERNEL__ + +#include + +struct netevent_redirect { + struct dst_entry *old; + struct dst_entry *new; +}; + +struct netevent_route_change { + int event; + struct fib_info *fib_info; +}; + +enum netevent_notif_type { + NETEVENT_NEIGH_UPDATE = 1, /* arg is * struct neighbour */ + NETEVENT_ROUTE_UPDATE, /* arg is * struct netevent_route_change */ + NETEVENT_REDIRECT, /* arg is * struct netevent_redirect */ +}; + +extern int register_netevent_notifier(struct notifier_block *nb); +extern int unregister_netevent_notifier(struct notifier_block *nb); +extern int call_netevent_notifiers(unsigned long val, void *v); + +#endif +#endif + + diff -u -r -x '.*' --new-file linux-2.6.14.5/net/core/Makefile linux-2.6.14.5.tom/net/core/Makefile --- linux-2.6.14.5/net/core/Makefile 2005-12-26 18:26:33.000000000 -0600 +++ linux-2.6.14.5.tom/net/core/Makefile 2006-01-16 13:41:42.000000000 -0600 @@ -7,7 +7,7 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.o -obj-y += dev.o ethtool.o dev_mcast.o dst.o \ +obj-y += dev.o ethtool.o dev_mcast.o dst.o netevent.o \ neighbour.o rtnetlink.o utils.o link_watch.o filter.o obj-$(CONFIG_XFRM) += flow.o diff -u -r -x '.*' --new-file linux-2.6.14.5/net/core/neighbour.c linux-2.6.14.5.tom/net/core/neighbour.c --- linux-2.6.14.5/net/core/neighbour.c 2005-12-26 18:26:33.000000000 -0600 +++ linux-2.6.14.5.tom/net/core/neighbour.c 2006-01-16 13:41:42.000000000 -0600 @@ -30,9 +30,11 @@ #include #include #include +#include #include #include #include +#include #define NEIGH_DEBUG 1 @@ -756,6 +758,7 @@ NEIGH_PRINTK2("neigh %p is suspected.\n", neigh); neigh->nud_state = NUD_STALE; neigh_suspect(neigh); + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); } } else if (state & NUD_DELAY) { if (time_before_eq(now, @@ -763,6 +766,7 @@ NEIGH_PRINTK2("neigh %p is now reachable.\n", neigh); neigh->nud_state = NUD_REACHABLE; neigh_connect(neigh); + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); next = neigh->confirmed + neigh->parms->reachable_time; } else { NEIGH_PRINTK2("neigh %p is probed.\n", neigh); @@ -781,6 +785,7 @@ neigh->nud_state = NUD_FAILED; notify = 1; + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); NEIGH_CACHE_STAT_INC(neigh->tbl, res_failed); NEIGH_PRINTK2("neigh %p is failed.\n", neigh); @@ -1051,6 +1056,9 @@ (neigh->flags | NTF_ROUTER) : (neigh->flags & ~NTF_ROUTER); } + + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); + write_unlock_bh(&neigh->lock); #ifdef CONFIG_ARPD if (notify && neigh->parms->app_probes) diff -u -r -x '.*' --new-file linux-2.6.14.5/net/core/netevent.c linux-2.6.14.5.tom/net/core/netevent.c --- linux-2.6.14.5/net/core/netevent.c 1969-12-31 18:00:00.000000000 -0600 +++ linux-2.6.14.5.tom/net/core/netevent.c 2006-01-16 13:50:25.000000000 -0600 @@ -0,0 +1,69 @@ +/* + * Network event notifiers + * + * Authors: + * Tom Tucker + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Fixes: + */ + +#include +#include + +static struct notifier_block *netevent_notif_chain; + +/** + * register_netevent_notifier - register a netevent notifier block + * @nb: notifier + * + * Register a notifier to be called when a netevent occurs. + * The notifier passed is linked into the kernel structures and must + * not be reused until it has been unregistered. A negative errno code + * is returned on a failure. + */ +int register_netevent_notifier(struct notifier_block *nb) +{ + int err; + + rtnl_lock(); + err = notifier_chain_register(&netevent_notif_chain, nb); + rtnl_unlock(); + return err; +} + +/** + * netevent_unregister_notifier - unregister a netevent notifier block + * @nb: notifier + * + * Unregister a notifier previously registered by + * register_neigh_notifier(). The notifier is unlinked into the + * kernel structures and may then be reused. A negative errno code + * is returned on a failure. + */ + +int unregister_netevent_notifier(struct notifier_block *nb) +{ + return notifier_chain_unregister(&netevent_notif_chain, nb); +} + +/** + * call_netevent_notifiers - call all netevent notifier blocks + * @val: value passed unmodified to notifier function + * @v: pointer passed unmodified to notifier function + * + * Call all neighbour notifier blocks. Parameters and return value + * are as for notifier_call_chain(). + */ + +int call_netevent_notifiers(unsigned long val, void *v) +{ + return notifier_call_chain(&netevent_notif_chain, val, v); +} + +EXPORT_SYMBOL(register_netevent_notifier); +EXPORT_SYMBOL(unregister_netevent_notifier); diff -u -r -x '.*' --new-file linux-2.6.14.5/net/ipv4/fib_semantics.c linux-2.6.14.5.tom/net/ipv4/fib_semantics.c --- linux-2.6.14.5/net/ipv4/fib_semantics.c 2005-12-26 18:26:33.000000000 -0600 +++ linux-2.6.14.5.tom/net/ipv4/fib_semantics.c 2006-01-23 15:20:53.000000000 -0600 @@ -43,6 +43,7 @@ #include #include #include +#include #include "fib_lookup.h" @@ -276,9 +277,15 @@ struct nlmsghdr *n, struct netlink_skb_parms *req) { struct sk_buff *skb; + struct netevent_route_change rev; + u32 pid = req ? req->pid : n->nlmsg_pid; int size = NLMSG_SPACE(sizeof(struct rtmsg)+256); + rev.event = event; + rev.fib_info = fa->fa_info; + call_netevent_notifiers(NETEVENT_ROUTE_UPDATE, &rev); + skb = alloc_skb(size, GFP_KERNEL); if (!skb) return; diff -u -r -x '.*' --new-file linux-2.6.14.5/net/ipv4/route.c linux-2.6.14.5.tom/net/ipv4/route.c --- linux-2.6.14.5/net/ipv4/route.c 2005-12-26 18:26:33.000000000 -0600 +++ linux-2.6.14.5.tom/net/ipv4/route.c 2006-01-16 13:41:42.000000000 -0600 @@ -103,6 +103,7 @@ #include #include #include +#include #ifdef CONFIG_SYSCTL #include #endif @@ -1118,6 +1119,7 @@ struct rtable *rth, **rthp; u32 skeys[2] = { saddr, 0 }; int ikeys[2] = { dev->ifindex, 0 }; + struct netevent_redirect netevent; tos &= IPTOS_RT_MASK; @@ -1213,6 +1215,10 @@ rt_drop(rt); goto do_next; } + + netevent.old = &rth->u.dst; + netevent.new = &rt->u.dst; + call_netevent_notifiers(NETEVENT_REDIRECT, &netevent); rt_del(hash, rth); if (!rt_intern_hash(hash, rt, &rt)) From mst at mellanox.co.il Mon Jan 23 15:01:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 01:01:32 +0200 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123211454.GM5074@us.ibm.com> References: <20060123211454.GM5074@us.ibm.com> Message-ID: <20060123230132.GA29917@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Looks like 5162/5163 is fine building wise. Here is what I got for > rdma_lat for a 32-bit server and 32-bit client: > > loading libehca local address: LID 0x0d QPN 0x140406 PSN 0x253f3e RKey 0x2340032 VAddr 0x00000010019001 > remote address: LID 0x08 QPN 0x140406 PSN 0xa79d77 RKey 0x2340032 VAddr 0x00000010019001 > Latency typical: 3.25746e+09 usec > Latency best : 3.19975e+09 usec > Latency worst : 4.21767e+10 usec > > and for rdma_bw: > > loading libehca local address: LID 0x0d, QPN 0x150406, PSN 0x1b3ee5 RKey 0x23a0032 VAddr 0x000000f7fce000 > remote address: LID 0x08, QPN 0x150406, PSN 0x2fa0a9, RKey 0x23a0032 VAddr 0x000000f7fb8000 > Bandwidth peak (#0 to #999): 4.3446e-07 MB/sec > Bandwidth average: 4.3446e-07 MB/sec > Service Demand peak (#0 to #999): 17301 cycles/KB > Service Demand Avg : 0 cycles/KB > > So it's still present... > I have just uploaded a simple utility which I called clock_test which measures a clock once a second: this way you'll know whether mtfb is measuring time properly. Update to the latest bits and give it a try. -- MST From mst at mellanox.co.il Mon Jan 23 15:21:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 01:21:42 +0200 Subject: [openib-general] Re: [PATCH] RFC: mthca handling of signals In-Reply-To: References: Message-ID: <20060123232142.GB29917@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] RFC: mthca handling of signals > > This is a good catch, and I think this is the right solution. One > question: does the mutex use in mthca_multicast_detach() suffer from > the same problem? It seems that we might call to detach from a > multicast group during process exit with a signal pending, and end up > failing there when we shouldn't. > > - R. While this isnt a probem, we probably want to change the one in mcast_attach as well, for consistency. -- MST From nacc at us.ibm.com Mon Jan 23 15:34:38 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 23 Jan 2006 15:34:38 -0800 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123212245.GI28891@mellanox.co.il> References: <20060123211454.GM5074@us.ibm.com> <20060123212245.GI28891@mellanox.co.il> Message-ID: <20060123233438.GQ5074@us.ibm.com> On 23.01.2006 [23:22:45 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Subject: Re: Re: Userspace testing results (many kernels,many svn trees) > > > > On 23.01.2006 [10:24:12 -0800], Nishanth Aravamudan wrote: > > > On 23.01.2006 [19:55:05 +0200], Michael S. Tsirkin wrote: > > > > Quoting r. Shirley Ma : > > > > > Subject: Re: Re: Userspace testing results (many kernels,?many svn trees) > > > > > > > > > > > > > > > >If true, it seems that this line > > > > > >typedef unsigned long long cycles_t; > > > > > >should be replaced by > > > > > >typedef unsigned long cycles_t; > > > > > > > > > > Yes. > > > > > > > > OK, I did it this way. > > > > # svn ci get_clock.h > > > > Sending get_clock.h > > > > Transmitting file data . > > > > Committed revision 5163. > > > > > > > > You might want to try this rev out. > > > > > > heh, ok. I'm going to let the 5162 version of a 32/32 setup finish, > > > then run 5163. > > > > Looks like 5162/5163 is fine building wise. Here is what I got for > > rdma_lat for a 32-bit server and 32-bit client: > > > > loading libehca local address: LID 0x0d QPN 0x140406 PSN 0x253f3e RKey 0x2340032 VAddr 0x00000010019001 > > remote address: LID 0x08 QPN 0x140406 PSN 0xa79d77 RKey 0x2340032 VAddr 0x00000010019001 > > Latency typical: 3.25746e+09 usec > > Latency best : 3.19975e+09 usec > > Latency worst : 4.21767e+10 usec > > > > and for rdma_bw: > > > > loading libehca local address: LID 0x0d, QPN 0x150406, PSN 0x1b3ee5 RKey 0x23a0032 VAddr 0x000000f7fce000 > > remote address: LID 0x08, QPN 0x150406, PSN 0x2fa0a9, RKey 0x23a0032 VAddr 0x000000f7fb8000 > > Bandwidth peak (#0 to #999): 4.3446e-07 MB/sec > > Bandwidth average: 4.3446e-07 MB/sec > > Service Demand peak (#0 to #999): 17301 cycles/KB > > Service Demand Avg : 0 cycles/KB > > > > So it's still present... > > > > Thanks, > > Nish > > Hmm. Could you please try running e.g. rdma_lat with -H to get all the results? rdma_lat -H: loading libehca local address: LID 0x0d QPN 0x140406 PSN 0x304b04 RKey 0x2340032 VAddr 0x00000010019001 remote address: LID 0x08 QPN 0x140406 PSN 0xc99ca RKey 0x2340032 VAddr 0x00000010019001 #, usec 1, 3.20378e+09 2, 3.20646e+09 3, 3.21451e+09 4, 3.21586e+09 5, 3.2172e+09 6, 3.21854e+09 7, 3.21854e+09 8, 3.21988e+09 9, 3.21988e+09 10, 3.22123e+09 11, 3.22123e+09 12, 3.22123e+09 13, 3.22123e+09 14, 3.22123e+09 15, 3.22123e+09 16, 3.22257e+09 17, 3.22257e+09 18, 3.22257e+09 19, 3.22257e+09 20, 3.22391e+09 21, 3.22391e+09 22, 3.22391e+09 23, 3.22391e+09 24, 3.22391e+09 25, 3.22391e+09 26, 3.22391e+09 27, 3.22391e+09 28, 3.22391e+09 29, 3.22391e+09 30, 3.22391e+09 31, 3.22525e+09 32, 3.22525e+09 33, 3.22525e+09 34, 3.22525e+09 35, 3.22525e+09 36, 3.22525e+09 37, 3.22525e+09 38, 3.22659e+09 39, 3.22659e+09 40, 3.22659e+09 41, 3.22659e+09 42, 3.22659e+09 43, 3.22659e+09 44, 3.22659e+09 45, 3.22659e+09 46, 3.22794e+09 47, 3.22794e+09 48, 3.22794e+09 49, 3.22794e+09 50, 3.22794e+09 51, 3.22794e+09 52, 3.22794e+09 53, 3.22794e+09 54, 3.22794e+09 55, 3.22794e+09 56, 3.22794e+09 57, 3.22794e+09 58, 3.22794e+09 59, 3.22794e+09 60, 3.22794e+09 61, 3.22928e+09 62, 3.22928e+09 63, 3.22928e+09 64, 3.22928e+09 65, 3.22928e+09 66, 3.22928e+09 67, 3.22928e+09 68, 3.22928e+09 69, 3.22928e+09 70, 3.22928e+09 71, 3.22928e+09 72, 3.23062e+09 73, 3.23062e+09 74, 3.23062e+09 75, 3.23196e+09 76, 3.23196e+09 77, 3.23196e+09 78, 3.23196e+09 79, 3.23196e+09 80, 3.23196e+09 81, 3.23196e+09 82, 3.23196e+09 83, 3.23196e+09 84, 3.23196e+09 85, 3.23196e+09 86, 3.23196e+09 87, 3.23331e+09 88, 3.23331e+09 89, 3.23331e+09 90, 3.23331e+09 91, 3.23331e+09 92, 3.23331e+09 93, 3.23331e+09 94, 3.23465e+09 95, 3.23465e+09 96, 3.23465e+09 97, 3.23465e+09 98, 3.23465e+09 99, 3.23465e+09 100, 3.23465e+09 101, 3.23465e+09 102, 3.23465e+09 103, 3.23465e+09 104, 3.23465e+09 105, 3.23465e+09 106, 3.23465e+09 107, 3.23599e+09 108, 3.23599e+09 109, 3.23599e+09 110, 3.23599e+09 111, 3.23599e+09 112, 3.23599e+09 113, 3.23599e+09 114, 3.23599e+09 115, 3.23599e+09 116, 3.23599e+09 117, 3.23599e+09 118, 3.23599e+09 119, 3.23599e+09 120, 3.23599e+09 121, 3.23599e+09 122, 3.23599e+09 123, 3.23733e+09 124, 3.23733e+09 125, 3.23733e+09 126, 3.23733e+09 127, 3.23733e+09 128, 3.23733e+09 129, 3.23733e+09 130, 3.23733e+09 131, 3.23733e+09 132, 3.23733e+09 133, 3.23733e+09 134, 3.23733e+09 135, 3.23733e+09 136, 3.23733e+09 137, 3.23733e+09 138, 3.23733e+09 139, 3.23733e+09 140, 3.23733e+09 141, 3.23867e+09 142, 3.23867e+09 143, 3.23867e+09 144, 3.23867e+09 145, 3.23867e+09 146, 3.23867e+09 147, 3.23867e+09 148, 3.23867e+09 149, 3.23867e+09 150, 3.23867e+09 151, 3.23867e+09 152, 3.23867e+09 153, 3.23867e+09 154, 3.23867e+09 155, 3.23867e+09 156, 3.23867e+09 157, 3.23867e+09 158, 3.23867e+09 159, 3.24002e+09 160, 3.24002e+09 161, 3.24002e+09 162, 3.24002e+09 163, 3.24002e+09 164, 3.24002e+09 165, 3.24002e+09 166, 3.24002e+09 167, 3.24002e+09 168, 3.24002e+09 169, 3.24002e+09 170, 3.24002e+09 171, 3.24002e+09 172, 3.24002e+09 173, 3.24002e+09 174, 3.24002e+09 175, 3.24002e+09 176, 3.24002e+09 177, 3.24002e+09 178, 3.24002e+09 179, 3.24002e+09 180, 3.24002e+09 181, 3.24002e+09 182, 3.24002e+09 183, 3.24136e+09 184, 3.24136e+09 185, 3.24136e+09 186, 3.24136e+09 187, 3.24136e+09 188, 3.24136e+09 189, 3.24136e+09 190, 3.24136e+09 191, 3.24136e+09 192, 3.24136e+09 193, 3.24136e+09 194, 3.24136e+09 195, 3.24136e+09 196, 3.24136e+09 197, 3.24136e+09 198, 3.24136e+09 199, 3.24136e+09 200, 3.24136e+09 201, 3.24136e+09 202, 3.24136e+09 203, 3.24136e+09 204, 3.24136e+09 205, 3.24136e+09 206, 3.2427e+09 207, 3.2427e+09 208, 3.2427e+09 209, 3.2427e+09 210, 3.2427e+09 211, 3.2427e+09 212, 3.2427e+09 213, 3.2427e+09 214, 3.2427e+09 215, 3.2427e+09 216, 3.2427e+09 217, 3.2427e+09 218, 3.2427e+09 219, 3.2427e+09 220, 3.2427e+09 221, 3.2427e+09 222, 3.2427e+09 223, 3.2427e+09 224, 3.24404e+09 225, 3.24404e+09 226, 3.24404e+09 227, 3.24404e+09 228, 3.24404e+09 229, 3.24404e+09 230, 3.24404e+09 231, 3.24404e+09 232, 3.24404e+09 233, 3.24404e+09 234, 3.24404e+09 235, 3.24404e+09 236, 3.24404e+09 237, 3.24404e+09 238, 3.24404e+09 239, 3.24404e+09 240, 3.24404e+09 241, 3.24404e+09 242, 3.24404e+09 243, 3.24404e+09 244, 3.24404e+09 245, 3.24404e+09 246, 3.24404e+09 247, 3.24538e+09 248, 3.24538e+09 249, 3.24538e+09 250, 3.24538e+09 251, 3.24538e+09 252, 3.24538e+09 253, 3.24538e+09 254, 3.24538e+09 255, 3.24538e+09 256, 3.24538e+09 257, 3.24538e+09 258, 3.24538e+09 259, 3.24538e+09 260, 3.24538e+09 261, 3.24538e+09 262, 3.24538e+09 263, 3.24538e+09 264, 3.24538e+09 265, 3.24538e+09 266, 3.24538e+09 267, 3.24538e+09 268, 3.24538e+09 269, 3.24538e+09 270, 3.24538e+09 271, 3.24538e+09 272, 3.24538e+09 273, 3.24673e+09 274, 3.24673e+09 275, 3.24673e+09 276, 3.24673e+09 277, 3.24673e+09 278, 3.24673e+09 279, 3.24673e+09 280, 3.24673e+09 281, 3.24673e+09 282, 3.24673e+09 283, 3.24673e+09 284, 3.24673e+09 285, 3.24673e+09 286, 3.24673e+09 287, 3.24673e+09 288, 3.24673e+09 289, 3.24673e+09 290, 3.24673e+09 291, 3.24673e+09 292, 3.24673e+09 293, 3.24673e+09 294, 3.24673e+09 295, 3.24673e+09 296, 3.24673e+09 297, 3.24807e+09 298, 3.24807e+09 299, 3.24807e+09 300, 3.24807e+09 301, 3.24807e+09 302, 3.24807e+09 303, 3.24807e+09 304, 3.24807e+09 305, 3.24807e+09 306, 3.24807e+09 307, 3.24807e+09 308, 3.24807e+09 309, 3.24807e+09 310, 3.24807e+09 311, 3.24807e+09 312, 3.24807e+09 313, 3.24807e+09 314, 3.24807e+09 315, 3.24807e+09 316, 3.24807e+09 317, 3.24807e+09 318, 3.24807e+09 319, 3.24807e+09 320, 3.24807e+09 321, 3.24807e+09 322, 3.24941e+09 323, 3.24941e+09 324, 3.24941e+09 325, 3.24941e+09 326, 3.24941e+09 327, 3.24941e+09 328, 3.24941e+09 329, 3.24941e+09 330, 3.24941e+09 331, 3.24941e+09 332, 3.24941e+09 333, 3.24941e+09 334, 3.24941e+09 335, 3.24941e+09 336, 3.24941e+09 337, 3.24941e+09 338, 3.24941e+09 339, 3.24941e+09 340, 3.25075e+09 341, 3.25075e+09 342, 3.25075e+09 343, 3.25075e+09 344, 3.25075e+09 345, 3.25075e+09 346, 3.25075e+09 347, 3.25075e+09 348, 3.25075e+09 349, 3.25075e+09 350, 3.25075e+09 351, 3.25075e+09 352, 3.25075e+09 353, 3.25075e+09 354, 3.25075e+09 355, 3.25075e+09 356, 3.25075e+09 357, 3.25075e+09 358, 3.25075e+09 359, 3.25075e+09 360, 3.25075e+09 361, 3.25075e+09 362, 3.25075e+09 363, 3.25075e+09 364, 3.25075e+09 365, 3.25075e+09 366, 3.25075e+09 367, 3.25075e+09 368, 3.25075e+09 369, 3.25075e+09 370, 3.25075e+09 371, 3.25075e+09 372, 3.25075e+09 373, 3.2521e+09 374, 3.2521e+09 375, 3.2521e+09 376, 3.2521e+09 377, 3.2521e+09 378, 3.2521e+09 379, 3.2521e+09 380, 3.2521e+09 381, 3.2521e+09 382, 3.2521e+09 383, 3.2521e+09 384, 3.2521e+09 385, 3.2521e+09 386, 3.2521e+09 387, 3.2521e+09 388, 3.2521e+09 389, 3.2521e+09 390, 3.2521e+09 391, 3.2521e+09 392, 3.2521e+09 393, 3.2521e+09 394, 3.2521e+09 395, 3.2521e+09 396, 3.2521e+09 397, 3.2521e+09 398, 3.2521e+09 399, 3.25344e+09 400, 3.25344e+09 401, 3.25344e+09 402, 3.25344e+09 403, 3.25344e+09 404, 3.25344e+09 405, 3.25344e+09 406, 3.25344e+09 407, 3.25344e+09 408, 3.25344e+09 409, 3.25344e+09 410, 3.25344e+09 411, 3.25344e+09 412, 3.25344e+09 413, 3.25344e+09 414, 3.25344e+09 415, 3.25344e+09 416, 3.25344e+09 417, 3.25344e+09 418, 3.25344e+09 419, 3.25344e+09 420, 3.25344e+09 421, 3.25344e+09 422, 3.25344e+09 423, 3.25344e+09 424, 3.25344e+09 425, 3.25344e+09 426, 3.25344e+09 427, 3.25344e+09 428, 3.25478e+09 429, 3.25478e+09 430, 3.25478e+09 431, 3.25478e+09 432, 3.25478e+09 433, 3.25478e+09 434, 3.25478e+09 435, 3.25478e+09 436, 3.25478e+09 437, 3.25478e+09 438, 3.25478e+09 439, 3.25478e+09 440, 3.25478e+09 441, 3.25478e+09 442, 3.25478e+09 443, 3.25478e+09 444, 3.25478e+09 445, 3.25478e+09 446, 3.25478e+09 447, 3.25478e+09 448, 3.25478e+09 449, 3.25612e+09 450, 3.25612e+09 451, 3.25612e+09 452, 3.25612e+09 453, 3.25612e+09 454, 3.25612e+09 455, 3.25612e+09 456, 3.25612e+09 457, 3.25612e+09 458, 3.25612e+09 459, 3.25612e+09 460, 3.25612e+09 461, 3.25612e+09 462, 3.25612e+09 463, 3.25612e+09 464, 3.25612e+09 465, 3.25612e+09 466, 3.25612e+09 467, 3.25612e+09 468, 3.25612e+09 469, 3.25612e+09 470, 3.25612e+09 471, 3.25612e+09 472, 3.25612e+09 473, 3.25612e+09 474, 3.25612e+09 475, 3.25612e+09 476, 3.25612e+09 477, 3.25612e+09 478, 3.25612e+09 479, 3.25612e+09 480, 3.25612e+09 481, 3.25612e+09 482, 3.25612e+09 483, 3.25612e+09 484, 3.25612e+09 485, 3.25612e+09 486, 3.25746e+09 487, 3.25746e+09 488, 3.25746e+09 489, 3.25746e+09 490, 3.25746e+09 491, 3.25746e+09 492, 3.25746e+09 493, 3.25746e+09 494, 3.25746e+09 495, 3.25746e+09 496, 3.25746e+09 497, 3.25746e+09 498, 3.25746e+09 499, 3.25746e+09 500, 3.25746e+09 501, 3.25746e+09 502, 3.25746e+09 503, 3.25746e+09 504, 3.25746e+09 505, 3.25746e+09 506, 3.25746e+09 507, 3.25746e+09 508, 3.25746e+09 509, 3.25746e+09 510, 3.25746e+09 511, 3.25746e+09 512, 3.25746e+09 513, 3.25746e+09 514, 3.25746e+09 515, 3.25746e+09 516, 3.25746e+09 517, 3.25746e+09 518, 3.25746e+09 519, 3.25746e+09 520, 3.25746e+09 521, 3.25881e+09 522, 3.25881e+09 523, 3.25881e+09 524, 3.25881e+09 525, 3.25881e+09 526, 3.25881e+09 527, 3.25881e+09 528, 3.25881e+09 529, 3.25881e+09 530, 3.25881e+09 531, 3.25881e+09 532, 3.25881e+09 533, 3.25881e+09 534, 3.25881e+09 535, 3.25881e+09 536, 3.25881e+09 537, 3.25881e+09 538, 3.25881e+09 539, 3.25881e+09 540, 3.25881e+09 541, 3.25881e+09 542, 3.25881e+09 543, 3.25881e+09 544, 3.25881e+09 545, 3.25881e+09 546, 3.25881e+09 547, 3.25881e+09 548, 3.25881e+09 549, 3.25881e+09 550, 3.25881e+09 551, 3.25881e+09 552, 3.25881e+09 553, 3.25881e+09 554, 3.25881e+09 555, 3.26015e+09 556, 3.26015e+09 557, 3.26015e+09 558, 3.26015e+09 559, 3.26015e+09 560, 3.26015e+09 561, 3.26015e+09 562, 3.26015e+09 563, 3.26015e+09 564, 3.26015e+09 565, 3.26015e+09 566, 3.26015e+09 567, 3.26015e+09 568, 3.26015e+09 569, 3.26015e+09 570, 3.26015e+09 571, 3.26015e+09 572, 3.26015e+09 573, 3.26015e+09 574, 3.26015e+09 575, 3.26015e+09 576, 3.26015e+09 577, 3.26015e+09 578, 3.26015e+09 579, 3.26015e+09 580, 3.26015e+09 581, 3.26149e+09 582, 3.26149e+09 583, 3.26149e+09 584, 3.26149e+09 585, 3.26149e+09 586, 3.26149e+09 587, 3.26149e+09 588, 3.26149e+09 589, 3.26149e+09 590, 3.26149e+09 591, 3.26149e+09 592, 3.26149e+09 593, 3.26149e+09 594, 3.26149e+09 595, 3.26149e+09 596, 3.26149e+09 597, 3.26149e+09 598, 3.26149e+09 599, 3.26149e+09 600, 3.26149e+09 601, 3.26149e+09 602, 3.26283e+09 603, 3.26283e+09 604, 3.26283e+09 605, 3.26283e+09 606, 3.26283e+09 607, 3.26283e+09 608, 3.26283e+09 609, 3.26283e+09 610, 3.26283e+09 611, 3.26283e+09 612, 3.26283e+09 613, 3.26283e+09 614, 3.26283e+09 615, 3.26283e+09 616, 3.26283e+09 617, 3.26283e+09 618, 3.26283e+09 619, 3.26283e+09 620, 3.26283e+09 621, 3.26283e+09 622, 3.26283e+09 623, 3.26418e+09 624, 3.26418e+09 625, 3.26418e+09 626, 3.26418e+09 627, 3.26418e+09 628, 3.26418e+09 629, 3.26418e+09 630, 3.26418e+09 631, 3.26418e+09 632, 3.26418e+09 633, 3.26418e+09 634, 3.26418e+09 635, 3.26418e+09 636, 3.26418e+09 637, 3.26418e+09 638, 3.26418e+09 639, 3.26418e+09 640, 3.26418e+09 641, 3.26418e+09 642, 3.26418e+09 643, 3.26418e+09 644, 3.26418e+09 645, 3.26418e+09 646, 3.26418e+09 647, 3.26418e+09 648, 3.26418e+09 649, 3.26418e+09 650, 3.26418e+09 651, 3.26418e+09 652, 3.26418e+09 653, 3.26418e+09 654, 3.26418e+09 655, 3.26418e+09 656, 3.26418e+09 657, 3.26418e+09 658, 3.26418e+09 659, 3.26418e+09 660, 3.26552e+09 661, 3.26552e+09 662, 3.26552e+09 663, 3.26552e+09 664, 3.26552e+09 665, 3.26552e+09 666, 3.26552e+09 667, 3.26552e+09 668, 3.26552e+09 669, 3.26552e+09 670, 3.26552e+09 671, 3.26552e+09 672, 3.26552e+09 673, 3.26552e+09 674, 3.26552e+09 675, 3.26552e+09 676, 3.26552e+09 677, 3.26552e+09 678, 3.26552e+09 679, 3.26552e+09 680, 3.26552e+09 681, 3.26552e+09 682, 3.26552e+09 683, 3.26552e+09 684, 3.26552e+09 685, 3.26552e+09 686, 3.26552e+09 687, 3.26552e+09 688, 3.26552e+09 689, 3.26686e+09 690, 3.26686e+09 691, 3.26686e+09 692, 3.26686e+09 693, 3.26686e+09 694, 3.26686e+09 695, 3.26686e+09 696, 3.26686e+09 697, 3.26686e+09 698, 3.26686e+09 699, 3.26686e+09 700, 3.26686e+09 701, 3.26686e+09 702, 3.26686e+09 703, 3.26686e+09 704, 3.26686e+09 705, 3.26686e+09 706, 3.26686e+09 707, 3.26686e+09 708, 3.26686e+09 709, 3.26686e+09 710, 3.26686e+09 711, 3.26686e+09 712, 3.26686e+09 713, 3.26686e+09 714, 3.26686e+09 715, 3.26686e+09 716, 3.26686e+09 717, 3.26686e+09 718, 3.2682e+09 719, 3.2682e+09 720, 3.2682e+09 721, 3.2682e+09 722, 3.2682e+09 723, 3.2682e+09 724, 3.2682e+09 725, 3.2682e+09 726, 3.2682e+09 727, 3.2682e+09 728, 3.2682e+09 729, 3.2682e+09 730, 3.2682e+09 731, 3.2682e+09 732, 3.2682e+09 733, 3.2682e+09 734, 3.2682e+09 735, 3.2682e+09 736, 3.2682e+09 737, 3.2682e+09 738, 3.2682e+09 739, 3.2682e+09 740, 3.2682e+09 741, 3.2682e+09 742, 3.2682e+09 743, 3.2682e+09 744, 3.26954e+09 745, 3.26954e+09 746, 3.26954e+09 747, 3.26954e+09 748, 3.26954e+09 749, 3.26954e+09 750, 3.26954e+09 751, 3.26954e+09 752, 3.26954e+09 753, 3.26954e+09 754, 3.26954e+09 755, 3.26954e+09 756, 3.26954e+09 757, 3.26954e+09 758, 3.26954e+09 759, 3.26954e+09 760, 3.26954e+09 761, 3.26954e+09 762, 3.26954e+09 763, 3.26954e+09 764, 3.26954e+09 765, 3.26954e+09 766, 3.26954e+09 767, 3.27089e+09 768, 3.27089e+09 769, 3.27089e+09 770, 3.27089e+09 771, 3.27089e+09 772, 3.27089e+09 773, 3.27089e+09 774, 3.27089e+09 775, 3.27089e+09 776, 3.27089e+09 777, 3.27089e+09 778, 3.27089e+09 779, 3.27089e+09 780, 3.27089e+09 781, 3.27089e+09 782, 3.27089e+09 783, 3.27089e+09 784, 3.27089e+09 785, 3.27089e+09 786, 3.27089e+09 787, 3.27089e+09 788, 3.27089e+09 789, 3.27089e+09 790, 3.27223e+09 791, 3.27223e+09 792, 3.27223e+09 793, 3.27223e+09 794, 3.27223e+09 795, 3.27223e+09 796, 3.27223e+09 797, 3.27223e+09 798, 3.27223e+09 799, 3.27223e+09 800, 3.27223e+09 801, 3.27223e+09 802, 3.27223e+09 803, 3.27223e+09 804, 3.27223e+09 805, 3.27223e+09 806, 3.27223e+09 807, 3.27223e+09 808, 3.27223e+09 809, 3.27223e+09 810, 3.27357e+09 811, 3.27357e+09 812, 3.27357e+09 813, 3.27357e+09 814, 3.27357e+09 815, 3.27357e+09 816, 3.27357e+09 817, 3.27357e+09 818, 3.27357e+09 819, 3.27357e+09 820, 3.27491e+09 821, 3.27491e+09 822, 3.27491e+09 823, 3.27491e+09 824, 3.27491e+09 825, 3.27491e+09 826, 3.27491e+09 827, 3.27491e+09 828, 3.27491e+09 829, 3.27491e+09 830, 3.27491e+09 831, 3.27491e+09 832, 3.27491e+09 833, 3.27625e+09 834, 3.27625e+09 835, 3.27625e+09 836, 3.27625e+09 837, 3.27625e+09 838, 3.27625e+09 839, 3.27625e+09 840, 3.27625e+09 841, 3.27625e+09 842, 3.27625e+09 843, 3.27625e+09 844, 3.27625e+09 845, 3.2776e+09 846, 3.2776e+09 847, 3.2776e+09 848, 3.2776e+09 849, 3.2776e+09 850, 3.2776e+09 851, 3.2776e+09 852, 3.2776e+09 853, 3.2776e+09 854, 3.2776e+09 855, 3.2776e+09 856, 3.2776e+09 857, 3.2776e+09 858, 3.2776e+09 859, 3.27894e+09 860, 3.27894e+09 861, 3.27894e+09 862, 3.27894e+09 863, 3.27894e+09 864, 3.27894e+09 865, 3.27894e+09 866, 3.27894e+09 867, 3.27894e+09 868, 3.27894e+09 869, 3.27894e+09 870, 3.28028e+09 871, 3.28028e+09 872, 3.28028e+09 873, 3.28028e+09 874, 3.28028e+09 875, 3.28028e+09 876, 3.28028e+09 877, 3.28028e+09 878, 3.28028e+09 879, 3.28028e+09 880, 3.28028e+09 881, 3.28028e+09 882, 3.28028e+09 883, 3.28028e+09 884, 3.28028e+09 885, 3.28162e+09 886, 3.28162e+09 887, 3.28162e+09 888, 3.28162e+09 889, 3.28162e+09 890, 3.28162e+09 891, 3.28162e+09 892, 3.28162e+09 893, 3.28162e+09 894, 3.28297e+09 895, 3.28297e+09 896, 3.28297e+09 897, 3.28297e+09 898, 3.28297e+09 899, 3.28297e+09 900, 3.28431e+09 901, 3.28431e+09 902, 3.28431e+09 903, 3.28431e+09 904, 3.28431e+09 905, 3.28431e+09 906, 3.28431e+09 907, 3.28431e+09 908, 3.28431e+09 909, 3.28431e+09 910, 3.28431e+09 911, 3.28431e+09 912, 3.28431e+09 913, 3.28565e+09 914, 3.28565e+09 915, 3.28565e+09 916, 3.28565e+09 917, 3.28565e+09 918, 3.28565e+09 919, 3.28699e+09 920, 3.28699e+09 921, 3.28699e+09 922, 3.28699e+09 923, 3.28699e+09 924, 3.28699e+09 925, 3.28833e+09 926, 3.28833e+09 927, 3.28833e+09 928, 3.28833e+09 929, 3.28833e+09 930, 3.28833e+09 931, 3.28833e+09 932, 3.28968e+09 933, 3.28968e+09 934, 3.28968e+09 935, 3.28968e+09 936, 3.28968e+09 937, 3.28968e+09 938, 3.29102e+09 939, 3.29102e+09 940, 3.29236e+09 941, 3.29236e+09 942, 3.29236e+09 943, 3.29236e+09 944, 3.29236e+09 945, 3.2937e+09 946, 3.2937e+09 947, 3.2937e+09 948, 3.29505e+09 949, 3.29505e+09 950, 3.29639e+09 951, 3.29773e+09 952, 3.29773e+09 953, 3.29773e+09 954, 3.29773e+09 955, 3.29773e+09 956, 3.29907e+09 957, 3.29907e+09 958, 3.30041e+09 959, 3.30041e+09 960, 3.30176e+09 961, 3.30176e+09 962, 3.30176e+09 963, 3.30176e+09 964, 3.3031e+09 965, 3.3031e+09 966, 3.30444e+09 967, 3.30578e+09 968, 3.30578e+09 969, 3.30578e+09 970, 3.30578e+09 971, 3.30578e+09 972, 3.30712e+09 973, 3.30712e+09 974, 3.30847e+09 975, 3.30981e+09 976, 3.31518e+09 977, 3.31518e+09 978, 3.31652e+09 979, 3.32055e+09 980, 3.32323e+09 981, 3.32592e+09 982, 3.32592e+09 983, 3.32726e+09 984, 3.3286e+09 985, 3.33128e+09 986, 3.33128e+09 987, 3.33397e+09 988, 3.35142e+09 989, 3.35544e+09 990, 3.35679e+09 991, 3.70575e+09 992, 4.54998e+09 993, 4.68554e+09 994, 5.29757e+09 995, 5.413e+09 996, 8.53222e+09 997, 9.34558e+09 998, 1.00247e+10 999, 4.0715e+10 Latency typical: 3.25746e+09 usec Latency best : 3.20378e+09 usec Latency worst : 4.0715e+10 usec rdma_bw -H: reports invalid usage :/ From mst at mellanox.co.il Mon Jan 23 15:39:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 01:39:07 +0200 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123233438.GQ5074@us.ibm.com> References: <20060123233438.GQ5074@us.ibm.com> Message-ID: <20060123233907.GC29917@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Subject: Re: Re: Userspace testing results (many kernels,many svn trees) > > On 23.01.2006 [23:22:45 +0200], Michael S. Tsirkin wrote: > > Quoting r. Nishanth Aravamudan : > > > Subject: Re: Re: Userspace testing results (many kernels,many svn trees) > > > > > > On 23.01.2006 [10:24:12 -0800], Nishanth Aravamudan wrote: > > > > On 23.01.2006 [19:55:05 +0200], Michael S. Tsirkin wrote: > > > > > Quoting r. Shirley Ma : > > > > > > Subject: Re: Re: Userspace testing results (many kernels,?many svn trees) > > > > > > > > > > > > > > > > > > >If true, it seems that this line > > > > > > >typedef unsigned long long cycles_t; > > > > > > >should be replaced by > > > > > > >typedef unsigned long cycles_t; > > > > > > > > > > > > Yes. > > > > > > > > > > OK, I did it this way. > > > > > # svn ci get_clock.h > > > > > Sending get_clock.h > > > > > Transmitting file data . > > > > > Committed revision 5163. > > > > > > > > > > You might want to try this rev out. > > > > > > > > heh, ok. I'm going to let the 5162 version of a 32/32 setup finish, > > > > then run 5163. > > > > > > Looks like 5162/5163 is fine building wise. Here is what I got for > > > rdma_lat for a 32-bit server and 32-bit client: > > > > > > loading libehca local address: LID 0x0d QPN 0x140406 PSN 0x253f3e RKey 0x2340032 VAddr 0x00000010019001 > > > remote address: LID 0x08 QPN 0x140406 PSN 0xa79d77 RKey 0x2340032 VAddr 0x00000010019001 > > > Latency typical: 3.25746e+09 usec > > > Latency best : 3.19975e+09 usec > > > Latency worst : 4.21767e+10 usec > > > > > > and for rdma_bw: > > > > > > loading libehca local address: LID 0x0d, QPN 0x150406, PSN 0x1b3ee5 RKey 0x23a0032 VAddr 0x000000f7fce000 > > > remote address: LID 0x08, QPN 0x150406, PSN 0x2fa0a9, RKey 0x23a0032 VAddr 0x000000f7fb8000 > > > Bandwidth peak (#0 to #999): 4.3446e-07 MB/sec > > > Bandwidth average: 4.3446e-07 MB/sec > > > Service Demand peak (#0 to #999): 17301 cycles/KB > > > Service Demand Avg : 0 cycles/KB > > > > > > So it's still present... > > > > > > Thanks, > > > Nish > > > > Hmm. Could you please try running e.g. rdma_lat with -H to get all the results? > > rdma_lat -H: > > loading libehca local address: LID 0x0d QPN 0x140406 PSN 0x304b04 RKey 0x2340032 VAddr 0x00000010019001 > remote address: LID 0x08 QPN 0x140406 PSN 0xc99ca RKey 0x2340032 VAddr 0x00000010019001 > #, usec > Latency typical: 3.25746e+09 usec > Latency best : 3.20378e+09 usec > Latency worst : 4.0715e+10 usec Could the high/low bits be swapped? What happends if you change cycles_t from long long to long? Could you try running the clock_test utility? -- MST From nacc at us.ibm.com Mon Jan 23 15:40:03 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 23 Jan 2006 15:40:03 -0800 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123230132.GA29917@mellanox.co.il> References: <20060123211454.GM5074@us.ibm.com> <20060123230132.GA29917@mellanox.co.il> Message-ID: <20060123234003.GR5074@us.ibm.com> On 24.01.2006 [01:01:32 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Looks like 5162/5163 is fine building wise. Here is what I got for > > rdma_lat for a 32-bit server and 32-bit client: > > > > loading libehca local address: LID 0x0d QPN 0x140406 PSN 0x253f3e RKey 0x2340032 VAddr 0x00000010019001 > > remote address: LID 0x08 QPN 0x140406 PSN 0xa79d77 RKey 0x2340032 VAddr 0x00000010019001 > > Latency typical: 3.25746e+09 usec > > Latency best : 3.19975e+09 usec > > Latency worst : 4.21767e+10 usec > > > > and for rdma_bw: > > > > loading libehca local address: LID 0x0d, QPN 0x150406, PSN 0x1b3ee5 RKey 0x23a0032 VAddr 0x000000f7fce000 > > remote address: LID 0x08, QPN 0x150406, PSN 0x2fa0a9, RKey 0x23a0032 VAddr 0x000000f7fb8000 > > Bandwidth peak (#0 to #999): 4.3446e-07 MB/sec > > Bandwidth average: 4.3446e-07 MB/sec > > Service Demand peak (#0 to #999): 17301 cycles/KB > > Service Demand Avg : 0 cycles/KB > > > > So it's still present... > > > > I have just uploaded a simple utility which I called clock_test which > measures a clock once a second: this way you'll know whether mtfb > is measuring time properly. Will it get built by running make in the perftest directory? Any special usage I should know about? Thanks, Nish From mst at mellanox.co.il Mon Jan 23 15:44:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 01:44:42 +0200 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123234003.GR5074@us.ibm.com> References: <20060123234003.GR5074@us.ibm.com> Message-ID: <20060123234442.GD29917@mellanox.co.il> Quoting r. Nishanth Aravamudan : > > I have just uploaded a simple utility which I called clock_test which > > measures a clock once a second: this way you'll know whether mtfb > > is measuring time properly. > > Will it get built by running make in the perftest directory? Yes. > Any special usage I should know about? Look at its source, you'll see. You just run it for a while and it will print out the time tkaen from mtfb each second. Kill it with CRTL-C. -- MST From nacc at us.ibm.com Mon Jan 23 15:48:50 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 23 Jan 2006 15:48:50 -0800 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123233907.GC29917@mellanox.co.il> References: <20060123233438.GQ5074@us.ibm.com> <20060123233907.GC29917@mellanox.co.il> Message-ID: <20060123234850.GS5074@us.ibm.com> On 24.01.2006 [01:39:07 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Subject: Re: Re: Userspace testing results (many kernels,many svn trees) > > > > On 23.01.2006 [23:22:45 +0200], Michael S. Tsirkin wrote: > > > Quoting r. Nishanth Aravamudan : > > > > Subject: Re: Re: Userspace testing results (many kernels,many svn trees) > > > > > > > > On 23.01.2006 [10:24:12 -0800], Nishanth Aravamudan wrote: > > > > > On 23.01.2006 [19:55:05 +0200], Michael S. Tsirkin wrote: > > > > > > Quoting r. Shirley Ma : > > > > > > > Subject: Re: Re: Userspace testing results (many kernels,?many svn trees) > > > > > > > > > > > > > > > > > > > > > >If true, it seems that this line > > > > > > > >typedef unsigned long long cycles_t; > > > > > > > >should be replaced by > > > > > > > >typedef unsigned long cycles_t; > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > OK, I did it this way. > > > > > > # svn ci get_clock.h > > > > > > Sending get_clock.h > > > > > > Transmitting file data . > > > > > > Committed revision 5163. > > > > > > > > > > > > You might want to try this rev out. > > > > > > > > > > heh, ok. I'm going to let the 5162 version of a 32/32 setup finish, > > > > > then run 5163. > > > > > > > > Looks like 5162/5163 is fine building wise. Here is what I got for > > > > rdma_lat for a 32-bit server and 32-bit client: > > > > > > > > loading libehca local address: LID 0x0d QPN 0x140406 PSN 0x253f3e RKey 0x2340032 VAddr 0x00000010019001 > > > > remote address: LID 0x08 QPN 0x140406 PSN 0xa79d77 RKey 0x2340032 VAddr 0x00000010019001 > > > > Latency typical: 3.25746e+09 usec > > > > Latency best : 3.19975e+09 usec > > > > Latency worst : 4.21767e+10 usec > > > > > > > > and for rdma_bw: > > > > > > > > loading libehca local address: LID 0x0d, QPN 0x150406, PSN 0x1b3ee5 RKey 0x23a0032 VAddr 0x000000f7fce000 > > > > remote address: LID 0x08, QPN 0x150406, PSN 0x2fa0a9, RKey 0x23a0032 VAddr 0x000000f7fb8000 > > > > Bandwidth peak (#0 to #999): 4.3446e-07 MB/sec > > > > Bandwidth average: 4.3446e-07 MB/sec > > > > Service Demand peak (#0 to #999): 17301 cycles/KB > > > > Service Demand Avg : 0 cycles/KB > > > > > > > > So it's still present... > > > > > > > > Thanks, > > > > Nish > > > > > > Hmm. Could you please try running e.g. rdma_lat with -H to get all the results? > > > > rdma_lat -H: > > > > loading libehca local address: LID 0x0d QPN 0x140406 PSN 0x304b04 RKey 0x2340032 VAddr 0x00000010019001 > > remote address: LID 0x08 QPN 0x140406 PSN 0xc99ca RKey 0x2340032 VAddr 0x00000010019001 > > #, usec > > Latency typical: 3.25746e+09 usec > > Latency best : 3.20378e+09 usec > > Latency worst : 4.0715e+10 usec > > Could the high/low bits be swapped? I was thinking that might be it, but wasn't sure. > What happends if you change cycles_t from long long to long? > Could you try running the clock_test utility? I'll try the latter first (I found the usage from the file and it is built by make). Could you send me a patch to do the former, in case I need to? It can be a patch that applies directly in the perftest directory. Thanks, Nish From rdreier at cisco.com Mon Jan 23 15:53:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 15:53:19 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <20060123215516.GG29214@esmail.cup.hp.com> (Grant Grundler's message of "Mon, 23 Jan 2006 13:55:16 -0800") References: <54AD0F12E08D1541B826BE97C98F99F11C32DF@NT-SJCA-0751.brcm.ad.broadcom.com> <20060123215516.GG29214@esmail.cup.hp.com> Message-ID: > Yes, but we need to start somewhere. Until someone submits > a driver that does all the things you mention, it makes > sense to move forward with what has been proposed to date. I agree with this, and overall I am very much in favor of getting iWARP support all the way upstream. The reason I want to take time to make sure that we have the right code before we merge it is that I get the feeling that there may be elements of a) using the IB tree to get changes upstream that would be vetoed on netdev and b) trying to get openib and the kernel community to accept code just so a vendor can meet a product marketing deadline. BTW, upon reflection, the best idea for moving this forward might be to push the Ammasso driver along with the rest of the iWARP patches, so that there's some more context for review. Just because a vendor is out of business is no reason for Linux not to have a driver for a piece of hardware. - R. From arlin.r.davis at intel.com Mon Jan 23 16:14:33 2006 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 23 Jan 2006 16:14:33 -0800 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal Message-ID: <59278FC0C48A994BABABD069571E45680DB4FE26@orsmsx401.amr.corp.intel.com> Arkady, Response inline... ________________________________ From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] Sent: Tuesday, January 17, 2006 7:16 AM To: Davis, Arlin R; Lentini, James Cc: dat-discussions at yahoogroups.com; openib-general at openib.org Subject: RE: [RFC] DAT 2.0 immediate data proposal Arlin, a few things need to be addressed. 1. correlation with local and remote invalidate This potentially effects both DAT_DTOs and post operations How does this differ from normal sends or writes? 2. Need a precise defintion for CONFIRM_FLAG definition in a transport independent fashion. What guarantees DAT Provider "provides" on successful local completion? Remote end guarantee? My understanding what you are trying to do is create 2 models one IB and one for iWARP. So for IB Consumers will use CONFIRM_FLAG and for iWARP IMMED_FLAG. Provider will indicate in Provider_attr which model it supports. The issue I have with it is that I do not see a model that Consumer can use to create a transport independent code. It looks like Immed_flag can be made transport independent. But with "sender" specifying the behavior a protocol extension is needed for IB. IB will always deliver Immediate data in the header not a payload and remote Provider can control how it is delivered to a Consumer. But this means that there is no need for DTO_flags for Send side. Instead it can be used for Recv side or controlled purely by Provider. Maybe we need to just go back to one model and always deliver via the event? With the post_recv_immed requirements, other transports have a mechanism to emulate and create the necessary resources on the recv side to place idata and copy to event when operation is completed. Would this work for iWARP? Two different models for receiving idata should be avoided if at all possible. 3. Need to define error behavior. for new operations, async errors, EP behavior. I will work on updating the draft. post_send_immed will look much like post_send and post_rdma_write_immed will look a lot like post_rdma_write with some additional errors based on the post receive buffer requirement. 4. Need to define DAT_Provider attributes for immediate data and dto_flags behavior 5. Does Solicited_wait completion_flag value now applicable for RDMA_write for immediate data? yes, applicable to send, send_immed, and write_immed 6. Is dto_completion_data xfer_length include immediate_data size or not? no 7. what memory privilages needed for a recv buffer for immediate data? Based on the operation... write_immed would require write privileges and send_immed would require recv privileges. 8. SRQ interaction? Good question. all post_recv_immed or all post_recv? 9. What happens of buffer for recv operation NOT recv_immed is matched for incomming recv/rdma_write op? The rules should be: Can receive a send, send_immed, or write_immed with recv_immed. Cannot receive send_immed or write_immed on a recv. However, I am not sure how you would enforce this on IB (DTO error on the receiving side?) since the idata is delivered via CQ and does not require a special receive post descriptor. 10. Change dat_ep_post_write_immed to dat_ep_post_rdma_write_immed to be consistent with current terminology. Ok 11. Need to cleanup operation description to make it clear that Send|RDMA_write and immediate data part is a single atomic operation. The current "followed by" language is misleading. Make it explicit that there is a single local DTO completion and single remote DTO completion. Ok, I will clean that up 12. Is your intension that post_recv_immed can ONLY except immediate data and is not capable to recv any message? No, the intention is to extend the post_recv to handle 32bit idata which may arrive with or without other send or rdma_write data. Does it make more sense to add a dto_flags to the existing post_recv? 13. size should be num_segments for dat_ep_post_recv_immed() ok -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Jan 23 16:28:22 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 23 Jan 2006 16:28:22 -0800 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal In-Reply-To: <59278FC0C48A994BABABD069571E45680DB4FE26@orsmsx401.amr.corp.intel.com> References: <59278FC0C48A994BABABD069571E45680DB4FE26@orsmsx401.amr.corp.intel.com> Message-ID: <43D574A6.3050208@ichips.intel.com> Davis, Arlin R wrote: > *Maybe we need to just go back to one model and always deliver via the > event? With the post_recv_immed requirements, other transports have a > mechanism to emulate and create the necessary resources on the recv side > to place idata and copy to event when operation is completed. Would this > work for iWARP?* You don't want post_recv_immed. The receiver shouldn't have to indicate whether a receive will get immediate data or not. > 12. Is your intension that post_recv_immed can ONLY except immediate > data and is not capable to recv any message? > > *No, the intention is to extend the post_recv to handle 32bit idata > which may arrive with or without other send or rdma_write data.* > > *Does it make more sense to add a dto_flags to the existing post_recv?* This looks like an API designed around hardware that doesn't support immediate data, rather than one that actually does. Post_recv_immed doesn't map to IB. - Sean From caitlinb at broadcom.com Mon Jan 23 16:47:45 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 23 Jan 2006 16:47:45 -0800 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C337E@NT-SJCA-0751.brcm.ad.broadcom.com> > > > > Maybe we need to just go back to one model and always deliver > via the event? With the post_recv_immed requirements, other > transports have a mechanism to emulate and create the > necessary resources on the recv side to place idata and copy > to event when operation is completed. Would this work for iWARP? > > > > Two different models for receiving idata should be avoided if > at all possible. > > > Always delivering by the event is not feasible for an iWARP vendor. If you are working over RDMAC verbs then the work completion is no longer accessible by the time the Work Completion is reaped. So copying from the receive buffer to the event does not work since the location of the receive buffer is now known only to the application. The same problem exists in the opposite direction for InfiniBand HCAs using standard verbs. They cannot copy from the CQE to the receive buffer. So the user is stuck checking a flag or the event type to know where their data is. This is not terribly user friendly, but it is the best that can be offered if we want to enable this optimization. The need to check the flag does reduce the value of the optimization though. > > > > 6. Is dto_completion_data xfer_length include immediate_data > size or not? > > > > no > > > Then how does the receiver know how much data there is? Even if an iWarp Provider attempts to optimize immediate placement into the CQ, it will end up setting the xfer_length whenever the packet is received out of order. So it is far simpler for the application to simply know that the data will be in the buffer, and that the xfer_length will be set. It doesn't need to worry about whether they were set by the cq_poll verb or by the hardware. > > > > 11. Need to cleanup operation description to make it clear > that Send|RDMA_write and immediate data part > > is a single atomic operation. The current "followed by" > language is misleading. > > Make it explicit that there is a single local DTO completion > and single remote DTO completion. > > > > Ok, I will clean that up > > The best mapping available over RDMAC-compliant firmware for an iWARP NIC would be to post two operations (RDMA Write followed by a short Send). That would require additional spacein the send and completion queues since a completion for the write can only be suppressed for a successful completion. Whether these extra slots were required would be an IA attribute. And the requirement is that nothing for that QP can come between the iWARP Write and the Send. How the provider does that is up to it. Options include locking over both posts and a composite work request. Anyone working over existing RDMAC-compliant verbs will have to use the first approach. > > 12. Is your intension that post_recv_immed can ONLY except > immediate data and is not > > capable to recv any message? > > > > No, the intention is to extend the post_recv to handle 32bit > idata which may arrive with or without other send or rdma_write data. > > > > Does it make more sense to add a dto_flags to the existing post_recv? > > How does this map to iWARP? When the data can be sent as an immediate OR as data, then when received it can be placed into the receive buffer or even potentially directly into the CQ when everything aligns just right. But an iWARP sender has to place the immediate value as the first four bytes of a Send message. There is no other mapping than makes sense. Shoving the rest of the message up is complex, as is using the last four bytes of the message since the last four bytes *could* cross a DDP Segment boundary, and would require the user to provide a buffer that was 4 bytes larger. From iod00d at hp.com Mon Jan 23 17:57:08 2006 From: iod00d at hp.com (Grant Grundler) Date: Mon, 23 Jan 2006 17:57:08 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: References: <54AD0F12E08D1541B826BE97C98F99F11C32DF@NT-SJCA-0751.brcm.ad.broadcom.com> <20060123215516.GG29214@esmail.cup.hp.com> Message-ID: <20060124015708.GQ29214@esmail.cup.hp.com> On Mon, Jan 23, 2006 at 03:53:19PM -0800, Roland Dreier wrote: > > Yes, but we need to start somewhere. Until someone submits > > a driver that does all the things you mention, it makes > > sense to move forward with what has been proposed to date. > > I agree with this, and overall I am very much in favor of getting > iWARP support all the way upstream. *nod* BTW, this is a message that needs to be repeated regularly until iWARP support *is* upstream. The opposite perception is still lingering in some places because of discussions from 1 and 2 years ago. > The reason I want to take time to make sure that we have the right > code before we merge it is that I get the feeling that there may be > elements of a) using the IB tree to get changes upstream that would be > vetoed on netdev Yeah, that has happened before. And I expect netdev folks might strongly object (if they haven't already) to some "sideband" method of managing TCP/IP config when TCP/IP is exclusively running on an RNIC (TOE with RDMA front-end). IMHO, that's seems like the "hardest to fix" issue so everyone is happy. Most of the other details can be negotiated. > and b) trying to get openib and the kernel community > to accept code just so a vendor can meet a product marketing deadline. TTM via kernel.org? BWHAHAHA! :^) Sorry, I can't take that serious. :) > BTW, upon reflection, the best idea for moving this forward might be > to push the Ammasso driver along with the rest of the iWARP patches, > so that there's some more context for review. Just because a vendor > is out of business is no reason for Linux not to have a driver for a > piece of hardware. "Exactly." says the co-maintainer of the parisc-linux port. :) thanks, grant From Arkady.Kanevsky at netapp.com Mon Jan 23 19:39:16 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 23 Jan 2006 22:39:16 -0500 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal Message-ID: Arlin, comments inline. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 ________________________________ From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] Sent: Monday, January 23, 2006 7:15 PM To: Kanevsky, Arkady; Lentini, James Cc: openib-general at openib.org; dat-discussions at yahoogroups.com Subject: RE: [RFC] DAT 2.0 immediate data proposal Arkady, Response inline... ________________________________ From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] Sent: Tuesday, January 17, 2006 7:16 AM To: Davis, Arlin R; Lentini, James Cc: dat-discussions at yahoogroups.com; openib-general at openib.org Subject: RE: [RFC] DAT 2.0 immediate data proposal Arlin, a few things need to be addressed. 1. correlation with local and remote invalidate This potentially effects both DAT_DTOs and post operations How does this differ from normal sends or writes? [AK] We had added a new Send_with_Invalidate. The completion also states whether RMR was invalidated and which one. But the text for interaction is added through out the completion and post operations. See the latest draft of uDAPL and kDAPL 1.3 specs on the DAT reflector. 2. Need a precise defintion for CONFIRM_FLAG definition in a transport independent fashion. What guarantees DAT Provider "provides" on successful local completion? Remote end guarantee? My understanding what you are trying to do is create 2 models one IB and one for iWARP. So for IB Consumers will use CONFIRM_FLAG and for iWARP IMMED_FLAG. Provider will indicate in Provider_attr which model it supports. The issue I have with it is that I do not see a model that Consumer can use to create a transport independent code. It looks like Immed_flag can be made transport independent. But with "sender" specifying the behavior a protocol extension is needed for IB. IB will always deliver Immediate data in the header not a payload and remote Provider can control how it is delivered to a Consumer. But this means that there is no need for DTO_flags for Send side. Instead it can be used for Recv side or controlled purely by Provider. Maybe we need to just go back to one model and always deliver via the event? With the post_recv_immed requirements, other transports have a mechanism to emulate and create the necessary resources on the recv side to place idata and copy to event when operation is completed. Would this work for iWARP? Two different models for receiving idata should be avoided if at all possible. [AK] Caitlin already responded to this. 3. Need to define error behavior. for new operations, async errors, EP behavior. I will work on updating the draft. post_send_immed will look much like post_send and post_rdma_write_immed will look a lot like post_rdma_write with some additional errors based on the post receive buffer requirement. [AK] Also consider if you want to add remote invalidate to the new operation. 4. Need to define DAT_Provider attributes for immediate data and dto_flags behavior 5. Does Solicited_wait completion_flag value now applicable for RDMA_write for immediate data? yes, applicable to send, send_immed, and write_immed 6. Is dto_completion_data xfer_length include immediate_data size or not? no [AK] It can work both ways. Either we include 4 extra bytes for immediate data or not. Consumer just have to know. The real data always starts at 4 byte boundary into the buffer is immediate data is returned inline. We need to state how immediate data is positioned if it is smaller than 4 bytes. 7. what memory privilages needed for a recv buffer for immediate data? Based on the operation... write_immed would require write privileges and send_immed would require recv privileges. 8. SRQ interaction? Good question. all post_recv_immed or all post_recv? [AK] Will this work for the user model? Not supporting handling immediate recv and regular recv with potential immediate data on one SRQ. 9. What happens of buffer for recv operation NOT recv_immed is matched for incomming recv/rdma_write op? The rules should be: Can receive a send, send_immed, or write_immed with recv_immed. Cannot receive send_immed or write_immed on a recv. However, I am not sure how you would enforce this on IB (DTO error on the receiving side?) since the idata is delivered via CQ and does not require a special receive post descriptor. [AK] We can make this Provider attribute. Or we can state that if immed data is return in event then there is no error for recv. 10. Change dat_ep_post_write_immed to dat_ep_post_rdma_write_immed to be consistent with current terminology. Ok 11. Need to cleanup operation description to make it clear that Send|RDMA_write and immediate data part is a single atomic operation. The current "followed by" language is misleading. Make it explicit that there is a single local DTO completion and single remote DTO completion. Ok, I will clean that up 12. Is your intension that post_recv_immed can ONLY except immediate data and is not capable to recv any message? No, the intention is to extend the post_recv to handle 32bit idata which may arrive with or without other send or rdma_write data. Does it make more sense to add a dto_flags to the existing post_recv? 13. size should be num_segments for dat_ep_post_recv_immed() ok -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at opengridcomputing.com Mon Jan 23 21:15:52 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 23 Jan 2006 23:15:52 -0600 Subject: [openib-general] [PATCH] RFC: AMSO1100 iWARP Driver Message-ID: <1138079753.4758.40.camel@strider.opengridcomputing.com> Given some of the discussion re: support for the AMSO1100, enclosed is a patch for an OpenIB provider in support of the AMSO1100. While we use these devices extensively for testing of iWARP support at OGC, the driver has not seen anywhere near the kind of attention that the mthca driver has. This patch requires the previously submitted iWARP core support and CMA patch. Please review and offer suggestions as to what we can do to improve it. There are some known issues with ULP that do not filter based on node type and can become confused and crash when loading and unloading this driver. Patches are available for these ULP add_one and remove_one handlers, but these are trivial and can be considered separately. Index: Kconfig =================================================================== --- Kconfig (revision 5098) +++ Kconfig (working copy) @@ -32,6 +32,8 @@ source "drivers/infiniband/hw/mthca/Kconfig" +source "drivers/infiniband/hw/amso1100/Kconfig" + source "drivers/infiniband/hw/ehca/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" Index: Makefile =================================================================== --- Makefile (revision 5098) +++ Makefile (working copy) @@ -1,6 +1,7 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_IPATH_CORE) += hw/ipath/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ +obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SDP) += ulp/sdp/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ Index: hw/amso1100/cc_ae.h =================================================================== --- hw/amso1100/cc_ae.h (revision 0) +++ hw/amso1100/cc_ae.h (revision 0) @@ -0,0 +1,108 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _CC_AE_H_ +#define _CC_AE_H_ + +/* + * WARNING: If you change this file, also bump CC_IVN_BASE + * in common/include/clustercore/cc_ivn.h. + */ + +/* + * Asynchronous Event Identifiers + * + * These start at 0x80 only so it's obvious from inspection that + * they are not work-request statuses. This isn't critical. + * + * NOTE: these event id's must fit in eight bits. + */ +typedef enum { + CCAE_REMOTE_SHUTDOWN = 0x80, + CCAE_ACTIVE_CONNECT_RESULTS, + CCAE_CONNECTION_REQUEST, + CCAE_LLP_CLOSE_COMPLETE, + CCAE_TERMINATE_MESSAGE_RECEIVED, + CCAE_LLP_CONNECTION_RESET, + CCAE_LLP_CONNECTION_LOST, + CCAE_LLP_SEGMENT_SIZE_INVALID, + CCAE_LLP_INVALID_CRC, + CCAE_LLP_BAD_FPDU, + CCAE_INVALID_DDP_VERSION, + CCAE_INVALID_RDMA_VERSION, + CCAE_UNEXPECTED_OPCODE, + CCAE_INVALID_DDP_QUEUE_NUMBER, + CCAE_RDMA_READ_NOT_ENABLED, + CCAE_RDMA_WRITE_NOT_ENABLED, + CCAE_RDMA_READ_TOO_SMALL, + CCAE_NO_L_BIT, + CCAE_TAGGED_INVALID_STAG, + CCAE_TAGGED_BASE_BOUNDS_VIOLATION, + CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION, + CCAE_TAGGED_INVALID_PD, + CCAE_WRAP_ERROR, + CCAE_BAD_CLOSE, + CCAE_BAD_LLP_CLOSE, + CCAE_INVALID_MSN_RANGE, + CCAE_INVALID_MSN_GAP, + CCAE_IRRQ_OVERFLOW, + CCAE_IRRQ_MSN_GAP, + CCAE_IRRQ_MSN_RANGE, + CCAE_IRRQ_INVALID_STAG, + CCAE_IRRQ_BASE_BOUNDS_VIOLATION, + CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION, + CCAE_IRRQ_INVALID_PD, + CCAE_IRRQ_WRAP_ERROR, + CCAE_CQ_SQ_COMPLETION_OVERFLOW, + CCAE_CQ_RQ_COMPLETION_ERROR, + CCAE_QP_SRQ_WQE_ERROR, + CCAE_QP_LOCAL_CATASTROPHIC_ERROR, + CCAE_CQ_OVERFLOW, + CCAE_CQ_OPERATION_ERROR, + CCAE_SRQ_LIMIT_REACHED, + CCAE_QP_RQ_LIMIT_REACHED, + CCAE_SRQ_CATASTROPHIC_ERROR, + CCAE_RNIC_CATASTROPHIC_ERROR + /* WARNING If you add more id's, make sure their values fit in eight bits. */ +} cc_event_id_t; + +/* + * Resource Indicators and Identifiers + */ +typedef enum { + CC_RES_IND_QP = 1, + CC_RES_IND_EP, + CC_RES_IND_CQ, + CC_RES_IND_SRQ, +} cc_resource_indicator_t; + +#endif /* _CC_AE_H_ */ Index: hw/amso1100/Kconfig =================================================================== --- hw/amso1100/Kconfig (revision 0) +++ hw/amso1100/Kconfig (revision 0) @@ -0,0 +1,15 @@ +config INFINIBAND_AMSO1100 + tristate "Ammasso 1100 HCA support" + depends on PCI && INFINIBAND + ---help--- + This is a low-level driver for the Ammasso 1100 host + channel adapter (HCA). + +config INFINIBAND_AMSO1100_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_AMSO1100 + default n + ---help--- + This option causes the amso1100 driver to produce a bunch of + debug messages. Select this if you are developing the driver + or trying to diagnose a problem. Index: hw/amso1100/c2_intr.c =================================================================== --- hw/amso1100/c2_intr.c (revision 0) +++ hw/amso1100/c2_intr.c (revision 0) @@ -0,0 +1,177 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include "c2_vq.h" + +static void handle_mq(struct c2_dev *c2dev, u32 index); +static void handle_vq(struct c2_dev *c2dev, u32 mq_index); + +/* + * Handle RNIC interrupts + */ +void +c2_rnic_interrupt(struct c2_dev *c2dev) +{ + unsigned int mq_index; + + while (c2dev->hints_read != be16_to_cpu(c2dev->hint_count)) { + mq_index = c2_read32(c2dev->regs + PCI_BAR0_HOST_HINT); + if (mq_index & 0x80000000) { + break; + } + + c2dev->hints_read++; + handle_mq(c2dev, mq_index); + } + +} + +/* + * Top level MQ handler + */ +static void +handle_mq(struct c2_dev *c2dev, u32 mq_index) +{ + if (c2dev->qptr_array[mq_index] == NULL) { + dprintk(KERN_INFO "handle_mq: stray activity for mq_index=%d\n", mq_index); + return; + } + + switch (mq_index) { + case (0): + /* + * An index of 0 in the activity queue + * indicates the req vq now has messages + * available... + * + * Wake up any waiters waiting on req VQ + * message availability. + */ + wake_up(&c2dev->req_vq_wo); + break; + case (1): + handle_vq(c2dev, mq_index); + break; + case (2): + spin_lock(&c2dev->aeq_lock); + c2_ae_event(c2dev, mq_index); + spin_unlock(&c2dev->aeq_lock); + break; + default: + c2_cq_event(c2dev, mq_index); + break; + } + + return; +} + +/* + * Handles verbs WR replies. + */ +static void +handle_vq(struct c2_dev *c2dev, u32 mq_index) +{ + void *adapter_msg, *reply_msg; + ccwr_hdr_t *host_msg; + ccwr_hdr_t tmp; + struct c2_mq *reply_vq; + struct c2_vq_req* req; + + reply_vq = (struct c2_mq *)c2dev->qptr_array[mq_index]; + + { + + /* + * get next msg from mq_index into adapter_msg. + * don't free it yet. + */ + adapter_msg = c2_mq_consume(reply_vq); + dprintk("handle_vq: adapter_msg=%p\n", adapter_msg); + if (adapter_msg == NULL) { + return; + } + + host_msg = vq_repbuf_alloc(c2dev); + + /* + * If we can't get a host buffer, then we'll still + * wakeup the waiter, we just won't give him the msg. + * It is assumed the waiter will deal with this... + */ + if (!host_msg) { + dprintk("handle_vq: no repbufs!\n"); + + /* + * just copy the WR header into a local variable. + * this allows us to still demux on the context + */ + host_msg = &tmp; + memcpy(host_msg, adapter_msg, sizeof(tmp)); + reply_msg = NULL; + } else { + memcpy(host_msg, adapter_msg, reply_vq->msg_size); + reply_msg = host_msg; + } + + /* + * consume the msg from the MQ + */ + c2_mq_free(reply_vq); + + /* + * wakeup the waiter. + */ + req = (struct c2_vq_req *)(unsigned long)host_msg->context; + if (req == NULL) { + /* + * We should never get here, as the adapter should + * never send us a reply that we're not expecting. + */ + vq_repbuf_free(c2dev, host_msg); + dprintk("handle_vq: UNEXPECTEDLY got NULL req\n"); + return; + } + req->reply_msg = (u64)(unsigned long)(reply_msg); + atomic_set(&req->reply_ready, 1); + dprintk("handle_vq: wakeup req %p\n", req); + wake_up(&req->wait_object); + + /* + * If the request was cancelled, then this put will + * free the vq_req memory...and reply_msg!!! + */ + vq_req_put(c2dev, req); + } + +} + Index: hw/amso1100/c2_mq.c =================================================================== --- hw/amso1100/c2_mq.c (revision 0) +++ hw/amso1100/c2_mq.c (revision 0) @@ -0,0 +1,182 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include "c2_mq.h" + +#define BUMP(q,p) (p) = ((p)+1) % (q)->q_size +#define BUMP_SHARED(q,p) (p) = cpu_to_be16((be16_to_cpu(p)+1) % (q)->q_size) + +void * +c2_mq_alloc(struct c2_mq *q) +{ + assert(q); + assert(q->magic == C2_MQ_MAGIC); + assert(q->type == C2_MQ_ADAPTER_TARGET); + + if (c2_mq_full(q)) { + return NULL; + } else { +#ifdef C2_DEBUG + ccwr_hdr_t *m = (ccwr_hdr_t*)(q->msg_pool + q->priv * q->msg_size); +#ifdef CCMSGMAGIC + assert(m->magic == be32_to_cpu(~CCWR_MAGIC)); + m->magic = cpu_to_be32(CCWR_MAGIC); +#endif + dprintk("c2_mq_alloc %p\n", m); + return m; +#else + return q->msg_pool + q->priv * q->msg_size; +#endif + } +} + +void +c2_mq_produce(struct c2_mq *q) +{ + assert(q); + assert(q->magic == C2_MQ_MAGIC); + assert(q->type == C2_MQ_ADAPTER_TARGET); + + if (!c2_mq_full(q)) { + BUMP(q, q->priv); + q->hint_count++; + /* Update peer's offset. */ + q->peer->shared = cpu_to_be16(q->priv); + } +} + +void * +c2_mq_consume(struct c2_mq *q) +{ + assert(q); + assert(q->magic == C2_MQ_MAGIC); + assert(q->type == C2_MQ_HOST_TARGET); + + if (c2_mq_empty(q)) { + return NULL; + } else { +#ifdef C2_DEBUG + ccwr_hdr_t *m = (ccwr_hdr_t*) + (q->msg_pool + q->priv * q->msg_size); +#ifdef CCMSGMAGIC + assert(m->magic == be32_to_cpu(CCWR_MAGIC)); +#endif + dprintk("c2_mq_consume %p\n", m); + return m; +#else + return q->msg_pool + q->priv * q->msg_size; +#endif + } +} + +void +c2_mq_free(struct c2_mq *q) +{ + assert(q); + assert(q->magic == C2_MQ_MAGIC); + assert(q->type == C2_MQ_HOST_TARGET); + + if (!c2_mq_empty(q)) { +#ifdef C2_DEBUG +{ + dprintk("c2_mq_free %p\n", (ccwr_hdr_t*)(q->msg_pool + q->priv * q->msg_size)); +} +#endif + +#ifdef CCMSGMAGIC +{ + ccwr_hdr_t *m = (ccwr_hdr_t*) + (q->msg_pool + q->priv * q->msg_size); + m->magic = cpu_to_be32(~CCWR_MAGIC); +} +#endif + BUMP(q, q->priv); + /* Update peer's offset. */ + q->peer->shared = cpu_to_be16(q->priv); + } +} + + +void +c2_mq_lconsume(struct c2_mq *q, u32 wqe_count) +{ + assert(q); + assert(q->magic == C2_MQ_MAGIC); + assert(q->type == C2_MQ_ADAPTER_TARGET); + + while (wqe_count--) { + assert(!c2_mq_empty(q)); + BUMP_SHARED(q, *q->shared); + } +} + + +u32 +c2_mq_count(struct c2_mq *q) +{ + s32 count; + + assert(q); + if (q->type == C2_MQ_HOST_TARGET) { + count = be16_to_cpu(*q->shared) - q->priv; + } else { + count = q->priv - be16_to_cpu(*q->shared); + } + + if (count < 0) { + count += q->q_size; + } + + return (u32)count; +} + +void +c2_mq_init(struct c2_mq *q, u32 index, u32 q_size, + u32 msg_size, u8 *pool_start, u16 *peer, + u32 type) +{ + assert(q->shared); + + /* This code assumes the byte swapping has already been done! */ + q->index = index; + q->q_size = q_size; + q->msg_size = msg_size; + q->msg_pool = pool_start; + q->peer = (struct c2_mq_shared *)peer; + q->magic = C2_MQ_MAGIC; + q->type = type; + q->priv = 0; + q->hint_count = 0; + return; +} + Index: hw/amso1100/cc_wr.h =================================================================== --- hw/amso1100/cc_wr.h (revision 0) +++ hw/amso1100/cc_wr.h (revision 0) @@ -0,0 +1,1340 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _CC_WR_H_ +#define _CC_WR_H_ +#include "cc_types.h" +/* + * WARNING: If you change this file, also bump CC_IVN_BASE + * in common/include/clustercore/cc_ivn.h. + */ + +#ifdef CCDEBUG +#define CCWR_MAGIC 0xb07700b0 +#endif + +#define WR_BUILD_STR_LEN 64 + +#ifdef _MSC_VER +#define PACKED +#pragma pack(push) +#pragma pack(1) +#define __inline__ __inline +#else +#define PACKED __attribute__ ((packed)) +#endif + +/* + * WARNING: All of these structs need to align any 64bit types on + * 64 bit boundaries! 64bit types include u64 and u64. + */ + +/* + * Clustercore Work Request Header. Be sensitive to field layout + * and alignment. + */ +typedef struct { + /* wqe_count is part of the cqe. It is put here so the + * adapter can write to it while the wr is pending without + * clobbering part of the wr. This word need not be dma'd + * from the host to adapter by libccil, but we copy it anyway + * to make the memcpy to the adapter better aligned. + */ + u32 wqe_count; + + /* Put these fields next so that later 32- and 64-bit + * quantities are naturally aligned. + */ + u8 id; + u8 result; /* adapter -> host */ + u8 sge_count; /* host -> adapter */ + u8 flags; /* host -> adapter */ + + u64 context; +#ifdef CCMSGMAGIC + u32 magic; + u32 pad; +#endif +} PACKED ccwr_hdr_t; + +/* + *------------------------ RNIC ------------------------ + */ + +/* + * WR_RNIC_OPEN + */ + +/* + * Flags for the RNIC WRs + */ +typedef enum { + RNIC_IRD_STATIC = 0x0001, + RNIC_ORD_STATIC = 0x0002, + RNIC_QP_STATIC = 0x0004, + RNIC_SRQ_SUPPORTED = 0x0008, + RNIC_PBL_BLOCK_MODE = 0x0010, + RNIC_SRQ_MODEL_ARRIVAL = 0x0020, + RNIC_CQ_OVF_DETECTED = 0x0040, + RNIC_PRIV_MODE = 0x0080 +} PACKED cc_rnic_flags_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 user_context; + u16 flags; /* See cc_rnic_flags_t */ + u16 port_num; +} PACKED ccwr_rnic_open_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; +} PACKED ccwr_rnic_open_rep_t; + +typedef union { + ccwr_rnic_open_req_t req; + ccwr_rnic_open_rep_t rep; +} PACKED ccwr_rnic_open_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; +} PACKED ccwr_rnic_query_req_t; + +/* + * WR_RNIC_QUERY + */ +typedef struct { + ccwr_hdr_t hdr; + u64 user_context; + u32 vendor_id; + u32 part_number; + u32 hw_version; + u32 fw_ver_major; + u32 fw_ver_minor; + u32 fw_ver_patch; + char fw_ver_build_str[WR_BUILD_STR_LEN]; + u32 max_qps; + u32 max_qp_depth; + u32 max_srq_depth; + u32 max_send_sgl_depth; + u32 max_rdma_sgl_depth; + u32 max_cqs; + u32 max_cq_depth; + u32 max_cq_event_handlers; + u32 max_mrs; + u32 max_pbl_depth; + u32 max_pds; + u32 max_global_ird; + u32 max_global_ord; + u32 max_qp_ird; + u32 max_qp_ord; + u32 flags; /* See cc_rnic_flags_t */ + u32 max_mws; + u32 pbe_range_low; + u32 pbe_range_high; + u32 max_srqs; + u32 page_size; +} PACKED ccwr_rnic_query_rep_t; + +typedef union { + ccwr_rnic_query_req_t req; + ccwr_rnic_query_rep_t rep; +} PACKED ccwr_rnic_query_t; + +/* + * WR_RNIC_GETCONFIG + */ + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 option; /* see cc_getconfig_cmd_t */ + u64 reply_buf; + u32 reply_buf_len; +} PACKED ccwr_rnic_getconfig_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 option; /* see cc_getconfig_cmd_t */ + u32 count_len; /* length of the number of addresses configured */ +} PACKED ccwr_rnic_getconfig_rep_t; + +typedef union { + ccwr_rnic_getconfig_req_t req; + ccwr_rnic_getconfig_rep_t rep; +} PACKED ccwr_rnic_getconfig_t; + +/* + * WR_RNIC_SETCONFIG + */ +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 option; /* See cc_setconfig_cmd_t */ + /* variable data and pad See cc_netaddr_t and + * cc_route_t + */ + u8 data[0]; +} PACKED ccwr_rnic_setconfig_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_rnic_setconfig_rep_t; + +typedef union { + ccwr_rnic_setconfig_req_t req; + ccwr_rnic_setconfig_rep_t rep; +} PACKED ccwr_rnic_setconfig_t; + +/* + * WR_RNIC_CLOSE + */ +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; +} PACKED ccwr_rnic_close_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_rnic_close_rep_t; + +typedef union { + ccwr_rnic_close_req_t req; + ccwr_rnic_close_rep_t rep; +} PACKED ccwr_rnic_close_t; + +/* + *------------------------ CQ ------------------------ + */ +typedef struct { + ccwr_hdr_t hdr; + u64 shared_ht; + u64 user_context; + u64 msg_pool; + u32 rnic_handle; + u32 msg_size; + u32 depth; +} PACKED ccwr_cq_create_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 mq_index; + u32 adapter_shared; + u32 cq_handle; +} PACKED ccwr_cq_create_rep_t; + +typedef union { + ccwr_cq_create_req_t req; + ccwr_cq_create_rep_t rep; +} PACKED ccwr_cq_create_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 cq_handle; + u32 new_depth; + u64 new_msg_pool; +} PACKED ccwr_cq_modify_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_cq_modify_rep_t; + +typedef union { + ccwr_cq_modify_req_t req; + ccwr_cq_modify_rep_t rep; +} PACKED ccwr_cq_modify_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 cq_handle; +} PACKED ccwr_cq_destroy_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_cq_destroy_rep_t; + +typedef union { + ccwr_cq_destroy_req_t req; + ccwr_cq_destroy_rep_t rep; +} PACKED ccwr_cq_destroy_t; + +/* + *------------------------ PD ------------------------ + */ +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 pd_id; +} PACKED ccwr_pd_alloc_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_pd_alloc_rep_t; + +typedef union { + ccwr_pd_alloc_req_t req; + ccwr_pd_alloc_rep_t rep; +} PACKED ccwr_pd_alloc_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 pd_id; +} PACKED ccwr_pd_dealloc_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_pd_dealloc_rep_t; + +typedef union { + ccwr_pd_dealloc_req_t req; + ccwr_pd_dealloc_rep_t rep; +} PACKED ccwr_pd_dealloc_t; + +/* + *------------------------ SRQ ------------------------ + */ +typedef struct { + ccwr_hdr_t hdr; + u64 shared_ht; + u64 user_context; + u32 rnic_handle; + u32 srq_depth; + u32 srq_limit; + u32 sgl_depth; + u32 pd_id; +} PACKED ccwr_srq_create_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 srq_depth; + u32 sgl_depth; + u32 msg_size; + u32 mq_index; + u32 mq_start; + u32 srq_handle; +} PACKED ccwr_srq_create_rep_t; + +typedef union { + ccwr_srq_create_req_t req; + ccwr_srq_create_rep_t rep; +} PACKED ccwr_srq_create_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 srq_handle; +} PACKED ccwr_srq_destroy_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_srq_destroy_rep_t; + +typedef union { + ccwr_srq_destroy_req_t req; + ccwr_srq_destroy_rep_t rep; +} PACKED ccwr_srq_destroy_t; + +/* + *------------------------ QP ------------------------ + */ +typedef enum { + QP_RDMA_READ = 0x00000001, /* RDMA read enabled? */ + QP_RDMA_WRITE = 0x00000002, /* RDMA write enabled? */ + QP_MW_BIND = 0x00000004, /* MWs enabled */ + QP_ZERO_STAG = 0x00000008, /* enabled? */ + QP_REMOTE_TERMINATION = 0x00000010, /* remote end terminated */ + QP_RDMA_READ_RESPONSE = 0x00000020 /* Remote RDMA read */ + /* enabled? */ +} PACKED ccwr_qp_flags_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 shared_sq_ht; + u64 shared_rq_ht; + u64 user_context; + u32 rnic_handle; + u32 sq_cq_handle; + u32 rq_cq_handle; + u32 sq_depth; + u32 rq_depth; + u32 srq_handle; + u32 srq_limit; + u32 flags; /* see ccwr_qp_flags_t */ + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u32 ord; + u32 ird; + u32 pd_id; +} PACKED ccwr_qp_create_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 sq_depth; + u32 rq_depth; + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u32 ord; + u32 ird; + u32 sq_msg_size; + u32 sq_mq_index; + u32 sq_mq_start; + u32 rq_msg_size; + u32 rq_mq_index; + u32 rq_mq_start; + u32 qp_handle; +} PACKED ccwr_qp_create_rep_t; + +typedef union { + ccwr_qp_create_req_t req; + ccwr_qp_create_rep_t rep; +} PACKED ccwr_qp_create_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 qp_handle; +} PACKED ccwr_qp_query_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 user_context; + u32 rnic_handle; + u32 sq_depth; + u32 rq_depth; + u32 send_sgl_depth; + u32 rdma_write_sgl_depth; + u32 recv_sgl_depth; + u32 ord; + u32 ird; + u16 qp_state; + u16 flags; /* see ccwr_qp_flags_t */ + u32 qp_id; + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; + u32 terminate_msg_length; /* 0 if not present */ + u8 data[0]; + /* Terminate Message in-line here. */ +} PACKED ccwr_qp_query_rep_t; + +typedef union { + ccwr_qp_query_req_t req; + ccwr_qp_query_rep_t rep; +} PACKED ccwr_qp_query_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 stream_msg; + u32 stream_msg_length; + u32 rnic_handle; + u32 qp_handle; + u32 next_qp_state; + u32 ord; + u32 ird; + u32 sq_depth; + u32 rq_depth; + u32 llp_ep_handle; +} PACKED ccwr_qp_modify_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 ord; + u32 ird; + u32 sq_depth; + u32 rq_depth; + u32 sq_msg_size; + u32 sq_mq_index; + u32 sq_mq_start; + u32 rq_msg_size; + u32 rq_mq_index; + u32 rq_mq_start; +} PACKED ccwr_qp_modify_rep_t; + +typedef union { + ccwr_qp_modify_req_t req; + ccwr_qp_modify_rep_t rep; +} PACKED ccwr_qp_modify_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 qp_handle; +} PACKED ccwr_qp_destroy_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_qp_destroy_rep_t; + +typedef union { + ccwr_qp_destroy_req_t req; + ccwr_qp_destroy_rep_t rep; +} PACKED ccwr_qp_destroy_t; + +/* + * The CCWR_QP_CONNECT msg is posted on the verbs request queue. It can + * only be posted when a QP is in IDLE state. After the connect request is + * submitted to the LLP, the adapter moves the QP to CONNECT_PENDING state. + * No synchronous reply from adapter to this WR. The results of + * connection are passed back in an async event CCAE_ACTIVE_CONNECT_RESULTS + * See ccwr_ae_active_connect_results_t + */ +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 qp_handle; + u32 remote_addr; + u16 remote_port; + u16 pad; + u32 private_data_length; + u8 private_data[0]; /* Private data in-line. */ +} PACKED ccwr_qp_connect_req_t; + +typedef struct { + ccwr_qp_connect_req_t req; + /* no synchronous reply. */ +} PACKED ccwr_qp_connect_t; + + +/* + *------------------------ MM ------------------------ + */ + +typedef cc_mm_flags_t ccwr_mr_flags_t; /* cc_types.h */ + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 pbl_depth; + u32 pd_id; + u32 flags; /* See ccwr_mr_flags_t */ +} PACKED ccwr_nsmr_stag_alloc_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 pbl_depth; + u32 stag_index; +} PACKED ccwr_nsmr_stag_alloc_rep_t; + +typedef union { + ccwr_nsmr_stag_alloc_req_t req; + ccwr_nsmr_stag_alloc_rep_t rep; +} PACKED ccwr_nsmr_stag_alloc_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 va; + u32 rnic_handle; + u16 flags; /* See ccwr_mr_flags_t */ + u8 stag_key; + u8 pad; + u32 pd_id; + u32 pbl_depth; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} PACKED ccwr_nsmr_register_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 pbl_depth; + u32 stag_index; +} PACKED ccwr_nsmr_register_rep_t; + +typedef union { + ccwr_nsmr_register_req_t req; + ccwr_nsmr_register_rep_t rep; +} PACKED ccwr_nsmr_register_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 flags; /* See ccwr_mr_flags_t */ + u32 stag_index; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} PACKED ccwr_nsmr_pbl_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_nsmr_pbl_rep_t; + +typedef union { + ccwr_nsmr_pbl_req_t req; + ccwr_nsmr_pbl_rep_t rep; +} PACKED ccwr_nsmr_pbl_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 stag_index; +} PACKED ccwr_mr_query_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u8 stag_key; + u8 pad[3]; + u32 pd_id; + u32 flags; /* See ccwr_mr_flags_t */ + u32 pbl_depth; +} PACKED ccwr_mr_query_rep_t; + +typedef union { + ccwr_mr_query_req_t req; + ccwr_mr_query_rep_t rep; +} PACKED ccwr_mr_query_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 stag_index; +} PACKED ccwr_mw_query_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u8 stag_key; + u8 pad[3]; + u32 pd_id; + u32 flags; /* See ccwr_mr_flags_t */ +} PACKED ccwr_mw_query_rep_t; + +typedef union { + ccwr_mw_query_req_t req; + ccwr_mw_query_rep_t rep; +} PACKED ccwr_mw_query_t; + + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 stag_index; +} PACKED ccwr_stag_dealloc_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_stag_dealloc_rep_t; + +typedef union { + ccwr_stag_dealloc_req_t req; + ccwr_stag_dealloc_rep_t rep; +} PACKED ccwr_stag_dealloc_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 va; + u32 rnic_handle; + u16 flags; /* See ccwr_mr_flags_t */ + u8 stag_key; + u8 pad; + u32 stag_index; + u32 pd_id; + u32 pbl_depth; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + u32 pad1; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} PACKED ccwr_nsmr_reregister_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 pbl_depth; + u32 stag_index; +} PACKED ccwr_nsmr_reregister_rep_t; + +typedef union { + ccwr_nsmr_reregister_req_t req; + ccwr_nsmr_reregister_rep_t rep; +} PACKED ccwr_nsmr_reregister_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 va; + u32 rnic_handle; + u16 flags; /* See ccwr_mr_flags_t */ + u8 stag_key; + u8 pad; + u32 stag_index; + u32 pd_id; +} PACKED ccwr_smr_register_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 stag_index; +} PACKED ccwr_smr_register_rep_t; + +typedef union { + ccwr_smr_register_req_t req; + ccwr_smr_register_rep_t rep; +} PACKED ccwr_smr_register_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 pd_id; +} PACKED ccwr_mw_alloc_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 stag_index; +} PACKED ccwr_mw_alloc_rep_t; + +typedef union { + ccwr_mw_alloc_req_t req; + ccwr_mw_alloc_rep_t rep; +} PACKED ccwr_mw_alloc_t; + +/* + *------------------------ WRs ----------------------- + */ + +typedef struct { + ccwr_hdr_t hdr; /* Has status and WR Type */ +} PACKED ccwr_user_hdr_t; + +/* Completion queue entry. */ +typedef struct { + ccwr_hdr_t hdr; /* Has status and WR Type */ + u64 qp_user_context;/* cc_user_qp_t * */ + u32 qp_state; /* Current QP State */ + u32 handle; /* QPID or EP Handle */ + u32 bytes_rcvd; /* valid for RECV WCs */ + u32 stag; +} PACKED ccwr_ce_t; + + +/* + * Flags used for all post-sq WRs. These must fit in the flags + * field of the ccwr_hdr_t (eight bits). + */ +typedef enum { + SQ_SIGNALED = 0x01, + SQ_READ_FENCE = 0x02, + SQ_FENCE = 0x04, +} PACKED cc_sq_flags_t; + +/* + * Common fields for all post-sq WRs. Namely the standard header and a + * secondary header with fields common to all post-sq WRs. + */ +typedef struct { + ccwr_user_hdr_t user_hdr; +} PACKED cc_sq_hdr_t; + +/* + * Same as above but for post-rq WRs. + */ +typedef struct { + ccwr_user_hdr_t user_hdr; +} PACKED cc_rq_hdr_t; + +/* + * use the same struct for all sends. + */ +typedef struct { + cc_sq_hdr_t sq_hdr; + u32 sge_len; + u32 remote_stag; + u8 data[0]; /* SGE array */ +} PACKED ccwr_send_req_t, ccwr_send_se_req_t, ccwr_send_inv_req_t, ccwr_send_se_inv_req_t; + +typedef ccwr_ce_t ccwr_send_rep_t; + +typedef union { + ccwr_send_req_t req; + ccwr_send_rep_t rep; +} PACKED ccwr_send_t, ccwr_send_se_t, ccwr_send_inv_t, ccwr_send_se_inv_t; + +typedef struct { + cc_sq_hdr_t sq_hdr; + u64 remote_to; + u32 remote_stag; + u32 sge_len; + u8 data[0]; /* SGE array */ +} PACKED ccwr_rdma_write_req_t; + +typedef ccwr_ce_t ccwr_rdma_write_rep_t; + +typedef union { + ccwr_rdma_write_req_t req; + ccwr_rdma_write_rep_t rep; +} PACKED ccwr_rdma_write_t; + +typedef struct { + cc_sq_hdr_t sq_hdr; + u64 local_to; + u64 remote_to; + u32 local_stag; + u32 remote_stag; + u32 length; +} PACKED ccwr_rdma_read_req_t,ccwr_rdma_read_inv_req_t; + +typedef ccwr_ce_t ccwr_rdma_read_rep_t; + +typedef union { + ccwr_rdma_read_req_t req; + ccwr_rdma_read_rep_t rep; +} PACKED ccwr_rdma_read_t, ccwr_rdma_read_inv_t; + +typedef struct { + cc_sq_hdr_t sq_hdr; + u64 va; + u8 stag_key; + u8 pad[3]; + u32 mw_stag_index; + u32 mr_stag_index; + u32 length; + u32 flags; /* see ccwr_mr_flags_t; */ +} PACKED ccwr_mw_bind_req_t; + +typedef ccwr_ce_t ccwr_mw_bind_rep_t; + +typedef union { + ccwr_mw_bind_req_t req; + ccwr_mw_bind_rep_t rep; +} PACKED ccwr_mw_bind_t; + +typedef struct { + cc_sq_hdr_t sq_hdr; + u64 va; + u8 stag_key; + u8 pad[3]; + u32 stag_index; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} PACKED ccwr_nsmr_fastreg_req_t; + +typedef ccwr_ce_t ccwr_nsmr_fastreg_rep_t; + +typedef union { + ccwr_nsmr_fastreg_req_t req; + ccwr_nsmr_fastreg_rep_t rep; +} PACKED ccwr_nsmr_fastreg_t; + +typedef struct { + cc_sq_hdr_t sq_hdr; + u8 stag_key; + u8 pad[3]; + u32 stag_index; +} PACKED ccwr_stag_invalidate_req_t; + +typedef ccwr_ce_t ccwr_stag_invalidate_rep_t; + +typedef union { + ccwr_stag_invalidate_req_t req; + ccwr_stag_invalidate_rep_t rep; +} PACKED ccwr_stag_invalidate_t; + +typedef union { + cc_sq_hdr_t sq_hdr; + ccwr_send_req_t send; + ccwr_send_se_req_t send_se; + ccwr_send_inv_req_t send_inv; + ccwr_send_se_inv_req_t send_se_inv; + ccwr_rdma_write_req_t rdma_write; + ccwr_rdma_read_req_t rdma_read; + ccwr_mw_bind_req_t mw_bind; + ccwr_nsmr_fastreg_req_t nsmr_fastreg; + ccwr_stag_invalidate_req_t stag_inv; +} PACKED ccwr_sqwr_t; + + +/* + * RQ WRs + */ +typedef struct { + cc_rq_hdr_t rq_hdr; + u8 data[0]; /* array of SGEs */ +} PACKED ccwr_rqwr_t, ccwr_recv_req_t; + +typedef ccwr_ce_t ccwr_recv_rep_t; + +typedef union { + ccwr_recv_req_t req; + ccwr_recv_rep_t rep; +} PACKED ccwr_recv_t; + +/* + * All AEs start with this header. Most AEs only need to convey the + * information in the header. Some, like LLP connection events, need + * more info. The union typdef ccwr_ae_t has all the possible AEs. + * + * hdr.context is the user_context from the rnic_open WR. NULL If this + * is not affiliated with an rnic + * + * hdr.id is the AE identifier (eg; CCAE_REMOTE_SHUTDOWN, + * CCAE_LLP_CLOSE_COMPLETE) + * + * resource_type is one of: CC_RES_IND_QP, CC_RES_IND_CQ, CC_RES_IND_SRQ + * + * user_context is the context passed down when the host created the resource. + */ +typedef struct { + ccwr_hdr_t hdr; + u64 user_context; /* user context for this res. */ + u32 resource_type; /* see cc_resource_indicator_t */ + u32 resource; /* handle for resource */ + u32 qp_state; /* current QP State */ +} PACKED PACKED ccwr_ae_hdr_t; + +/* + * After submitting the CCAE_ACTIVE_CONNECT_RESULTS message on the AEQ, + * the adapter moves the QP into RTS state + */ +typedef struct { + ccwr_ae_hdr_t ae_hdr; + u32 laddr; + u32 raddr; + u16 lport; + u16 rport; + u32 private_data_length; + u8 private_data[0]; /* data is in-line in the msg. */ +} PACKED ccwr_ae_active_connect_results_t; + +/* + * When connections are established by the stack (and the private data + * MPA frame is received), the adapter will generate an event to the host. + * The details of the connection, any private data, and the new connection + * request handle is passed up via the CCAE_CONNECTION_REQUEST msg on the + * AE queue: + */ +typedef struct { + ccwr_ae_hdr_t ae_hdr; + u32 cr_handle; /* connreq handle (sock ptr) */ + u32 laddr; + u32 raddr; + u16 lport; + u16 rport; + u32 private_data_length; + u8 private_data[0]; /* data is in-line in the msg. */ +} PACKED ccwr_ae_connection_request_t; + +typedef union { + ccwr_ae_hdr_t ae_generic; + ccwr_ae_active_connect_results_t ae_active_connect_results; + ccwr_ae_connection_request_t ae_connection_request; +} PACKED ccwr_ae_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 hint_count; + u64 q0_host_shared; + u64 q1_host_shared; + u64 q1_host_msg_pool; + u64 q2_host_shared; + u64 q2_host_msg_pool; +} PACKED ccwr_init_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_init_rep_t; + +typedef union { + ccwr_init_req_t req; + ccwr_init_rep_t rep; +} PACKED ccwr_init_t; + +/* + * For upgrading flash. + */ + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; +} PACKED ccwr_flash_init_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 adapter_flash_buf_offset; + u32 adapter_flash_len; +} PACKED ccwr_flash_init_rep_t; + +typedef union { + ccwr_flash_init_req_t req; + ccwr_flash_init_rep_t rep; +} PACKED ccwr_flash_init_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 len; +} PACKED ccwr_flash_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 status; +} PACKED ccwr_flash_rep_t; + +typedef union { + ccwr_flash_req_t req; + ccwr_flash_rep_t rep; +} PACKED ccwr_flash_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 size; +} PACKED ccwr_buf_alloc_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 offset; /* 0 if mem not available */ + u32 size; /* 0 if mem not available */ +} PACKED ccwr_buf_alloc_rep_t; + +typedef union { + ccwr_buf_alloc_req_t req; + ccwr_buf_alloc_rep_t rep; +} PACKED ccwr_buf_alloc_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 offset; /* Must match value from alloc */ + u32 size; /* Must match value from alloc */ +} PACKED ccwr_buf_free_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_buf_free_rep_t; + +typedef union { + ccwr_buf_free_req_t req; + ccwr_buf_free_rep_t rep; +} PACKED ccwr_buf_free_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 offset; + u32 size; + u32 type; + u32 flags; +} PACKED ccwr_flash_write_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 status; +} PACKED ccwr_flash_write_rep_t; + +typedef union { + ccwr_flash_write_req_t req; + ccwr_flash_write_rep_t rep; +} PACKED ccwr_flash_write_t; + +/* + * Messages for LLP connection setup. + */ + +/* + * Listen Request. This allocates a listening endpoint to allow passive + * connection setup. Newly established LLP connections are passed up + * via an AE. See ccwr_ae_connection_request_t + */ +typedef struct { + ccwr_hdr_t hdr; + u64 user_context; /* returned in AEs. */ + u32 rnic_handle; + u32 local_addr; /* local addr, or 0 */ + u16 local_port; /* 0 means "pick one" */ + u16 pad; + u32 backlog; /* tradional tcp listen bl */ +} PACKED ccwr_ep_listen_create_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 ep_handle; /* handle to new listening ep */ + u16 local_port; /* resulting port... */ + u16 pad; +} PACKED ccwr_ep_listen_create_rep_t; + +typedef union { + ccwr_ep_listen_create_req_t req; + ccwr_ep_listen_create_rep_t rep; +} PACKED ccwr_ep_listen_create_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 ep_handle; +} PACKED ccwr_ep_listen_destroy_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_ep_listen_destroy_rep_t; + +typedef union { + ccwr_ep_listen_destroy_req_t req; + ccwr_ep_listen_destroy_rep_t rep; +} PACKED ccwr_ep_listen_destroy_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 ep_handle; +} PACKED ccwr_ep_query_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; +} PACKED ccwr_ep_query_rep_t; + +typedef union { + ccwr_ep_query_req_t req; + ccwr_ep_query_rep_t rep; +} PACKED ccwr_ep_query_t; + + +/* + * The host passes this down to indicate acceptance of a pending iWARP + * connection. The cr_handle was obtained from the CONNECTION_REQUEST + * AE passed up by the adapter. See ccwr_ae_connection_request_t. + */ +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 qp_handle; /* QP to bind to this LLP conn */ + u32 ep_handle; /* LLP handle to accept */ + u32 private_data_length; + u8 private_data[0]; /* data in-line in msg. */ +} PACKED ccwr_cr_accept_req_t; + +/* + * adapter sends reply when private data is successfully submitted to + * the LLP. + */ +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_cr_accept_rep_t; + +typedef union { + ccwr_cr_accept_req_t req; + ccwr_cr_accept_rep_t rep; +} PACKED ccwr_cr_accept_t; + +/* + * The host sends this down if a given iWARP connection request was + * rejected by the consumer. The cr_handle was obtained from a + * previous ccwr_ae_connection_request_t AE sent by the adapter. + */ +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 ep_handle; /* LLP handle to reject */ +} PACKED ccwr_cr_reject_req_t; + +/* + * Dunno if this is needed, but we'll add it for now. The adapter will + * send the reject_reply after the LLP endpoint has been destroyed. + */ +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_cr_reject_rep_t; + +typedef union { + ccwr_cr_reject_req_t req; + ccwr_cr_reject_rep_t rep; +} PACKED ccwr_cr_reject_t; + +/* + * console command. Used to implement a debug console over the verbs + * request and reply queues. + */ + +/* + * Console request message. It contains: + * - message hdr with id = CCWR_CONSOLE + * - the physaddr/len of host memory to be used for the reply. + * - the command string. eg: "netstat -s" or "zoneinfo" + */ +typedef struct { + ccwr_hdr_t hdr; /* id = CCWR_CONSOLE */ + u64 reply_buf; /* pinned host buf for reply */ + u32 reply_buf_len; /* length of reply buffer */ + u8 command[0]; /* NUL terminated ascii string */ + /* containing the command req */ +} PACKED ccwr_console_req_t; + +/* + * flags used in the console reply. + */ +typedef enum { + CONS_REPLY_TRUNCATED = 0x00000001 /* reply was truncated */ +} PACKED cc_console_flags_t; + +/* + * Console reply message. + * hdr.result contains the cc_status_t error if the reply was _not_ generated, + * or CC_OK if the reply was generated. + */ +typedef struct { + ccwr_hdr_t hdr; /* id = CCWR_CONSOLE */ + u32 flags; /* see cc_console_flags_t */ +} PACKED ccwr_console_rep_t; + +typedef union { + ccwr_console_req_t req; + ccwr_console_rep_t rep; +} PACKED ccwr_console_t; + + +/* + * Giant union with all WRs. Makes life easier... + */ +typedef union { + ccwr_hdr_t hdr; + ccwr_user_hdr_t user_hdr; + ccwr_rnic_open_t rnic_open; + ccwr_rnic_query_t rnic_query; + ccwr_rnic_getconfig_t rnic_getconfig; + ccwr_rnic_setconfig_t rnic_setconfig; + ccwr_rnic_close_t rnic_close; + ccwr_cq_create_t cq_create; + ccwr_cq_modify_t cq_modify; + ccwr_cq_destroy_t cq_destroy; + ccwr_pd_alloc_t pd_alloc; + ccwr_pd_dealloc_t pd_dealloc; + ccwr_srq_create_t srq_create; + ccwr_srq_destroy_t srq_destroy; + ccwr_qp_create_t qp_create; + ccwr_qp_query_t qp_query; + ccwr_qp_modify_t qp_modify; + ccwr_qp_destroy_t qp_destroy; + ccwr_qp_connect_t qp_connect; + ccwr_nsmr_stag_alloc_t nsmr_stag_alloc; + ccwr_nsmr_register_t nsmr_register; + ccwr_nsmr_pbl_t nsmr_pbl; + ccwr_mr_query_t mr_query; + ccwr_mw_query_t mw_query; + ccwr_stag_dealloc_t stag_dealloc; + ccwr_sqwr_t sqwr; + ccwr_rqwr_t rqwr; + ccwr_ce_t ce; + ccwr_ae_t ae; + ccwr_init_t init; + ccwr_ep_listen_create_t ep_listen_create; + ccwr_ep_listen_destroy_t ep_listen_destroy; + ccwr_cr_accept_t cr_accept; + ccwr_cr_reject_t cr_reject; + ccwr_console_t console; + ccwr_flash_init_t flash_init; + ccwr_flash_t flash; + ccwr_buf_alloc_t buf_alloc; + ccwr_buf_free_t buf_free; + ccwr_flash_write_t flash_write; +} PACKED ccwr_t; + + +/* + * Accessors for the wr fields that are packed together tightly to + * reduce the wr message size. The wr arguments are void* so that + * either a ccwr_t*, a ccwr_hdr_t*, or a pointer to any of the types + * in the ccwr_t union can be passed in. + */ +static __inline__ u8 +cc_wr_get_id(void *wr) +{ + return ((ccwr_hdr_t *)wr)->id; +} +static __inline__ void +c2_wr_set_id(void *wr, u8 id) +{ + ((ccwr_hdr_t *)wr)->id = id; +} +static __inline__ u8 +cc_wr_get_result(void *wr) +{ + return ((ccwr_hdr_t *)wr)->result; +} +static __inline__ void +cc_wr_set_result(void *wr, u8 result) +{ + ((ccwr_hdr_t *)wr)->result = result; +} +static __inline__ u8 +cc_wr_get_flags(void *wr) +{ + return ((ccwr_hdr_t *)wr)->flags; +} +static __inline__ void +cc_wr_set_flags(void *wr, u8 flags) +{ + ((ccwr_hdr_t *)wr)->flags = flags; +} +static __inline__ u8 +cc_wr_get_sge_count(void *wr) +{ + return ((ccwr_hdr_t *)wr)->sge_count; +} +static __inline__ void +cc_wr_set_sge_count(void *wr, u8 sge_count) +{ + ((ccwr_hdr_t *)wr)->sge_count = sge_count; +} +static __inline__ u32 +cc_wr_get_wqe_count(void *wr) +{ + return ((ccwr_hdr_t *)wr)->wqe_count; +} +static __inline__ void +cc_wr_set_wqe_count(void *wr, u32 wqe_count) +{ + ((ccwr_hdr_t *)wr)->wqe_count = wqe_count; +} + +#undef PACKED + +#ifdef _MSC_VER +#pragma pack(pop) +#endif + +#endif /* _CC_WR_H_ */ Index: hw/amso1100/c2.c =================================================================== --- hw/amso1100/c2.c (revision 0) +++ hw/amso1100/c2.c (revision 0) @@ -0,0 +1,1221 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include "c2.h" +#include "c2_provider.h" + +MODULE_AUTHOR("Tom Tucker "); +MODULE_DESCRIPTION("Ammasso AMSO1100 Low-level iWARP Driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK + | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN; + +static int debug = -1; /* defaults above */ +module_param(debug, int, 0); +MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); + +char *rnic_ip_addr = "192.168.69.169"; +module_param(rnic_ip_addr, charp, S_IRUGO); +MODULE_PARM_DESC(rnic_ip_addr, "IP Address for the AMSO1100 Adapter"); + +static int c2_up(struct net_device *netdev); +static int c2_down(struct net_device *netdev); +static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev); +static void c2_tx_interrupt(struct net_device *netdev); +static void c2_rx_interrupt(struct net_device *netdev); +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs); +static void c2_tx_timeout(struct net_device *netdev); +static int c2_change_mtu(struct net_device *netdev, int new_mtu); +static void c2_reset(struct c2_port *c2_port); +static struct net_device_stats* c2_get_stats(struct net_device *netdev); + +extern void c2_rnic_interrupt(struct c2_dev *c2dev); + +static struct pci_device_id c2_pci_table[] = { + { 0x18b8, 0xb001, PCI_ANY_ID, PCI_ANY_ID }, + { 0 } +}; + +MODULE_DEVICE_TABLE(pci, c2_pci_table); + +static void c2_print_macaddr(struct net_device *netdev) +{ + dprintk(KERN_INFO PFX "%s: MAC %02X:%02X:%02X:%02X:%02X:%02X, " + "IRQ %u\n", netdev->name, + netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2], + netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5], + netdev->irq); +} + +static void c2_set_rxbufsize(struct c2_port *c2_port) +{ + struct net_device *netdev = c2_port->netdev; + + assert(netdev != NULL); + + if (netdev->mtu > RX_BUF_SIZE) + c2_port->rx_buf_size = netdev->mtu + ETH_HLEN + sizeof(struct c2_rxp_hdr) + NET_IP_ALIGN; + else + c2_port->rx_buf_size = sizeof(struct c2_rxp_hdr) + RX_BUF_SIZE; +} + +/* + * Allocate TX ring elements and chain them together. + * One-to-one association of adapter descriptors with ring elements. + */ +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr, dma_addr_t base, + void __iomem *mmio_txp_ring) +{ + struct c2_tx_desc *tx_desc; + struct c2_txp_desc *txp_desc; + struct c2_element *elem; + int i; + + tx_ring->start = kmalloc(sizeof(*elem)*tx_ring->count, GFP_KERNEL); + if (!tx_ring->start) + return -ENOMEM; + + for (i = 0, elem = tx_ring->start, tx_desc = vaddr, txp_desc = mmio_txp_ring; + i < tx_ring->count; i++, elem++, tx_desc++, txp_desc++) + { + tx_desc->len = 0; + tx_desc->status = 0; + + /* Set TXP_HTXD_UNINIT */ + c2_write64((void *)txp_desc + C2_TXP_ADDR, cpu_to_be64(0x1122334455667788ULL)); + c2_write16((void *)txp_desc + C2_TXP_LEN, cpu_to_be16(0)); + c2_write16((void *)txp_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_UNINIT)); + + elem->skb = NULL; + elem->ht_desc = tx_desc; + elem->hw_desc = txp_desc; + + if (i == tx_ring->count - 1) { + elem->next = tx_ring->start; + tx_desc->next_offset = base; + } else { + elem->next = elem + 1; + tx_desc->next_offset = base + (i + 1) * sizeof(*tx_desc); + } + } + + tx_ring->to_use = tx_ring->to_clean = tx_ring->start; + + return 0; +} + +/* + * Allocate RX ring elements and chain them together. + * One-to-one association of adapter descriptors with ring elements. + */ +static int c2_rx_ring_alloc(struct c2_ring *rx_ring, void *vaddr, dma_addr_t base, + void __iomem *mmio_rxp_ring) +{ + struct c2_rx_desc *rx_desc; + struct c2_rxp_desc *rxp_desc; + struct c2_element *elem; + int i; + + rx_ring->start = kmalloc(sizeof(*elem) * rx_ring->count, GFP_KERNEL); + if (!rx_ring->start) + return -ENOMEM; + + for (i = 0, elem = rx_ring->start, rx_desc = vaddr, rxp_desc = mmio_rxp_ring; + i < rx_ring->count; i++, elem++, rx_desc++, rxp_desc++) + { + rx_desc->len = 0; + rx_desc->status = 0; + + /* Set RXP_HRXD_UNINIT */ + c2_write16((void *)rxp_desc + C2_RXP_STATUS, cpu_to_be16(RXP_HRXD_OK)); + c2_write16((void *)rxp_desc + C2_RXP_COUNT, cpu_to_be16(0)); + c2_write16((void *)rxp_desc + C2_RXP_LEN, cpu_to_be16(0)); + c2_write64((void *)rxp_desc + C2_RXP_ADDR, cpu_to_be64(0x99aabbccddeeffULL)); + c2_write16((void *)rxp_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_UNINIT)); + + elem->skb = NULL; + elem->ht_desc = rx_desc; + elem->hw_desc = rxp_desc; + + if (i == rx_ring->count - 1) { + elem->next = rx_ring->start; + rx_desc->next_offset = base; + } else { + elem->next = elem + 1; + rx_desc->next_offset = base + (i + 1) * sizeof(*rx_desc); + } + } + + rx_ring->to_use = rx_ring->to_clean = rx_ring->start; + + return 0; +} + +/* Setup buffer for receiving */ +static inline int c2_rx_alloc(struct c2_port *c2_port, struct c2_element *elem) +{ + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_rx_desc *rx_desc = elem->ht_desc; + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen; + struct c2_rxp_hdr *rxp_hdr; + + skb = dev_alloc_skb(c2_port->rx_buf_size); + if (unlikely(!skb)) { + dprintk(KERN_ERR PFX "%s: out of memory for receive\n", + c2_port->netdev->name); + return -ENOMEM; + } + + /* Zero out the rxp hdr in the sk_buff */ + memset(skb->data, 0, sizeof(*rxp_hdr)); + + skb->dev = c2_port->netdev; + + maplen = c2_port->rx_buf_size; + mapaddr = pci_map_single(c2dev->pcidev, skb->data, maplen, PCI_DMA_FROMDEVICE); + + /* Set the sk_buff RXP_header to RXP_HRXD_READY */ + rxp_hdr = (struct c2_rxp_hdr *) skb->data; + rxp_hdr->flags = RXP_HRXD_READY; + + /* c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); */ + c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); + c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16((u16)maplen - sizeof(*rxp_hdr))); + c2_write64(elem->hw_desc + C2_RXP_ADDR, cpu_to_be64(mapaddr)); + c2_write16(elem->hw_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_READY)); + + elem->skb = skb; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + rx_desc->len = maplen; + + return 0; +} + +/* + * Allocate buffers for the Rx ring + * For receive: rx_ring.to_clean is next received frame + */ +static int c2_rx_fill(struct c2_port *c2_port) +{ + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + int ret = 0; + + elem = rx_ring->start; + do { + if (c2_rx_alloc(c2_port, elem)) { + ret = 1; + break; + } + } while ((elem = elem->next) != rx_ring->start); + + rx_ring->to_clean = rx_ring->start; + return ret; +} + +/* Free all buffers in RX ring, assumes receiver stopped */ +static void c2_rx_clean(struct c2_port *c2_port) +{ + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + struct c2_rx_desc *rx_desc; + + elem = rx_ring->start; + do { + rx_desc = elem->ht_desc; + rx_desc->len = 0; + + c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); + c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); + c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16(0)); + c2_write64(elem->hw_desc + C2_RXP_ADDR, cpu_to_be64(0x99aabbccddeeffULL)); + c2_write16(elem->hw_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_UNINIT)); + + if (elem->skb) { + pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen, + PCI_DMA_FROMDEVICE); + dev_kfree_skb(elem->skb); + elem->skb = NULL; + } + } while ((elem = elem->next) != rx_ring->start); +} + +static inline int c2_tx_free(struct c2_dev *c2dev, struct c2_element *elem) +{ + struct c2_tx_desc *tx_desc = elem->ht_desc; + + tx_desc->len = 0; + + pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen, PCI_DMA_TODEVICE); + + if (elem->skb) { + dev_kfree_skb_any(elem->skb); + elem->skb = NULL; + } + + return 0; +} + +/* Free all buffers in TX ring, assumes transmitter stopped */ +static void c2_tx_clean(struct c2_port *c2_port) +{ + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + struct c2_txp_desc txp_htxd; + int retry; + unsigned long flags; + + spin_lock_irqsave(&c2_port->tx_lock, flags); + + elem = tx_ring->start; + + do { + retry = 0; + do { + txp_htxd.flags = c2_read16(elem->hw_desc + C2_TXP_FLAGS); + + if (txp_htxd.flags == TXP_HTXD_READY) { + retry = 1; + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(0)); + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(0)); + c2_write16(elem->hw_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_DONE)); + c2_port->netstats.tx_dropped++; + break; + } else { + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(0)); + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(0x1122334455667788ULL)); + c2_write16(elem->hw_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_UNINIT)); + } + + c2_tx_free(c2_port->c2dev, elem); + + } while ((elem = elem->next) != tx_ring->start); + } while (retry); + + c2_port->tx_avail = c2_port->tx_ring.count - 1; + c2_port->c2dev->cur_tx = tx_ring->to_use - tx_ring->start; + + if (c2_port->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(c2_port->netdev); + + spin_unlock_irqrestore(&c2_port->tx_lock, flags); +} + +/* + * Process transmit descriptors marked 'DONE' by the firmware, + * freeing up their unneeded sk_buffs. + */ +static void c2_tx_interrupt(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + struct c2_txp_desc txp_htxd; + + spin_lock(&c2_port->tx_lock); + + for(elem = tx_ring->to_clean; elem != tx_ring->to_use; elem = elem->next) + { + txp_htxd.flags = be16_to_cpu(c2_read16(elem->hw_desc + C2_TXP_FLAGS)); + + if (txp_htxd.flags != TXP_HTXD_DONE) + break; + + if (netif_msg_tx_done(c2_port)) { + /* PCI reads are expensive in fast path */ + //txp_htxd.addr = be64_to_cpu(c2_read64(elem->hw_desc + C2_TXP_ADDR)); + txp_htxd.len = be16_to_cpu(c2_read16(elem->hw_desc + C2_TXP_LEN)); + dprintk(KERN_INFO PFX + "%s: tx done slot %3Zu status 0x%x len %5u bytes\n", + netdev->name, elem - tx_ring->start, + txp_htxd.flags, txp_htxd.len); + } + + c2_tx_free(c2dev, elem); + ++(c2_port->tx_avail); + } + + tx_ring->to_clean = elem; + + if (netif_queue_stopped(netdev) && c2_port->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(netdev); + + spin_unlock(&c2_port->tx_lock); +} + +static void c2_rx_error(struct c2_port *c2_port, struct c2_element *elem) +{ + struct c2_rx_desc *rx_desc = elem->ht_desc; + struct c2_rxp_hdr *rxp_hdr = (struct c2_rxp_hdr *)elem->skb->data; + + if (rxp_hdr->status != RXP_HRXD_OK || + rxp_hdr->len > (rx_desc->len - sizeof(*rxp_hdr))) { + dprintk(KERN_ERR PFX "BAD RXP_HRXD\n"); + dprintk(KERN_ERR PFX " rx_desc : %p\n", rx_desc); + dprintk(KERN_ERR PFX " index : %Zu\n", elem - c2_port->rx_ring.start); + dprintk(KERN_ERR PFX " len : %u\n", rx_desc->len); + dprintk(KERN_ERR PFX " rxp_hdr : %p [PA %p]\n", rxp_hdr, + (void *)__pa((unsigned long)rxp_hdr)); + dprintk(KERN_ERR PFX " flags : 0x%x\n", rxp_hdr->flags); + dprintk(KERN_ERR PFX " status: 0x%x\n", rxp_hdr->status); + dprintk(KERN_ERR PFX " len : %u\n", rxp_hdr->len); + dprintk(KERN_ERR PFX " rsvd : 0x%x\n", rxp_hdr->rsvd); + } + + /* Setup the skb for reuse since we're dropping this pkt */ + elem->skb->tail = elem->skb->data = elem->skb->head; + + /* Zero out the rxp hdr in the sk_buff */ + memset(elem->skb->data, 0, sizeof(*rxp_hdr)); + + /* Write the descriptor to the adapter's rx ring */ + c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); + c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); + c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16((u16)elem->maplen - sizeof(*rxp_hdr))); + c2_write64(elem->hw_desc + C2_RXP_ADDR, cpu_to_be64(elem->mapaddr)); + c2_write16(elem->hw_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_READY)); + + dprintk(KERN_INFO PFX "packet dropped\n"); + c2_port->netstats.rx_dropped++; +} + +static void c2_rx_interrupt(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + struct c2_rx_desc *rx_desc; + struct c2_rxp_hdr *rxp_hdr; + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen, buflen; + unsigned long flags; + + spin_lock_irqsave(&c2dev->lock, flags); + + /* Begin where we left off */ + rx_ring->to_clean = rx_ring->start + c2dev->cur_rx; + + for(elem = rx_ring->to_clean; elem->next != rx_ring->to_clean; elem = elem->next) + { + rx_desc = elem->ht_desc; + mapaddr = elem->mapaddr; + maplen = elem->maplen; + skb = elem->skb; + rxp_hdr = (struct c2_rxp_hdr *)skb->data; + + if (rxp_hdr->flags != RXP_HRXD_DONE) + break; + + if (netif_msg_rx_status(c2_port)) + dprintk(KERN_INFO PFX "%s: rx done slot %3Zu status 0x%x len %5u bytes\n", + netdev->name, elem - rx_ring->start, + rxp_hdr->flags, rxp_hdr->len); + + buflen = rxp_hdr->len; + + /* Sanity check the RXP header */ + if (rxp_hdr->status != RXP_HRXD_OK || + buflen > (rx_desc->len - sizeof(*rxp_hdr))) { + c2_rx_error(c2_port, elem); + continue; + } + + /* Allocate and map a new skb for replenishing the host RX desc */ + if (c2_rx_alloc(c2_port, elem)) { + c2_rx_error(c2_port, elem); + continue; + } + + /* Unmap the old skb */ + pci_unmap_single(c2dev->pcidev, mapaddr, maplen, PCI_DMA_FROMDEVICE); + + /* + * Skip past the leading 8 bytes comprising of the + * "struct c2_rxp_hdr", prepended by the adapter + * to the usual Ethernet header ("struct ethhdr"), + * to the start of the raw Ethernet packet. + * + * Fix up the various fields in the sk_buff before + * passing it up to netif_rx(). The transfer size + * (in bytes) specified by the adapter len field of + * the "struct rxp_hdr_t" does NOT include the + * "sizeof(struct c2_rxp_hdr)". + */ + skb->data += sizeof(*rxp_hdr); + skb->tail = skb->data + buflen; + skb->len = buflen; + skb->dev = netdev; + skb->protocol = eth_type_trans(skb, netdev); + + netif_rx(skb); + + netdev->last_rx = jiffies; + c2_port->netstats.rx_packets++; + c2_port->netstats.rx_bytes += buflen; + } + + /* Save where we left off */ + rx_ring->to_clean = elem; + c2dev->cur_rx = elem - rx_ring->start; + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); + + spin_unlock_irqrestore(&c2dev->lock, flags); +} + +/* + * Handle netisr0 TX & RX interrupts. + */ +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs) +{ + unsigned int netisr0, dmaisr; + int handled = 0; + struct c2_dev *c2dev = (struct c2_dev *)dev_id; + + assert(c2dev != NULL); + + /* Process CCILNET interrupts */ + netisr0 = c2_read32(c2dev->regs + C2_NISR0); + if (netisr0) { + + /* + * There is an issue with the firmware that always + * provides the status of RX for both TX & RX + * interrupts. So process both queues here. + */ + c2_rx_interrupt(c2dev->netdev); + c2_tx_interrupt(c2dev->netdev); + + /* Clear the interrupt */ + c2_write32(c2dev->regs + C2_NISR0, netisr0); + handled++; + } + + /* Process RNIC interrupts */ + dmaisr = c2_read32(c2dev->regs + C2_DISR); + if (dmaisr) { + c2_write32(c2dev->regs + C2_DISR, dmaisr); + c2_rnic_interrupt(c2dev); + handled++; + } + + if (handled) { + return IRQ_HANDLED; + } else { + return IRQ_NONE; + } +} + +static int c2_up(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_element *elem; + struct c2_rxp_hdr *rxp_hdr; + size_t rx_size, tx_size; + int ret, i; + unsigned int netimr0; + + assert(c2dev != NULL); + + if (netif_msg_ifup(c2_port)) + dprintk(KERN_INFO PFX "%s: enabling interface\n", netdev->name); + + /* Set the Rx buffer size based on MTU */ + c2_set_rxbufsize(c2_port); + + /* Allocate DMA'able memory for Tx/Rx host descriptor rings */ + rx_size = c2_port->rx_ring.count * sizeof(struct c2_rx_desc); + tx_size = c2_port->tx_ring.count * sizeof(struct c2_tx_desc); + + c2_port->mem_size = tx_size + rx_size; + c2_port->mem = pci_alloc_consistent(c2dev->pcidev, c2_port->mem_size, + &c2_port->dma); + if (c2_port->mem == NULL) { + dprintk(KERN_ERR PFX "Unable to allocate memory for host descriptor rings\n"); + return -ENOMEM; + } + + memset(c2_port->mem, 0, c2_port->mem_size); + + /* Create the Rx host descriptor ring */ + if ((ret = c2_rx_ring_alloc(&c2_port->rx_ring, c2_port->mem, c2_port->dma, + c2dev->mmio_rxp_ring))) { + dprintk(KERN_ERR PFX "Unable to create RX ring\n"); + goto bail0; + } + + /* Allocate Rx buffers for the host descriptor ring */ + if (c2_rx_fill(c2_port)) { + dprintk(KERN_ERR PFX "Unable to fill RX ring\n"); + goto bail1; + } + + /* Create the Tx host descriptor ring */ + if ((ret = c2_tx_ring_alloc(&c2_port->tx_ring, c2_port->mem + rx_size, + c2_port->dma + rx_size, c2dev->mmio_txp_ring))) { + dprintk(KERN_ERR PFX "Unable to create TX ring\n"); + goto bail1; + } + + /* Set the TX pointer to where we left off */ + c2_port->tx_avail = c2_port->tx_ring.count - 1; + c2_port->tx_ring.to_use = c2_port->tx_ring.to_clean = c2_port->tx_ring.start + c2dev->cur_tx; + + /* missing: Initialize MAC */ + + BUG_ON(c2_port->tx_ring.to_use != c2_port->tx_ring.to_clean); + + /* Reset the adapter, ensures the driver is in sync with the RXP */ + c2_reset(c2_port); + + /* Reset the READY bit in the sk_buff RXP headers & adapter HRXDQ */ + for(i = 0, elem = c2_port->rx_ring.start; i < c2_port->rx_ring.count; + i++, elem++) + { + rxp_hdr = (struct c2_rxp_hdr *)elem->skb->data; + rxp_hdr->flags = 0; + c2_write16(elem->hw_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_READY)); + } + + /* Enable network packets */ + netif_start_queue(netdev); + + /* Enable IRQ */ + c2_write32(c2dev->regs + C2_IDIS, 0); + netimr0 = c2_read32(c2dev->regs + C2_NIMR0); + netimr0 &= ~(C2_PCI_HTX_INT | C2_PCI_HRX_INT); + c2_write32(c2dev->regs + C2_NIMR0, netimr0); + + return 0; + + bail1: + c2_rx_clean(c2_port); + kfree(c2_port->rx_ring.start); + + bail0: + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, c2_port->dma); + + return ret; +} + +static int c2_down(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + + if (netif_msg_ifdown(c2_port)) + dprintk(KERN_INFO PFX "%s: disabling interface\n", netdev->name); + + /* Wait for all the queued packets to get sent */ + c2_tx_interrupt(netdev); + + /* Disable network packets */ + netif_stop_queue(netdev); + + /* Disable IRQs by clearing the interrupt mask */ + c2_write32(c2dev->regs + C2_IDIS, 1); + c2_write32(c2dev->regs + C2_NIMR0, 0); + + /* missing: Stop transmitter */ + + /* missing: Stop receiver */ + + /* Reset the adapter, ensures the driver is in sync with the RXP */ + c2_reset(c2_port); + + /* missing: Turn off LEDs here */ + + /* Free all buffers in the host descriptor rings */ + c2_tx_clean(c2_port); + c2_rx_clean(c2_port); + + /* Free the host descriptor rings */ + kfree(c2_port->rx_ring.start); + kfree(c2_port->tx_ring.start); + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, c2_port->dma); + + return 0; +} + +static void c2_reset(struct c2_port *c2_port) +{ + struct c2_dev *c2dev = c2_port->c2dev; + unsigned int cur_rx = c2dev->cur_rx; + + /* Tell the hardware to quiesce */ + C2_SET_CUR_RX(c2dev, cur_rx|C2_PCI_HRX_QUI); + + /* + * The hardware will reset the C2_PCI_HRX_QUI bit once + * the RXP is quiesced. Wait 2 seconds for this. + */ + ssleep(2); + + cur_rx = C2_GET_CUR_RX(c2dev); + + if (cur_rx & C2_PCI_HRX_QUI) + dprintk(KERN_ERR PFX "c2_reset: failed to quiesce the hardware!\n"); + + cur_rx &= ~C2_PCI_HRX_QUI; + + c2dev->cur_rx = cur_rx; + + dprintk("Current RX: %u\n", c2dev->cur_rx); +} + +static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + dma_addr_t mapaddr; + u32 maplen; + unsigned long flags; + unsigned int i; + + spin_lock_irqsave(&c2_port->tx_lock, flags); + + if (unlikely(c2_port->tx_avail < (skb_shinfo(skb)->nr_frags + 1))) { + netif_stop_queue(netdev); + spin_unlock_irqrestore(&c2_port->tx_lock, flags); + + dprintk(KERN_WARNING PFX "%s: Tx ring full when queue awake!\n", + netdev->name); + return NETDEV_TX_BUSY; + } + + maplen = skb_headlen(skb); + mapaddr = pci_map_single(c2dev->pcidev, skb->data, maplen, PCI_DMA_TODEVICE); + + elem = tx_ring->to_use; + elem->skb = skb; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + + /* Tell HW to xmit */ + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(mapaddr)); + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(maplen)); + c2_write16(elem->hw_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_READY)); + + c2_port->netstats.tx_packets++; + c2_port->netstats.tx_bytes += maplen; + + /* Loop thru additional data fragments and queue them */ + if (skb_shinfo(skb)->nr_frags) { + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) + { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + maplen = frag->size; + mapaddr = pci_map_page(c2dev->pcidev, frag->page, frag->page_offset, + maplen, PCI_DMA_TODEVICE); + + elem = elem->next; + elem->skb = NULL; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + + /* Tell HW to xmit */ + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(mapaddr)); + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(maplen)); + c2_write16(elem->hw_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_READY)); + + c2_port->netstats.tx_packets++; + c2_port->netstats.tx_bytes += maplen; + } + } + + tx_ring->to_use = elem->next; + c2_port->tx_avail -= (skb_shinfo(skb)->nr_frags + 1); + + if (netif_msg_tx_queued(c2_port)) + dprintk(KERN_DEBUG PFX "%s: tx queued, slot %3Zu, len %5u bytes, avail = %u\n", + netdev->name, elem - tx_ring->start, maplen, c2_port->tx_avail); + + if (c2_port->tx_avail <= MAX_SKB_FRAGS + 1) { + netif_stop_queue(netdev); + if (netif_msg_tx_queued(c2_port)) + dprintk(KERN_INFO PFX "%s: transmit queue full\n", netdev->name); + } + + spin_unlock_irqrestore(&c2_port->tx_lock, flags); + + netdev->trans_start = jiffies; + + return NETDEV_TX_OK; +} + +static struct net_device_stats *c2_get_stats(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + + return &c2_port->netstats; +} + +static int c2_set_mac_address(struct net_device *netdev, void *p) +{ + return -1; +} + +static void c2_tx_timeout(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + + if (netif_msg_timer(c2_port)) + dprintk(KERN_DEBUG PFX "%s: tx timeout\n", netdev->name); + + c2_tx_clean(c2_port); +} + +static int c2_change_mtu(struct net_device *netdev, int new_mtu) +{ + int ret = 0; + + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) + return -EINVAL; + + netdev->mtu = new_mtu; + + if (netif_running(netdev)) { + c2_down(netdev); + + c2_up(netdev); + } + + return ret; +} + +/* Initialize network device */ +static struct net_device *c2_devinit(struct c2_dev *c2dev, void __iomem *mmio_addr) +{ + struct c2_port *c2_port = NULL; + struct net_device *netdev = alloc_etherdev(sizeof(*c2_port)); + + if (!netdev) { + dprintk(KERN_ERR PFX "c2_port etherdev alloc failed"); + return NULL; + } + + SET_MODULE_OWNER(netdev); + SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev); + + netdev->open = c2_up; + netdev->stop = c2_down; + netdev->hard_start_xmit = c2_xmit_frame; + netdev->get_stats = c2_get_stats; + netdev->tx_timeout = c2_tx_timeout; + netdev->set_mac_address = c2_set_mac_address; + netdev->change_mtu = c2_change_mtu; + netdev->watchdog_timeo = C2_TX_TIMEOUT; + netdev->irq = c2dev->pcidev->irq; + + c2_port = netdev_priv(netdev); + c2_port->netdev = netdev; + c2_port->c2dev = c2dev; + c2_port->msg_enable = netif_msg_init(debug, default_msg); + c2_port->tx_ring.count = C2_NUM_TX_DESC; + c2_port->rx_ring.count = C2_NUM_RX_DESC; + + spin_lock_init(&c2_port->tx_lock); + + /* Copy our 48-bit ethernet hardware address */ +#if 1 + memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_ENADDR, 6); +#else + memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_RDMA_ENADDR, 6); +#endif + /* Validate the MAC address */ + if(!is_valid_ether_addr(netdev->dev_addr)) { + dprintk(KERN_ERR PFX "Invalid MAC Address\n"); + c2_print_macaddr(netdev); + free_netdev(netdev); + return NULL; + } + + c2dev->netdev = netdev; + + return netdev; +} + +static int __devinit c2_probe(struct pci_dev *pcidev, const struct pci_device_id *ent) +{ + int ret = 0, i; + unsigned long reg0_start, reg0_flags, reg0_len; + unsigned long reg2_start, reg2_flags, reg2_len; + unsigned long reg4_start, reg4_flags, reg4_len; + unsigned kva_map_size; + struct net_device *netdev = NULL; + struct c2_dev *c2dev = NULL; + void __iomem *mmio_regs = NULL; + + assert(pcidev != NULL); + assert(ent != NULL); + + dprintk(KERN_INFO PFX "AMSO1100 Gigabit Ethernet driver v%s loaded\n", + DRV_VERSION); + + /* Enable PCI device */ + ret = pci_enable_device(pcidev); + if (ret) { + dprintk(KERN_ERR PFX "%s: Unable to enable PCI device\n", pci_name(pcidev)); + goto bail0; + } + + reg0_start = pci_resource_start(pcidev, BAR_0); + reg0_len = pci_resource_len(pcidev, BAR_0); + reg0_flags = pci_resource_flags(pcidev, BAR_0); + + reg2_start = pci_resource_start(pcidev, BAR_2); + reg2_len = pci_resource_len(pcidev, BAR_2); + reg2_flags = pci_resource_flags(pcidev, BAR_2); + + reg4_start = pci_resource_start(pcidev, BAR_4); + reg4_len = pci_resource_len(pcidev, BAR_4); + reg4_flags = pci_resource_flags(pcidev, BAR_4); + + dprintk(KERN_INFO PFX "BAR0 size = 0x%lX bytes\n", reg0_len); + dprintk(KERN_INFO PFX "BAR2 size = 0x%lX bytes\n", reg2_len); + dprintk(KERN_INFO PFX "BAR4 size = 0x%lX bytes\n", reg4_len); + + /* Make sure PCI base addr are MMIO */ + if (!(reg0_flags & IORESOURCE_MEM) || + !(reg2_flags & IORESOURCE_MEM) || + !(reg4_flags & IORESOURCE_MEM)) { + dprintk (KERN_ERR PFX "PCI regions not an MMIO resource\n"); + ret = -ENODEV; + goto bail1; + } + + /* Check for weird/broken PCI region reporting */ + if ((reg0_len < C2_REG0_SIZE) || + (reg2_len < C2_REG2_SIZE) || + (reg4_len < C2_REG4_SIZE)) { + dprintk (KERN_ERR PFX "Invalid PCI region sizes\n"); + ret = -ENODEV; + goto bail1; + } + + /* Reserve PCI I/O and memory resources */ + ret = pci_request_regions(pcidev, DRV_NAME); + if (ret) { + dprintk(KERN_ERR PFX "%s: Unable to request regions\n", pci_name(pcidev)); + goto bail1; + } + + if ((sizeof(dma_addr_t) > 4)) { + ret = pci_set_dma_mask(pcidev, DMA_64BIT_MASK); + if (ret < 0) { + dprintk(KERN_ERR PFX "64b DMA configuration failed\n"); + goto bail2; + } + } else { + ret = pci_set_dma_mask(pcidev, DMA_32BIT_MASK); + if (ret < 0) { + dprintk(KERN_ERR PFX "32b DMA configuration failed\n"); + goto bail2; + } + } + + /* Enables bus-mastering on the device */ + pci_set_master(pcidev); + + /* Remap the adapter PCI registers in BAR4 */ + mmio_regs = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, + sizeof(struct c2_adapter_pci_regs)); + if (mmio_regs == 0UL) { + dprintk(KERN_ERR PFX "Unable to remap adapter PCI registers in BAR4\n"); + ret = -EIO; + goto bail2; + } + + /* Validate PCI regs magic */ + for (i = 0; i < sizeof(c2_magic); i++) + { + if (c2_magic[i] != c2_read8(mmio_regs + C2_REGS_MAGIC + i)) { + dprintk(KERN_ERR PFX + "Invalid PCI regs magic [%d/%Zd: got 0x%x, exp 0x%x]\n", + i + 1, sizeof(c2_magic), + c2_read8(mmio_regs + C2_REGS_MAGIC + i), c2_magic[i]); + dprintk(KERN_ERR PFX "Adapter not claimed\n"); + iounmap(mmio_regs); + ret = -EIO; + goto bail2; + } + } + + /* Validate the adapter version */ + if (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_VERS)) != C2_VERSION) { + dprintk(KERN_ERR PFX "Version mismatch [fw=%u, c2=%u], Adapter not claimed\n", + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_VERS)), C2_VERSION); + ret = -EINVAL; + iounmap(mmio_regs); + goto bail2; + } + + /* Validate the adapter IVN */ + if (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_IVN)) != C2_IVN) { + dprintk(KERN_ERR PFX "IVN mismatch [fw=0x%x, c2=0x%x], Adapter not claimed\n", + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_IVN)), C2_IVN); + ret = -EINVAL; + iounmap(mmio_regs); + goto bail2; + } + + /* Allocate hardware structure */ + c2dev = (struct c2_dev*)ib_alloc_device(sizeof *c2dev); + if (!c2dev) { + dprintk(KERN_ERR PFX "%s: Unable to alloc hardware struct\n", + pci_name(pcidev)); + ret = -ENOMEM; + iounmap(mmio_regs); + goto bail2; + } + + memset(c2dev, 0, sizeof(*c2dev)); + spin_lock_init(&c2dev->lock); + c2dev->pcidev = pcidev; + c2dev->cur_tx = 0; + + /* Get the last RX index */ + c2dev->cur_rx = (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_HRX_CUR)) - 0xffffc000) / sizeof(struct c2_rxp_desc); + + /* Request an interrupt line for the driver */ + ret = request_irq(pcidev->irq, c2_interrupt, SA_SHIRQ, DRV_NAME, c2dev); + if (ret) { + dprintk(KERN_ERR PFX "%s: requested IRQ %u is busy\n", + pci_name(pcidev), pcidev->irq); + iounmap(mmio_regs); + goto bail3; + } + + /* Set driver specific data */ + pci_set_drvdata(pcidev, c2dev); + + /* Initialize network device */ + if ((netdev = c2_devinit(c2dev, mmio_regs)) == NULL) { + iounmap(mmio_regs); + goto bail4; + } + + /* Save off the actual size prior to unmapping mmio_regs */ + kva_map_size = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_PCI_WINSIZE)); + + /* Unmap the adapter PCI registers in BAR4 */ + iounmap(mmio_regs); + + /* Register network device */ + ret = register_netdev(netdev); + if (ret) { + dprintk(KERN_ERR PFX "Unable to register netdev, ret = %d\n", ret); + goto bail5; + } + + /* Disable network packets */ + netif_stop_queue(netdev); + + /* Remap the adapter HRXDQ PA space to kernel VA space */ + c2dev->mmio_rxp_ring = ioremap_nocache(reg4_start + C2_RXP_HRXDQ_OFFSET, + C2_RXP_HRXDQ_SIZE); + if (c2dev->mmio_rxp_ring == 0UL) { + dprintk(KERN_ERR PFX "Unable to remap MMIO HRXDQ region\n"); + ret = -EIO; + goto bail6; + } + + /* Remap the adapter HTXDQ PA space to kernel VA space */ + c2dev->mmio_txp_ring = ioremap_nocache(reg4_start + C2_TXP_HTXDQ_OFFSET, + C2_TXP_HTXDQ_SIZE); + if (c2dev->mmio_txp_ring == 0UL) { + dprintk(KERN_ERR PFX "Unable to remap MMIO HTXDQ region\n"); + ret = -EIO; + goto bail7; + } + + /* Save off the current RX index in the last 4 bytes of the TXP Ring */ + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); + + /* Remap the PCI registers in adapter BAR0 to kernel VA space */ + c2dev->regs = ioremap_nocache(reg0_start, reg0_len); + if (c2dev->regs == 0UL) { + dprintk(KERN_ERR PFX "Unable to remap BAR0\n"); + ret = -EIO; + goto bail8; + } + + /* Remap the PCI registers in adapter BAR4 to kernel VA space */ + c2dev->pa = (void *)(reg4_start + C2_PCI_REGS_OFFSET); + c2dev->kva = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, kva_map_size); + if (c2dev->kva == 0UL) { + dprintk(KERN_ERR PFX "Unable to remap BAR4\n"); + ret = -EIO; + goto bail9; + } + + /* Print out the MAC address */ + c2_print_macaddr(netdev); + + ret = c2_rnic_init(c2dev); + if (ret) { + dprintk(KERN_ERR PFX "c2_rnic_init failed: %d\n", ret); + goto bail10; + } + + c2_register_device(c2dev); + + return 0; + + bail10: + iounmap(c2dev->kva); + + bail9: + iounmap(c2dev->regs); + + bail8: + iounmap(c2dev->mmio_txp_ring); + + bail7: + iounmap(c2dev->mmio_rxp_ring); + + bail6: + unregister_netdev(netdev); + + bail5: + free_netdev(netdev); + + bail4: + free_irq(pcidev->irq, c2dev); + + bail3: + ib_dealloc_device(&c2dev->ibdev); + + bail2: + pci_release_regions(pcidev); + + bail1: + pci_disable_device(pcidev); + + bail0: + return ret; +} + +static void __devexit c2_remove(struct pci_dev *pcidev) +{ + struct c2_dev *c2dev = pci_get_drvdata(pcidev); + struct net_device *netdev = c2dev->netdev; + + assert(netdev != NULL); + + /* Unregister with OpenIB */ + ib_unregister_device(&c2dev->ibdev); + + /* Clean up the RNIC resources */ + c2_rnic_term(c2dev); + + /* Remove network device from the kernel */ + unregister_netdev(netdev); + + /* Free network device */ + free_netdev(netdev); + + /* Free the interrupt line */ + free_irq(pcidev->irq, c2dev); + + /* missing: Turn LEDs off here */ + + /* Unmap adapter PA space */ + iounmap(c2dev->kva); + iounmap(c2dev->regs); + iounmap(c2dev->mmio_txp_ring); + iounmap(c2dev->mmio_rxp_ring); + + /* Free the hardware structure */ + ib_dealloc_device(&c2dev->ibdev); + + /* Release reserved PCI I/O and memory resources */ + pci_release_regions(pcidev); + + /* Disable PCI device */ + pci_disable_device(pcidev); + + /* Clear driver specific data */ + pci_set_drvdata(pcidev, NULL); +} + +static struct pci_driver c2_pci_driver = { + .name = DRV_NAME, + .id_table = c2_pci_table, + .probe = c2_probe, + .remove = __devexit_p(c2_remove), +}; + +static int __init c2_init_module(void) +{ + return pci_module_init(&c2_pci_driver); +} + +static void __exit c2_exit_module(void) +{ + pci_unregister_driver(&c2_pci_driver); +} + +module_init(c2_init_module); +module_exit(c2_exit_module); Index: hw/amso1100/c2_qp.c =================================================================== --- hw/amso1100/c2_qp.c (revision 0) +++ hw/amso1100/c2_qp.c (revision 0) @@ -0,0 +1,840 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include "c2.h" +#include "c2_vq.h" +#include "cc_status.h" + +#define C2_MAX_ORD_PER_QP 128 +#define C2_MAX_IRD_PER_QP 128 + +#define CC_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count) +#define CC_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16) +#define CC_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF) + +enum c2_qp_state { + C2_QP_STATE_IDLE = 0x01, + C2_QP_STATE_CONNECTING = 0x02, + C2_QP_STATE_RTS = 0x04, + C2_QP_STATE_CLOSING = 0x08, + C2_QP_STATE_TERMINATE = 0x10, + C2_QP_STATE_ERROR = 0x20, +}; + +#define NO_SUPPORT -1 +static const u8 c2_opcode[] = { + [IB_WR_SEND] = CC_WR_TYPE_SEND, + [IB_WR_SEND_WITH_IMM] = NO_SUPPORT, + [IB_WR_RDMA_WRITE] = CC_WR_TYPE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = NO_SUPPORT, + [IB_WR_RDMA_READ] = CC_WR_TYPE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = NO_SUPPORT, + [IB_WR_ATOMIC_FETCH_AND_ADD] = NO_SUPPORT, +}; + +void c2_qp_event(struct c2_dev *c2dev, u32 qpn, + enum ib_event_type event_type) +{ + struct c2_qp *qp; + struct ib_event event; + + spin_lock(&c2dev->qp_table.lock); + qp = c2_array_get(&c2dev->qp_table.qp, qpn & (c2dev->max_qp - 1)); + if (qp) + atomic_inc(&qp->refcount); + spin_unlock(&c2dev->qp_table.lock); + + if (!qp) { + dprintk("Async event for bogus QP %08x\n", qpn); + return; + } + + event.device = &c2dev->ibdev; + event.event = event_type; + event.element.qp = &qp->ibqp; + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); + + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +static int to_c2_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: return C2_QP_STATE_IDLE; + case IB_QPS_RTS: return C2_QP_STATE_RTS; + case IB_QPS_SQD: return C2_QP_STATE_CLOSING; + case IB_QPS_SQE: return C2_QP_STATE_CLOSING; + case IB_QPS_ERR: return C2_QP_STATE_ERROR; + default: return -1; + } +} + +#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF + +int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp, + struct ib_qp_attr *attr, int attr_mask) +{ + ccwr_qp_modify_req_t wr; + ccwr_qp_modify_rep_t *reply; + struct c2_vq_req *vq_req; + int err; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + c2_wr_set_id(&wr, CCWR_QP_MODIFY); + wr.hdr.context = (unsigned long)vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.qp_handle = qp->adapter_handle; + wr.ord = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.ird = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + + if (attr_mask & IB_QP_STATE) { + + /* Ensure the state is valid */ + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) + return -EINVAL; + + wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state)); + + } else if (attr_mask & IB_QP_CUR_STATE) { + + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + wr.next_qp_state = cpu_to_be32(to_c2_state(attr->cur_qp_state)); + } else { + err = 0; + goto bail0; + } + + /* reference the request struct */ + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (ccwr_t *)&wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + reply = (ccwr_qp_modify_rep_t *)(unsigned long)vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + + vq_repbuf_free(c2dev, reply); +bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +static int destroy_qp(struct c2_dev *c2dev, + struct c2_qp *qp) +{ + struct c2_vq_req *vq_req; + ccwr_qp_destroy_req_t wr; + ccwr_qp_destroy_rep_t *reply; + int err; + + /* + * Allocate a verb request message + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + return -ENOMEM; + } + + /* + * Initialize the WR + */ + c2_wr_set_id(&wr, CCWR_QP_DESTROY); + wr.hdr.context = (unsigned long)vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.qp_handle = qp->adapter_handle; + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (ccwr_t*)&wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (ccwr_qp_destroy_rep_t*)(unsigned long)(vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + if ( (err = c2_errno(reply)) != 0) { + // XXX print error + } + + vq_repbuf_free(c2dev, reply); +bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +int c2_alloc_qp(struct c2_dev *c2dev, + struct c2_pd *pd, + struct ib_qp_init_attr *qp_attrs, + struct c2_qp *qp) +{ + ccwr_qp_create_req_t wr; + ccwr_qp_create_rep_t *reply; + struct c2_vq_req *vq_req; + struct c2_cq *send_cq = to_c2cq(qp_attrs->send_cq); + struct c2_cq *recv_cq = to_c2cq(qp_attrs->recv_cq); + unsigned long peer_pa; + u32 q_size, msg_size, mmap_size; + void *mmap; + int err; + + qp->qpn = c2_alloc(&c2dev->qp_table.alloc); + if (qp->qpn == -1) + return -ENOMEM; + + /* Allocate the SQ and RQ shared pointers */ + qp->sq_mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); + if (!qp->sq_mq.shared) { + err = -ENOMEM; + goto bail0; + } + + qp->rq_mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); + if (!qp->rq_mq.shared) { + err = -ENOMEM; + goto bail1; + } + + /* Allocate the verbs request */ + vq_req = vq_req_alloc(c2dev); + if (vq_req == NULL) { + err = -ENOMEM; + goto bail2; + } + + /* Initialize the work request */ + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_QP_CREATE); + wr.hdr.context = (unsigned long)vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.sq_cq_handle = send_cq->adapter_handle; + wr.rq_cq_handle = recv_cq->adapter_handle; + wr.sq_depth = cpu_to_be32(qp_attrs->cap.max_send_wr+1); + wr.rq_depth = cpu_to_be32(qp_attrs->cap.max_recv_wr+1); + wr.srq_handle = 0; + wr.flags = cpu_to_be32(QP_RDMA_READ | QP_RDMA_WRITE | QP_MW_BIND | + QP_ZERO_STAG | QP_RDMA_READ_RESPONSE); + wr.send_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge); + wr.recv_sgl_depth = cpu_to_be32(qp_attrs->cap.max_recv_sge); + wr.rdma_write_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge); // XXX no write depth? + wr.shared_sq_ht = cpu_to_be64(__pa(qp->sq_mq.shared)); + wr.shared_rq_ht = cpu_to_be64(__pa(qp->rq_mq.shared)); + wr.ord = cpu_to_be32(C2_MAX_ORD_PER_QP); + wr.ird = cpu_to_be32(C2_MAX_IRD_PER_QP); + wr.pd_id = pd->pd_id; + wr.user_context = (unsigned long)qp; + + vq_req_get(c2dev, vq_req); + + /* Send the WR to the adapter */ + err = vq_send_wr(c2dev, (ccwr_t*)&wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail3; + } + + /* Wait for the verb reply */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail3; + } + + /* Process the reply */ + reply = (ccwr_qp_create_rep_t*)(unsigned long)(vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail3; + } + + if ( (err = c2_wr_get_result(reply)) != 0) { + goto bail4; + } + + /* Fill in the kernel QP struct */ + atomic_set(&qp->refcount, 1); + qp->adapter_handle = reply->qp_handle; + qp->state = IB_QPS_RESET; + qp->send_sgl_depth = qp_attrs->cap.max_send_sge; + qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge; + qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge; + + /* Initialize the SQ MQ */ + q_size = be32_to_cpu(reply->sq_depth); + msg_size = be32_to_cpu(reply->sq_msg_size); + peer_pa = (unsigned long)(c2dev->pa + be32_to_cpu(reply->sq_mq_start)); + mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size); + mmap = ioremap_nocache(peer_pa, mmap_size); + if (!mmap) { + err = -ENOMEM; + goto bail5; + } + + c2_mq_init(&qp->sq_mq, + be32_to_cpu(reply->sq_mq_index), + q_size, + msg_size, + mmap + sizeof(struct c2_mq_shared), /* pool start */ + mmap, /* peer */ + C2_MQ_ADAPTER_TARGET); + + /* Initialize the RQ mq */ + q_size = be32_to_cpu(reply->rq_depth); + msg_size = be32_to_cpu(reply->rq_msg_size); + peer_pa = (unsigned long)(c2dev->pa + be32_to_cpu(reply->rq_mq_start)); + mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size); + mmap = ioremap_nocache(peer_pa, mmap_size); + if (!mmap) { + err = -ENOMEM; + goto bail6; + } + + c2_mq_init(&qp->rq_mq, + be32_to_cpu(reply->rq_mq_index), + q_size, + msg_size, + mmap + sizeof(struct c2_mq_shared), /* pool start */ + mmap, /* peer */ + C2_MQ_ADAPTER_TARGET); + + vq_repbuf_free(c2dev, reply); + vq_req_free(c2dev, vq_req); + + spin_lock_irq(&c2dev->qp_table.lock); + c2_array_set(&c2dev->qp_table.qp, + qp->qpn & (c2dev->max_qp - 1), qp); + spin_unlock_irq(&c2dev->qp_table.lock); + + return 0; + +bail6: + iounmap(qp->sq_mq.peer); +bail5: + destroy_qp(c2dev, qp); +bail4: + vq_repbuf_free(c2dev, reply); +bail3: + vq_req_free(c2dev, vq_req); +bail2: + c2_free_mqsp(qp->rq_mq.shared); +bail1: + c2_free_mqsp(qp->sq_mq.shared); +bail0: + c2_free(&c2dev->qp_table.alloc, qp->qpn); + return err; +} + +void c2_free_qp(struct c2_dev *c2dev, + struct c2_qp *qp) +{ + struct c2_cq *send_cq; + struct c2_cq *recv_cq; + + send_cq = to_c2cq(qp->ibqp.send_cq); + recv_cq = to_c2cq(qp->ibqp.recv_cq); + + /* + * Lock CQs here, so that CQ polling code can do QP lookup + * without taking a lock. + */ + spin_lock_irq(&send_cq->lock); + if (send_cq != recv_cq) + spin_lock(&recv_cq->lock); + + spin_lock(&c2dev->qp_table.lock); + c2_array_clear(&c2dev->qp_table.qp, + qp->qpn & (c2dev->max_qp - 1)); + spin_unlock(&c2dev->qp_table.lock); + + if (send_cq != recv_cq) + spin_unlock(&recv_cq->lock); + spin_unlock_irq(&send_cq->lock); + + atomic_dec(&qp->refcount); + wait_event(qp->wait, !atomic_read(&qp->refcount)); + + /* + * Destory qp in the rnic... + */ + destroy_qp(c2dev, qp); + + /* + * Mark any unreaped CQEs as null and void. + */ + c2_cq_clean(c2dev, qp, send_cq->cqn); + if (send_cq != recv_cq) + c2_cq_clean(c2dev, qp, recv_cq->cqn); + /* + * Unmap the MQs and return the shared pointers + * to the message pool. + */ + iounmap(qp->sq_mq.peer); + iounmap(qp->rq_mq.peer); + c2_free_mqsp(qp->sq_mq.shared); + c2_free_mqsp(qp->rq_mq.shared); + + c2_free(&c2dev->qp_table.alloc, qp->qpn); +} + +/* + * Function: move_sgl + * + * Description: + * Move an SGL from the user's work request struct into a CCIL Work Request + * message, swapping to WR byte order and ensure the total length doesn't + * overflow. + * + * IN: + * dst - ptr to CCIL Work Request message SGL memory. + * src - ptr to the consumers SGL memory. + * + * OUT: none + * + * Return: + * CCIL status codes. + */ +static int +move_sgl(cc_data_addr_t *dst, struct ib_sge *src, int count, u32 *p_len, u8 *actual_count) +{ + u32 tot = 0; /* running total */ + u8 acount = 0; /* running total non-0 len sge's */ + + while (count > 0) { + /* + * If the addition of this SGE causes the + * total SGL length to exceed 2^32-1, then + * fail-n-bail. + * + * If the current total plus the next element length + * wraps, then it will go negative and be less than the + * current total... + */ + if ((tot+src->length) < tot) { + return -EINVAL; + } + /* + * Bug: 1456 (as well as 1498 & 1643) + * Skip over any sge's supplied with len=0 + */ + if (src->length) { + tot += src->length; + dst->stag = cpu_to_be32(src->lkey); + dst->to = cpu_to_be64(src->addr); + dst->length = cpu_to_be32(src->length); + dst++; + acount++; + } + src++; + count--; + } + + if (acount == 0) { + /* + * Bug: 1476 (as well as 1498, 1456 and 1643) + * Setup the SGL in the WR to make it easier for the RNIC. + * This way, the FW doesn't have to deal with special cases. + * Setting length=0 should be sufficient. + */ + dst->stag = 0; + dst->to = 0; + dst->length = 0; + } + + *p_len = tot; + *actual_count = acount; + return 0; +} + +/* + * Function: c2_activity (private function) + * + * Description: + * Post an mq index to the host->adapter activity fifo. + * + * IN: + * c2dev - ptr to c2dev structure + * mq_index - mq index to post + * shared - value most recently written to shared + * + * OUT: + * + * Return: + * none + */ +static inline void +c2_activity(struct c2_dev *c2dev, u32 mq_index, u16 shared) +{ + /* + * First read the register to see if the FIFO is full, and if so, + * spin until it's not. This isn't perfect -- there is no + * synchronization among the clients of the register, but in + * practice it prevents multiple CPU from hammering the bus + * with PCI RETRY. Note that when this does happen, the card + * cannot get on the bus and the card and system hang in a + * deadlock -- thus the need for this code. [TOT] + */ + while (c2_read32(c2dev->regs + PCI_BAR0_ADAPTER_HINT) & 0x80000000) { + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_timeout(0); + } + + c2_write32(c2dev->regs + PCI_BAR0_ADAPTER_HINT, CC_HINT_MAKE(mq_index, shared)); +} + +/* + * Function: qp_wr_post + * + * Description: + * This in-line function allocates a MQ msg, then moves the host-copy of + * the completed WR into msg. Then it posts the message. + * + * IN: + * q - ptr to user MQ. + * wr - ptr to host-copy of the WR. + * qp - ptr to user qp + * size - Number of bytes to post. Assumed to be divisible by 4. + * + * OUT: none + * + * Return: + * CCIL status codes. + */ +static int +qp_wr_post(struct c2_mq *q, ccwr_t *wr, struct c2_qp *qp, u32 size) +{ + ccwr_t *msg; + + msg = c2_mq_alloc(q); + if (msg == NULL) { + return -EINVAL; + } + +#ifdef CCMSGMAGIC + ((ccwr_hdr_t *)wr)->magic = cpu_to_be32(CCWR_MAGIC); +#endif + + /* + * Since all header fields in the WR are the same as the + * CQE, set the following so the adapter need not. + */ + c2_wr_set_result(wr, CCERR_PENDING); + + /* + * Copy the wr down to the adapter + */ + memcpy((void *)msg, (void *)wr, size); + + c2_mq_produce(q); + return 0; +} + + +int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, + struct ib_send_wr **bad_wr) +{ + struct c2_dev *c2dev = to_c2dev(ibqp->device); + struct c2_qp *qp = to_c2qp(ibqp); + ccwr_t wr; + int err = 0; + + u32 flags; + u32 tot_len; + u8 actual_sge_count; + u32 msg_size; + + if (qp->state > IB_QPS_RTS) + return -EINVAL; + + while (ib_wr) { + + flags = 0; + wr.sqwr.sq_hdr.user_hdr.hdr.context = ib_wr->wr_id; + if (ib_wr->send_flags & IB_SEND_SIGNALED) { + flags |= SQ_SIGNALED; + } + + switch (ib_wr->opcode) { + case IB_WR_SEND: + if (ib_wr->send_flags & IB_SEND_SOLICITED) { + c2_wr_set_id(&wr, CC_WR_TYPE_SEND_SE); + msg_size = sizeof(ccwr_send_se_req_t); + } else { + c2_wr_set_id(&wr, CC_WR_TYPE_SEND); + msg_size = sizeof(ccwr_send_req_t); + } + + wr.sqwr.send.remote_stag = 0; + msg_size += sizeof(cc_data_addr_t) * ib_wr->num_sge; + if (ib_wr->num_sge > qp->send_sgl_depth) { + err = -EINVAL; + break; + } + if (ib_wr->send_flags & IB_SEND_FENCE) { + flags |= SQ_READ_FENCE; + } + err = move_sgl((cc_data_addr_t*)&(wr.sqwr.send.data), + ib_wr->sg_list, + ib_wr->num_sge, + &tot_len, + &actual_sge_count); + wr.sqwr.send.sge_len = cpu_to_be32(tot_len); + c2_wr_set_sge_count(&wr, actual_sge_count); + break; + case IB_WR_RDMA_WRITE: + c2_wr_set_id(&wr, CC_WR_TYPE_RDMA_WRITE); + msg_size = sizeof(ccwr_rdma_write_req_t) + + (sizeof(cc_data_addr_t) * ib_wr->num_sge); + if (ib_wr->num_sge > qp->rdma_write_sgl_depth) { + err = -EINVAL; + break; + } + if (ib_wr->send_flags & IB_SEND_FENCE) { + flags |= SQ_READ_FENCE; + } + wr.sqwr.rdma_write.remote_stag = cpu_to_be32(ib_wr->wr.rdma.rkey); + wr.sqwr.rdma_write.remote_to = cpu_to_be64(ib_wr->wr.rdma.remote_addr); + err = move_sgl((cc_data_addr_t*) + &(wr.sqwr.rdma_write.data), + ib_wr->sg_list, + ib_wr->num_sge, + &tot_len, + &actual_sge_count); + wr.sqwr.rdma_write.sge_len = cpu_to_be32(tot_len); + c2_wr_set_sge_count(&wr, actual_sge_count); + break; + case IB_WR_RDMA_READ: + c2_wr_set_id(&wr, CC_WR_TYPE_RDMA_READ); + msg_size = sizeof(ccwr_rdma_read_req_t); + + /* IWarp only suppots 1 sge for RDMA reads */ + if (ib_wr->num_sge > 1) { + err = -EINVAL; + break; + } + + /* + * Move the local and remote stag/to/len into the WR. + */ + wr.sqwr.rdma_read.local_stag = + cpu_to_be32(ib_wr->sg_list->lkey); + wr.sqwr.rdma_read.local_to = + cpu_to_be64(ib_wr->sg_list->addr); + wr.sqwr.rdma_read.remote_stag = + cpu_to_be32(ib_wr->wr.rdma.rkey); + wr.sqwr.rdma_read.remote_to = + cpu_to_be64(ib_wr->wr.rdma.remote_addr); + wr.sqwr.rdma_read.length = + cpu_to_be32(ib_wr->sg_list->length); + break; + default: + /* error */ + msg_size = 0; + err = -EINVAL; + break; + } + + /* + * If we had an error on the last wr build, then + * break out. Possible errors include bogus WR + * type, and a bogus SGL length... + */ + if (err) { + break; + } + + /* + * Store flags + */ + c2_wr_set_flags(&wr, flags); + + /* + * Post the puppy! + */ + err = qp_wr_post(&qp->sq_mq, &wr, qp, msg_size); + if (err) { + break; + } + + /* + * Enqueue mq index to activity FIFO. + */ + c2_activity(c2dev, qp->sq_mq.index, qp->sq_mq.hint_count); + + ib_wr = ib_wr->next; + } + + if (err) + *bad_wr = ib_wr; + return err; +} + +int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, + struct ib_recv_wr **bad_wr) +{ + struct c2_dev *c2dev = to_c2dev(ibqp->device); + struct c2_qp *qp = to_c2qp(ibqp); + ccwr_t wr; + int err = 0; + + if (qp->state > IB_QPS_RTS) + return -EINVAL; + + /* + * Try and post each work request + */ + while (ib_wr) { + u32 tot_len; + u8 actual_sge_count; + + if (ib_wr->num_sge > qp->recv_sgl_depth) { + err = -EINVAL; + break; + } + + /* + * Create local host-copy of the WR + */ + wr.rqwr.rq_hdr.user_hdr.hdr.context = ib_wr->wr_id; + c2_wr_set_id(&wr, CCWR_RECV); + c2_wr_set_flags(&wr, 0); + + /* sge_count is limited to eight bits. */ + assert(ib_wr->num_sge < 256); + err = move_sgl((cc_data_addr_t*)&(wr.rqwr.data), + ib_wr->sg_list, + ib_wr->num_sge, + &tot_len, + &actual_sge_count); + c2_wr_set_sge_count(&wr, actual_sge_count); + + /* + * If we had an error on the last wr build, then + * break out. Possible errors include bogus WR + * type, and a bogus SGL length... + */ + if (err) { + break; + } + + err = qp_wr_post(&qp->rq_mq, &wr, qp, qp->rq_mq.msg_size); + if (err) { + break; + } + + /* + * Enqueue mq index to activity FIFO + */ + c2_activity(c2dev, qp->rq_mq.index, qp->rq_mq.hint_count); + + ib_wr = ib_wr->next; + } + + if (err) + *bad_wr = ib_wr; + return err; +} + +int __devinit c2_init_qp_table(struct c2_dev *c2dev) +{ + int err; + + spin_lock_init(&c2dev->qp_table.lock); + + err = c2_alloc_init(&c2dev->qp_table.alloc, + c2dev->max_qp, + 0); + if (err) + return err; + + err = c2_array_init(&c2dev->qp_table.qp, + c2dev->max_qp); + if (err) { + c2_alloc_cleanup(&c2dev->qp_table.alloc); + return err; + } + + return 0; +} + +void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev) +{ + c2_alloc_cleanup(&c2dev->qp_table.alloc); +} Index: hw/amso1100/cc_ivn.h =================================================================== --- hw/amso1100/cc_ivn.h (revision 0) +++ hw/amso1100/cc_ivn.h (revision 0) @@ -0,0 +1,57 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _CC_IVN_H_ +#define _CC_IVN_H_ + +/* + * The following value must be incremented each time structures shared + * between the firmware and host drivers are changed. This includes + * structures, types, and Max number of queue pairs.. + */ +#define CC_IVN_BASE 18 + +/* Used to mask of the CCMSGMAGIC bit */ +#define CC_IVN_MASK 0x7fffffff + + +/* + * The high order bit indicates a CCMSGMAGIC build, which changes the + * adapter<->host message formats. + */ +#ifdef CCMSGMAGIC +#define CC_IVN (CC_IVN_BASE | 0x80000000) +#else +#define CC_IVN (CC_IVN_BASE & 0x7fffffff) +#endif + +#endif /* _CC_IVN_H_ */ Index: hw/amso1100/c2_mq.h =================================================================== --- hw/amso1100/c2_mq.h (revision 0) +++ hw/amso1100/c2_mq.h (revision 0) @@ -0,0 +1,104 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef _C2_MQ_H_ +#define _C2_MQ_H_ +#include +#include "c2_wr.h" + +enum c2_shared_regs { + + C2_SHARED_ARMED = 0x10, + C2_SHARED_NOTIFY = 0x18, + C2_SHARED_SHARED = 0x40, +}; + +struct c2_mq_shared { + u16 unused1; + u8 armed; + u8 notification_type; + u32 unused2; + u16 shared; + /* Pad to 64 bytes. */ + u8 pad[64-sizeof(u16)-2*sizeof(u8)-sizeof(u32)-sizeof(u16)]; +}; + +enum c2_mq_type { + C2_MQ_HOST_TARGET = 1, + C2_MQ_ADAPTER_TARGET = 2, +}; + +/* + * c2_mq_t is for kernel-mode MQs like the VQs and the AEQ. + * c2_user_mq_t (which is the same format) is for user-mode MQs... + */ +#define C2_MQ_MAGIC 0x4d512020 /* 'MQ ' */ +struct c2_mq { + u32 magic; + u8* msg_pool; + u16 hint_count; + u16 priv; + struct c2_mq_shared *peer; + u16* shared; + u32 q_size; + u32 msg_size; + u32 index; + enum c2_mq_type type; +}; + +#define BUMP(q,p) (p) = ((p)+1) % (q)->q_size +#define BUMP_SHARED(q,p) (p) = cpu_to_be16((be16_to_cpu(p)+1) % (q)->q_size) + +static __inline__ int +c2_mq_empty(struct c2_mq *q) +{ + return q->priv == be16_to_cpu(*q->shared); +} + +static __inline__ int +c2_mq_full(struct c2_mq *q) +{ + return q->priv == (be16_to_cpu(*q->shared) + q->q_size-1) % q->q_size; +} + +extern void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count); +extern void * c2_mq_alloc(struct c2_mq *q); +extern void c2_mq_produce(struct c2_mq *q); +extern void * c2_mq_consume(struct c2_mq *q); +extern void c2_mq_free(struct c2_mq *q); +extern u32 c2_mq_count(struct c2_mq *q); +extern void c2_mq_init(struct c2_mq *q, u32 index, u32 q_size, + u32 msg_size, u8 *pool_start, + u16 *peer, u32 type); + +#endif /* _C2_MQ_H_ */ Index: hw/amso1100/c2_user.h =================================================================== --- hw/amso1100/c2_user.h (revision 0) +++ hw/amso1100/c2_user.h (revision 0) @@ -0,0 +1,82 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef C2_USER_H +#define C2_USER_H + +#include + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct c2_alloc_ucontext_resp { + __u32 qp_tab_size; + __u32 uarc_size; +}; + +struct c2_alloc_pd_resp { + __u32 pdn; + __u32 reserved; +}; + +struct c2_create_cq { + __u32 lkey; + __u32 pdn; + __u64 arm_db_page; + __u64 set_db_page; + __u32 arm_db_index; + __u32 set_db_index; +}; + +struct c2_create_cq_resp { + __u32 cqn; + __u32 reserved; +}; + +struct c2_create_qp { + __u32 lkey; + __u32 reserved; + __u64 sq_db_page; + __u64 rq_db_page; + __u32 sq_db_index; + __u32 rq_db_index; +}; + +#endif /* C2_USER_H */ Index: hw/amso1100/c2_ae.c =================================================================== --- hw/amso1100/c2_ae.c (revision 0) +++ hw/amso1100/c2_ae.c (revision 0) @@ -0,0 +1,216 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include +#include "cc_status.h" +#include "cc_ae.h" + +static int c2_convert_cm_status(u32 cc_status) +{ + switch (cc_status) { + case CC_CONN_STATUS_SUCCESS: + return 0; + case CC_CONN_STATUS_REJECTED: + return -ENETRESET; + case CC_CONN_STATUS_REFUSED: + return -ECONNREFUSED; + case CC_CONN_STATUS_TIMEDOUT: + return -ETIMEDOUT; + case CC_CONN_STATUS_NETUNREACH: + return -ENETUNREACH; + case CC_CONN_STATUS_HOSTUNREACH: + return -EHOSTUNREACH; + case CC_CONN_STATUS_INVALID_RNIC: + return -EINVAL; + case CC_CONN_STATUS_INVALID_QP: + return -EINVAL; + case CC_CONN_STATUS_INVALID_QP_STATE: + return -EINVAL; + default: + panic("Unable to convert CM status: %d\n", cc_status); + break; + } +} + +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) +{ + struct c2_mq *mq = c2dev->qptr_array[mq_index]; + ccwr_t *wr; + void *resource_user_context; + struct iw_cm_event cm_event; + struct ib_event ib_event; + cc_resource_indicator_t resource_indicator; + cc_event_id_t event_id; + u8 *pdata = NULL; + + /* + * retreive the message + */ + wr = c2_mq_consume(mq); + if (!wr) + return; + + memset(&cm_event, 0, sizeof(cm_event)); + + event_id = c2_wr_get_id(wr); + resource_indicator = be32_to_cpu(wr->ae.ae_generic.resource_type); + resource_user_context = (void *)(unsigned long)wr->ae.ae_generic.user_context; + + cm_event.status = c2_convert_cm_status(c2_wr_get_result(wr)); + + switch (resource_indicator) { + case CC_RES_IND_QP: { + + struct c2_qp *qp = (struct c2_qp *)resource_user_context; + + switch (event_id) { + case CCAE_ACTIVE_CONNECT_RESULTS: + cm_event.event = IW_CM_EVENT_CONNECT_REPLY; + cm_event.local_addr.sin_addr.s_addr = + wr->ae.ae_active_connect_results.laddr; + cm_event.remote_addr.sin_addr.s_addr = + wr->ae.ae_active_connect_results.raddr; + cm_event.local_addr.sin_port = + wr->ae.ae_active_connect_results.lport; + cm_event.remote_addr.sin_port = + wr->ae.ae_active_connect_results.rport; + cm_event.private_data_len = + be32_to_cpu(wr->ae.ae_active_connect_results.private_data_length); + + if (cm_event.private_data_len) { + /* XXX */ + pdata = kmalloc(cm_event.private_data_len, GFP_ATOMIC); + if (!pdata) { + /* Ignore the request, maybe the remote peer + * will retry */ + dprintk("Ignored connect request -- no memory for pdata" + "private_data_len=%d\n", cm_event.private_data_len); + goto ignore_it; + } + + memcpy(pdata, + wr->ae.ae_active_connect_results.private_data, + cm_event.private_data_len); + + cm_event.private_data = pdata; + } + if (qp->cm_id->event_handler) + qp->cm_id->event_handler(qp->cm_id, &cm_event); + + break; + + case CCAE_TERMINATE_MESSAGE_RECEIVED: + case CCAE_CQ_SQ_COMPLETION_OVERFLOW: + ib_event.device = &c2dev->ibdev; + ib_event.element.qp = &qp->ibqp; + ib_event.event = IB_EVENT_QP_REQ_ERR; + + if(qp->ibqp.event_handler) + (*qp->ibqp.event_handler)(&ib_event, + qp->ibqp.qp_context); + case CCAE_BAD_CLOSE: + case CCAE_LLP_CLOSE_COMPLETE: + case CCAE_LLP_CONNECTION_RESET: + case CCAE_LLP_CONNECTION_LOST: + default: + cm_event.event = IW_CM_EVENT_CLOSE; + if (qp->cm_id->event_handler) + qp->cm_id->event_handler(qp->cm_id, &cm_event); + + } + break; + } + + case CC_RES_IND_EP: { + + struct iw_cm_id* cm_id = (struct iw_cm_id*)resource_user_context; + + dprintk("CC_RES_IND_EP event_id=%d\n", event_id); + if (event_id != CCAE_CONNECTION_REQUEST) { + dprintk("%s: Invalid event_id: %d\n", __FUNCTION__, event_id); + break; + } + + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; + cm_event.provider_id = + wr->ae.ae_connection_request.cr_handle; + cm_event.local_addr.sin_addr.s_addr = + wr->ae.ae_connection_request.laddr; + cm_event.remote_addr.sin_addr.s_addr = + wr->ae.ae_connection_request.raddr; + cm_event.local_addr.sin_port = + wr->ae.ae_connection_request.lport; + cm_event.remote_addr.sin_port = + wr->ae.ae_connection_request.rport; + cm_event.private_data_len = + be32_to_cpu(wr->ae.ae_connection_request.private_data_length); + + if (cm_event.private_data_len) { + pdata = kmalloc(cm_event.private_data_len, GFP_ATOMIC); + if (!pdata) { + /* Ignore the request, maybe the remote peer + * will retry */ + dprintk("Ignored connect request -- no memory for pdata" + "private_data_len=%d\n", cm_event.private_data_len); + goto ignore_it; + } + memcpy(pdata, + wr->ae.ae_connection_request.private_data, + cm_event.private_data_len); + + cm_event.private_data = pdata; + } + if (cm_id->event_handler) + cm_id->event_handler(cm_id, &cm_event); + break; + } + + case CC_RES_IND_CQ: { + struct c2_cq *cq = (struct c2_cq *)resource_user_context; + + dprintk("IB_EVENT_CQ_ERR\n"); + ib_event.device = &c2dev->ibdev; + ib_event.element.cq = &cq->ibcq; + ib_event.event = IB_EVENT_CQ_ERR; + + if (cq->ibcq.event_handler) + cq->ibcq.event_handler(&ib_event, cq->ibcq.cq_context); + } + + default: + break; + } + + ignore_it: + c2_mq_free(mq); +} Index: hw/amso1100/c2.h =================================================================== --- hw/amso1100/c2.h (revision 0) +++ hw/amso1100/c2.h (revision 0) @@ -0,0 +1,617 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef __C2_H +#define __C2_H + +#include +#include +#include +#include +#include +#include + +#include "c2_provider.h" +#include "c2_mq.h" +#include "cc_status.h" + +#define DRV_NAME "c2" +#define DRV_VERSION "1.1" +#define PFX DRV_NAME ": " + +#ifdef C2_DEBUG +#define assert(expr) \ + if(!(expr)) { \ + printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n",\ + #expr, __FILE__, __FUNCTION__, __LINE__); \ + } +#define dprintk(fmt, args...) do {printk(KERN_INFO PFX fmt, ##args);} while (0) +#else +#define assert(expr) do {} while (0) +#define dprintk(fmt, args...) do {} while (0) +#endif /* C2_DEBUG */ + +#define BAR_0 0 +#define BAR_2 2 +#define BAR_4 4 + +#define RX_BUF_SIZE (1536 + 8) +#define ETH_JUMBO_MTU 9000 +#define C2_MAGIC "CEPHEUS" +#define C2_VERSION 4 +#define C2_IVN (18 & 0x7fffffff) + +#define C2_REG0_SIZE (16 * 1024) +#define C2_REG2_SIZE (2 * 1024 * 1024) +#define C2_REG4_SIZE (256 * 1024 * 1024) +#define C2_NUM_TX_DESC 341 +#define C2_NUM_RX_DESC 256 +#define C2_PCI_REGS_OFFSET (0x10000) +#define C2_RXP_HRXDQ_OFFSET (((C2_REG4_SIZE)/2)) +#define C2_RXP_HRXDQ_SIZE (4096) +#define C2_TXP_HTXDQ_OFFSET (((C2_REG4_SIZE)/2) + C2_RXP_HRXDQ_SIZE) +#define C2_TXP_HTXDQ_SIZE (4096) +#define C2_TX_TIMEOUT (6*HZ) + +/* CEPHEUS */ +static const u8 c2_magic[] = { + 0x43, 0x45, 0x50, 0x48, 0x45, 0x55, 0x53 + }; + +enum adapter_pci_regs { + C2_REGS_MAGIC = 0x0000, + C2_REGS_VERS = 0x0008, + C2_REGS_IVN = 0x000C, + C2_REGS_PCI_WINSIZE = 0x0010, + C2_REGS_Q0_QSIZE = 0x0014, + C2_REGS_Q0_MSGSIZE = 0x0018, + C2_REGS_Q0_POOLSTART = 0x001C, + C2_REGS_Q0_SHARED = 0x0020, + C2_REGS_Q1_QSIZE = 0x0024, + C2_REGS_Q1_MSGSIZE = 0x0028, + C2_REGS_Q1_SHARED = 0x0030, + C2_REGS_Q2_QSIZE = 0x0034, + C2_REGS_Q2_MSGSIZE = 0x0038, + C2_REGS_Q2_SHARED = 0x0040, + C2_REGS_ENADDR = 0x004C, + C2_REGS_RDMA_ENADDR = 0x0054, + C2_REGS_HRX_CUR = 0x006C, +}; + +struct c2_adapter_pci_regs { + char reg_magic[8]; + u32 version; + u32 ivn; + u32 pci_window_size; + u32 q0_q_size; + u32 q0_msg_size; + u32 q0_pool_start; + u32 q0_shared; + u32 q1_q_size; + u32 q1_msg_size; + u32 q1_pool_start; + u32 q1_shared; + u32 q2_q_size; + u32 q2_msg_size; + u32 q2_pool_start; + u32 q2_shared; + u32 log_start; + u32 log_size; + u8 host_enaddr[8]; + u8 rdma_enaddr[8]; + u32 crash_entry; + u32 crash_ready[2]; + u32 fw_txd_cur; + u32 fw_hrxd_cur; + u32 fw_rxd_cur; +}; + +enum pci_regs { + C2_HISR = 0x0000, + C2_DISR = 0x0004, + C2_HIMR = 0x0008, + C2_DIMR = 0x000C, + C2_NISR0 = 0x0010, + C2_NISR1 = 0x0014, + C2_NIMR0 = 0x0018, + C2_NIMR1 = 0x001C, + C2_IDIS = 0x0020, +}; + +enum { + C2_PCI_HRX_INT = 1<<8, + C2_PCI_HTX_INT = 1<<17, + C2_PCI_HRX_QUI = 1<<31, +}; + +/* + * Cepheus registers in BAR0. + */ +struct c2_pci_regs { + u32 hostisr; + u32 dmaisr; + u32 hostimr; + u32 dmaimr; + u32 netisr0; + u32 netisr1; + u32 netimr0; + u32 netimr1; + u32 int_disable; +}; + +/* TXP flags */ +enum c2_txp_flags { + TXP_HTXD_DONE = 0, + TXP_HTXD_READY = 1<<0, + TXP_HTXD_UNINIT = 1<<1, +}; + +/* RXP flags */ +enum c2_rxp_flags { + RXP_HRXD_UNINIT = 0, + RXP_HRXD_READY = 1<<0, + RXP_HRXD_DONE = 1<<1, +}; + +/* RXP status */ +enum c2_rxp_status { + RXP_HRXD_ZERO = 0, + RXP_HRXD_OK = 1<<0, + RXP_HRXD_BUF_OV = 1<<1, +}; + +/* TXP descriptor fields */ +enum txp_desc { + C2_TXP_FLAGS = 0x0000, + C2_TXP_LEN = 0x0002, + C2_TXP_ADDR = 0x0004, +}; + +/* RXP descriptor fields */ +enum rxp_desc { + C2_RXP_FLAGS = 0x0000, + C2_RXP_STATUS = 0x0002, + C2_RXP_COUNT = 0x0004, + C2_RXP_LEN = 0x0006, + C2_RXP_ADDR = 0x0008, +}; + +struct c2_txp_desc { + u16 flags; + u16 len; + u64 addr; +} __attribute__ ((packed)); + +struct c2_rxp_desc { + u16 flags; + u16 status; + u16 count; + u16 len; + u64 addr; +} __attribute__ ((packed)); + +struct c2_rxp_hdr { + u16 flags; + u16 status; + u16 len; + u16 rsvd; +} __attribute__ ((packed)); + +struct c2_tx_desc { + u32 len; + u32 status; + dma_addr_t next_offset; +}; + +struct c2_rx_desc { + u32 len; + u32 status; + dma_addr_t next_offset; +}; + +struct c2_alloc { + u32 last; + u32 max; + spinlock_t lock; + unsigned long *table; +}; + +struct c2_array { + struct { + void **page; + int used; + } *page_list; +}; + +/* + * The MQ shared pointer pool is organized as a linked list of + * chunks. Each chunk contains a linked list of free shared pointers + * that can be allocated to a given user mode client. + * + */ +struct sp_chunk { + struct sp_chunk* next; + u32 gfp_mask; + u16 head; + u16 shared_ptr[0]; +}; + +struct c2_pd_table { + struct c2_alloc alloc; + struct c2_array pd; +}; + +struct c2_qp_table { + struct c2_alloc alloc; + u32 rdb_base; + int rdb_shift; + int sqp_start; + spinlock_t lock; + struct c2_array qp; + struct c2_icm_table *qp_table; + struct c2_icm_table *eqp_table; + struct c2_icm_table *rdb_table; +}; + +struct c2_element { + struct c2_element *next; + void *ht_desc; /* host descriptor */ + void *hw_desc; /* hardware descriptor */ + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen; +}; + +struct c2_ring { + struct c2_element *to_clean; + struct c2_element *to_use; + struct c2_element *start; + unsigned long count; +}; + +struct c2_dev { + struct ib_device ibdev; + void __iomem *regs; + void __iomem *mmio_txp_ring; /* remapped adapter memory for hw rings */ + void __iomem *mmio_rxp_ring; + spinlock_t lock; + struct pci_dev *pcidev; + struct net_device *netdev; + unsigned int cur_tx; + unsigned int cur_rx; + u64 fw_ver; + u32 adapter_handle; + u32 hw_rev; + u32 device_cap_flags; + u32 vendor_id; + u32 vendor_part_id; + void __iomem *kva; /* KVA device memory */ + void __iomem *pa; /* PA device memory */ + void **qptr_array; + + kmem_cache_t* host_msg_cache; + //kmem_cache_t* ae_msg_cache; + + struct list_head cca_link; /* adapter list */ + struct list_head eh_wakeup_list; /* event wakeup list */ + wait_queue_head_t req_vq_wo; + + /* RNIC Limits */ + u32 max_mr; + u32 max_mr_size; + u32 max_qp; + u32 max_qp_wr; + u32 max_sge; + u32 max_cq; + u32 max_cqe; + u32 max_pd; + + struct c2_pd_table pd_table; + struct c2_qp_table qp_table; +#if 0 + struct c2_mr_table mr_table; +#endif + int ports; /* num of GigE ports */ + int devnum; + spinlock_t vqlock; /* sync vbs req MQ */ + + /* Verbs Queues */ + struct c2_mq req_vq; /* Verbs Request MQ */ + struct c2_mq rep_vq; /* Verbs Reply MQ */ + struct c2_mq aeq; /* Async Events MQ */ + + /* Kernel client MQs */ + struct sp_chunk* kern_mqsp_pool; + + /* Device updates these values when posting messages to a host + * target queue */ + u16 req_vq_shared; + u16 rep_vq_shared; + u16 aeq_shared; + u16 irq_claimed; + + /* + * Shared host target pages for user-accessible MQs. + */ + int hthead; /* index of first free entry */ + void* htpages; /* kernel vaddr */ + int htlen; /* length of htpages memory */ + void* htuva; /* user mapped vaddr */ + spinlock_t htlock; /* serialize allocation */ + + u64 adapter_hint_uva; /* access to the activity FIFO */ + + spinlock_t aeq_lock; + spinlock_t rnic_lock; + + + u16 hint_count; + u16 hints_read; + + int init; /* TRUE if it's ready */ + char ae_cache_name[16]; + char vq_cache_name[16]; +}; + +struct c2_port { + u32 msg_enable; + struct c2_dev *c2dev; + struct net_device *netdev; + + spinlock_t tx_lock; + u32 tx_avail; + struct c2_ring tx_ring; + struct c2_ring rx_ring; + + void *mem; /* PCI memory for host rings */ + dma_addr_t dma; + unsigned long mem_size; + + u32 rx_buf_size; + + struct net_device_stats netstats; +}; + +/* + * Activity FIFO registers in BAR0. + */ +#define PCI_BAR0_HOST_HINT 0x100 +#define PCI_BAR0_ADAPTER_HINT 0x2000 + +/* + * Ammasso PCI vendor id and Cepheus PCI device id. + */ +#define CQ_ARMED 0x01 +#define CQ_WAIT_FOR_DMA 0x80 + +/* + * The format of a hint is as follows: + * Lower 16 bits are the count of hints for the queue. + * Next 15 bits are the qp_index + * Upper most bit depends on who reads it: + * If read by producer, then it means Full (1) or Not-Full (0) + * If read by consumer, then it means Empty (1) or Not-Empty (0) + */ +#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count) +#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16) +#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF) + + +/* + * The following defines the offset in SDRAM for the cc_adapter_pci_regs_t + * struct. + */ +#define C2_ADAPTER_PCI_REGS_OFFSET 0x10000 + +#ifndef readq +static inline u64 readq(const void __iomem *addr) +{ + u64 ret = readl(addr + 4); + ret <<= 32; + ret |= readl(addr); + + return ret; +} +#endif + +#ifndef writeq +static inline void writeq(u64 val, void __iomem *addr) +{ + writel((u32) (val), addr); + writel((u32) (val >> 32), (addr + 4)); +} +#endif + +/* Read from memory-mapped device */ +static inline u64 c2_read64(const void __iomem *addr) +{ + return readq(addr); +} + +static inline u32 c2_read32(const void __iomem *addr) +{ + return readl(addr); +} + +static inline u16 c2_read16(const void __iomem *addr) +{ + return readw(addr); +} + +static inline u8 c2_read8(const void __iomem *addr) +{ + return readb(addr); +} + +/* Write to memory-mapped device */ +static inline void c2_write64(void __iomem *addr, u64 val) +{ + writeq(val, addr); +} + +static inline void c2_write32(void __iomem *addr, u32 val) +{ + writel(val, addr); +} + +static inline void c2_write16(void __iomem *addr, u16 val) +{ + writew(val, addr); +} + +static inline void c2_write8(void __iomem *addr, u8 val) +{ + writeb(val, addr); +} + +#define C2_SET_CUR_RX(c2dev, cur_rx) \ + c2_write32(c2dev->mmio_txp_ring + 4092, cpu_to_be32(cur_rx)) + +#define C2_GET_CUR_RX(c2dev) \ + be32_to_cpu(c2_read32(c2dev->mmio_txp_ring + 4092)) + +static inline struct c2_dev *to_c2dev(struct ib_device* ibdev) +{ + return container_of(ibdev, struct c2_dev, ibdev); +} + +static inline int c2_errno(void *reply) +{ + switch(c2_wr_get_result(reply)) { + case CC_OK: + return 0; + case CCERR_NO_BUFS: + case CCERR_INSUFFICIENT_RESOURCES: + case CCERR_ZERO_RDMA_READ_RESOURCES: + return -ENOMEM; + case CCERR_MR_IN_USE: + case CCERR_QP_IN_USE: + return -EBUSY; + case CCERR_ADDR_IN_USE: + return -EADDRINUSE; + case CCERR_ADDR_NOT_AVAIL: + return -EADDRNOTAVAIL; + case CCERR_CONN_RESET: + return -ECONNRESET; + case CCERR_NOT_IMPLEMENTED: + case CCERR_INVALID_WQE: + return -ENOSYS; + case CCERR_QP_NOT_PRIVILEGED: + return -EPERM; + case CCERR_STACK_ERROR: + return -EPROTO; + case CCERR_ACCESS_VIOLATION: + case CCERR_BASE_AND_BOUNDS_VIOLATION: + return -EFAULT; + case CCERR_STAG_STATE_NOT_INVALID: + case CCERR_INVALID_ADDRESS: + case CCERR_INVALID_CQ: + case CCERR_INVALID_EP: + case CCERR_INVALID_MODIFIER: + case CCERR_INVALID_MTU: + case CCERR_INVALID_PD_ID: + case CCERR_INVALID_QP: + case CCERR_INVALID_RNIC: + case CCERR_INVALID_STAG: + return -EINVAL; + default: + return -EAGAIN; + } +} + +/* Device */ +extern int c2_register_device(struct c2_dev *c2dev); +extern void c2_unregister_device(struct c2_dev *c2dev); +extern int c2_rnic_init(struct c2_dev* c2dev); +extern void c2_rnic_term(struct c2_dev* c2dev); + +/* QPs */ +extern int c2_alloc_qp(struct c2_dev *c2dev, struct c2_pd *pd, + struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp); +extern void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp); +extern int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp, + struct ib_qp_attr *attr, int attr_mask); +extern int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, + struct ib_send_wr **bad_wr); +extern int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, + struct ib_recv_wr **bad_wr); +extern int __devinit c2_init_qp_table(struct c2_dev *c2dev); +extern void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev); + +/* PDs */ +extern int c2_pd_alloc(struct c2_dev *c2dev, int privileged, struct c2_pd *pd); +extern void c2_pd_free(struct c2_dev *c2dev, struct c2_pd *pd); +extern int __devinit c2_init_pd_table(struct c2_dev *c2dev); +extern void __devexit c2_cleanup_pd_table(struct c2_dev *c2dev); + +/* CQs */ +extern int c2_init_cq(struct c2_dev *c2dev, int entries, struct c2_ucontext *ctx, + struct c2_cq *cq); +extern void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq); +extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); +extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index); +extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); + +/* CM */ +extern int c2_llp_connect(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len); +extern int c2_llp_accept(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len); +extern int c2_llp_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len); +extern int c2_llp_service_create(struct iw_cm_id* cm_id, int backlog); +extern int c2_llp_service_destroy(struct iw_cm_id* cm_id); + +/* MM */ +extern int c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 **addr_list, + int pbl_depth, u32 length, u64 *va, + cc_acf_t acf, struct c2_mr *mr); +extern int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index); + +/* AE */ +extern void c2_ae_event(struct c2_dev *c2dev, u32 mq_index); + +/* Allocators */ +extern u32 c2_alloc(struct c2_alloc *alloc); +extern void c2_free(struct c2_alloc *alloc, u32 obj); +extern int c2_alloc_init(struct c2_alloc *alloc, u32 num, u32 reserved); +extern void c2_alloc_cleanup(struct c2_alloc *alloc); +extern int c2_init_mqsp_pool(unsigned int gfp_mask, struct sp_chunk** root); +extern void c2_free_mqsp_pool(struct sp_chunk* root); +extern u16* c2_alloc_mqsp(struct sp_chunk* head); +extern void c2_free_mqsp(u16* mqsp); +extern int c2_array_init(struct c2_array *array, int nent); +extern void c2_array_clear(struct c2_array *array, int index); +extern int c2_array_set(struct c2_array *array, int index, void *value); +extern void *c2_array_get(struct c2_array *array, int index); + +#endif + Index: hw/amso1100/c2_vq.c =================================================================== --- hw/amso1100/c2_vq.c (revision 0) +++ hw/amso1100/c2_vq.c (revision 0) @@ -0,0 +1,272 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include "c2_vq.h" + +/* + * Verbs Request Objects: + * + * VQ Request Objects are allocated by the kernel verbs handlers. + * They contain a wait object, a refcnt, an atomic bool indicating that the + * adapter has replied, and a copy of the verb reply work request. + * A pointer to the VQ Request Object is passed down in the context + * field of the work request message, and reflected back by the adapter + * in the verbs reply message. The function handle_vq() in the interrupt + * path will use this pointer to: + * 1) append a copy of the verbs reply message + * 2) mark that the reply is ready + * 3) wake up the kernel verbs handler blocked awaiting the reply. + * + * + * The kernel verbs handlers do a "get" to put a 2nd reference on the + * VQ Request object. If the kernel verbs handler exits before the adapter + * can respond, this extra reference will keep the VQ Request object around + * until the adapter's reply can be processed. The reason we need this is + * because a pointer to this object is stuffed into the context field of + * the verbs work request message, and reflected back in the reply message. + * It is used in the interrupt handler (handle_vq()) to wake up the appropriate + * kernel verb handler that is blocked awaiting the verb reply. + * So handle_vq() will do a "put" on the object when it's done accessing it. + * NOTE: If we guarantee that the kernel verb handler will never bail before + * getting the reply, then we don't need these refcnts. + * + * + * VQ Request objects are freed by the kernel verbs handlers only + * after the verb has been processed, or when the adapter fails and + * does not reply. + * + * + * Verbs Reply Buffers: + * + * VQ Reply bufs are local host memory copies of a outstanding Verb Request reply + * message. The are always allocated by the kernel verbs handlers, and _may_ be + * freed by either the kernel verbs handler -or- the interrupt handler. The + * kernel verbs handler _must_ free the repbuf, then free the vq request object + * in that order. + */ + +int +vq_init(struct c2_dev* c2dev) +{ + sprintf(c2dev->vq_cache_name, "c2-vq:dev%c", (char ) ('0' + c2dev->devnum)); + c2dev->host_msg_cache = kmem_cache_create(c2dev->vq_cache_name, + c2dev->rep_vq.msg_size, 0, + SLAB_HWCACHE_ALIGN, NULL, NULL); + if (c2dev->host_msg_cache == NULL) { + return -ENOMEM; + } + return 0; +} + +void +vq_term(struct c2_dev* c2dev) +{ + kmem_cache_destroy(c2dev->host_msg_cache); +} + +/* vq_req_alloc - allocate a VQ Request Object and initialize it. + * The refcnt is set to 1. + */ +struct c2_vq_req * +vq_req_alloc(struct c2_dev *c2dev) +{ + struct c2_vq_req *r; + + r = (struct c2_vq_req *)kmalloc(sizeof(struct c2_vq_req), GFP_KERNEL); + if (r) { + init_waitqueue_head(&r->wait_object); + r->reply_msg = (u64)NULL; + atomic_set(&r->refcnt, 1); + atomic_set(&r->reply_ready, 0); + } + return r; +} + + +/* vq_req_free - free the VQ Request Object. It is assumed the verbs handler + * has already free the VQ Reply Buffer if it existed. + */ +void +vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *r) +{ + r->reply_msg = (u64)NULL; + if (atomic_dec_and_test(&r->refcnt)) { + kfree(r); + } +} + +/* vq_req_get - reference a VQ Request Object. Done + * only in the kernel verbs handlers. + */ +void +vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *r) +{ + atomic_inc(&r->refcnt); +} + + +/* vq_req_put - dereference and potentially free a VQ Request Object. + * + * This is only called by handle_vq() on the interrupt when it is done processing + * a verb reply message. If the associated kernel verbs handler has already bailed, + * then this put will actually free the VQ Request object _and_ the VQ Reply Buffer + * if it exists. + */ +void +vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *r) +{ + if (atomic_dec_and_test(&r->refcnt)) { + if (r->reply_msg != (u64)NULL) + vq_repbuf_free(c2dev, (void *)(unsigned long)r->reply_msg); + kfree(r); + } +} + + +/* + * vq_repbuf_alloc - allocate a VQ Reply Buffer. + */ +void * +vq_repbuf_alloc(struct c2_dev *c2dev) +{ + return kmem_cache_alloc(c2dev->host_msg_cache, SLAB_ATOMIC); +} + +/* + * vq_send_wr - post a verbs request message to the Verbs Request Queue. + * If a message is not available in the MQ, then block until one is available. + * NOTE: handle_mq() on the interrupt context will wake up threads blocked here. + * When the adapter drains the Verbs Request Queue, it inserts MQ index 0 in to the + * adapter->host activity fifo and interrupts the host. + */ +int +vq_send_wr(struct c2_dev *c2dev, ccwr_t *wr) +{ + void *msg; + wait_queue_t __wait; + + /* + * grab adapter vq lock + */ + spin_lock(&c2dev->vqlock); + + /* + * allocate msg + */ + msg = c2_mq_alloc(&c2dev->req_vq); + + /* + * If we cannot get a msg, then we'll wait + * When a messages are available, the int handler will wake_up() + * any waiters. + */ + while (msg == NULL) { + init_waitqueue_entry(&__wait, current); + add_wait_queue(&c2dev->req_vq_wo, &__wait); + spin_unlock(&c2dev->vqlock); + for (;;) { + set_current_state(TASK_INTERRUPTIBLE); + if (!c2_mq_full(&c2dev->req_vq)) { + break; + } + if (!signal_pending(current)) { + schedule_timeout(1*HZ); /* 1 second... */ + continue; + } + set_current_state(TASK_RUNNING); + remove_wait_queue(&c2dev->req_vq_wo, &__wait); + return -EINTR; + } + set_current_state(TASK_RUNNING); + remove_wait_queue(&c2dev->req_vq_wo, &__wait); + spin_lock(&c2dev->vqlock); + msg = c2_mq_alloc(&c2dev->req_vq); + } + + /* + * copy wr into adapter msg + */ + memcpy(msg, wr, c2dev->req_vq.msg_size); + + /* + * post msg + */ + c2_mq_produce(&c2dev->req_vq); + + /* + * release adapter vq lock + */ + spin_unlock(&c2dev->vqlock); + return 0; +} + + +/* + * vq_wait_for_reply - block until the adapter posts a Verb Reply Message. + */ +int +vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req) +{ + wait_queue_t __wait; + int rc = 0; + + /* + * Add this request to the wait queue. + */ + init_waitqueue_entry(&__wait, current); + add_wait_queue(&req->wait_object, &__wait); + for (;;) { + set_current_state(TASK_UNINTERRUPTIBLE); + if (atomic_read(&req->reply_ready)) { + break; + } + if (schedule_timeout(60*HZ) == 0) { + rc = -ETIMEDOUT; + break; + } + } + set_current_state(TASK_RUNNING); + remove_wait_queue(&req->wait_object, &__wait); + return rc; +} + +/* + * vq_repbuf_free - Free a Verbs Reply Buffer. + */ +void +vq_repbuf_free(struct c2_dev *c2dev, void *reply) +{ + kmem_cache_free(c2dev->host_msg_cache, reply); +} Index: hw/amso1100/README =================================================================== --- hw/amso1100/README (revision 0) +++ hw/amso1100/README (revision 0) @@ -0,0 +1,11 @@ + +This is the OpenIB iWARP driver for the AMSO1100 HCA from +Open Grid Computing. The adapter is a 1Gb RDMA capable PCI-X RNIC. + +The driver implements an iWARP CM Provider and OpenIB verbs +provider. The company that created the device (Ammasso, Inc.) +is no longer in business, however, limited quantities of the cards +are available for development purposes from Open Grid Computing. + +Please contact 512-343-9196 x 108 or e-mail tom at opengridcomputing.com +for more information. Index: hw/amso1100/c2_provider.c =================================================================== --- hw/amso1100/c2_provider.c (revision 0) +++ hw/amso1100/c2_provider.c (revision 0) @@ -0,0 +1,704 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include "c2.h" +#include "c2_provider.h" +#include "c2_user.h" + +static int c2_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct c2_dev* c2dev = to_c2dev(ibdev); + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + memset(props, 0, sizeof *props); + + memcpy(&props->sys_image_guid, c2dev->netdev->dev_addr, 6); + memcpy(&props->node_guid, c2dev->netdev->dev_addr, 6); + + props->fw_ver = c2dev->fw_ver; + props->device_cap_flags = c2dev->device_cap_flags; + props->vendor_id = c2dev->vendor_id; + props->vendor_part_id = c2dev->vendor_part_id; + props->hw_ver = c2dev->hw_rev; + props->max_mr_size = ~0ull; + props->max_qp = c2dev->max_qp; + props->max_qp_wr = c2dev->max_qp_wr; + props->max_sge = c2dev->max_sge; + props->max_cq = c2dev->max_cq; + props->max_cqe = c2dev->max_cqe; + props->max_mr = c2dev->max_mr; + props->max_pd = c2dev->max_pd; + props->max_qp_rd_atom = 0; + props->max_qp_init_rd_atom = 0; + props->local_ca_ack_delay = 0; + + return 0; +} + +static int c2_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + props->max_mtu = IB_MTU_4096; + props->lid = 0; + props->lmc = 0; + props->sm_lid = 0; + props->sm_sl = 0; + props->state = IB_PORT_ACTIVE; + props->phys_state = 0; + props->port_cap_flags = + IB_PORT_CM_SUP | + IB_PORT_SNMP_TUNNEL_SUP | + IB_PORT_REINIT_SUP | + IB_PORT_DEVICE_MGMT_SUP | + IB_PORT_VENDOR_CLASS_SUP| + IB_PORT_BOOT_MGMT_SUP; + props->gid_tbl_len = 128; + props->pkey_tbl_len = 1; + props->qkey_viol_cntr = 0; + props->active_width = 1; + props->active_speed = 1; + + return 0; +} + +static int c2_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return 0; +} + +static int c2_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 *pkey) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + *pkey = 0; + return 0; +} + +static int c2_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct c2_dev* c2dev = to_c2dev(ibdev); + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + memcpy(&(gid->raw[0]),c2dev->netdev->dev_addr, MAX_ADDR_LEN); + + return 0; +} + +/* Allocate the user context data structure. This keeps track + * of all objects associated with a particular user-mode client. + */ +static struct ib_ucontext *c2_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct c2_alloc_ucontext_resp uresp; + struct c2_ucontext *context; + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + memset(&uresp, 0, sizeof uresp); + + uresp.qp_tab_size = to_c2dev(ibdev)->max_qp; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + + /* The OpenIB user context is logically similar to the RNIC + * Instance of our existing driver + */ + /* context->rnic_p = rnic_open */ + + if (ib_copy_to_udata(udata, &uresp, sizeof uresp)) { + kfree(context); + return ERR_PTR(-EFAULT); + } + + return &context->ibucontext; +} + +static int c2_dealloc_ucontext(struct ib_ucontext *context) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_mmap_uar(struct ib_ucontext *context, + struct vm_area_struct *vma) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static struct ib_pd *c2_alloc_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct c2_pd* pd; + int err; + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + pd = kmalloc(sizeof *pd, GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = c2_pd_alloc(to_c2dev(ibdev), !context, pd); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + if (context) { + if (ib_copy_to_udata(udata, &pd->pd_id, sizeof (__u32))) { + c2_pd_free(to_c2dev(ibdev), pd); + kfree(pd); + return ERR_PTR(-EFAULT); + } + } + + return &pd->ibpd; +} + +static int c2_dealloc_pd(struct ib_pd *pd) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + c2_pd_free(to_c2dev(pd->device), to_c2pd(pd)); + kfree(pd); + + return 0; +} + +static struct ib_ah *c2_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return ERR_PTR(-ENOSYS); +} + +static int c2_ah_destroy(struct ib_ah *ah) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static struct ib_qp *c2_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata) +{ + struct c2_qp *qp; + int err; + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + switch(init_attr->qp_type) { + case IB_QPT_RC: + qp = kmalloc(sizeof(*qp), GFP_KERNEL); + if (!qp) { + dprintk("%s: Unable to allocate QP\n", __FUNCTION__); + return ERR_PTR(-ENOMEM); + } + + if (pd->uobject) { + /* XXX userspace specific */ + } + + err = c2_alloc_qp(to_c2dev(pd->device), + to_c2pd(pd), + init_attr, + qp); + if (err && pd->uobject) { + /* XXX userspace specific */ + } + + break; + default: + dprintk("%s: Invalid QP type: %d\n", __FUNCTION__, init_attr->qp_type); + return ERR_PTR(-EINVAL); + break; + } + + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + return &qp->ibqp; +} + +static int c2_destroy_qp(struct ib_qp *ib_qp) +{ + struct c2_qp *qp = to_c2qp(ib_qp); + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + c2_free_qp(to_c2dev(ib_qp->device), qp); + kfree(qp); + + return 0; +} + +static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct c2_cq *cq; + int err; + + cq = kmalloc(sizeof(*cq), GFP_KERNEL); + if (!cq) { + dprintk("%s: Unable to allocate CQ\n", __FUNCTION__); + return ERR_PTR(-ENOMEM); + } + + err = c2_init_cq(to_c2dev(ibdev), entries, NULL, cq); + if (err) { + dprintk("%s: error initializing CQ\n", __FUNCTION__); + kfree(cq); + return ERR_PTR(err); + } + + return &cq->ibcq; +} + +static int c2_destroy_cq(struct ib_cq *ib_cq) +{ + struct c2_cq *cq = to_c2cq(ib_cq); + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + c2_free_cq(to_c2dev(ib_cq->device), cq); + kfree(cq); + + return 0; +} + +static inline u32 c2_convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_WRITE ? CC_ACF_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? CC_ACF_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? CC_ACF_LOCAL_WRITE : 0) | + CC_ACF_LOCAL_READ | CC_ACF_WINDOW_BIND; +} + +static struct ib_mr *c2_reg_phys_mr(struct ib_pd *ib_pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + struct c2_mr *mr; + u64 **page_list; + u32 total_len; + int err, i, j, k, pbl_depth; + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + pbl_depth = 0; + total_len = 0; + + for (i = 0; i < num_phys_buf; i++) { + + int size; + + if (buffer_list[i].addr & ~PAGE_MASK) { + dprintk("Unaligned Memory Buffer: 0x%x\n", + (unsigned int)buffer_list[i].addr); + return ERR_PTR(-EINVAL); + } + + if (!buffer_list[i].size) { + dprintk("Invalid Buffer Size\n"); + return ERR_PTR(-EINVAL); + } + + size = buffer_list[i].size; + total_len += size; + while (size) { + pbl_depth++; + size -= PAGE_SIZE; + } + } + + page_list = kmalloc(sizeof(u64 *) * pbl_depth, GFP_KERNEL); + if (!page_list) + return ERR_PTR(-ENOMEM); + + for (i = 0, j = 0; i < num_phys_buf; i++) { + + int naddrs; + + naddrs = (u32)buffer_list[i].size % ~PAGE_MASK; + for (k = 0; k < naddrs; k++) + page_list[j++] = + (u64 *)(unsigned long)(buffer_list[i].addr + (k << PAGE_SHIFT)); + } + + mr = kmalloc(sizeof(*mr), GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + mr->pd = to_c2pd(ib_pd); + + err = c2_nsmr_register_phys_kern(to_c2dev(ib_pd->device), page_list, + pbl_depth, total_len, iova_start, + c2_convert_access(acc), mr); + kfree(page_list); + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + return &mr->ibmr; +} + +static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct ib_phys_buf bl; + u64 kva; + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + bl.size = 4096; + kva = (u64)(unsigned long)kmalloc(bl.size, GFP_KERNEL); + if (!kva) + return ERR_PTR(-ENOMEM); + + bl.addr = __pa(kva); + return c2_reg_phys_mr(pd, &bl, 1, acc, &kva); +} + +static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, struct ib_udata *udata) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return ERR_PTR(-ENOSYS); +} + +static int c2_dereg_mr(struct ib_mr *ib_mr) +{ + struct c2_mr *mr = to_c2mr(ib_mr); + int err; + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + err = c2_stag_dealloc(to_c2dev(ib_mr->device), ib_mr->lkey); + if (err) + dprintk("c2_stag_dealloc failed: %d\n", err); + else + kfree(mr); + + return err; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return sprintf(buf, "%x\n", dev->hw_rev); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return sprintf(buf, "%x.%x.%x\n", + (int)(dev->fw_ver >> 32), + (int)(dev->fw_ver >> 16) & 0xffff, + (int)(dev->fw_ver & 0xffff)); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return sprintf(buf, "AMSO1100\n"); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return sprintf(buf, "%.*s\n", 32, "AMSO1100 Board ID"); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *c2_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + +static int c2_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + int err; + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + err = c2_qp_modify(to_c2dev(ibqp->device), to_c2qp(ibqp), attr, attr_mask); + + return err; +} + +static int c2_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, + struct ib_mad *out_mad) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_connect(struct iw_cm_id* cm_id, + const void* pdata, u8 pdata_len) +{ + int err; + struct c2_qp* qp = container_of(cm_id->qp, struct c2_qp, ibqp); + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + if (cm_id->qp == NULL) + return -EINVAL; + + /* Cache the cm_id in the qp */ + qp->cm_id = cm_id; + + err = c2_llp_connect(cm_id, pdata, pdata_len); + + return err; +} + +static int c2_disconnect(struct iw_cm_id* cm_id, int abrupt) +{ + struct ib_qp_attr attr; + struct ib_qp *ib_qp = cm_id->qp; + int err; + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + if (ib_qp == 0) + /* If this is a lietening endpoint, there is no QP */ + return 0; + + memset(&attr, 0, sizeof(struct ib_qp_attr)); + if (abrupt) + attr.qp_state = IB_QPS_ERR; + else + attr.qp_state = IB_QPS_SQD; + + err = c2_modify_qp(ib_qp, &attr, IB_QP_STATE); + return err; +} + +static int c2_accept(struct iw_cm_id* cm_id, const void *pdata, u8 pdata_len) +{ + int err; + struct c2_qp* qp = container_of(cm_id->qp, struct c2_qp, ibqp); + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + /* Cache the cm_id in the qp */ + qp->cm_id = cm_id; + + err = c2_llp_accept(cm_id, pdata, pdata_len); + + return err; +} + +static int c2_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) +{ + int err; + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + err = c2_llp_reject(cm_id, pdata, pdata_len); + return err; +} + +static int c2_getpeername(struct iw_cm_id* cm_id, + struct sockaddr_in* local_addr, + struct sockaddr_in* remote_addr ) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + *local_addr = cm_id->local_addr; + *remote_addr = cm_id->remote_addr; + return 0; +} + +static int c2_service_create(struct iw_cm_id* cm_id, int backlog) +{ + int err; + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + err = c2_llp_service_create(cm_id, backlog); + return err; +} + +static int c2_service_destroy(struct iw_cm_id* cm_id) +{ + int err; + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + err = c2_llp_service_destroy(cm_id); + + return err; +} + +int c2_register_device(struct c2_dev *dev) +{ + int ret; + int i; + + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX); + dev->ibdev.owner = THIS_MODULE; + + dev->ibdev.node_type = IB_NODE_RNIC; + memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); + memcpy(&dev->ibdev.node_guid, dev->netdev->dev_addr, 6); + dev->ibdev.phys_port_cnt = 1; + dev->ibdev.dma_device = &dev->pcidev->dev; + dev->ibdev.class_dev.dev = &dev->pcidev->dev; + dev->ibdev.query_device = c2_query_device; + dev->ibdev.query_port = c2_query_port; + dev->ibdev.modify_port = c2_modify_port; + dev->ibdev.query_pkey = c2_query_pkey; + dev->ibdev.query_gid = c2_query_gid; + dev->ibdev.alloc_ucontext = c2_alloc_ucontext; + dev->ibdev.dealloc_ucontext = c2_dealloc_ucontext; + dev->ibdev.mmap = c2_mmap_uar; + dev->ibdev.alloc_pd = c2_alloc_pd; + dev->ibdev.dealloc_pd = c2_dealloc_pd; + dev->ibdev.create_ah = c2_ah_create; + dev->ibdev.destroy_ah = c2_ah_destroy; + dev->ibdev.create_qp = c2_create_qp; + dev->ibdev.modify_qp = c2_modify_qp; + dev->ibdev.destroy_qp = c2_destroy_qp; + dev->ibdev.create_cq = c2_create_cq; + dev->ibdev.destroy_cq = c2_destroy_cq; + dev->ibdev.poll_cq = c2_poll_cq; + dev->ibdev.get_dma_mr = c2_get_dma_mr; + dev->ibdev.reg_phys_mr = c2_reg_phys_mr; + dev->ibdev.reg_user_mr = c2_reg_user_mr; + dev->ibdev.dereg_mr = c2_dereg_mr; + + dev->ibdev.alloc_fmr = 0; + dev->ibdev.unmap_fmr = 0; + dev->ibdev.dealloc_fmr = 0; + dev->ibdev.map_phys_fmr = 0; + + dev->ibdev.attach_mcast = c2_multicast_attach; + dev->ibdev.detach_mcast = c2_multicast_detach; + dev->ibdev.process_mad = c2_process_mad; + + dev->ibdev.req_notify_cq = c2_arm_cq; + dev->ibdev.post_send = c2_post_send; + dev->ibdev.post_recv = c2_post_receive; + + dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL); + dev->ibdev.iwcm->connect = c2_connect; + dev->ibdev.iwcm->disconnect = c2_disconnect; + dev->ibdev.iwcm->accept = c2_accept; + dev->ibdev.iwcm->reject = c2_reject; + dev->ibdev.iwcm->getpeername = c2_getpeername; + dev->ibdev.iwcm->create_listen = c2_service_create; + dev->ibdev.iwcm->destroy_listen = c2_service_destroy; + + ret = ib_register_device(&dev->ibdev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(c2_class_attributes); ++i) { + ret = class_device_create_file(&dev->ibdev.class_dev, + c2_class_attributes[i]); + if (ret) { + ib_unregister_device(&dev->ibdev); + return ret; + } + } + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return 0; +} + +void c2_unregister_device(struct c2_dev *dev) +{ + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + ib_unregister_device(&dev->ibdev); +} Index: hw/amso1100/c2_alloc.c =================================================================== --- hw/amso1100/c2_alloc.c (revision 0) +++ hw/amso1100/c2_alloc.c (revision 0) @@ -0,0 +1,255 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "c2.h" + +/* Trivial bitmap-based allocator */ +u32 c2_alloc(struct c2_alloc *alloc) +{ + u32 obj; + + spin_lock(&alloc->lock); + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); + if (obj < alloc->max) { + set_bit(obj, alloc->table); + alloc->last = obj; + } else + obj = -1; + + spin_unlock(&alloc->lock); + + return obj; +} + +void c2_free(struct c2_alloc *alloc, u32 obj) +{ + spin_lock(&alloc->lock); + clear_bit(obj, alloc->table); + alloc->last = min(alloc->last, obj); + spin_unlock(&alloc->lock); +} + +int c2_alloc_init(struct c2_alloc *alloc, u32 num, u32 reserved) +{ + int i; + + alloc->last = 0; + alloc->max = num; + spin_lock_init(&alloc->lock); + alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long), + GFP_KERNEL); + if (!alloc->table) + return -ENOMEM; + + bitmap_zero(alloc->table, num); + for (i = 0; i < reserved; ++i) + set_bit(i, alloc->table); + + return 0; +} + +void c2_alloc_cleanup(struct c2_alloc *alloc) +{ + kfree(alloc->table); +} + +/* + * Array of pointers with lazy allocation of leaf pages. Callers of + * _get, _set and _clear methods must use a lock or otherwise + * serialize access to the array. + */ + +void *c2_array_get(struct c2_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (array->page_list[p].page) { + int i = index & (PAGE_SIZE / sizeof (void *) - 1); + return array->page_list[p].page[i]; + } else + return NULL; +} + +int c2_array_set(struct c2_array *array, int index, void *value) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ + if (!array->page_list[p].page) + array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC); + + if (!array->page_list[p].page) + return -ENOMEM; + + array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] = + value; + ++array->page_list[p].used; + + return 0; +} + +void c2_array_clear(struct c2_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (--array->page_list[p].used == 0) { + free_page((unsigned long) array->page_list[p].page); + array->page_list[p].page = NULL; + } + + if (array->page_list[p].used < 0) + pr_debug("Array %p index %d page %d with ref count %d < 0\n", + array, index, p, array->page_list[p].used); +} + +int c2_array_init(struct c2_array *array, int nent) +{ + int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; + int i; + + array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL); + if (!array->page_list) + return -ENOMEM; + + for (i = 0; i < npage; ++i) { + array->page_list[i].page = NULL; + array->page_list[i].used = 0; + } + + return 0; +} + +void c2_array_cleanup(struct c2_array *array, int nent) +{ + int i; + + for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + free_page((unsigned long) array->page_list[i].page); + + kfree(array->page_list); +} + +static int c2_alloc_mqsp_chunk(unsigned int gfp_mask, struct sp_chunk** head) +{ + int i; + struct sp_chunk* new_head; + + new_head = (struct sp_chunk*)__get_free_page(gfp_mask|GFP_DMA); + if (new_head == NULL) + return -ENOMEM; + + new_head->next = NULL; + new_head->head = 0; + new_head->gfp_mask = gfp_mask; + + /* build list where each index is the next free slot */ + for (i = 0; + i < (PAGE_SIZE-sizeof(struct sp_chunk*)-sizeof(u16)) / sizeof(u16)-1; + i++) { + new_head->shared_ptr[i] = i+1; + } + /* terminate list */ + new_head->shared_ptr[i] = 0xFFFF; + + *head = new_head; + return 0; +} + +int c2_init_mqsp_pool(unsigned int gfp_mask, struct sp_chunk** root) { + return c2_alloc_mqsp_chunk(gfp_mask, root); +} + +void c2_free_mqsp_pool(struct sp_chunk* root) +{ + struct sp_chunk* next; + + while (root) { + next = root->next; + __free_page((struct page*)root); + root = next; + } +} + +u16* c2_alloc_mqsp(struct sp_chunk* head) +{ + u16 mqsp; + + while (head) { + mqsp = head->head; + if (mqsp != 0xFFFF) { + head->head = head->shared_ptr[mqsp]; + break; + } else if (head->next == NULL) { + if (c2_alloc_mqsp_chunk(head->gfp_mask, &head->next) == 0) { + head = head->next; + mqsp = head->head; + head->head = + head->shared_ptr[mqsp]; + break; + } + else + return 0; + } + else + head = head->next; + } + if (head) + return &(head->shared_ptr[mqsp]); + return 0; +} + +void c2_free_mqsp(u16* mqsp) +{ + struct sp_chunk* head; + u16 idx; + + /* The chunk containing this ptr begins at the page boundary */ + head = (struct sp_chunk*)((unsigned long)mqsp & PAGE_MASK); + + /* Link head to new mqsp */ + *mqsp = head->head; + + /* Compute the shared_ptr index */ + idx = ((unsigned long)mqsp & ~PAGE_MASK) >> 1; + idx -= (unsigned long)&(((struct sp_chunk*)0)->shared_ptr[0]) >> 1; + + /* Point this index at the head */ + head->shared_ptr[idx] = head->head; + + /* Point head at this index */ + head->head = idx; +} Index: hw/amso1100/cc_types.h =================================================================== --- hw/amso1100/cc_types.h (revision 0) +++ hw/amso1100/cc_types.h (revision 0) @@ -0,0 +1,297 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _CC_TYPES_H_ +#define _CC_TYPES_H_ + +#include + +#ifndef NULL +#define NULL 0 +#endif +#ifndef TRUE +#define TRUE 1 +#endif +#ifndef FALSE +#define FALSE 0 +#endif + +#define PTR_TO_CTX(p) (u64)(u32)(p) + +#define CC_PTR_TO_64(p) (u64)(u32)(p) +#define CC_64_TO_PTR(c) (void*)(u32)(c) + + + +/* + * not really a "type" however this needs + * to be common between adapter and host. + * this is the best place to put it. + */ +#define CC_QP_NO_ATTR_CHANGE 0xFFFFFFFF + +/* Maximum allowed size in bytes of private_data exchange + * on connect. + */ +#define CC_MAX_PRIVATE_DATA_SIZE 200 + +/* + * These types are shared among the adapter, host, and CCIL consumer. Thus + * they are placed here since everyone includes cc_types.h... + */ +typedef enum { + CC_CQ_NOTIFICATION_TYPE_NONE = 1, + CC_CQ_NOTIFICATION_TYPE_NEXT, + CC_CQ_NOTIFICATION_TYPE_NEXT_SE +} cc_cq_notification_type_t; + +typedef enum { + CC_CFG_ADD_ADDR = 1, + CC_CFG_DEL_ADDR = 2, + CC_CFG_ADD_ROUTE = 3, + CC_CFG_DEL_ROUTE = 4 +} cc_setconfig_cmd_t; + +typedef enum { + CC_GETCONFIG_ROUTES = 1, + CC_GETCONFIG_ADDRS +} cc_getconfig_cmd_t; + +/* + * CCIL Work Request Identifiers + */ +typedef enum { + CCWR_RNIC_OPEN = 1, + CCWR_RNIC_QUERY, + CCWR_RNIC_SETCONFIG, + CCWR_RNIC_GETCONFIG, + CCWR_RNIC_CLOSE, + CCWR_CQ_CREATE, + CCWR_CQ_QUERY, + CCWR_CQ_MODIFY, + CCWR_CQ_DESTROY, + CCWR_QP_CONNECT, + CCWR_PD_ALLOC, + CCWR_PD_DEALLOC, + CCWR_SRQ_CREATE, + CCWR_SRQ_QUERY, + CCWR_SRQ_MODIFY, + CCWR_SRQ_DESTROY, + CCWR_QP_CREATE, + CCWR_QP_QUERY, + CCWR_QP_MODIFY, + CCWR_QP_DESTROY, + CCWR_NSMR_STAG_ALLOC, + CCWR_NSMR_REGISTER, + CCWR_NSMR_PBL, + CCWR_STAG_DEALLOC, + CCWR_NSMR_REREGISTER, + CCWR_SMR_REGISTER, + CCWR_MR_QUERY, + CCWR_MW_ALLOC, + CCWR_MW_QUERY, + CCWR_EP_CREATE, + CCWR_EP_GETOPT, + CCWR_EP_SETOPT, + CCWR_EP_DESTROY, + CCWR_EP_BIND, + CCWR_EP_CONNECT, + CCWR_EP_LISTEN, + CCWR_EP_SHUTDOWN, + CCWR_EP_LISTEN_CREATE, + CCWR_EP_LISTEN_DESTROY, + CCWR_EP_QUERY, + CCWR_CR_ACCEPT, + CCWR_CR_REJECT, + CCWR_CONSOLE, + CCWR_TERM, + CCWR_FLASH_INIT, + CCWR_FLASH, + CCWR_BUF_ALLOC, + CCWR_BUF_FREE, + CCWR_FLASH_WRITE, + CCWR_INIT, /* WARNING: Don't move this ever again! */ + + + + /* Add new IDs here */ + + + + /* + * WARNING: CCWR_LAST must always be the last verbs id defined! + * All the preceding IDs are fixed, and must not change. + * You can add new IDs, but must not remove or reorder + * any IDs. If you do, YOU will ruin any hope of + * compatability between versions. + */ + CCWR_LAST, + + /* + * Start over at 1 so that arrays indexed by user wr id's + * begin at 1. This is OK since the verbs and user wr id's + * are always used on disjoint sets of queues. + */ +#if 0 + CCWR_SEND = 1, + CCWR_SEND_SE, + CCWR_SEND_INV, + CCWR_SEND_SE_INV, +#else + /* + * The order of the CCWR_SEND_XX verbs must + * match the order of the RDMA_OPs + */ + CCWR_SEND = 1, + CCWR_SEND_INV, + CCWR_SEND_SE, + CCWR_SEND_SE_INV, +#endif + CCWR_RDMA_WRITE, + CCWR_RDMA_READ, + CCWR_RDMA_READ_INV, + CCWR_MW_BIND, + CCWR_NSMR_FASTREG, + CCWR_STAG_INVALIDATE, + CCWR_RECV, + CCWR_NOP, + CCWR_UNIMPL, /* WARNING: This must always be the last user wr id defined! */ +} ccwr_ids_t; +#define RDMA_SEND_OPCODE_FROM_WR_ID(x) (x+2) + +/* + * SQ/RQ Work Request Types + */ +typedef enum { + CC_WR_TYPE_SEND = CCWR_SEND, + CC_WR_TYPE_SEND_SE = CCWR_SEND_SE, + CC_WR_TYPE_SEND_INV = CCWR_SEND_INV, + CC_WR_TYPE_SEND_SE_INV = CCWR_SEND_SE_INV, + CC_WR_TYPE_RDMA_WRITE = CCWR_RDMA_WRITE, + CC_WR_TYPE_RDMA_READ = CCWR_RDMA_READ, + CC_WR_TYPE_RDMA_READ_INV_STAG = CCWR_RDMA_READ_INV, + CC_WR_TYPE_BIND_MW = CCWR_MW_BIND, + CC_WR_TYPE_FASTREG_NSMR = CCWR_NSMR_FASTREG, + CC_WR_TYPE_INV_STAG = CCWR_STAG_INVALIDATE, + CC_WR_TYPE_RECV = CCWR_RECV, + CC_WR_TYPE_NOP = CCWR_NOP, +} cc_wr_type_t; + +/* + * These are used as bitfields for efficient comparison of multiple possible + * states. + */ +typedef enum { + CC_QP_STATE_IDLE = 0x01, /* initial state */ + CC_QP_STATE_CONNECTING = 0x02, /* LLP is connecting */ + CC_QP_STATE_RTS = 0x04, /* RDDP/RDMAP enabled */ + CC_QP_STATE_CLOSING = 0x08, /* LLP is shutting down */ + CC_QP_STATE_TERMINATE = 0x10, /* Connection Terminat[ing|ed] */ + CC_QP_STATE_ERROR = 0x20, /* Error state to flush everything */ +} cc_qp_state_t; + +typedef struct _cc_netaddr_s { + u32 ip_addr; + u32 netmask; + u32 mtu; +} cc_netaddr_t; + +typedef struct _cc_route_s { + u32 ip_addr; /* 0 indicates the default route */ + u32 netmask; /* netmask associated with dst */ + u32 flags; + union { + u32 ipaddr; /* address of the nexthop interface */ + u8 enaddr[6]; + } nexthop; +} cc_route_t; + +/* + * A Scatter Gather Entry. + */ +typedef u32 cc_stag_t; + +typedef struct { + cc_stag_t stag; + u32 length; + u64 to; +} cc_data_addr_t; + +/* + * MR and MW flags used by the consumer, RI, and RNIC. + */ +typedef enum { + MEM_REMOTE = 0x0001, /* allow mw binds with remote access. */ + MEM_VA_BASED = 0x0002, /* Not Zero-based */ + MEM_PBL_COMPLETE = 0x0004, /* PBL array is complete in this msg */ + MEM_LOCAL_READ = 0x0008, /* allow local reads */ + MEM_LOCAL_WRITE = 0x0010, /* allow local writes */ + MEM_REMOTE_READ = 0x0020, /* allow remote reads */ + MEM_REMOTE_WRITE = 0x0040, /* allow remote writes */ + MEM_WINDOW_BIND = 0x0080, /* binds allowed */ + MEM_SHARED = 0x0100, /* set if MR is shared */ + MEM_STAG_VALID = 0x0200 /* set if STAG is in valid state */ +} cc_mm_flags_t; + +/* + * CCIL API ACF flags defined in terms of the low level mem flags. + * This minimizes translation needed in the user API + */ +typedef enum { + CC_ACF_LOCAL_READ = MEM_LOCAL_READ, + CC_ACF_LOCAL_WRITE = MEM_LOCAL_WRITE, + CC_ACF_REMOTE_READ = MEM_REMOTE_READ, + CC_ACF_REMOTE_WRITE = MEM_REMOTE_WRITE, + CC_ACF_WINDOW_BIND = MEM_WINDOW_BIND +} cc_acf_t; + +/* + * Image types of objects written to flash + */ +#define CC_FLASH_IMG_BITFILE 1 +#define CC_FLASH_IMG_OPTION_ROM 2 +#define CC_FLASH_IMG_VPD 3 + +/* + * to fix bug 1815 we define the max size allowable of the + * terminate message (per the IETF spec).Refer to the IETF + * protocal specification, section 12.1.6, page 64) + * The message is prefixed by 20 types of DDP info. + * + * Then the message has 6 bytes for the terminate control + * and DDP segment length info plus a DDP header (either + * 14 or 18 byts) plus 28 bytes for the RDMA header. + * Thus the max size in: + * 20 + (6 + 18 + 28) = 72 + */ +#define CC_MAX_TERMINATE_MESSAGE_SIZE (72) +#endif Index: hw/amso1100/c2_rnic.c =================================================================== --- hw/amso1100/c2_rnic.c (revision 0) +++ hw/amso1100/c2_rnic.c (revision 0) @@ -0,0 +1,581 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#ifdef NETEVENT_NOTIFIER +#include +#include +#include +#endif + + +#include +#include +#include +#include +#include "c2.h" +#include "c2_vq.h" + +#define C2_MAX_MRS 32768 +#define C2_MAX_QPS 16000 +#define C2_MAX_WQE_SZ 256 +#define C2_MAX_QP_WR ((128*1024)/C2_MAX_WQE_SZ) +#define C2_MAX_SGES 4 +#define C2_MAX_CQS 32768 +#define C2_MAX_CQES 4096 +#define C2_MAX_PDS 16384 + +/* + * Send the adapter INIT message to the amso1100 + */ +static int c2_adapter_init(struct c2_dev *c2dev) +{ + ccwr_init_req_t wr; + int err; + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_INIT); + wr.hdr.context = 0; + wr.hint_count = cpu_to_be64(__pa(&c2dev->hint_count)); + wr.q0_host_shared = + cpu_to_be64(__pa(c2dev->req_vq.shared)); + wr.q1_host_shared = + cpu_to_be64(__pa(c2dev->rep_vq.shared)); + wr.q1_host_msg_pool = + cpu_to_be64(__pa(c2dev->rep_vq.msg_pool)); + wr.q2_host_shared = + cpu_to_be64(__pa(c2dev->aeq.shared)); + wr.q2_host_msg_pool = + cpu_to_be64(__pa(c2dev->aeq.msg_pool)); + + /* Post the init message */ + err = vq_send_wr(c2dev, (ccwr_t *)&wr); + + return err; +} + +/* + * Send the adapter TERM message to the amso1100 + */ +static void c2_adapter_term(struct c2_dev *c2dev) +{ + ccwr_init_req_t wr; + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_TERM); + wr.hdr.context = 0; + + /* Post the init message */ + vq_send_wr(c2dev, (ccwr_t *)&wr); + c2dev->init = 0; + + return; +} + +/* + * Hack to hard code an ip address + */ +extern char *rnic_ip_addr; +static int c2_setconfig_hack(struct c2_dev *c2dev) +{ + struct c2_vq_req *vq_req; + ccwr_rnic_setconfig_req_t *wr; + ccwr_rnic_setconfig_rep_t *reply; + cc_netaddr_t netaddr; + int err, len; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + len = sizeof(cc_netaddr_t); + wr = kmalloc(sizeof(*wr) + len, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG); + wr->hdr.context = (unsigned long)vq_req; + wr->rnic_handle = c2dev->adapter_handle; + wr->option = cpu_to_be32(CC_CFG_ADD_ADDR); + + netaddr.ip_addr = in_aton(rnic_ip_addr); + netaddr.netmask = htonl(0xFFFFFF00); + netaddr.mtu = 0; + + memcpy(wr->data, &netaddr, len); + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (ccwr_t *)wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = (ccwr_rnic_setconfig_rep_t *)(unsigned long)(vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail1; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + +bail1: + kfree(wr); +bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Open a single RNIC instance to use with all + * low level openib calls + */ +static int c2_rnic_open(struct c2_dev *c2dev) +{ + struct c2_vq_req *vq_req; + ccwr_t wr; + ccwr_rnic_open_rep_t* reply; + int err; + + vq_req = vq_req_alloc(c2dev); + if (vq_req == NULL) { + return -ENOMEM; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_RNIC_OPEN); + wr.rnic_open.req.hdr.context = (unsigned long)(vq_req); + wr.rnic_open.req.flags = cpu_to_be16(RNIC_PRIV_MODE); + wr.rnic_open.req.port_num = cpu_to_be16(0); + wr.rnic_open.req.user_context = (unsigned long)c2dev; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, &wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + reply = (ccwr_rnic_open_rep_t*)(unsigned long)(vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + if ( (err = c2_errno(reply)) != 0) { + goto bail1; + } + + c2dev->adapter_handle = reply->rnic_handle; + +bail1: + vq_repbuf_free(c2dev, reply); +bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Close the RNIC instance + */ +static int c2_rnic_close(struct c2_dev *c2dev) +{ + struct c2_vq_req *vq_req; + ccwr_t wr; + ccwr_rnic_close_rep_t *reply; + int err; + + vq_req = vq_req_alloc(c2dev); + if (vq_req == NULL) { + return -ENOMEM; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_RNIC_CLOSE); + wr.rnic_close.req.hdr.context = (unsigned long)vq_req; + wr.rnic_close.req.rnic_handle = c2dev->adapter_handle; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, &wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + reply = (ccwr_rnic_close_rep_t*)(unsigned long)(vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + if ( (err = c2_errno(reply)) != 0) { + goto bail1; + } + + c2dev->adapter_handle = 0; + +bail1: + vq_repbuf_free(c2dev, reply); +bail0: + vq_req_free(c2dev, vq_req); + return err; +} +#ifdef NETEVENT_NOTIFIER +static int netevent_notifier(struct notifier_block *self, unsigned long event, void* data) +{ + int i; + u8* ha; + struct neighbour* neigh = data; + struct netevent_redirect* redir = data; + struct netevent_route_change* rev = data; + + switch (event) { + case NETEVENT_ROUTE_UPDATE: + printk(KERN_ERR "NETEVENT_ROUTE_UPDATE:\n"); + printk(KERN_ERR "fib_flags : %d\n", + rev->fib_info->fib_flags); + printk(KERN_ERR "fib_protocol : %d\n", + rev->fib_info->fib_protocol); + printk(KERN_ERR "fib_prefsrc : %08x\n", + rev->fib_info->fib_prefsrc); + printk(KERN_ERR "fib_priority : %d\n", + rev->fib_info->fib_priority); + break; + + case NETEVENT_NEIGH_UPDATE: + printk(KERN_ERR "NETEVENT_NEIGH_UPDATE:\n"); + printk(KERN_ERR "nud_state : %d\n", neigh->nud_state); + printk(KERN_ERR "refcnt : %d\n", neigh->refcnt); + printk(KERN_ERR "used : %d\n", neigh->used); + printk(KERN_ERR "confirmed : %d\n", neigh->confirmed); + printk(KERN_ERR " ha: "); + for (i=0; i < neigh->dev->addr_len; i+=4) { + ha = &neigh->ha[i]; + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], ha[3]); + } + printk("\n"); + + printk(KERN_ERR "%8s: ", neigh->dev->name); + for (i=0; i < neigh->dev->addr_len; i+=4) { + ha = &neigh->ha[i]; + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], ha[3]); + } + printk("\n"); + break; + + case NETEVENT_REDIRECT: + printk(KERN_ERR "NETEVENT_REDIRECT:\n"); + printk(KERN_ERR "old: "); + for (i=0; i < redir->old->neighbour->dev->addr_len; i+=4) { + ha = &redir->old->neighbour->ha[i]; + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], ha[3]); + } + printk("\n"); + + printk(KERN_ERR "new: "); + for (i=0; i < redir->new->neighbour->dev->addr_len; i+=4) { + ha = &redir->new->neighbour->ha[i]; + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], ha[3]); + } + printk("\n"); + break; + + default: + printk(KERN_ERR "NETEVENT_WTFO:\n"); + } + + return NOTIFY_DONE; +} + +static struct notifier_block nb = { + .notifier_call = netevent_notifier, +}; +#endif +/* + * Called by c2_probe to initialize the RNIC. This principally + * involves initalizing the various limits and resouce pools that + * comprise the RNIC instance. + */ +int c2_rnic_init(struct c2_dev* c2dev) +{ + int err; + u32 qsize, msgsize; + void *q1_pages; + void *q2_pages; + void __iomem *mmio_regs; + + /* Initialize the adapter limits */ + c2dev->max_mr = C2_MAX_MRS; + c2dev->max_mr_size = ~0; + c2dev->max_qp = C2_MAX_QPS; + c2dev->max_qp_wr = C2_MAX_QP_WR; + c2dev->max_sge = C2_MAX_SGES; + c2dev->max_cq = C2_MAX_CQS; + c2dev->max_cqe = C2_MAX_CQES; + c2dev->max_pd = C2_MAX_PDS; + + /* Device capabilities */ + c2dev->device_cap_flags = + ( + IB_DEVICE_RESIZE_MAX_WR | + IB_DEVICE_CURR_QP_STATE_MOD | + IB_DEVICE_SYS_IMAGE_GUID | + IB_DEVICE_ZERO_STAG | + IB_DEVICE_SEND_W_INV | + IB_DEVICE_MW | + IB_DEVICE_ARP + ); + + /* Allocate the qptr_array */ + c2dev->qptr_array = vmalloc(C2_MAX_CQS*sizeof(void *)); + if (!c2dev->qptr_array) { + return -ENOMEM; + } + + /* Inialize the qptr_array */ + memset(c2dev->qptr_array, 0, C2_MAX_CQS*sizeof(void *)); + c2dev->qptr_array[0] = (void *)&c2dev->req_vq; + c2dev->qptr_array[1] = (void *)&c2dev->rep_vq; + c2dev->qptr_array[2] = (void *)&c2dev->aeq; + + /* Initialize data structures */ + init_waitqueue_head(&c2dev->req_vq_wo); + spin_lock_init(&c2dev->vqlock); + spin_lock_init(&c2dev->aeq_lock); + + + /* Allocate MQ shared pointer pool for kernel clients. User + * mode client pools are hung off the user context + */ + err = c2_init_mqsp_pool(GFP_KERNEL, &c2dev->kern_mqsp_pool); + if (err) { + goto bail0; + } + + /* Allocate shared pointers for Q0, Q1, and Q2 from + * the shared pointer pool. + */ + c2dev->req_vq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); + c2dev->rep_vq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); + c2dev->aeq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); + if (!c2dev->req_vq.shared || + !c2dev->rep_vq.shared || + !c2dev->aeq.shared) { + err = -ENOMEM; + goto bail1; + } + + mmio_regs = c2dev->kva; + /* Initialize the Verbs Request Queue */ + c2_mq_init(&c2dev->req_vq, 0, + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q0_QSIZE)), + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q0_MSGSIZE)), + mmio_regs + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q0_POOLSTART)), + mmio_regs + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q0_SHARED)), + C2_MQ_ADAPTER_TARGET); + + /* Initialize the Verbs Reply Queue */ + qsize = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q1_QSIZE)); + msgsize = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q1_MSGSIZE)); + q1_pages = kmalloc(qsize * msgsize, GFP_KERNEL); + if (!q1_pages) { + err = -ENOMEM; + goto bail1; + } + c2_mq_init(&c2dev->rep_vq, + 1, + qsize, + msgsize, + q1_pages, + mmio_regs + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q1_SHARED)), + C2_MQ_HOST_TARGET); + + /* Initialize the Asynchronus Event Queue */ + qsize = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q2_QSIZE)); + msgsize = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q2_MSGSIZE)); + q2_pages = kmalloc(qsize * msgsize, GFP_KERNEL); + if (!q2_pages) { + err = -ENOMEM; + goto bail2; + } + c2_mq_init(&c2dev->aeq, + 2, + qsize, + msgsize, + q2_pages, + mmio_regs + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q2_SHARED)), + C2_MQ_HOST_TARGET); + + /* Initialize the verbs request allocator */ + err = vq_init(c2dev); + if (err) { + goto bail3; + } + + /* Enable interrupts on the adapter */ + c2_write32(c2dev->regs + C2_IDIS, 0); + + /* create the WR init message */ + err = c2_adapter_init(c2dev); + if (err) { + goto bail4; + } + c2dev->init++; + + /* open an adapter instance */ + err = c2_rnic_open(c2dev); + if (err) { + goto bail4; + } + + /* Initialize the PD pool */ + err = c2_init_pd_table(c2dev); + if (err) + goto bail5; + + /* Initialize the QP pool */ + err = c2_init_qp_table(c2dev); + if (err) + goto bail6; + + /* XXX hardcode an address */ + err = c2_setconfig_hack(c2dev); + if (err) + goto bail7; + +#ifdef NETEVENT_NOTIFIER + register_netevent_notifier(&nb); +#endif + return 0; + +bail7: + c2_cleanup_qp_table(c2dev); +bail6: + c2_cleanup_pd_table(c2dev); +bail5: + c2_rnic_close(c2dev); +bail4: + vq_term(c2dev); +bail3: + kfree(q2_pages); +bail2: + kfree(q1_pages); +bail1: + c2_free_mqsp_pool(c2dev->kern_mqsp_pool); +bail0: + vfree(c2dev->qptr_array); + + return err; +} + +/* + * Called by c2_remove to cleanup the RNIC resources. + */ +void c2_rnic_term(struct c2_dev* c2dev) +{ +#ifdef NETEVENT_NOTIFIER + unregister_netevent_notifier(&nb); +#endif + + /* Close the open adapter instance */ + c2_rnic_close(c2dev); + + /* Send the TERM message to the adapter */ + c2_adapter_term(c2dev); + + /* Disable interrupts on the adapter */ + c2_write32(c2dev->regs + C2_IDIS, 1); + + /* Free the QP pool */ + c2_cleanup_qp_table(c2dev); + + /* Free the PD pool */ + c2_cleanup_pd_table(c2dev); + + /* Free the verbs request allocator */ + vq_term(c2dev); + + /* Free the asynchronus event queue */ + kfree(c2dev->aeq.msg_pool); + + /* Free the verbs reply queue */ + kfree(c2dev->rep_vq.msg_pool); + + /* Free the MQ shared pointer pool */ + c2_free_mqsp_pool(c2dev->kern_mqsp_pool); + + /* Free the qptr_array */ + vfree(c2dev->qptr_array); + + return; +} Index: hw/amso1100/c2_vq.h =================================================================== --- hw/amso1100/c2_vq.h (revision 0) +++ hw/amso1100/c2_vq.h (revision 0) @@ -0,0 +1,60 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_VQ_H_ +#define _C2_VQ_H_ +#include + +#include "c2.h" +#include "c2_wr.h" + +struct c2_vq_req{ + u64 reply_msg; /* ptr to reply msg */ + wait_queue_head_t wait_object; /* wait object for vq reqs */ + atomic_t reply_ready; /* set when reply is ready */ + atomic_t refcnt; /* used to cancel WRs... */ +}; + +extern int vq_init(struct c2_dev* c2dev); +extern void vq_term(struct c2_dev* c2dev); + +extern struct c2_vq_req* vq_req_alloc(struct c2_dev *c2dev); +extern void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *req); +extern void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *req); +extern void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *req); +extern int vq_send_wr(struct c2_dev *c2dev, ccwr_t *wr); + +extern void* vq_repbuf_alloc(struct c2_dev *c2dev); +extern void vq_repbuf_free(struct c2_dev *c2dev, void *reply); + +extern int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req); +#endif /* _C2_VQ_H_ */ Index: hw/amso1100/c2_wr.h =================================================================== --- hw/amso1100/c2_wr.h (revision 0) +++ hw/amso1100/c2_wr.h (revision 0) @@ -0,0 +1,1343 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _CC_WR_H_ +#define _CC_WR_H_ +#include "cc_types.h" +/* + * WARNING: If you change this file, also bump CC_IVN_BASE + * in common/include/clustercore/cc_ivn.h. + */ + +#ifdef CCDEBUG +#define CCWR_MAGIC 0xb07700b0 +#endif + +/* + * Build String Length. It must be the same as CC_BUILD_STR_LEN in ccil_api.h + */ +#define WR_BUILD_STR_LEN 64 + +#ifdef _MSC_VER +#define PACKED +#pragma pack(push) +#pragma pack(1) +#define __inline__ __inline +#else +#define PACKED __attribute__ ((packed)) +#endif + +/* + * WARNING: All of these structs need to align any 64bit types on + * 64 bit boundaries! 64bit types include u64 and u64. + */ + +/* + * Clustercore Work Request Header. Be sensitive to field layout + * and alignment. + */ +typedef struct { + /* wqe_count is part of the cqe. It is put here so the + * adapter can write to it while the wr is pending without + * clobbering part of the wr. This word need not be dma'd + * from the host to adapter by libccil, but we copy it anyway + * to make the memcpy to the adapter better aligned. + */ + u32 wqe_count; + + /* Put these fields next so that later 32- and 64-bit + * quantities are naturally aligned. + */ + u8 id; + u8 result; /* adapter -> host */ + u8 sge_count; /* host -> adapter */ + u8 flags; /* host -> adapter */ + + u64 context; +#ifdef CCMSGMAGIC + u32 magic; + u32 pad; +#endif +} PACKED ccwr_hdr_t; + +/* + *------------------------ RNIC ------------------------ + */ + +/* + * WR_RNIC_OPEN + */ + +/* + * Flags for the RNIC WRs + */ +typedef enum { + RNIC_IRD_STATIC = 0x0001, + RNIC_ORD_STATIC = 0x0002, + RNIC_QP_STATIC = 0x0004, + RNIC_SRQ_SUPPORTED = 0x0008, + RNIC_PBL_BLOCK_MODE = 0x0010, + RNIC_SRQ_MODEL_ARRIVAL = 0x0020, + RNIC_CQ_OVF_DETECTED = 0x0040, + RNIC_PRIV_MODE = 0x0080 +} PACKED cc_rnic_flags_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 user_context; + u16 flags; /* See cc_rnic_flags_t */ + u16 port_num; +} PACKED ccwr_rnic_open_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; +} PACKED ccwr_rnic_open_rep_t; + +typedef union { + ccwr_rnic_open_req_t req; + ccwr_rnic_open_rep_t rep; +} PACKED ccwr_rnic_open_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; +} PACKED ccwr_rnic_query_req_t; + +/* + * WR_RNIC_QUERY + */ +typedef struct { + ccwr_hdr_t hdr; + u64 user_context; + u32 vendor_id; + u32 part_number; + u32 hw_version; + u32 fw_ver_major; + u32 fw_ver_minor; + u32 fw_ver_patch; + char fw_ver_build_str[WR_BUILD_STR_LEN]; + u32 max_qps; + u32 max_qp_depth; + u32 max_srq_depth; + u32 max_send_sgl_depth; + u32 max_rdma_sgl_depth; + u32 max_cqs; + u32 max_cq_depth; + u32 max_cq_event_handlers; + u32 max_mrs; + u32 max_pbl_depth; + u32 max_pds; + u32 max_global_ird; + u32 max_global_ord; + u32 max_qp_ird; + u32 max_qp_ord; + u32 flags; /* See cc_rnic_flags_t */ + u32 max_mws; + u32 pbe_range_low; + u32 pbe_range_high; + u32 max_srqs; + u32 page_size; +} PACKED ccwr_rnic_query_rep_t; + +typedef union { + ccwr_rnic_query_req_t req; + ccwr_rnic_query_rep_t rep; +} PACKED ccwr_rnic_query_t; + +/* + * WR_RNIC_GETCONFIG + */ + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 option; /* see cc_getconfig_cmd_t */ + u64 reply_buf; + u32 reply_buf_len; +} PACKED ccwr_rnic_getconfig_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 option; /* see cc_getconfig_cmd_t */ + u32 count_len; /* length of the number of addresses configured */ +} PACKED ccwr_rnic_getconfig_rep_t; + +typedef union { + ccwr_rnic_getconfig_req_t req; + ccwr_rnic_getconfig_rep_t rep; +} PACKED ccwr_rnic_getconfig_t; + +/* + * WR_RNIC_SETCONFIG + */ +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 option; /* See cc_setconfig_cmd_t */ + /* variable data and pad See cc_netaddr_t and + * cc_route_t + */ + u8 data[0]; +} PACKED ccwr_rnic_setconfig_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_rnic_setconfig_rep_t; + +typedef union { + ccwr_rnic_setconfig_req_t req; + ccwr_rnic_setconfig_rep_t rep; +} PACKED ccwr_rnic_setconfig_t; + +/* + * WR_RNIC_CLOSE + */ +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; +} PACKED ccwr_rnic_close_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_rnic_close_rep_t; + +typedef union { + ccwr_rnic_close_req_t req; + ccwr_rnic_close_rep_t rep; +} PACKED ccwr_rnic_close_t; + +/* + *------------------------ CQ ------------------------ + */ +typedef struct { + ccwr_hdr_t hdr; + u64 shared_ht; + u64 user_context; + u64 msg_pool; + u32 rnic_handle; + u32 msg_size; + u32 depth; +} PACKED ccwr_cq_create_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 mq_index; + u32 adapter_shared; + u32 cq_handle; +} PACKED ccwr_cq_create_rep_t; + +typedef union { + ccwr_cq_create_req_t req; + ccwr_cq_create_rep_t rep; +} PACKED ccwr_cq_create_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 cq_handle; + u32 new_depth; + u64 new_msg_pool; +} PACKED ccwr_cq_modify_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_cq_modify_rep_t; + +typedef union { + ccwr_cq_modify_req_t req; + ccwr_cq_modify_rep_t rep; +} PACKED ccwr_cq_modify_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 cq_handle; +} PACKED ccwr_cq_destroy_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_cq_destroy_rep_t; + +typedef union { + ccwr_cq_destroy_req_t req; + ccwr_cq_destroy_rep_t rep; +} PACKED ccwr_cq_destroy_t; + +/* + *------------------------ PD ------------------------ + */ +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 pd_id; +} PACKED ccwr_pd_alloc_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_pd_alloc_rep_t; + +typedef union { + ccwr_pd_alloc_req_t req; + ccwr_pd_alloc_rep_t rep; +} PACKED ccwr_pd_alloc_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 pd_id; +} PACKED ccwr_pd_dealloc_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_pd_dealloc_rep_t; + +typedef union { + ccwr_pd_dealloc_req_t req; + ccwr_pd_dealloc_rep_t rep; +} PACKED ccwr_pd_dealloc_t; + +/* + *------------------------ SRQ ------------------------ + */ +typedef struct { + ccwr_hdr_t hdr; + u64 shared_ht; + u64 user_context; + u32 rnic_handle; + u32 srq_depth; + u32 srq_limit; + u32 sgl_depth; + u32 pd_id; +} PACKED ccwr_srq_create_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 srq_depth; + u32 sgl_depth; + u32 msg_size; + u32 mq_index; + u32 mq_start; + u32 srq_handle; +} PACKED ccwr_srq_create_rep_t; + +typedef union { + ccwr_srq_create_req_t req; + ccwr_srq_create_rep_t rep; +} PACKED ccwr_srq_create_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 srq_handle; +} PACKED ccwr_srq_destroy_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_srq_destroy_rep_t; + +typedef union { + ccwr_srq_destroy_req_t req; + ccwr_srq_destroy_rep_t rep; +} PACKED ccwr_srq_destroy_t; + +/* + *------------------------ QP ------------------------ + */ +typedef enum { + QP_RDMA_READ = 0x00000001, /* RDMA read enabled? */ + QP_RDMA_WRITE = 0x00000002, /* RDMA write enabled? */ + QP_MW_BIND = 0x00000004, /* MWs enabled */ + QP_ZERO_STAG = 0x00000008, /* enabled? */ + QP_REMOTE_TERMINATION = 0x00000010, /* remote end terminated */ + QP_RDMA_READ_RESPONSE = 0x00000020 /* Remote RDMA read */ + /* enabled? */ +} PACKED ccwr_qp_flags_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 shared_sq_ht; + u64 shared_rq_ht; + u64 user_context; + u32 rnic_handle; + u32 sq_cq_handle; + u32 rq_cq_handle; + u32 sq_depth; + u32 rq_depth; + u32 srq_handle; + u32 srq_limit; + u32 flags; /* see ccwr_qp_flags_t */ + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u32 ord; + u32 ird; + u32 pd_id; +} PACKED ccwr_qp_create_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 sq_depth; + u32 rq_depth; + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u32 ord; + u32 ird; + u32 sq_msg_size; + u32 sq_mq_index; + u32 sq_mq_start; + u32 rq_msg_size; + u32 rq_mq_index; + u32 rq_mq_start; + u32 qp_handle; +} PACKED ccwr_qp_create_rep_t; + +typedef union { + ccwr_qp_create_req_t req; + ccwr_qp_create_rep_t rep; +} PACKED ccwr_qp_create_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 qp_handle; +} PACKED ccwr_qp_query_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 user_context; + u32 rnic_handle; + u32 sq_depth; + u32 rq_depth; + u32 send_sgl_depth; + u32 rdma_write_sgl_depth; + u32 recv_sgl_depth; + u32 ord; + u32 ird; + u16 qp_state; + u16 flags; /* see ccwr_qp_flags_t */ + u32 qp_id; + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; + u32 terminate_msg_length; /* 0 if not present */ + u8 data[0]; + /* Terminate Message in-line here. */ +} PACKED ccwr_qp_query_rep_t; + +typedef union { + ccwr_qp_query_req_t req; + ccwr_qp_query_rep_t rep; +} PACKED ccwr_qp_query_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 stream_msg; + u32 stream_msg_length; + u32 rnic_handle; + u32 qp_handle; + u32 next_qp_state; + u32 ord; + u32 ird; + u32 sq_depth; + u32 rq_depth; + u32 llp_ep_handle; +} PACKED ccwr_qp_modify_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 ord; + u32 ird; + u32 sq_depth; + u32 rq_depth; + u32 sq_msg_size; + u32 sq_mq_index; + u32 sq_mq_start; + u32 rq_msg_size; + u32 rq_mq_index; + u32 rq_mq_start; +} PACKED ccwr_qp_modify_rep_t; + +typedef union { + ccwr_qp_modify_req_t req; + ccwr_qp_modify_rep_t rep; +} PACKED ccwr_qp_modify_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 qp_handle; +} PACKED ccwr_qp_destroy_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_qp_destroy_rep_t; + +typedef union { + ccwr_qp_destroy_req_t req; + ccwr_qp_destroy_rep_t rep; +} PACKED ccwr_qp_destroy_t; + +/* + * The CCWR_QP_CONNECT msg is posted on the verbs request queue. It can + * only be posted when a QP is in IDLE state. After the connect request is + * submitted to the LLP, the adapter moves the QP to CONNECT_PENDING state. + * No synchronous reply from adapter to this WR. The results of + * connection are passed back in an async event CCAE_ACTIVE_CONNECT_RESULTS + * See ccwr_ae_active_connect_results_t + */ +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 qp_handle; + u32 remote_addr; + u16 remote_port; + u16 pad; + u32 private_data_length; + u8 private_data[0]; /* Private data in-line. */ +} PACKED ccwr_qp_connect_req_t; + +typedef struct { + ccwr_qp_connect_req_t req; + /* no synchronous reply. */ +} PACKED ccwr_qp_connect_t; + + +/* + *------------------------ MM ------------------------ + */ + +typedef cc_mm_flags_t ccwr_mr_flags_t; /* cc_types.h */ + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 pbl_depth; + u32 pd_id; + u32 flags; /* See ccwr_mr_flags_t */ +} PACKED ccwr_nsmr_stag_alloc_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 pbl_depth; + u32 stag_index; +} PACKED ccwr_nsmr_stag_alloc_rep_t; + +typedef union { + ccwr_nsmr_stag_alloc_req_t req; + ccwr_nsmr_stag_alloc_rep_t rep; +} PACKED ccwr_nsmr_stag_alloc_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 va; + u32 rnic_handle; + u16 flags; /* See ccwr_mr_flags_t */ + u8 stag_key; + u8 pad; + u32 pd_id; + u32 pbl_depth; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} PACKED ccwr_nsmr_register_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 pbl_depth; + u32 stag_index; +} PACKED ccwr_nsmr_register_rep_t; + +typedef union { + ccwr_nsmr_register_req_t req; + ccwr_nsmr_register_rep_t rep; +} PACKED ccwr_nsmr_register_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 flags; /* See ccwr_mr_flags_t */ + u32 stag_index; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} PACKED ccwr_nsmr_pbl_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_nsmr_pbl_rep_t; + +typedef union { + ccwr_nsmr_pbl_req_t req; + ccwr_nsmr_pbl_rep_t rep; +} PACKED ccwr_nsmr_pbl_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 stag_index; +} PACKED ccwr_mr_query_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u8 stag_key; + u8 pad[3]; + u32 pd_id; + u32 flags; /* See ccwr_mr_flags_t */ + u32 pbl_depth; +} PACKED ccwr_mr_query_rep_t; + +typedef union { + ccwr_mr_query_req_t req; + ccwr_mr_query_rep_t rep; +} PACKED ccwr_mr_query_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 stag_index; +} PACKED ccwr_mw_query_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u8 stag_key; + u8 pad[3]; + u32 pd_id; + u32 flags; /* See ccwr_mr_flags_t */ +} PACKED ccwr_mw_query_rep_t; + +typedef union { + ccwr_mw_query_req_t req; + ccwr_mw_query_rep_t rep; +} PACKED ccwr_mw_query_t; + + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 stag_index; +} PACKED ccwr_stag_dealloc_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_stag_dealloc_rep_t; + +typedef union { + ccwr_stag_dealloc_req_t req; + ccwr_stag_dealloc_rep_t rep; +} PACKED ccwr_stag_dealloc_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 va; + u32 rnic_handle; + u16 flags; /* See ccwr_mr_flags_t */ + u8 stag_key; + u8 pad; + u32 stag_index; + u32 pd_id; + u32 pbl_depth; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + u32 pad1; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} PACKED ccwr_nsmr_reregister_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 pbl_depth; + u32 stag_index; +} PACKED ccwr_nsmr_reregister_rep_t; + +typedef union { + ccwr_nsmr_reregister_req_t req; + ccwr_nsmr_reregister_rep_t rep; +} PACKED ccwr_nsmr_reregister_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 va; + u32 rnic_handle; + u16 flags; /* See ccwr_mr_flags_t */ + u8 stag_key; + u8 pad; + u32 stag_index; + u32 pd_id; +} PACKED ccwr_smr_register_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 stag_index; +} PACKED ccwr_smr_register_rep_t; + +typedef union { + ccwr_smr_register_req_t req; + ccwr_smr_register_rep_t rep; +} PACKED ccwr_smr_register_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 pd_id; +} PACKED ccwr_mw_alloc_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 stag_index; +} PACKED ccwr_mw_alloc_rep_t; + +typedef union { + ccwr_mw_alloc_req_t req; + ccwr_mw_alloc_rep_t rep; +} PACKED ccwr_mw_alloc_t; + +/* + *------------------------ WRs ----------------------- + */ + +typedef struct { + ccwr_hdr_t hdr; /* Has status and WR Type */ +} PACKED ccwr_user_hdr_t; + +/* Completion queue entry. */ +typedef struct { + ccwr_hdr_t hdr; /* Has status and WR Type */ + u64 qp_user_context;/* cc_user_qp_t * */ + u32 qp_state; /* Current QP State */ + u32 handle; /* QPID or EP Handle */ + u32 bytes_rcvd; /* valid for RECV WCs */ + u32 stag; +} PACKED ccwr_ce_t; + + +/* + * Flags used for all post-sq WRs. These must fit in the flags + * field of the ccwr_hdr_t (eight bits). + */ +typedef enum { + SQ_SIGNALED = 0x01, + SQ_READ_FENCE = 0x02, + SQ_FENCE = 0x04, +} PACKED cc_sq_flags_t; + +/* + * Common fields for all post-sq WRs. Namely the standard header and a + * secondary header with fields common to all post-sq WRs. + */ +typedef struct { + ccwr_user_hdr_t user_hdr; +} PACKED cc_sq_hdr_t; + +/* + * Same as above but for post-rq WRs. + */ +typedef struct { + ccwr_user_hdr_t user_hdr; +} PACKED cc_rq_hdr_t; + +/* + * use the same struct for all sends. + */ +typedef struct { + cc_sq_hdr_t sq_hdr; + u32 sge_len; + u32 remote_stag; + u8 data[0]; /* SGE array */ +} PACKED ccwr_send_req_t, ccwr_send_se_req_t, ccwr_send_inv_req_t, ccwr_send_se_inv_req_t; + +typedef ccwr_ce_t ccwr_send_rep_t; + +typedef union { + ccwr_send_req_t req; + ccwr_send_rep_t rep; +} PACKED ccwr_send_t, ccwr_send_se_t, ccwr_send_inv_t, ccwr_send_se_inv_t; + +typedef struct { + cc_sq_hdr_t sq_hdr; + u64 remote_to; + u32 remote_stag; + u32 sge_len; + u8 data[0]; /* SGE array */ +} PACKED ccwr_rdma_write_req_t; + +typedef ccwr_ce_t ccwr_rdma_write_rep_t; + +typedef union { + ccwr_rdma_write_req_t req; + ccwr_rdma_write_rep_t rep; +} PACKED ccwr_rdma_write_t; + +typedef struct { + cc_sq_hdr_t sq_hdr; + u64 local_to; + u64 remote_to; + u32 local_stag; + u32 remote_stag; + u32 length; +} PACKED ccwr_rdma_read_req_t,ccwr_rdma_read_inv_req_t; + +typedef ccwr_ce_t ccwr_rdma_read_rep_t; + +typedef union { + ccwr_rdma_read_req_t req; + ccwr_rdma_read_rep_t rep; +} PACKED ccwr_rdma_read_t, ccwr_rdma_read_inv_t; + +typedef struct { + cc_sq_hdr_t sq_hdr; + u64 va; + u8 stag_key; + u8 pad[3]; + u32 mw_stag_index; + u32 mr_stag_index; + u32 length; + u32 flags; /* see ccwr_mr_flags_t; */ +} PACKED ccwr_mw_bind_req_t; + +typedef ccwr_ce_t ccwr_mw_bind_rep_t; + +typedef union { + ccwr_mw_bind_req_t req; + ccwr_mw_bind_rep_t rep; +} PACKED ccwr_mw_bind_t; + +typedef struct { + cc_sq_hdr_t sq_hdr; + u64 va; + u8 stag_key; + u8 pad[3]; + u32 stag_index; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} PACKED ccwr_nsmr_fastreg_req_t; + +typedef ccwr_ce_t ccwr_nsmr_fastreg_rep_t; + +typedef union { + ccwr_nsmr_fastreg_req_t req; + ccwr_nsmr_fastreg_rep_t rep; +} PACKED ccwr_nsmr_fastreg_t; + +typedef struct { + cc_sq_hdr_t sq_hdr; + u8 stag_key; + u8 pad[3]; + u32 stag_index; +} PACKED ccwr_stag_invalidate_req_t; + +typedef ccwr_ce_t ccwr_stag_invalidate_rep_t; + +typedef union { + ccwr_stag_invalidate_req_t req; + ccwr_stag_invalidate_rep_t rep; +} PACKED ccwr_stag_invalidate_t; + +typedef union { + cc_sq_hdr_t sq_hdr; + ccwr_send_req_t send; + ccwr_send_se_req_t send_se; + ccwr_send_inv_req_t send_inv; + ccwr_send_se_inv_req_t send_se_inv; + ccwr_rdma_write_req_t rdma_write; + ccwr_rdma_read_req_t rdma_read; + ccwr_mw_bind_req_t mw_bind; + ccwr_nsmr_fastreg_req_t nsmr_fastreg; + ccwr_stag_invalidate_req_t stag_inv; +} PACKED ccwr_sqwr_t; + + +/* + * RQ WRs + */ +typedef struct { + cc_rq_hdr_t rq_hdr; + u8 data[0]; /* array of SGEs */ +} PACKED ccwr_rqwr_t, ccwr_recv_req_t; + +typedef ccwr_ce_t ccwr_recv_rep_t; + +typedef union { + ccwr_recv_req_t req; + ccwr_recv_rep_t rep; +} PACKED ccwr_recv_t; + +/* + * All AEs start with this header. Most AEs only need to convey the + * information in the header. Some, like LLP connection events, need + * more info. The union typdef ccwr_ae_t has all the possible AEs. + * + * hdr.context is the user_context from the rnic_open WR. NULL If this + * is not affiliated with an rnic + * + * hdr.id is the AE identifier (eg; CCAE_REMOTE_SHUTDOWN, + * CCAE_LLP_CLOSE_COMPLETE) + * + * resource_type is one of: CC_RES_IND_QP, CC_RES_IND_CQ, CC_RES_IND_SRQ + * + * user_context is the context passed down when the host created the resource. + */ +typedef struct { + ccwr_hdr_t hdr; + u64 user_context; /* user context for this res. */ + u32 resource_type; /* see cc_resource_indicator_t */ + u32 resource; /* handle for resource */ + u32 qp_state; /* current QP State */ +} PACKED PACKED ccwr_ae_hdr_t; + +/* + * After submitting the CCAE_ACTIVE_CONNECT_RESULTS message on the AEQ, + * the adapter moves the QP into RTS state + */ +typedef struct { + ccwr_ae_hdr_t ae_hdr; + u32 laddr; + u32 raddr; + u16 lport; + u16 rport; + u32 private_data_length; + u8 private_data[0]; /* data is in-line in the msg. */ +} PACKED ccwr_ae_active_connect_results_t; + +/* + * When connections are established by the stack (and the private data + * MPA frame is received), the adapter will generate an event to the host. + * The details of the connection, any private data, and the new connection + * request handle is passed up via the CCAE_CONNECTION_REQUEST msg on the + * AE queue: + */ +typedef struct { + ccwr_ae_hdr_t ae_hdr; + u32 cr_handle; /* connreq handle (sock ptr) */ + u32 laddr; + u32 raddr; + u16 lport; + u16 rport; + u32 private_data_length; + u8 private_data[0]; /* data is in-line in the msg. */ +} PACKED ccwr_ae_connection_request_t; + +typedef union { + ccwr_ae_hdr_t ae_generic; + ccwr_ae_active_connect_results_t ae_active_connect_results; + ccwr_ae_connection_request_t ae_connection_request; +} PACKED ccwr_ae_t; + +typedef struct { + ccwr_hdr_t hdr; + u64 hint_count; + u64 q0_host_shared; + u64 q1_host_shared; + u64 q1_host_msg_pool; + u64 q2_host_shared; + u64 q2_host_msg_pool; +} PACKED ccwr_init_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_init_rep_t; + +typedef union { + ccwr_init_req_t req; + ccwr_init_rep_t rep; +} PACKED ccwr_init_t; + +/* + * For upgrading flash. + */ + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; +} PACKED ccwr_flash_init_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 adapter_flash_buf_offset; + u32 adapter_flash_len; +} PACKED ccwr_flash_init_rep_t; + +typedef union { + ccwr_flash_init_req_t req; + ccwr_flash_init_rep_t rep; +} PACKED ccwr_flash_init_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 len; +} PACKED ccwr_flash_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 status; +} PACKED ccwr_flash_rep_t; + +typedef union { + ccwr_flash_req_t req; + ccwr_flash_rep_t rep; +} PACKED ccwr_flash_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 size; +} PACKED ccwr_buf_alloc_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 offset; /* 0 if mem not available */ + u32 size; /* 0 if mem not available */ +} PACKED ccwr_buf_alloc_rep_t; + +typedef union { + ccwr_buf_alloc_req_t req; + ccwr_buf_alloc_rep_t rep; +} PACKED ccwr_buf_alloc_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 offset; /* Must match value from alloc */ + u32 size; /* Must match value from alloc */ +} PACKED ccwr_buf_free_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_buf_free_rep_t; + +typedef union { + ccwr_buf_free_req_t req; + ccwr_buf_free_rep_t rep; +} PACKED ccwr_buf_free_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 offset; + u32 size; + u32 type; + u32 flags; +} PACKED ccwr_flash_write_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 status; +} PACKED ccwr_flash_write_rep_t; + +typedef union { + ccwr_flash_write_req_t req; + ccwr_flash_write_rep_t rep; +} PACKED ccwr_flash_write_t; + +/* + * Messages for LLP connection setup. + */ + +/* + * Listen Request. This allocates a listening endpoint to allow passive + * connection setup. Newly established LLP connections are passed up + * via an AE. See ccwr_ae_connection_request_t + */ +typedef struct { + ccwr_hdr_t hdr; + u64 user_context; /* returned in AEs. */ + u32 rnic_handle; + u32 local_addr; /* local addr, or 0 */ + u16 local_port; /* 0 means "pick one" */ + u16 pad; + u32 backlog; /* tradional tcp listen bl */ +} PACKED ccwr_ep_listen_create_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 ep_handle; /* handle to new listening ep */ + u16 local_port; /* resulting port... */ + u16 pad; +} PACKED ccwr_ep_listen_create_rep_t; + +typedef union { + ccwr_ep_listen_create_req_t req; + ccwr_ep_listen_create_rep_t rep; +} PACKED ccwr_ep_listen_create_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 ep_handle; +} PACKED ccwr_ep_listen_destroy_req_t; + +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_ep_listen_destroy_rep_t; + +typedef union { + ccwr_ep_listen_destroy_req_t req; + ccwr_ep_listen_destroy_rep_t rep; +} PACKED ccwr_ep_listen_destroy_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 ep_handle; +} PACKED ccwr_ep_query_req_t; + +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; +} PACKED ccwr_ep_query_rep_t; + +typedef union { + ccwr_ep_query_req_t req; + ccwr_ep_query_rep_t rep; +} PACKED ccwr_ep_query_t; + + +/* + * The host passes this down to indicate acceptance of a pending iWARP + * connection. The cr_handle was obtained from the CONNECTION_REQUEST + * AE passed up by the adapter. See ccwr_ae_connection_request_t. + */ +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 qp_handle; /* QP to bind to this LLP conn */ + u32 ep_handle; /* LLP handle to accept */ + u32 private_data_length; + u8 private_data[0]; /* data in-line in msg. */ +} PACKED ccwr_cr_accept_req_t; + +/* + * adapter sends reply when private data is successfully submitted to + * the LLP. + */ +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_cr_accept_rep_t; + +typedef union { + ccwr_cr_accept_req_t req; + ccwr_cr_accept_rep_t rep; +} PACKED ccwr_cr_accept_t; + +/* + * The host sends this down if a given iWARP connection request was + * rejected by the consumer. The cr_handle was obtained from a + * previous ccwr_ae_connection_request_t AE sent by the adapter. + */ +typedef struct { + ccwr_hdr_t hdr; + u32 rnic_handle; + u32 ep_handle; /* LLP handle to reject */ +} PACKED ccwr_cr_reject_req_t; + +/* + * Dunno if this is needed, but we'll add it for now. The adapter will + * send the reject_reply after the LLP endpoint has been destroyed. + */ +typedef struct { + ccwr_hdr_t hdr; +} PACKED ccwr_cr_reject_rep_t; + +typedef union { + ccwr_cr_reject_req_t req; + ccwr_cr_reject_rep_t rep; +} PACKED ccwr_cr_reject_t; + +/* + * console command. Used to implement a debug console over the verbs + * request and reply queues. + */ + +/* + * Console request message. It contains: + * - message hdr with id = CCWR_CONSOLE + * - the physaddr/len of host memory to be used for the reply. + * - the command string. eg: "netstat -s" or "zoneinfo" + */ +typedef struct { + ccwr_hdr_t hdr; /* id = CCWR_CONSOLE */ + u64 reply_buf; /* pinned host buf for reply */ + u32 reply_buf_len; /* length of reply buffer */ + u8 command[0]; /* NUL terminated ascii string */ + /* containing the command req */ +} PACKED ccwr_console_req_t; + +/* + * flags used in the console reply. + */ +typedef enum { + CONS_REPLY_TRUNCATED = 0x00000001 /* reply was truncated */ +} PACKED cc_console_flags_t; + +/* + * Console reply message. + * hdr.result contains the cc_status_t error if the reply was _not_ generated, + * or CC_OK if the reply was generated. + */ +typedef struct { + ccwr_hdr_t hdr; /* id = CCWR_CONSOLE */ + u32 flags; /* see cc_console_flags_t */ +} PACKED ccwr_console_rep_t; + +typedef union { + ccwr_console_req_t req; + ccwr_console_rep_t rep; +} PACKED ccwr_console_t; + + +/* + * Giant union with all WRs. Makes life easier... + */ +typedef union { + ccwr_hdr_t hdr; + ccwr_user_hdr_t user_hdr; + ccwr_rnic_open_t rnic_open; + ccwr_rnic_query_t rnic_query; + ccwr_rnic_getconfig_t rnic_getconfig; + ccwr_rnic_setconfig_t rnic_setconfig; + ccwr_rnic_close_t rnic_close; + ccwr_cq_create_t cq_create; + ccwr_cq_modify_t cq_modify; + ccwr_cq_destroy_t cq_destroy; + ccwr_pd_alloc_t pd_alloc; + ccwr_pd_dealloc_t pd_dealloc; + ccwr_srq_create_t srq_create; + ccwr_srq_destroy_t srq_destroy; + ccwr_qp_create_t qp_create; + ccwr_qp_query_t qp_query; + ccwr_qp_modify_t qp_modify; + ccwr_qp_destroy_t qp_destroy; + ccwr_qp_connect_t qp_connect; + ccwr_nsmr_stag_alloc_t nsmr_stag_alloc; + ccwr_nsmr_register_t nsmr_register; + ccwr_nsmr_pbl_t nsmr_pbl; + ccwr_mr_query_t mr_query; + ccwr_mw_query_t mw_query; + ccwr_stag_dealloc_t stag_dealloc; + ccwr_sqwr_t sqwr; + ccwr_rqwr_t rqwr; + ccwr_ce_t ce; + ccwr_ae_t ae; + ccwr_init_t init; + ccwr_ep_listen_create_t ep_listen_create; + ccwr_ep_listen_destroy_t ep_listen_destroy; + ccwr_cr_accept_t cr_accept; + ccwr_cr_reject_t cr_reject; + ccwr_console_t console; + ccwr_flash_init_t flash_init; + ccwr_flash_t flash; + ccwr_buf_alloc_t buf_alloc; + ccwr_buf_free_t buf_free; + ccwr_flash_write_t flash_write; +} PACKED ccwr_t; + + +/* + * Accessors for the wr fields that are packed together tightly to + * reduce the wr message size. The wr arguments are void* so that + * either a ccwr_t*, a ccwr_hdr_t*, or a pointer to any of the types + * in the ccwr_t union can be passed in. + */ +static __inline__ u8 +c2_wr_get_id(void *wr) +{ + return ((ccwr_hdr_t *)wr)->id; +} +static __inline__ void +c2_wr_set_id(void *wr, u8 id) +{ + ((ccwr_hdr_t *)wr)->id = id; +} +static __inline__ u8 +c2_wr_get_result(void *wr) +{ + return ((ccwr_hdr_t *)wr)->result; +} +static __inline__ void +c2_wr_set_result(void *wr, u8 result) +{ + ((ccwr_hdr_t *)wr)->result = result; +} +static __inline__ u8 +c2_wr_get_flags(void *wr) +{ + return ((ccwr_hdr_t *)wr)->flags; +} +static __inline__ void +c2_wr_set_flags(void *wr, u8 flags) +{ + ((ccwr_hdr_t *)wr)->flags = flags; +} +static __inline__ u8 +c2_wr_get_sge_count(void *wr) +{ + return ((ccwr_hdr_t *)wr)->sge_count; +} +static __inline__ void +c2_wr_set_sge_count(void *wr, u8 sge_count) +{ + ((ccwr_hdr_t *)wr)->sge_count = sge_count; +} +static __inline__ u32 +c2_wr_get_wqe_count(void *wr) +{ + return ((ccwr_hdr_t *)wr)->wqe_count; +} +static __inline__ void +c2_wr_set_wqe_count(void *wr, u32 wqe_count) +{ + ((ccwr_hdr_t *)wr)->wqe_count = wqe_count; +} + +#undef PACKED + +#ifdef _MSC_VER +#pragma pack(pop) +#endif + +#endif /* _CC_WR_H_ */ Index: hw/amso1100/c2_cm.c =================================================================== --- hw/amso1100/c2_cm.c (revision 0) +++ hw/amso1100/c2_cm.c (revision 0) @@ -0,0 +1,415 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include "c2.h" +#include "c2_vq.h" +#include + +int c2_llp_connect(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) +{ + struct c2_dev *c2dev = to_c2dev(cm_id->device); + struct c2_qp *qp = to_c2qp(cm_id->qp); + ccwr_qp_connect_req_t *wr; /* variable size needs a malloc. */ + struct c2_vq_req *vq_req; + int err; + + /* + * only support the max private_data length + */ + if (pdata_len > CC_MAX_PRIVATE_DATA_SIZE) { + return -EINVAL; + } + + /* + * Create and send a WR_QP_CONNECT... + */ + wr = kmalloc(sizeof(*wr) + pdata_len, GFP_KERNEL); + if (!wr) { + return -ENOMEM; + } + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + err = -ENOMEM; + goto bail0; + } + + c2_wr_set_id(wr, CCWR_QP_CONNECT); + wr->hdr.context = 0; + wr->rnic_handle = c2dev->adapter_handle; + wr->qp_handle = qp->adapter_handle; + + wr->remote_addr = cm_id->remote_addr.sin_addr.s_addr; + wr->remote_port = cm_id->remote_addr.sin_port; + + /* + * Move any private data from the callers's buf into + * the WR. + */ + if (pdata) { + wr->private_data_length = cpu_to_be32(pdata_len); + memcpy(&wr->private_data[0], pdata, pdata_len); + } else { + wr->private_data_length = 0; + } + + /* + * Send WR to adapter. NOTE: There is no synch reply from + * the adapter. + */ + err = vq_send_wr(c2dev, (ccwr_t*)wr); + vq_req_free(c2dev, vq_req); +bail0: + kfree(wr); + return err; +} + +int +c2_llp_service_create(struct iw_cm_id* cm_id, int backlog) +{ + struct c2_dev *c2dev; + ccwr_ep_listen_create_req_t wr; + ccwr_ep_listen_create_rep_t *reply; + struct c2_vq_req *vq_req; + int err; + + c2dev = to_c2dev(cm_id->device); + if (c2dev == NULL) + return -EINVAL; + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_EP_LISTEN_CREATE); + wr.hdr.context = (u64)(unsigned long)vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.local_addr = cm_id->local_addr.sin_addr.s_addr; + wr.local_port = cm_id->local_addr.sin_port; + wr.backlog = cpu_to_be32(backlog); + wr.user_context = (u64)(unsigned long)cm_id; + + /* + * Reference the request struct. Dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (ccwr_t*)&wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (ccwr_ep_listen_create_rep_t*)(unsigned long)vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail1; + } + + if ( (err = c2_errno(reply)) != 0) { + goto bail1; + } + + /* + * get the adapter handle + */ + cm_id->provider_id = reply->ep_handle; + + /* + * free vq stuff + */ + vq_repbuf_free(c2dev, reply); + vq_req_free(c2dev, vq_req); + + return 0; + +bail1: + vq_repbuf_free(c2dev, reply); +bail0: + vq_req_free(c2dev, vq_req); + return err; +} + + +int +c2_llp_service_destroy(struct iw_cm_id* cm_id) +{ + + struct c2_dev *c2dev; + ccwr_ep_listen_destroy_req_t wr; + ccwr_ep_listen_destroy_rep_t *reply; + struct c2_vq_req *vq_req; + int err; + + c2dev = to_c2dev(cm_id->device); + if (c2dev == NULL) + return -EINVAL; + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + return -ENOMEM; + } + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_EP_LISTEN_DESTROY); + wr.hdr.context = (unsigned long)vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.ep_handle = cm_id->provider_id; + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (ccwr_t*)&wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (ccwr_ep_listen_destroy_rep_t*)(unsigned long)vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + if ( (err = c2_errno(reply)) != 0) { + goto bail1; + } + +bail1: + vq_repbuf_free(c2dev, reply); +bail0: + vq_req_free(c2dev, vq_req); + return err; +} + + +int +c2_llp_accept(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) +{ + struct c2_dev *c2dev = to_c2dev(cm_id->device); + struct c2_qp *qp = to_c2qp(cm_id->qp); + ccwr_cr_accept_req_t *wr; /* variable length WR */ + struct c2_vq_req *vq_req; + ccwr_cr_accept_rep_t *reply; /* VQ Reply msg ptr. */ + int err; + + /* Make sure there's a bound QP */ + if (qp == 0) + return -EINVAL; + + /* + * only support the max private_data length + */ + if (pdata_len > CC_MAX_PRIVATE_DATA_SIZE) { + return -EINVAL; + } + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + return -ENOMEM; + } + + wr = kmalloc(sizeof(*wr) + pdata_len, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + /* + * Build the WR + */ + c2_wr_set_id(wr, CCWR_CR_ACCEPT); + wr->hdr.context = (unsigned long)vq_req; + wr->rnic_handle = c2dev->adapter_handle; + wr->ep_handle = (u32)cm_id->provider_id; + wr->qp_handle = qp->adapter_handle; + if (pdata) { + wr->private_data_length = cpu_to_be32(pdata_len); + memcpy(&wr->private_data[0], pdata, pdata_len); + } else { + wr->private_data_length = 0; + } + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (ccwr_t*)wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail1; + } + + /* + * Process reply + */ + reply = (ccwr_cr_accept_rep_t*)(unsigned long)vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail1; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + +bail1: + kfree(wr); +bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +int +c2_llp_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) +{ + struct c2_dev *c2dev; + ccwr_cr_reject_req_t wr; + struct c2_vq_req *vq_req; + ccwr_cr_reject_rep_t *reply; + int err; + + c2dev = to_c2dev(cm_id->device); + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + return -ENOMEM; + } + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_CR_REJECT); + wr.hdr.context = (unsigned long)vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.ep_handle = (u32)cm_id->provider_id; + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (ccwr_t*)&wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (ccwr_cr_reject_rep_t*)(unsigned long)vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + err = c2_errno(reply); + + /* + * free vq stuff + */ + vq_repbuf_free(c2dev, reply); + +bail0: + vq_req_free(c2dev, vq_req); + return err; +} + Index: hw/amso1100/c2_provider.h =================================================================== --- hw/amso1100/c2_provider.h (revision 0) +++ hw/amso1100/c2_provider.h (revision 0) @@ -0,0 +1,174 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef C2_PROVIDER_H +#define C2_PROVIDER_H + +#include +#include + +#include "c2_mq.h" +#include + +#define C2_MPT_FLAG_ATOMIC (1 << 14) +#define C2_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define C2_MPT_FLAG_REMOTE_READ (1 << 12) +#define C2_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define C2_MPT_FLAG_LOCAL_READ (1 << 10) + +struct c2_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + + +/* The user context keeps track of objects allocated for a + * particular user-mode client. */ +struct c2_ucontext { + struct ib_ucontext ibucontext; + + int index; /* rnic index (minor) */ + int port; /* Which GigE port */ + + /* + * Shared HT pages for user-accessible MQs. + */ + int hthead; /* index of first free entry */ + void* htpages; /* kernel vaddr */ + int htlen; /* length of htpages memory */ + void* htuva; /* user mapped vaddr */ + spinlock_t htlock; /* serialize allocation */ + u64 adapter_hint_uva; /* Activity FIFO */ +}; + +struct c2_mtt; + +/* All objects associated with a PD are kept in the + * associated user context if present. + */ +struct c2_pd { + struct ib_pd ibpd; + u32 pd_id; + atomic_t sqp_count; +}; + +struct c2_mr { + struct ib_mr ibmr; + struct c2_pd *pd; +}; + +struct c2_av; + +enum c2_ah_type { + C2_AH_ON_HCA, + C2_AH_PCI_POOL, + C2_AH_KMALLOC +}; + +struct c2_ah { + struct ib_ah ibah; +}; + +struct c2_cq { + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + int is_kernel; + wait_queue_head_t wait; + + u32 adapter_handle; + struct c2_mq mq; +}; + +struct c2_wq { + spinlock_t lock; +}; +struct iw_cm_id; +struct c2_qp { + struct ib_qp ibqp; + struct iw_cm_id* cm_id; + spinlock_t lock; + atomic_t refcount; + wait_queue_head_t wait; + int qpn; + + u32 adapter_handle; + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u8 state; + + struct c2_mq sq_mq; + struct c2_mq rq_mq; +}; + +struct c2_cr_query_attrs { + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; +}; + +static inline struct c2_pd *to_c2pd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct c2_pd, ibpd); +} + +static inline struct c2_ucontext *to_c2ucontext(struct ib_ucontext *ibucontext) +{ + return container_of(ibucontext, struct c2_ucontext, ibucontext); +} + +static inline struct c2_mr *to_c2mr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct c2_mr, ibmr); +} + + +static inline struct c2_ah *to_c2ah(struct ib_ah *ibah) +{ + return container_of(ibah, struct c2_ah, ibah); +} + +static inline struct c2_cq *to_c2cq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct c2_cq, ibcq); +} + +static inline struct c2_qp *to_c2qp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct c2_qp, ibqp); +} +#endif /* C2_PROVIDER_H */ Index: hw/amso1100/c2_pd.c =================================================================== --- hw/amso1100/c2_pd.c (revision 0) +++ hw/amso1100/c2_pd.c (revision 0) @@ -0,0 +1,73 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include "c2.h" +#include "c2_provider.h" + +int c2_pd_alloc(struct c2_dev *dev, int privileged, struct c2_pd *pd) +{ + int err = 0; + + might_sleep(); + + atomic_set(&pd->sqp_count, 0); + pd->pd_id = c2_alloc(&dev->pd_table.alloc); + if (pd->pd_id == -1) + return -ENOMEM; + + return err; +} + +void c2_pd_free(struct c2_dev *dev, struct c2_pd *pd) +{ + might_sleep(); + c2_free(&dev->pd_table.alloc, pd->pd_id); +} + +int __devinit c2_init_pd_table(struct c2_dev *dev) +{ + return c2_alloc_init(&dev->pd_table.alloc, + dev->max_pd, + 0); +} + +void __devexit c2_cleanup_pd_table(struct c2_dev *dev) +{ + /* XXX check if any PDs are still allocated? */ + c2_alloc_cleanup(&dev->pd_table.alloc); +} Index: hw/amso1100/c2_cq.c =================================================================== --- hw/amso1100/c2_cq.c (revision 0) +++ hw/amso1100/c2_cq.c (revision 0) @@ -0,0 +1,401 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Cisco Systems, Inc. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include "c2.h" +#include "c2_vq.h" +#include "cc_status.h" + +#define C2_CQ_MSG_SIZE ((sizeof(ccwr_ce_t) + 32-1) & ~(32-1)) + +void c2_cq_event(struct c2_dev *c2dev, u32 mq_index) +{ + struct c2_cq *cq; + + cq = c2dev->qptr_array[mq_index]; + + if (!cq) { + dprintk("Completion event for bogus CQ %08x\n", mq_index); + return; + } + + assert(cq->ibcq.comp_handler); + (*cq->ibcq.comp_handler)(&cq->ibcq, cq->ibcq.cq_context); +} + +void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index) +{ + struct c2_cq *cq; + struct c2_mq *q; + + cq = c2dev->qptr_array[mq_index]; + if (!cq) + return; + + spin_lock_irq(&cq->lock); + + q = &cq->mq; + if (q && !c2_mq_empty(q)) { + u16 priv = q->priv; + ccwr_ce_t *msg; + + while (priv != cpu_to_be16(*q->shared)) { + msg = (ccwr_ce_t *)(q->msg_pool + priv * q->msg_size); + if (msg->qp_user_context == (u64)(unsigned long)qp) { + msg->qp_user_context = (u64)0; + } + BUMP(q, priv); + } + } + + spin_unlock_irq(&cq->lock); +} + +static inline enum ib_wc_status c2_cqe_status_to_openib(u8 status) +{ + switch (status) { + case CC_OK: return IB_WC_SUCCESS; + case CCERR_FLUSHED: return IB_WC_WR_FLUSH_ERR; + case CCERR_BASE_AND_BOUNDS_VIOLATION: return IB_WC_LOC_PROT_ERR; + case CCERR_ACCESS_VIOLATION: return IB_WC_LOC_ACCESS_ERR; + case CCERR_TOTAL_LENGTH_TOO_BIG: return IB_WC_LOC_LEN_ERR; + case CCERR_INVALID_WINDOW: return IB_WC_MW_BIND_ERR; + default: return IB_WC_GENERAL_ERR; + } +} + + +static inline int c2_poll_one(struct c2_dev *c2dev, + struct c2_cq *cq, + struct ib_wc *entry) +{ + ccwr_ce_t *ce; + struct c2_qp *qp; + int is_recv = 0; + + ce = (ccwr_ce_t *)c2_mq_consume(&cq->mq); + if (!ce) { + return -EAGAIN; + } + + /* + * if the qp returned is null then this qp has already + * been freed and we are unable process the completion. + * try pulling the next message + */ + while ( (qp = (struct c2_qp *)(unsigned long)ce->qp_user_context) == NULL) { + c2_mq_free(&cq->mq); + ce = (ccwr_ce_t *)c2_mq_consume(&cq->mq); + if (!ce) + return -EAGAIN; + } + + entry->status = c2_cqe_status_to_openib(c2_wr_get_result(ce)); + entry->wr_id = ce->hdr.context; + entry->qp_num = ce->handle; + entry->wc_flags = 0; + entry->slid = 0; + entry->sl = 0; + entry->src_qp = 0; + entry->dlid_path_bits = 0; + entry->pkey_index = 0; + + switch (c2_wr_get_id(ce)) { + case CC_WR_TYPE_SEND: + entry->opcode = IB_WC_SEND; + break; + case CC_WR_TYPE_RDMA_WRITE: + entry->opcode = IB_WC_RDMA_WRITE; + break; + case CC_WR_TYPE_RDMA_READ: + entry->opcode = IB_WC_RDMA_READ; + break; + case CC_WR_TYPE_BIND_MW: + entry->opcode = IB_WC_BIND_MW; + break; + case CC_WR_TYPE_RECV: + entry->byte_len = be32_to_cpu(ce->bytes_rcvd); + entry->opcode = IB_WC_RECV; + is_recv = 1; + break; + default: + break; + } + + /* consume the WQEs */ + if (is_recv) + c2_mq_lconsume(&qp->rq_mq, 1); + else + c2_mq_lconsume(&qp->sq_mq, be32_to_cpu(c2_wr_get_wqe_count(ce))+1); + + /* free the message */ + c2_mq_free(&cq->mq); + + return 0; +} + +int c2_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry) +{ + struct c2_dev *c2dev = to_c2dev(ibcq->device); + struct c2_cq *cq = to_c2cq(ibcq); + unsigned long flags; + int npolled, err; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + + err = c2_poll_one(c2dev, cq, entry + npolled); + if (err) + break; + } + + spin_unlock_irqrestore(&cq->lock, flags); + + return npolled; +} + +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +{ + struct c2_mq_shared volatile *shared; + struct c2_cq *cq; + + cq = to_c2cq(ibcq); + shared = cq->mq.peer; + + if (notify == IB_CQ_NEXT_COMP) + shared->notification_type = CC_CQ_NOTIFICATION_TYPE_NEXT; + else if (notify == IB_CQ_SOLICITED) + shared->notification_type = CC_CQ_NOTIFICATION_TYPE_NEXT_SE; + else + return -EINVAL; + + shared->armed = CQ_WAIT_FOR_DMA|CQ_ARMED; + + /* + * Now read back shared->armed to make the PCI + * write synchronous. This is necessary for + * correct cq notification semantics. + */ + { + volatile char c; + c = shared->armed; + } + + return 0; +} + +static void c2_free_cq_buf(struct c2_mq *mq) +{ + int npages; + + npages = ((mq->q_size * mq->msg_size) + PAGE_SIZE - 1) / PAGE_SIZE; + free_pages((unsigned long)mq->msg_pool, npages); +} + +static int c2_alloc_cq_buf(struct c2_mq *mq, int q_size, int msg_size) +{ + unsigned long pool_start; + int npages; + + npages = ( (q_size * msg_size) + PAGE_SIZE - 1) / PAGE_SIZE; + + pool_start = __get_free_pages(GFP_KERNEL, npages); + if (!pool_start) + return -ENOMEM; + + c2_mq_init(mq, + 0, /* index (currently unknown) */ + q_size, + msg_size, + (u8 *)pool_start, + 0, /* peer (currently unknown) */ + C2_MQ_HOST_TARGET); + + return 0; +} + +int c2_init_cq(struct c2_dev *c2dev, int entries, + struct c2_ucontext *ctx, struct c2_cq *cq) +{ + ccwr_cq_create_req_t wr; + ccwr_cq_create_rep_t* reply; + unsigned long peer_pa; + struct c2_vq_req *vq_req; + int err; + + might_sleep(); + + cq->ibcq.cqe = entries - 1; + cq->is_kernel = !ctx; + + /* Allocate a shared pointer */ + cq->mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); + if (!cq->mq.shared) + return -ENOMEM; + + /* Allocate pages for the message pool */ + err = c2_alloc_cq_buf(&cq->mq, entries+1, C2_CQ_MSG_SIZE); + if (err) + goto bail0; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + err = -ENOMEM; + goto bail1; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_CQ_CREATE); + wr.hdr.context = (unsigned long)vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.msg_size = cpu_to_be32(cq->mq.msg_size); + wr.depth = cpu_to_be32(cq->mq.q_size); + wr.shared_ht = cpu_to_be64(__pa(cq->mq.shared)); + wr.msg_pool = cpu_to_be64(__pa(cq->mq.msg_pool)); + wr.user_context = (u64)(unsigned long)(cq); + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (ccwr_t*)&wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail2; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail2; + + reply = (ccwr_cq_create_rep_t*)(unsigned long)(vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail2; + } + + if ( (err = c2_errno(reply)) != 0) + goto bail3; + + cq->adapter_handle = reply->cq_handle; + cq->mq.index = be32_to_cpu(reply->mq_index); + + peer_pa = (unsigned long)(c2dev->pa + be32_to_cpu(reply->adapter_shared)); + cq->mq.peer = ioremap_nocache(peer_pa, PAGE_SIZE); + if (!cq->mq.peer) { + err = -ENOMEM; + goto bail3; + } + + vq_repbuf_free(c2dev, reply); + vq_req_free(c2dev, vq_req); + + spin_lock_init(&cq->lock); + atomic_set(&cq->refcount, 1); + init_waitqueue_head(&cq->wait); + + /* + * Use the MQ index allocated by the adapter to + * store the CQ in the qptr_array + */ + /* XXX qptr_array lock? */ + cq->cqn = cq->mq.index; + c2dev->qptr_array[cq->cqn] = cq; + + return 0; + +bail3: + vq_repbuf_free(c2dev, reply); +bail2: + vq_req_free(c2dev, vq_req); +bail1: + c2_free_cq_buf(&cq->mq); +bail0: + c2_free_mqsp(cq->mq.shared); + + return err; +} + +void c2_free_cq(struct c2_dev *c2dev, + struct c2_cq *cq) +{ + int err; + struct c2_vq_req *vq_req; + ccwr_cq_destroy_req_t wr; + ccwr_cq_destroy_rep_t *reply; + + might_sleep(); + + atomic_dec(&cq->refcount); + wait_event(cq->wait, !atomic_read(&cq->refcount)); + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + goto bail0; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_CQ_DESTROY); + wr.hdr.context = (unsigned long)vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.cq_handle = cq->adapter_handle; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (ccwr_t*)&wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = (ccwr_cq_destroy_rep_t*)(unsigned long)(vq_req->reply_msg); + +//bail2: + vq_repbuf_free(c2dev, reply); +bail1: + vq_req_free(c2dev, vq_req); +bail0: + if (cq->is_kernel) { + c2_free_cq_buf(&cq->mq); + } + + return; +} + Index: hw/amso1100/Makefile =================================================================== --- hw/amso1100/Makefile (revision 0) +++ hw/amso1100/Makefile (revision 0) @@ -0,0 +1,22 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +ifdef CONFIG_INFINIBAND_AMSO1100_DEBUG +EXTRA_CFLAGS += -DC2_DEBUG +endif + +obj-$(CONFIG_INFINIBAND_AMSO1100) += iw_c2.o + +iw_c2-y := \ + c2.o \ + c2_provider.o \ + c2_rnic.o \ + c2_alloc.o \ + c2_mq.o \ + c2_ae.o \ + c2_vq.o \ + c2_intr.o \ + c2_cq.o \ + c2_qp.o \ + c2_cm.o \ + c2_mm.o \ + c2_pd.o Index: hw/amso1100/c2_mm.c =================================================================== --- hw/amso1100/c2_mm.c (revision 0) +++ hw/amso1100/c2_mm.c (revision 0) @@ -0,0 +1,376 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include "c2_vq.h" + +#define PBL_VIRT 1 +#define PBL_PHYS 2 + +/* + * Send all the PBL messages to convey the remainder of the PBL + * Wait for the adapter's reply on the last one. + * This is indicated by setting the MEM_PBL_COMPLETE in the flags. + * + * NOTE: vq_req is _not_ freed by this function. The VQ Host + * Reply buffer _is_ freed by this function. + */ +static int +send_pbl_messages(struct c2_dev *c2dev, u32 stag_index, + unsigned long va, u32 pbl_depth, + struct c2_vq_req *vq_req, int pbl_type) +{ + u32 pbe_count; /* amt that fits in a PBL msg */ + u32 count; /* amt in this PBL MSG. */ + ccwr_nsmr_pbl_req_t *wr; /* PBL WR ptr */ + ccwr_nsmr_pbl_rep_t *reply; /* reply ptr */ + int err, pbl_virt, i; + + switch (pbl_type) { + case PBL_VIRT: + pbl_virt = 1; + break; + case PBL_PHYS: + pbl_virt = 0; + break; + default: + return -EINVAL; + break; + } + + pbe_count = (c2dev->req_vq.msg_size - + sizeof(ccwr_nsmr_pbl_req_t)) / sizeof(u64); + wr = (ccwr_nsmr_pbl_req_t*)kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + return -ENOMEM; + } + c2_wr_set_id(wr, CCWR_NSMR_PBL); + + /* + * Only the last PBL message will generate a reply from the verbs, + * so we set the context to 0 indicating there is no kernel verbs + * handler blocked awaiting this reply. + */ + wr->hdr.context = 0; + wr->rnic_handle = c2dev->adapter_handle; + wr->stag_index = stag_index; /* already swapped */ + wr->flags = 0; + while (pbl_depth) { + count = min(pbe_count, pbl_depth); + wr->addrs_length = cpu_to_be32(count); + + /* + * If this is the last message, then reference the + * vq request struct cuz we're gonna wait for a reply. + * also make this PBL msg as the last one. + */ + if (count == pbl_depth) { + /* + * reference the request struct. dereferenced in the + * int handler. + */ + vq_req_get(c2dev, vq_req); + wr->flags = cpu_to_be32(MEM_PBL_COMPLETE); + + /* + * This is the last PBL message. + * Set the context to our VQ Request Object so we can + * wait for the reply. + */ + wr->hdr.context = (unsigned long)vq_req; + } + + /* + * if pbl_virt is set then va is a virtual address that describes a + * virtually contiguous memory allocation. the wr needs the start of + * each virtual page to be converted to the corresponding physical + * address of the page. + * + * if pbl_virt is not set then va is an array of physical addresses and + * there is no conversion to do. just fill in the wr with what is in + * the array. + */ + for (i=0; i < count; i++) { + if (pbl_virt) { + /* XXX */ //wr->paddrs[i] = cpu_to_be64(user_virt_to_phys(va)); + va += PAGE_SIZE; + } else { + wr->paddrs[i] = cpu_to_be64((u64)(unsigned long)((void **)va)[i]); + } + } + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (ccwr_t*)wr); + if (err) { + if (count <= pbe_count) { + vq_req_put(c2dev, vq_req); + } + goto bail0; + } + pbl_depth -= count; + } + + /* + * Now wait for the reply... + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (ccwr_nsmr_pbl_rep_t*)(unsigned long)vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + + vq_repbuf_free(c2dev, reply); +bail0: + kfree(wr); + return err; +} + +#define CC_PBL_MAX_DEPTH 131072 +int +c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 **addr_list, + int pbl_depth, u32 length, u64 *va, + cc_acf_t acf, struct c2_mr *mr) +{ + struct c2_vq_req *vq_req; + ccwr_nsmr_register_req_t *wr; + ccwr_nsmr_register_rep_t *reply; + u16 flags; + int i, pbe_count, count; + int err; + + if (!va || !length || !addr_list || !pbl_depth) + return -EINTR; + + /* + * Verify PBL depth is within rnic max + */ + if (pbl_depth > CC_PBL_MAX_DEPTH) { + return -EINTR; + } + + /* + * allocate verbs request object + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + /* + * build the WR + */ + c2_wr_set_id(wr, CCWR_NSMR_REGISTER); + wr->hdr.context = (unsigned long)vq_req; + wr->rnic_handle = c2dev->adapter_handle; + + flags = (acf | MEM_VA_BASED | MEM_REMOTE); + + /* + * compute how many pbes can fit in the message + */ + pbe_count = (c2dev->req_vq.msg_size - + sizeof(ccwr_nsmr_register_req_t)) / + sizeof(u64); + + if (pbl_depth <= pbe_count) { + flags |= MEM_PBL_COMPLETE; + } + wr->flags = cpu_to_be16(flags); + wr->stag_key = 0; //stag_key; + wr->va = cpu_to_be64(*va); + wr->pd_id = mr->pd->pd_id; + wr->pbe_size = cpu_to_be32(PAGE_SIZE); + wr->length = cpu_to_be32(length); + wr->pbl_depth = cpu_to_be32(pbl_depth); + wr->fbo = cpu_to_be32(0); + count = min(pbl_depth, pbe_count); + wr->addrs_length = cpu_to_be32(count); + + /* + * fill out the PBL for this message + */ + for (i = 0; i < count; i++) { + wr->paddrs[i] = cpu_to_be64((u64)(unsigned long)addr_list[i]); + } + + /* + * regerence the request struct + */ + vq_req_get(c2dev, vq_req); + + /* + * send the WR to the adapter + */ + err = vq_send_wr(c2dev, (ccwr_t *)wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + /* + * wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail1; + } + + /* + * process reply + */ + reply = (ccwr_nsmr_register_rep_t *)(unsigned long)(vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail1; + } + if ( (err = c2_errno(reply))) { + goto bail2; + } + //*p_pb_entries = be32_to_cpu(reply->pbl_depth); + mr->ibmr.lkey = mr->ibmr.rkey = be32_to_cpu(reply->stag_index); + vq_repbuf_free(c2dev, reply); + + /* + * if there are still more PBEs we need to send them to + * the adapter and wait for a reply on the final one. + * reuse vq_req for this purpose. + */ + pbl_depth -= count; + if (pbl_depth) { + + vq_req->reply_msg = (unsigned long)NULL; + atomic_set(&vq_req->reply_ready, 0); + err = send_pbl_messages(c2dev, + cpu_to_be32(mr->ibmr.lkey), + (unsigned long)&addr_list[i], + pbl_depth, vq_req, PBL_PHYS); + if (err) { + goto bail1; + } + } + + vq_req_free(c2dev, vq_req); + kfree(wr); + + return err; + +bail2: + vq_repbuf_free(c2dev, reply); +bail1: + kfree(wr); +bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +int +c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index) +{ + struct c2_vq_req *vq_req; /* verbs request object */ + ccwr_stag_dealloc_req_t wr; /* work request */ + ccwr_stag_dealloc_rep_t *reply; /* WR reply */ + int err; + + + /* + * allocate verbs request object + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + return -ENOMEM; + } + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_STAG_DEALLOC); + wr.hdr.context = (u64)(unsigned long)vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.stag_index = cpu_to_be32(stag_index); + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (ccwr_t*)&wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (ccwr_stag_dealloc_rep_t*)(unsigned long)vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + + vq_repbuf_free(c2dev, reply); +bail0: + vq_req_free(c2dev, vq_req); + return err; +} + + Index: hw/amso1100/cc_status.h =================================================================== --- hw/amso1100/cc_status.h (revision 0) +++ hw/amso1100/cc_status.h (revision 0) @@ -0,0 +1,163 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _CC_STATUS_H_ +#define _CC_STATUS_H_ + +/* + * Verbs Status Codes + */ +typedef enum { + CC_OK = 0, /* This must be zero */ + CCERR_INSUFFICIENT_RESOURCES = 1, + CCERR_INVALID_MODIFIER = 2, + CCERR_INVALID_MODE = 3, + CCERR_IN_USE = 4, + CCERR_INVALID_RNIC = 5, + CCERR_INTERRUPTED_OPERATION = 6, + CCERR_INVALID_EH = 7, + CCERR_INVALID_CQ = 8, + CCERR_CQ_EMPTY = 9, + CCERR_NOT_IMPLEMENTED = 10, + CCERR_CQ_DEPTH_TOO_SMALL = 11, + CCERR_PD_IN_USE = 12, + CCERR_INVALID_PD = 13, + CCERR_INVALID_SRQ = 14, + CCERR_INVALID_ADDRESS = 15, + CCERR_INVALID_NETMASK = 16, + CCERR_INVALID_QP = 17, + CCERR_INVALID_QP_STATE = 18, + CCERR_TOO_MANY_WRS_POSTED = 19, + CCERR_INVALID_WR_TYPE = 20, + CCERR_INVALID_SGL_LENGTH = 21, + CCERR_INVALID_SQ_DEPTH = 22, + CCERR_INVALID_RQ_DEPTH = 23, + CCERR_INVALID_ORD = 24, + CCERR_INVALID_IRD = 25, + CCERR_QP_ATTR_CANNOT_CHANGE = 26, + CCERR_INVALID_STAG = 27, + CCERR_QP_IN_USE = 28, + CCERR_OUTSTANDING_WRS = 29, + CCERR_STAG_IN_USE = 30, + CCERR_INVALID_STAG_INDEX = 31, + CCERR_INVALID_SGL_FORMAT = 32, + CCERR_ADAPTER_TIMEOUT = 33, + CCERR_INVALID_CQ_DEPTH = 34, + CCERR_INVALID_PRIVATE_DATA_LENGTH = 35, + CCERR_INVALID_EP = 36, + CCERR_MR_IN_USE = CCERR_STAG_IN_USE, + CCERR_FLUSHED = 38, + CCERR_INVALID_WQE = 39, + CCERR_LOCAL_QP_CATASTROPHIC_ERROR = 40, + CCERR_REMOTE_TERMINATION_ERROR = 41, + CCERR_BASE_AND_BOUNDS_VIOLATION = 42, + CCERR_ACCESS_VIOLATION = 43, + CCERR_INVALID_PD_ID = 44, + CCERR_WRAP_ERROR = 45, + CCERR_INV_STAG_ACCESS_ERROR = 46, + CCERR_ZERO_RDMA_READ_RESOURCES = 47, + CCERR_QP_NOT_PRIVILEGED = 48, + CCERR_STAG_STATE_NOT_INVALID = 49, + CCERR_INVALID_PAGE_SIZE = 50, + CCERR_INVALID_BUFFER_SIZE = 51, + CCERR_INVALID_PBE = 52, + CCERR_INVALID_FBO = 53, + CCERR_INVALID_LENGTH = 54, + CCERR_INVALID_ACCESS_RIGHTS = 55, + CCERR_PBL_TOO_BIG = 56, + CCERR_INVALID_VA = 57, + CCERR_INVALID_REGION = 58, + CCERR_INVALID_WINDOW = 59, + CCERR_TOTAL_LENGTH_TOO_BIG = 60, + CCERR_INVALID_QP_ID = 61, + CCERR_ADDR_IN_USE = 62, + CCERR_ADDR_NOT_AVAIL = 63, + CCERR_NET_DOWN = 64, + CCERR_NET_UNREACHABLE = 65, + CCERR_CONN_ABORTED = 66, + CCERR_CONN_RESET = 67, + CCERR_NO_BUFS = 68, + CCERR_CONN_TIMEDOUT = 69, + CCERR_CONN_REFUSED = 70, + CCERR_HOST_UNREACHABLE = 71, + CCERR_INVALID_SEND_SGL_DEPTH = 72, + CCERR_INVALID_RECV_SGL_DEPTH = 73, + CCERR_INVALID_RDMA_WRITE_SGL_DEPTH = 74, + CCERR_INSUFFICIENT_PRIVILEGES = 75, + CCERR_STACK_ERROR = 76, + CCERR_INVALID_VERSION = 77, + CCERR_INVALID_MTU = 78, + CCERR_INVALID_IMAGE = 79, + CCERR_PENDING = 98, /* not an error; user internally by adapter */ + CCERR_DEFER = 99, /* not an error; used internally by adapter */ + CCERR_FAILED_WRITE = 100, + CCERR_FAILED_ERASE = 101, + CCERR_FAILED_VERIFICATION = 102, + CCERR_NOT_FOUND = 103, + +} cc_status_t; + +/* + * Verbs and Completion Status Code types... + */ +typedef cc_status_t cc_verbs_status_t; +typedef cc_status_t cc_wc_status_t; + +/* + * CCAE_ACTIVE_CONNECT_RESULTS status result codes. + */ +typedef enum { + CC_CONN_STATUS_SUCCESS = CC_OK, + CC_CONN_STATUS_NO_MEM = CCERR_INSUFFICIENT_RESOURCES, + CC_CONN_STATUS_TIMEDOUT = CCERR_CONN_TIMEDOUT, + CC_CONN_STATUS_REFUSED = CCERR_CONN_REFUSED, + CC_CONN_STATUS_NETUNREACH = CCERR_NET_UNREACHABLE, + CC_CONN_STATUS_HOSTUNREACH = CCERR_HOST_UNREACHABLE, + CC_CONN_STATUS_INVALID_RNIC = CCERR_INVALID_RNIC, + CC_CONN_STATUS_INVALID_QP = CCERR_INVALID_QP, + CC_CONN_STATUS_INVALID_QP_STATE = CCERR_INVALID_QP_STATE, + CC_CONN_STATUS_REJECTED = CCERR_CONN_RESET, +} cc_connect_status_t; + +/* + * Flash programming status codes. + */ +typedef enum { + CC_FLASH_STATUS_SUCCESS = 0x0000, + CC_FLASH_STATUS_VERIFY_ERR = 0x0002, + CC_FLASH_STATUS_IMAGE_ERR = 0x0004, + CC_FLASH_STATUS_ECLBS = 0x0400, + CC_FLASH_STATUS_PSLBS = 0x0800, + CC_FLASH_STATUS_VPENS = 0x1000, +} cc_flash_status_t; + +#endif /* _CC_STATUS_H_ */ From rdreier at cisco.com Mon Jan 23 21:52:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 21:52:46 -0800 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123233907.GC29917@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 24 Jan 2006 01:39:07 +0200") References: <20060123233438.GQ5074@us.ibm.com> <20060123233907.GC29917@mellanox.co.il> Message-ID: Michael> Could the high/low bits be swapped? What happends if you Michael> change cycles_t from long long to long? Could you try Michael> running the clock_test utility? What seems to be happening is that mftb is giving the low 32 bits of the timebase (as expected on ppc32). Since your get_cycles() is returning a long long, those 32 bits get put in the most significant 32 bits of the return value, and the low 32 bits are garbage (ppc is big endian). If I compile clock_test for ppc32, I see that get_cycles() compiles to: 1000064c : 1000064c: 7c 6c 42 e6 mftb r3 10000650: 4e 80 00 20 blr For comparison, a function like unsigned long long blah(void) { return 0x100000002ull; } compiles to 00000000 : 0: 38 60 00 01 li r3,1 4: 38 80 00 02 li r4,2 8: 4e 80 00 20 blr In other words the convention on ppc32 is that unsigned long long return values have the high 32 bits in r3 and the low 32 bits in r4. I think you want to use something like typedef unsigned long long cycles_t; static inline cycles_t get_cycles() { unsigned long low, hi, hi2; do { asm volatile ("mftbu %0" : "=r" (hi)); asm volatile ("mftb %0" : "=r" (low)); asm volatile ("mftbu %0" : "=r" (hi2)); } while (hi != hi2); return ((unsigned long long) hi << 32) | low; } for ppc32. However, this is not quite enough to make things work on all powerpc systems, because the timebase does not necessarily run at the same speed as the CPU. For example, on an IBM JS20 blade, clock_test prints 1 sec = 6536.8 usec 1 sec = 6537.05 usec (both as a 32-bit and 64-bit executable) because, as /proc/cpuinfo shows: processor : 0 cpu : PPC970FX, altivec supported clock : 2194.624509MHz revision : 3.0 processor : 1 cpu : PPC970FX, altivec supported clock : 2194.624509MHz revision : 3.0 timebase : 14318000 machine : CHRP IBM,8842-P2C the timebase runs at about 14.3 MHz, or approx 153 times slower than the CPU clock. I'm not sure how you want to fix this in perftest. - R. From rdreier at cisco.com Mon Jan 23 22:07:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Jan 2006 22:07:53 -0800 Subject: [openib-general] [PATCH] RFC: AMSO1100 iWARP Driver In-Reply-To: <1138079753.4758.40.camel@strider.opengridcomputing.com> (Tom Tucker's message of "Mon, 23 Jan 2006 23:15:52 -0600") References: <1138079753.4758.40.camel@strider.opengridcomputing.com> Message-ID: Tom, thanks for posting this. After thinking this over, I really think the amso1100 driver belongs upstream. If Linux is shipping ARCnet and Sound Blaster CD-ROM support, then you've got a long wait until your card is obsolete enough to forget about. And having a real iWARP driver just makes things a lot easier to justify and understand, although I would still like to get buy-in from people like NetEffect and Chelsio. Anyway, some easy comments from a quick skim: > +/* > + * WARNING: If you change this file, also bump CC_IVN_BASE > + * in common/include/clustercore/cc_ivn.h. > + */ Uh, where's clustercore/cc_ivn.h? > +typedef enum { > ... > +} cc_event_id_t; > +typedef enum { > ... > +} cc_resource_indicator_t; typedefs that create foo_t are strongly deprecated in the kernel. Just do enum cc_event_id { ... }; and use "enum cc_event_id" everywhere. > + switch (mq_index) { > + case (0): no need for parentheses here. and can the magic (0), (1), (2) be given names that say what they mean? > + struct c2_mq_shared volatile *shared; volatile declarations are almost always a bug... use proper locking or memory barriers to say what you mean instead. > + /* > + * Now read back shared->armed to make the PCI > + * write synchronous. This is necessary for > + * correct cq notification semantics. > + */ > + { > + volatile char c; > + c = shared->armed; > + } If you're reading across PCI you should be using readb(). > + qp->adapter_handle = reply->qp_handle; > + qp->state = IB_QPS_RESET; > + qp->send_sgl_depth = qp_attrs->cap.max_send_sge; > + qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge; > + qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge; whitespace damage alert > +#define assert(expr) \ > + if(!(expr)) { \ > + printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n",\ > + #expr, __FILE__, __FUNCTION__, __LINE__); \ > + } probably just use BUG_ON() -- then you get a traceback too. > +struct c2_adapter_pci_regs { > + char reg_magic[8]; > + u32 version; > + u32 ivn; > + u32 pci_window_size; > + u32 q0_q_size; Indent with tabs not 4 spaces (lots of other places too) > +static inline u32 c2_read32(const void __iomem *addr) > +{ > + return readl(addr); > +} Any reason for not using readl() directly (and similarly for all the other c2_readxx c2_writexx funcs)? - R. From tom at opengridcomputing.com Mon Jan 23 22:58:46 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 24 Jan 2006 00:58:46 -0600 Subject: [openib-general] [PATCH] RFC: AMSO1100 iWARP Driver In-Reply-To: References: <1138079753.4758.40.camel@strider.opengridcomputing.com> Message-ID: <1138085926.675.12.camel@strider.opengridcomputing.com> Roland: Thanks for the comments. I'll take a pass through given your review and repost. Thanks, again, Tom On Mon, 2006-01-23 at 22:07 -0800, Roland Dreier wrote: > Tom, thanks for posting this. After thinking this over, I really > think the amso1100 driver belongs upstream. If Linux is shipping > ARCnet and Sound Blaster CD-ROM support, then you've got a long wait > until your card is obsolete enough to forget about. And having a real > iWARP driver just makes things a lot easier to justify and understand, > although I would still like to get buy-in from people like NetEffect > and Chelsio. > > Anyway, some easy comments from a quick skim: > > > +/* > > + * WARNING: If you change this file, also bump CC_IVN_BASE > > + * in common/include/clustercore/cc_ivn.h. > > + */ > > Uh, where's clustercore/cc_ivn.h? time-warp comment. > > > +typedef enum { > > ... > > +} cc_event_id_t; > > > +typedef enum { > > ... > > +} cc_resource_indicator_t; > > typedefs that create foo_t are strongly deprecated in the kernel. > Just do > > enum cc_event_id { > ... > }; > got it. > and use "enum cc_event_id" everywhere. > > > + switch (mq_index) { > > + case (0): > > no need for parentheses here. and can the magic (0), (1), (2) be > given names that say what they mean? yeah - I noticed this too. > > > + struct c2_mq_shared volatile *shared; > > volatile declarations are almost always a bug... use proper locking or > memory barriers to say what you mean instead. agreed. wmb() > > > + /* > > + * Now read back shared->armed to make the PCI > > + * write synchronous. This is necessary for > > + * correct cq notification semantics. > > + */ > > + { > > + volatile char c; > > + c = shared->armed; > > + } > > If you're reading across PCI you should be using readb(). agreed. > > > + qp->adapter_handle = reply->qp_handle; > > + qp->state = IB_QPS_RESET; > > + qp->send_sgl_depth = qp_attrs->cap.max_send_sge; > > + qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge; > > + qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge; > > whitespace damage alert it's all over... vi vs. emacs wars. > > > +#define assert(expr) \ > > + if(!(expr)) { \ > > + printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n",\ > > + #expr, __FILE__, __FUNCTION__, __LINE__); \ > > + } > > probably just use BUG_ON() -- then you get a traceback too. > yep. > > +struct c2_adapter_pci_regs { > > + char reg_magic[8]; > > + u32 version; > > + u32 ivn; > > + u32 pci_window_size; > > + u32 q0_q_size; > > Indent with tabs not 4 spaces (lots of other places too) > > > +static inline u32 c2_read32(const void __iomem *addr) > > +{ > > + return readl(addr); > > +} > > Any reason for not using readl() directly (and similarly for all the > other c2_readxx c2_writexx funcs)? we used to share firmware/device driver code... These could be reduced to their Linux equivalent readl, etc.... > > - R. From mst at mellanox.co.il Tue Jan 24 00:10:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 10:10:50 +0200 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123234850.GS5074@us.ibm.com> References: <20060123234850.GS5074@us.ibm.com> Message-ID: <20060124081050.GC26123@mellanox.co.il> Quoting r. Nishanth Aravamudan : > > What happends if you change cycles_t from long long to long? > > Could you try running the clock_test utility? > > I'll try the latter first (I found the usage from the file and it is > built by make). Could you send me a patch to do the former, in case I > need to? It can be a patch that applies directly in the perftest > directory. > > Thanks, > Nish > Just update get_clock.h to r5163. -- MST From postmaster at demeter.nodens.net Tue Jan 24 00:14:08 2006 From: postmaster at demeter.nodens.net (MailScanner) Date: Tue, 24 Jan 2006 16:14:08 +0800 Subject: [openib-general] {Bounce} Unsolicited commercial email rejected Message-ID: <200601240814.k0O8E8iU025941@demeter.nodens.net> Our UCE (spam) detectors have been triggered by a message you sent:- To: kq at testech-elect.com Subject: Mail Delivery System Date: Tue Jan 24 16:14:08 2006 This message has been rejected. The detectors that were triggered are spam, SBL+XBL, SpamAssassin. Your message has been detected as spam based on both its contents and the mail server which has sent the message to us. We do not accept unsolicited commercial (spam) e-mail and actively work to stop it. If you have any questions about this, or you believe you have received this message in error, please contact the site system administrators. -- MailScanner Email Virus Scanner www.mailscanner.info MailScanner thanks transtec Computers for their support From mst at mellanox.co.il Tue Jan 24 01:57:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 11:57:13 +0200 Subject: [openib-general] Re: Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: References: Message-ID: <20060124095713.GP26724@mellanox.co.il> Quoting Roland Dreier : > However, this is not quite enough to make things work on > all powerpc systems, because the timebase does not necessarily run at > the same speed as the CPU. For example, on an IBM JS20 blade, > clock_test prints > > 1 sec = 6536.8 usec > 1 sec = 6537.05 usec > > (both as a 32-bit and 64-bit executable) because, as /proc/cpuinfo shows: > > processor : 0 > cpu : PPC970FX, altivec supported > clock : 2194.624509MHz > revision : 3.0 > > processor : 1 > cpu : PPC970FX, altivec supported > clock : 2194.624509MHz > revision : 3.0 > > timebase : 14318000 > machine : CHRP IBM,8842-P2C > > the timebase runs at about 14.3 MHz, or approx 153 times slower than > the CPU clock. Right, the PPC book clearly says "Since the update frequency of the Time Base is implementation- dependent, the algorithm for converting the current value in the Time Base to time of day is also implementation-dependent." But I was hoping this would be 1:1 for most systems. > I'm not sure how you want to fix this in perftest. I plan on implementing a small program that will use msleep to measure the timebase rate. (Something like linear regression should do it). Tests will get a new option to pass in the timebase rate rather than guessing it from /proc/cpuinfo. -- MST From krkumar2 at in.ibm.com Tue Jan 24 02:42:13 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Tue, 24 Jan 2006 16:12:13 +0530 Subject: [openib-general] [PATCH] RFC: AMSO1100 iWARP Driver In-Reply-To: <1138079753.4758.40.camel@strider.opengridcomputing.com> Message-ID: Hi Tom, - c2_create_qp() should kfree(qp) on error and not pd. Some very (very) MINOR nits : - c2_pd_alloc() should be called c2_pd_id_alloc() ? And why is might_sleep() required for this and c2_pd_free() ? Shouldn't that be in c2_alloc_pd() before the kmalloc() ? - netevent_notifier : why is it using KERN_ERR and not KERN_INFO ? - c2_mq_init() does a return at the end of routine, can be removed : + return; - Remove typecasts of void *, eg : + reply_vq = (struct c2_mq *)c2dev->qptr_array[mq_index]; - Change (for consistency and to be clear) : + rx_ring->start = kmalloc(sizeof(*elem) * rx_ring->count, GFP_KERNEL); to + rx_ring->start = kmalloc(sizeof(*rx_ring->start) * rx_ring->count, GFP_KERNEL); - In c2_tx_clean, you can do : + if (netif_queue_stopped(c2_port->netdev) && c2_port->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(c2_port->netdev); - Lots of + if (err) { + break; + } (braces for one line, not a big deal but can remove) - c2_init_qp_table() can be written : + if (err) + c2_alloc_cleanup(&c2dev->qp_table.alloc); + return err; removing some redundant returns. Thanks, - KK openib-general-bounces at openib.org wrote on 01/24/2006 10:45:52 AM: > > > Given some of the discussion re: support for the AMSO1100, enclosed is a > patch for an OpenIB provider in support of the AMSO1100. While we use > these devices extensively for testing of iWARP support at OGC, the > driver has not seen anywhere near the kind of attention that the mthca > driver has. > > This patch requires the previously submitted iWARP core support and CMA > patch. > > Please review and offer suggestions as to what we can do to improve it. > There are some known issues with ULP that do not filter based on node > type and can become confused and crash when loading and unloading this > driver. > > Patches are available for these ULP add_one and remove_one handlers, but > these are trivial and can be considered separately. > > Index: Kconfig > =================================================================== > --- Kconfig (revision 5098) > +++ Kconfig (working copy) > @@ -32,6 +32,8 @@ > > source "drivers/infiniband/hw/mthca/Kconfig" > > +source "drivers/infiniband/hw/amso1100/Kconfig" > + > source "drivers/infiniband/hw/ehca/Kconfig" > > source "drivers/infiniband/ulp/ipoib/Kconfig" > Index: Makefile > =================================================================== > --- Makefile (revision 5098) > +++ Makefile (working copy) > @@ -1,6 +1,7 @@ > obj-$(CONFIG_INFINIBAND) += core/ > obj-$(CONFIG_IPATH_CORE) += hw/ipath/ > obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ > +obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ > obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ > obj-$(CONFIG_INFINIBAND_SDP) += ulp/sdp/ > obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ > Index: hw/amso1100/cc_ae.h > =================================================================== > --- hw/amso1100/cc_ae.h (revision 0) > +++ hw/amso1100/cc_ae.h (revision 0) > @@ -0,0 +1,108 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#ifndef _CC_AE_H_ > +#define _CC_AE_H_ > + > +/* > + * WARNING: If you change this file, also bump CC_IVN_BASE > + * in common/include/clustercore/cc_ivn.h. > + */ > + > +/* > + * Asynchronous Event Identifiers > + * > + * These start at 0x80 only so it's obvious from inspection that > + * they are not work-request statuses. This isn't critical. > + * > + * NOTE: these event id's must fit in eight bits. > + */ > +typedef enum { > + CCAE_REMOTE_SHUTDOWN = 0x80, > + CCAE_ACTIVE_CONNECT_RESULTS, > + CCAE_CONNECTION_REQUEST, > + CCAE_LLP_CLOSE_COMPLETE, > + CCAE_TERMINATE_MESSAGE_RECEIVED, > + CCAE_LLP_CONNECTION_RESET, > + CCAE_LLP_CONNECTION_LOST, > + CCAE_LLP_SEGMENT_SIZE_INVALID, > + CCAE_LLP_INVALID_CRC, > + CCAE_LLP_BAD_FPDU, > + CCAE_INVALID_DDP_VERSION, > + CCAE_INVALID_RDMA_VERSION, > + CCAE_UNEXPECTED_OPCODE, > + CCAE_INVALID_DDP_QUEUE_NUMBER, > + CCAE_RDMA_READ_NOT_ENABLED, > + CCAE_RDMA_WRITE_NOT_ENABLED, > + CCAE_RDMA_READ_TOO_SMALL, > + CCAE_NO_L_BIT, > + CCAE_TAGGED_INVALID_STAG, > + CCAE_TAGGED_BASE_BOUNDS_VIOLATION, > + CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION, > + CCAE_TAGGED_INVALID_PD, > + CCAE_WRAP_ERROR, > + CCAE_BAD_CLOSE, > + CCAE_BAD_LLP_CLOSE, > + CCAE_INVALID_MSN_RANGE, > + CCAE_INVALID_MSN_GAP, > + CCAE_IRRQ_OVERFLOW, > + CCAE_IRRQ_MSN_GAP, > + CCAE_IRRQ_MSN_RANGE, > + CCAE_IRRQ_INVALID_STAG, > + CCAE_IRRQ_BASE_BOUNDS_VIOLATION, > + CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION, > + CCAE_IRRQ_INVALID_PD, > + CCAE_IRRQ_WRAP_ERROR, > + CCAE_CQ_SQ_COMPLETION_OVERFLOW, > + CCAE_CQ_RQ_COMPLETION_ERROR, > + CCAE_QP_SRQ_WQE_ERROR, > + CCAE_QP_LOCAL_CATASTROPHIC_ERROR, > + CCAE_CQ_OVERFLOW, > + CCAE_CQ_OPERATION_ERROR, > + CCAE_SRQ_LIMIT_REACHED, > + CCAE_QP_RQ_LIMIT_REACHED, > + CCAE_SRQ_CATASTROPHIC_ERROR, > + CCAE_RNIC_CATASTROPHIC_ERROR > + /* WARNING If you add more id's, make sure their values fit in eight bits. */ > +} cc_event_id_t; > + > +/* > + * Resource Indicators and Identifiers > + */ > +typedef enum { > + CC_RES_IND_QP = 1, > + CC_RES_IND_EP, > + CC_RES_IND_CQ, > + CC_RES_IND_SRQ, > +} cc_resource_indicator_t; > + > +#endif /* _CC_AE_H_ */ > Index: hw/amso1100/Kconfig > =================================================================== > --- hw/amso1100/Kconfig (revision 0) > +++ hw/amso1100/Kconfig (revision 0) > @@ -0,0 +1,15 @@ > +config INFINIBAND_AMSO1100 > + tristate "Ammasso 1100 HCA support" > + depends on PCI && INFINIBAND > + ---help--- > + This is a low-level driver for the Ammasso 1100 host > + channel adapter (HCA). > + > +config INFINIBAND_AMSO1100_DEBUG > + bool "Verbose debugging output" > + depends on INFINIBAND_AMSO1100 > + default n > + ---help--- > + This option causes the amso1100 driver to produce a bunch of > + debug messages. Select this if you are developing the driver > + or trying to diagnose a problem. > Index: hw/amso1100/c2_intr.c > =================================================================== > --- hw/amso1100/c2_intr.c (revision 0) > +++ hw/amso1100/c2_intr.c (revision 0) > @@ -0,0 +1,177 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#include "c2.h" > +#include "c2_vq.h" > + > +static void handle_mq(struct c2_dev *c2dev, u32 index); > +static void handle_vq(struct c2_dev *c2dev, u32 mq_index); > + > +/* > + * Handle RNIC interrupts > + */ > +void > +c2_rnic_interrupt(struct c2_dev *c2dev) > +{ > + unsigned int mq_index; > + > + while (c2dev->hints_read != be16_to_cpu(c2dev->hint_count)) { > + mq_index = c2_read32(c2dev->regs + PCI_BAR0_HOST_HINT); > + if (mq_index & 0x80000000) { > + break; > + } > + > + c2dev->hints_read++; > + handle_mq(c2dev, mq_index); > + } > + > +} > + > +/* > + * Top level MQ handler > + */ > +static void > +handle_mq(struct c2_dev *c2dev, u32 mq_index) > +{ > + if (c2dev->qptr_array[mq_index] == NULL) { > + dprintk(KERN_INFO "handle_mq: stray activity for mq_index=%d\n", mq_index); > + return; > + } > + > + switch (mq_index) { > + case (0): > + /* > + * An index of 0 in the activity queue > + * indicates the req vq now has messages > + * available... > + * > + * Wake up any waiters waiting on req VQ > + * message availability. > + */ > + wake_up(&c2dev->req_vq_wo); > + break; > + case (1): > + handle_vq(c2dev, mq_index); > + break; > + case (2): > + spin_lock(&c2dev->aeq_lock); > + c2_ae_event(c2dev, mq_index); > + spin_unlock(&c2dev->aeq_lock); > + break; > + default: > + c2_cq_event(c2dev, mq_index); > + break; > + } > + > + return; > +} > + > +/* > + * Handles verbs WR replies. > + */ > +static void > +handle_vq(struct c2_dev *c2dev, u32 mq_index) > +{ > + void *adapter_msg, *reply_msg; > + ccwr_hdr_t *host_msg; > + ccwr_hdr_t tmp; > + struct c2_mq *reply_vq; > + struct c2_vq_req* req; > + > + reply_vq = (struct c2_mq *)c2dev->qptr_array[mq_index]; > + > + { > + > + /* > + * get next msg from mq_index into adapter_msg. > + * don't free it yet. > + */ > + adapter_msg = c2_mq_consume(reply_vq); > + dprintk("handle_vq: adapter_msg=%p\n", adapter_msg); > + if (adapter_msg == NULL) { > + return; > + } > + > + host_msg = vq_repbuf_alloc(c2dev); > + > + /* > + * If we can't get a host buffer, then we'll still > + * wakeup the waiter, we just won't give him the msg. > + * It is assumed the waiter will deal with this... > + */ > + if (!host_msg) { > + dprintk("handle_vq: no repbufs!\n"); > + > + /* > + * just copy the WR header into a local variable. > + * this allows us to still demux on the context > + */ > + host_msg = &tmp; > + memcpy(host_msg, adapter_msg, sizeof(tmp)); > + reply_msg = NULL; > + } else { > + memcpy(host_msg, adapter_msg, reply_vq->msg_size); > + reply_msg = host_msg; > + } > + > + /* > + * consume the msg from the MQ > + */ > + c2_mq_free(reply_vq); > + > + /* > + * wakeup the waiter. > + */ > + req = (struct c2_vq_req *)(unsigned long)host_msg->context; > + if (req == NULL) { > + /* > + * We should never get here, as the adapter should > + * never send us a reply that we're not expecting. > + */ > + vq_repbuf_free(c2dev, host_msg); > + dprintk("handle_vq: UNEXPECTEDLY got NULL req\n"); > + return; > + } > + req->reply_msg = (u64)(unsigned long)(reply_msg); > + atomic_set(&req->reply_ready, 1); > + dprintk("handle_vq: wakeup req %p\n", req); > + wake_up(&req->wait_object); > + > + /* > + * If the request was cancelled, then this put will > + * free the vq_req memory...and reply_msg!!! > + */ > + vq_req_put(c2dev, req); > + } > + > +} > + > Index: hw/amso1100/c2_mq.c > =================================================================== > --- hw/amso1100/c2_mq.c (revision 0) > +++ hw/amso1100/c2_mq.c (revision 0) > @@ -0,0 +1,182 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#include "c2.h" > +#include "c2_mq.h" > + > +#define BUMP(q,p) (p) = ((p)+1) % (q)->q_size > +#define BUMP_SHARED(q,p) (p) = cpu_to_be16((be16_to_cpu(p)+1) % (q)->q_size) > + > +void * > +c2_mq_alloc(struct c2_mq *q) > +{ > + assert(q); > + assert(q->magic == C2_MQ_MAGIC); > + assert(q->type == C2_MQ_ADAPTER_TARGET); > + > + if (c2_mq_full(q)) { > + return NULL; > + } else { > +#ifdef C2_DEBUG > + ccwr_hdr_t *m = (ccwr_hdr_t*)(q->msg_pool + q->priv * q->msg_size); > +#ifdef CCMSGMAGIC > + assert(m->magic == be32_to_cpu(~CCWR_MAGIC)); > + m->magic = cpu_to_be32(CCWR_MAGIC); > +#endif > + dprintk("c2_mq_alloc %p\n", m); > + return m; > +#else > + return q->msg_pool + q->priv * q->msg_size; > +#endif > + } > +} > + > +void > +c2_mq_produce(struct c2_mq *q) > +{ > + assert(q); > + assert(q->magic == C2_MQ_MAGIC); > + assert(q->type == C2_MQ_ADAPTER_TARGET); > + > + if (!c2_mq_full(q)) { > + BUMP(q, q->priv); > + q->hint_count++; > + /* Update peer's offset. */ > + q->peer->shared = cpu_to_be16(q->priv); > + } > +} > + > +void * > +c2_mq_consume(struct c2_mq *q) > +{ > + assert(q); > + assert(q->magic == C2_MQ_MAGIC); > + assert(q->type == C2_MQ_HOST_TARGET); > + > + if (c2_mq_empty(q)) { > + return NULL; > + } else { > +#ifdef C2_DEBUG > + ccwr_hdr_t *m = (ccwr_hdr_t*) > + (q->msg_pool + q->priv * q->msg_size); > +#ifdef CCMSGMAGIC > + assert(m->magic == be32_to_cpu(CCWR_MAGIC)); > +#endif > + dprintk("c2_mq_consume %p\n", m); > + return m; > +#else > + return q->msg_pool + q->priv * q->msg_size; > +#endif > + } > +} > + > +void > +c2_mq_free(struct c2_mq *q) > +{ > + assert(q); > + assert(q->magic == C2_MQ_MAGIC); > + assert(q->type == C2_MQ_HOST_TARGET); > + > + if (!c2_mq_empty(q)) { > +#ifdef C2_DEBUG > +{ > + dprintk("c2_mq_free %p\n", (ccwr_hdr_t*)(q->msg_pool + q->priv * q->msg_size)); > +} > +#endif > + > +#ifdef CCMSGMAGIC > +{ > + ccwr_hdr_t *m = (ccwr_hdr_t*) > + (q->msg_pool + q->priv * q->msg_size); > + m->magic = cpu_to_be32(~CCWR_MAGIC); > +} > +#endif > + BUMP(q, q->priv); > + /* Update peer's offset. */ > + q->peer->shared = cpu_to_be16(q->priv); > + } > +} > + > + > +void > +c2_mq_lconsume(struct c2_mq *q, u32 wqe_count) > +{ > + assert(q); > + assert(q->magic == C2_MQ_MAGIC); > + assert(q->type == C2_MQ_ADAPTER_TARGET); > + > + while (wqe_count--) { > + assert(!c2_mq_empty(q)); > + BUMP_SHARED(q, *q->shared); > + } > +} > + > + > +u32 > +c2_mq_count(struct c2_mq *q) > +{ > + s32 count; > + > + assert(q); > + if (q->type == C2_MQ_HOST_TARGET) { > + count = be16_to_cpu(*q->shared) - q->priv; > + } else { > + count = q->priv - be16_to_cpu(*q->shared); > + } > + > + if (count < 0) { > + count += q->q_size; > + } > + > + return (u32)count; > +} > + > +void > +c2_mq_init(struct c2_mq *q, u32 index, u32 q_size, > + u32 msg_size, u8 *pool_start, u16 *peer, > + u32 type) > +{ > + assert(q->shared); > + > + /* This code assumes the byte swapping has already been done! */ > + q->index = index; > + q->q_size = q_size; > + q->msg_size = msg_size; > + q->msg_pool = pool_start; > + q->peer = (struct c2_mq_shared *)peer; > + q->magic = C2_MQ_MAGIC; > + q->type = type; > + q->priv = 0; > + q->hint_count = 0; > + return; > +} > + > Index: hw/amso1100/cc_wr.h > =================================================================== > --- hw/amso1100/cc_wr.h (revision 0) > +++ hw/amso1100/cc_wr.h (revision 0) > @@ -0,0 +1,1340 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#ifndef _CC_WR_H_ > +#define _CC_WR_H_ > +#include "cc_types.h" > +/* > + * WARNING: If you change this file, also bump CC_IVN_BASE > + * in common/include/clustercore/cc_ivn.h. > + */ > + > +#ifdef CCDEBUG > +#define CCWR_MAGIC 0xb07700b0 > +#endif > + > +#define WR_BUILD_STR_LEN 64 > + > +#ifdef _MSC_VER > +#define PACKED > +#pragma pack(push) > +#pragma pack(1) > +#define __inline__ __inline > +#else > +#define PACKED __attribute__ ((packed)) > +#endif > + > +/* > + * WARNING: All of these structs need to align any 64bit types on > + * 64 bit boundaries! 64bit types include u64 and u64. > + */ > + > +/* > + * Clustercore Work Request Header. Be sensitive to field layout > + * and alignment. > + */ > +typedef struct { > + /* wqe_count is part of the cqe. It is put here so the > + * adapter can write to it while the wr is pending without > + * clobbering part of the wr. This word need not be dma'd > + * from the host to adapter by libccil, but we copy it anyway > + * to make the memcpy to the adapter better aligned. > + */ > + u32 wqe_count; > + > + /* Put these fields next so that later 32- and 64-bit > + * quantities are naturally aligned. > + */ > + u8 id; > + u8 result; /* adapter -> host */ > + u8 sge_count; /* host -> adapter */ > + u8 flags; /* host -> adapter */ > + > + u64 context; > +#ifdef CCMSGMAGIC > + u32 magic; > + u32 pad; > +#endif > +} PACKED ccwr_hdr_t; > + > +/* > + *------------------------ RNIC ------------------------ > + */ > + > +/* > + * WR_RNIC_OPEN > + */ > + > +/* > + * Flags for the RNIC WRs > + */ > +typedef enum { > + RNIC_IRD_STATIC = 0x0001, > + RNIC_ORD_STATIC = 0x0002, > + RNIC_QP_STATIC = 0x0004, > + RNIC_SRQ_SUPPORTED = 0x0008, > + RNIC_PBL_BLOCK_MODE = 0x0010, > + RNIC_SRQ_MODEL_ARRIVAL = 0x0020, > + RNIC_CQ_OVF_DETECTED = 0x0040, > + RNIC_PRIV_MODE = 0x0080 > +} PACKED cc_rnic_flags_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 user_context; > + u16 flags; /* See cc_rnic_flags_t */ > + u16 port_num; > +} PACKED ccwr_rnic_open_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > +} PACKED ccwr_rnic_open_rep_t; > + > +typedef union { > + ccwr_rnic_open_req_t req; > + ccwr_rnic_open_rep_t rep; > +} PACKED ccwr_rnic_open_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > +} PACKED ccwr_rnic_query_req_t; > + > +/* > + * WR_RNIC_QUERY > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u64 user_context; > + u32 vendor_id; > + u32 part_number; > + u32 hw_version; > + u32 fw_ver_major; > + u32 fw_ver_minor; > + u32 fw_ver_patch; > + char fw_ver_build_str[WR_BUILD_STR_LEN]; > + u32 max_qps; > + u32 max_qp_depth; > + u32 max_srq_depth; > + u32 max_send_sgl_depth; > + u32 max_rdma_sgl_depth; > + u32 max_cqs; > + u32 max_cq_depth; > + u32 max_cq_event_handlers; > + u32 max_mrs; > + u32 max_pbl_depth; > + u32 max_pds; > + u32 max_global_ird; > + u32 max_global_ord; > + u32 max_qp_ird; > + u32 max_qp_ord; > + u32 flags; /* See cc_rnic_flags_t */ > + u32 max_mws; > + u32 pbe_range_low; > + u32 pbe_range_high; > + u32 max_srqs; > + u32 page_size; > +} PACKED ccwr_rnic_query_rep_t; > + > +typedef union { > + ccwr_rnic_query_req_t req; > + ccwr_rnic_query_rep_t rep; > +} PACKED ccwr_rnic_query_t; > + > +/* > + * WR_RNIC_GETCONFIG > + */ > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 option; /* see cc_getconfig_cmd_t */ > + u64 reply_buf; > + u32 reply_buf_len; > +} PACKED ccwr_rnic_getconfig_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 option; /* see cc_getconfig_cmd_t */ > + u32 count_len; /* length of the number of addresses configured */ > +} PACKED ccwr_rnic_getconfig_rep_t; > + > +typedef union { > + ccwr_rnic_getconfig_req_t req; > + ccwr_rnic_getconfig_rep_t rep; > +} PACKED ccwr_rnic_getconfig_t; > + > +/* > + * WR_RNIC_SETCONFIG > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 option; /* See cc_setconfig_cmd_t */ > + /* variable data and pad See cc_netaddr_t and > + * cc_route_t > + */ > + u8 data[0]; > +} PACKED ccwr_rnic_setconfig_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_rnic_setconfig_rep_t; > + > +typedef union { > + ccwr_rnic_setconfig_req_t req; > + ccwr_rnic_setconfig_rep_t rep; > +} PACKED ccwr_rnic_setconfig_t; > + > +/* > + * WR_RNIC_CLOSE > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > +} PACKED ccwr_rnic_close_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_rnic_close_rep_t; > + > +typedef union { > + ccwr_rnic_close_req_t req; > + ccwr_rnic_close_rep_t rep; > +} PACKED ccwr_rnic_close_t; > + > +/* > + *------------------------ CQ ------------------------ > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u64 shared_ht; > + u64 user_context; > + u64 msg_pool; > + u32 rnic_handle; > + u32 msg_size; > + u32 depth; > +} PACKED ccwr_cq_create_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 mq_index; > + u32 adapter_shared; > + u32 cq_handle; > +} PACKED ccwr_cq_create_rep_t; > + > +typedef union { > + ccwr_cq_create_req_t req; > + ccwr_cq_create_rep_t rep; > +} PACKED ccwr_cq_create_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 cq_handle; > + u32 new_depth; > + u64 new_msg_pool; > +} PACKED ccwr_cq_modify_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_cq_modify_rep_t; > + > +typedef union { > + ccwr_cq_modify_req_t req; > + ccwr_cq_modify_rep_t rep; > +} PACKED ccwr_cq_modify_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 cq_handle; > +} PACKED ccwr_cq_destroy_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_cq_destroy_rep_t; > + > +typedef union { > + ccwr_cq_destroy_req_t req; > + ccwr_cq_destroy_rep_t rep; > +} PACKED ccwr_cq_destroy_t; > + > +/* > + *------------------------ PD ------------------------ > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 pd_id; > +} PACKED ccwr_pd_alloc_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_pd_alloc_rep_t; > + > +typedef union { > + ccwr_pd_alloc_req_t req; > + ccwr_pd_alloc_rep_t rep; > +} PACKED ccwr_pd_alloc_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 pd_id; > +} PACKED ccwr_pd_dealloc_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_pd_dealloc_rep_t; > + > +typedef union { > + ccwr_pd_dealloc_req_t req; > + ccwr_pd_dealloc_rep_t rep; > +} PACKED ccwr_pd_dealloc_t; > + > +/* > + *------------------------ SRQ ------------------------ > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u64 shared_ht; > + u64 user_context; > + u32 rnic_handle; > + u32 srq_depth; > + u32 srq_limit; > + u32 sgl_depth; > + u32 pd_id; > +} PACKED ccwr_srq_create_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 srq_depth; > + u32 sgl_depth; > + u32 msg_size; > + u32 mq_index; > + u32 mq_start; > + u32 srq_handle; > +} PACKED ccwr_srq_create_rep_t; > + > +typedef union { > + ccwr_srq_create_req_t req; > + ccwr_srq_create_rep_t rep; > +} PACKED ccwr_srq_create_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 srq_handle; > +} PACKED ccwr_srq_destroy_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_srq_destroy_rep_t; > + > +typedef union { > + ccwr_srq_destroy_req_t req; > + ccwr_srq_destroy_rep_t rep; > +} PACKED ccwr_srq_destroy_t; > + > +/* > + *------------------------ QP ------------------------ > + */ > +typedef enum { > + QP_RDMA_READ = 0x00000001, /* RDMA read enabled? */ > + QP_RDMA_WRITE = 0x00000002, /* RDMA write enabled? */ > + QP_MW_BIND = 0x00000004, /* MWs enabled */ > + QP_ZERO_STAG = 0x00000008, /* enabled? */ > + QP_REMOTE_TERMINATION = 0x00000010, /* remote end terminated */ > + QP_RDMA_READ_RESPONSE = 0x00000020 /* Remote RDMA read */ > + /* enabled? */ > +} PACKED ccwr_qp_flags_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 shared_sq_ht; > + u64 shared_rq_ht; > + u64 user_context; > + u32 rnic_handle; > + u32 sq_cq_handle; > + u32 rq_cq_handle; > + u32 sq_depth; > + u32 rq_depth; > + u32 srq_handle; > + u32 srq_limit; > + u32 flags; /* see ccwr_qp_flags_t */ > + u32 send_sgl_depth; > + u32 recv_sgl_depth; > + u32 rdma_write_sgl_depth; > + u32 ord; > + u32 ird; > + u32 pd_id; > +} PACKED ccwr_qp_create_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 sq_depth; > + u32 rq_depth; > + u32 send_sgl_depth; > + u32 recv_sgl_depth; > + u32 rdma_write_sgl_depth; > + u32 ord; > + u32 ird; > + u32 sq_msg_size; > + u32 sq_mq_index; > + u32 sq_mq_start; > + u32 rq_msg_size; > + u32 rq_mq_index; > + u32 rq_mq_start; > + u32 qp_handle; > +} PACKED ccwr_qp_create_rep_t; > + > +typedef union { > + ccwr_qp_create_req_t req; > + ccwr_qp_create_rep_t rep; > +} PACKED ccwr_qp_create_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 qp_handle; > +} PACKED ccwr_qp_query_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 user_context; > + u32 rnic_handle; > + u32 sq_depth; > + u32 rq_depth; > + u32 send_sgl_depth; > + u32 rdma_write_sgl_depth; > + u32 recv_sgl_depth; > + u32 ord; > + u32 ird; > + u16 qp_state; > + u16 flags; /* see ccwr_qp_flags_t */ > + u32 qp_id; > + u32 local_addr; > + u32 remote_addr; > + u16 local_port; > + u16 remote_port; > + u32 terminate_msg_length; /* 0 if not present */ > + u8 data[0]; > + /* Terminate Message in-line here. */ > +} PACKED ccwr_qp_query_rep_t; > + > +typedef union { > + ccwr_qp_query_req_t req; > + ccwr_qp_query_rep_t rep; > +} PACKED ccwr_qp_query_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 stream_msg; > + u32 stream_msg_length; > + u32 rnic_handle; > + u32 qp_handle; > + u32 next_qp_state; > + u32 ord; > + u32 ird; > + u32 sq_depth; > + u32 rq_depth; > + u32 llp_ep_handle; > +} PACKED ccwr_qp_modify_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 ord; > + u32 ird; > + u32 sq_depth; > + u32 rq_depth; > + u32 sq_msg_size; > + u32 sq_mq_index; > + u32 sq_mq_start; > + u32 rq_msg_size; > + u32 rq_mq_index; > + u32 rq_mq_start; > +} PACKED ccwr_qp_modify_rep_t; > + > +typedef union { > + ccwr_qp_modify_req_t req; > + ccwr_qp_modify_rep_t rep; > +} PACKED ccwr_qp_modify_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 qp_handle; > +} PACKED ccwr_qp_destroy_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_qp_destroy_rep_t; > + > +typedef union { > + ccwr_qp_destroy_req_t req; > + ccwr_qp_destroy_rep_t rep; > +} PACKED ccwr_qp_destroy_t; > + > +/* > + * The CCWR_QP_CONNECT msg is posted on the verbs request queue. It can > + * only be posted when a QP is in IDLE state. After the connect request is > + * submitted to the LLP, the adapter moves the QP to CONNECT_PENDING state. > + * No synchronous reply from adapter to this WR. The results of > + * connection are passed back in an async event CCAE_ACTIVE_CONNECT_RESULTS > + * See ccwr_ae_active_connect_results_t > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 qp_handle; > + u32 remote_addr; > + u16 remote_port; > + u16 pad; > + u32 private_data_length; > + u8 private_data[0]; /* Private data in-line. */ > +} PACKED ccwr_qp_connect_req_t; > + > +typedef struct { > + ccwr_qp_connect_req_t req; > + /* no synchronous reply. */ > +} PACKED ccwr_qp_connect_t; > + > + > +/* > + *------------------------ MM ------------------------ > + */ > + > +typedef cc_mm_flags_t ccwr_mr_flags_t; /* cc_types.h */ > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 pbl_depth; > + u32 pd_id; > + u32 flags; /* See ccwr_mr_flags_t */ > +} PACKED ccwr_nsmr_stag_alloc_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 pbl_depth; > + u32 stag_index; > +} PACKED ccwr_nsmr_stag_alloc_rep_t; > + > +typedef union { > + ccwr_nsmr_stag_alloc_req_t req; > + ccwr_nsmr_stag_alloc_rep_t rep; > +} PACKED ccwr_nsmr_stag_alloc_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 va; > + u32 rnic_handle; > + u16 flags; /* See ccwr_mr_flags_t */ > + u8 stag_key; > + u8 pad; > + u32 pd_id; > + u32 pbl_depth; > + u32 pbe_size; > + u32 fbo; > + u32 length; > + u32 addrs_length; > + /* array of paddrs (must be aligned on a 64bit boundary) */ > + u64 paddrs[0]; > +} PACKED ccwr_nsmr_register_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 pbl_depth; > + u32 stag_index; > +} PACKED ccwr_nsmr_register_rep_t; > + > +typedef union { > + ccwr_nsmr_register_req_t req; > + ccwr_nsmr_register_rep_t rep; > +} PACKED ccwr_nsmr_register_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 flags; /* See ccwr_mr_flags_t */ > + u32 stag_index; > + u32 addrs_length; > + /* array of paddrs (must be aligned on a 64bit boundary) */ > + u64 paddrs[0]; > +} PACKED ccwr_nsmr_pbl_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_nsmr_pbl_rep_t; > + > +typedef union { > + ccwr_nsmr_pbl_req_t req; > + ccwr_nsmr_pbl_rep_t rep; > +} PACKED ccwr_nsmr_pbl_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 stag_index; > +} PACKED ccwr_mr_query_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u8 stag_key; > + u8 pad[3]; > + u32 pd_id; > + u32 flags; /* See ccwr_mr_flags_t */ > + u32 pbl_depth; > +} PACKED ccwr_mr_query_rep_t; > + > +typedef union { > + ccwr_mr_query_req_t req; > + ccwr_mr_query_rep_t rep; > +} PACKED ccwr_mr_query_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 stag_index; > +} PACKED ccwr_mw_query_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u8 stag_key; > + u8 pad[3]; > + u32 pd_id; > + u32 flags; /* See ccwr_mr_flags_t */ > +} PACKED ccwr_mw_query_rep_t; > + > +typedef union { > + ccwr_mw_query_req_t req; > + ccwr_mw_query_rep_t rep; > +} PACKED ccwr_mw_query_t; > + > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 stag_index; > +} PACKED ccwr_stag_dealloc_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_stag_dealloc_rep_t; > + > +typedef union { > + ccwr_stag_dealloc_req_t req; > + ccwr_stag_dealloc_rep_t rep; > +} PACKED ccwr_stag_dealloc_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 va; > + u32 rnic_handle; > + u16 flags; /* See ccwr_mr_flags_t */ > + u8 stag_key; > + u8 pad; > + u32 stag_index; > + u32 pd_id; > + u32 pbl_depth; > + u32 pbe_size; > + u32 fbo; > + u32 length; > + u32 addrs_length; > + u32 pad1; > + /* array of paddrs (must be aligned on a 64bit boundary) */ > + u64 paddrs[0]; > +} PACKED ccwr_nsmr_reregister_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 pbl_depth; > + u32 stag_index; > +} PACKED ccwr_nsmr_reregister_rep_t; > + > +typedef union { > + ccwr_nsmr_reregister_req_t req; > + ccwr_nsmr_reregister_rep_t rep; > +} PACKED ccwr_nsmr_reregister_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 va; > + u32 rnic_handle; > + u16 flags; /* See ccwr_mr_flags_t */ > + u8 stag_key; > + u8 pad; > + u32 stag_index; > + u32 pd_id; > +} PACKED ccwr_smr_register_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 stag_index; > +} PACKED ccwr_smr_register_rep_t; > + > +typedef union { > + ccwr_smr_register_req_t req; > + ccwr_smr_register_rep_t rep; > +} PACKED ccwr_smr_register_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 pd_id; > +} PACKED ccwr_mw_alloc_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 stag_index; > +} PACKED ccwr_mw_alloc_rep_t; > + > +typedef union { > + ccwr_mw_alloc_req_t req; > + ccwr_mw_alloc_rep_t rep; > +} PACKED ccwr_mw_alloc_t; > + > +/* > + *------------------------ WRs ----------------------- > + */ > + > +typedef struct { > + ccwr_hdr_t hdr; /* Has status and WR Type */ > +} PACKED ccwr_user_hdr_t; > + > +/* Completion queue entry. */ > +typedef struct { > + ccwr_hdr_t hdr; /* Has status and WR Type */ > + u64 qp_user_context;/* cc_user_qp_t * */ > + u32 qp_state; /* Current QP State */ > + u32 handle; /* QPID or EP Handle */ > + u32 bytes_rcvd; /* valid for RECV WCs */ > + u32 stag; > +} PACKED ccwr_ce_t; > + > + > +/* > + * Flags used for all post-sq WRs. These must fit in the flags > + * field of the ccwr_hdr_t (eight bits). > + */ > +typedef enum { > + SQ_SIGNALED = 0x01, > + SQ_READ_FENCE = 0x02, > + SQ_FENCE = 0x04, > +} PACKED cc_sq_flags_t; > + > +/* > + * Common fields for all post-sq WRs. Namely the standard header and a > + * secondary header with fields common to all post-sq WRs. > + */ > +typedef struct { > + ccwr_user_hdr_t user_hdr; > +} PACKED cc_sq_hdr_t; > + > +/* > + * Same as above but for post-rq WRs. > + */ > +typedef struct { > + ccwr_user_hdr_t user_hdr; > +} PACKED cc_rq_hdr_t; > + > +/* > + * use the same struct for all sends. > + */ > +typedef struct { > + cc_sq_hdr_t sq_hdr; > + u32 sge_len; > + u32 remote_stag; > + u8 data[0]; /* SGE array */ > +} PACKED ccwr_send_req_t, ccwr_send_se_req_t, ccwr_send_inv_req_t, > ccwr_send_se_inv_req_t; > + > +typedef ccwr_ce_t ccwr_send_rep_t; > + > +typedef union { > + ccwr_send_req_t req; > + ccwr_send_rep_t rep; > +} PACKED ccwr_send_t, ccwr_send_se_t, ccwr_send_inv_t, ccwr_send_se_inv_t; > + > +typedef struct { > + cc_sq_hdr_t sq_hdr; > + u64 remote_to; > + u32 remote_stag; > + u32 sge_len; > + u8 data[0]; /* SGE array */ > +} PACKED ccwr_rdma_write_req_t; > + > +typedef ccwr_ce_t ccwr_rdma_write_rep_t; > + > +typedef union { > + ccwr_rdma_write_req_t req; > + ccwr_rdma_write_rep_t rep; > +} PACKED ccwr_rdma_write_t; > + > +typedef struct { > + cc_sq_hdr_t sq_hdr; > + u64 local_to; > + u64 remote_to; > + u32 local_stag; > + u32 remote_stag; > + u32 length; > +} PACKED ccwr_rdma_read_req_t,ccwr_rdma_read_inv_req_t; > + > +typedef ccwr_ce_t ccwr_rdma_read_rep_t; > + > +typedef union { > + ccwr_rdma_read_req_t req; > + ccwr_rdma_read_rep_t rep; > +} PACKED ccwr_rdma_read_t, ccwr_rdma_read_inv_t; > + > +typedef struct { > + cc_sq_hdr_t sq_hdr; > + u64 va; > + u8 stag_key; > + u8 pad[3]; > + u32 mw_stag_index; > + u32 mr_stag_index; > + u32 length; > + u32 flags; /* see ccwr_mr_flags_t; */ > +} PACKED ccwr_mw_bind_req_t; > + > +typedef ccwr_ce_t ccwr_mw_bind_rep_t; > + > +typedef union { > + ccwr_mw_bind_req_t req; > + ccwr_mw_bind_rep_t rep; > +} PACKED ccwr_mw_bind_t; > + > +typedef struct { > + cc_sq_hdr_t sq_hdr; > + u64 va; > + u8 stag_key; > + u8 pad[3]; > + u32 stag_index; > + u32 pbe_size; > + u32 fbo; > + u32 length; > + u32 addrs_length; > + /* array of paddrs (must be aligned on a 64bit boundary) */ > + u64 paddrs[0]; > +} PACKED ccwr_nsmr_fastreg_req_t; > + > +typedef ccwr_ce_t ccwr_nsmr_fastreg_rep_t; > + > +typedef union { > + ccwr_nsmr_fastreg_req_t req; > + ccwr_nsmr_fastreg_rep_t rep; > +} PACKED ccwr_nsmr_fastreg_t; > + > +typedef struct { > + cc_sq_hdr_t sq_hdr; > + u8 stag_key; > + u8 pad[3]; > + u32 stag_index; > +} PACKED ccwr_stag_invalidate_req_t; > + > +typedef ccwr_ce_t ccwr_stag_invalidate_rep_t; > + > +typedef union { > + ccwr_stag_invalidate_req_t req; > + ccwr_stag_invalidate_rep_t rep; > +} PACKED ccwr_stag_invalidate_t; > + > +typedef union { > + cc_sq_hdr_t sq_hdr; > + ccwr_send_req_t send; > + ccwr_send_se_req_t send_se; > + ccwr_send_inv_req_t send_inv; > + ccwr_send_se_inv_req_t send_se_inv; > + ccwr_rdma_write_req_t rdma_write; > + ccwr_rdma_read_req_t rdma_read; > + ccwr_mw_bind_req_t mw_bind; > + ccwr_nsmr_fastreg_req_t nsmr_fastreg; > + ccwr_stag_invalidate_req_t stag_inv; > +} PACKED ccwr_sqwr_t; > + > + > +/* > + * RQ WRs > + */ > +typedef struct { > + cc_rq_hdr_t rq_hdr; > + u8 data[0]; /* array of SGEs */ > +} PACKED ccwr_rqwr_t, ccwr_recv_req_t; > + > +typedef ccwr_ce_t ccwr_recv_rep_t; > + > +typedef union { > + ccwr_recv_req_t req; > + ccwr_recv_rep_t rep; > +} PACKED ccwr_recv_t; > + > +/* > + * All AEs start with this header. Most AEs only need to convey the > + * information in the header. Some, like LLP connection events, need > + * more info. The union typdef ccwr_ae_t has all the possible AEs. > + * > + * hdr.context is the user_context from the rnic_open WR. NULL If this > + * is not affiliated with an rnic > + * > + * hdr.id is the AE identifier (eg; CCAE_REMOTE_SHUTDOWN, > + * CCAE_LLP_CLOSE_COMPLETE) > + * > + * resource_type is one of: CC_RES_IND_QP, CC_RES_IND_CQ, CC_RES_IND_SRQ > + * > + * user_context is the context passed down when the host created the resource. > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u64 user_context; /* user context for this res. */ > + u32 resource_type; /* see cc_resource_indicator_t */ > + u32 resource; /* handle for resource */ > + u32 qp_state; /* current QP State */ > +} PACKED PACKED ccwr_ae_hdr_t; > + > +/* > + * After submitting the CCAE_ACTIVE_CONNECT_RESULTS message on the AEQ, > + * the adapter moves the QP into RTS state > + */ > +typedef struct { > + ccwr_ae_hdr_t ae_hdr; > + u32 laddr; > + u32 raddr; > + u16 lport; > + u16 rport; > + u32 private_data_length; > + u8 private_data[0]; /* data is in-line in the msg. */ > +} PACKED ccwr_ae_active_connect_results_t; > + > +/* > + * When connections are established by the stack (and the private data > + * MPA frame is received), the adapter will generate an event to the host. > + * The details of the connection, any private data, and the new connection > + * request handle is passed up via the CCAE_CONNECTION_REQUEST msg on the > + * AE queue: > + */ > +typedef struct { > + ccwr_ae_hdr_t ae_hdr; > + u32 cr_handle; /* connreq handle (sock ptr) */ > + u32 laddr; > + u32 raddr; > + u16 lport; > + u16 rport; > + u32 private_data_length; > + u8 private_data[0]; /* data is in-line in the msg. */ > +} PACKED ccwr_ae_connection_request_t; > + > +typedef union { > + ccwr_ae_hdr_t ae_generic; > + ccwr_ae_active_connect_results_t ae_active_connect_results; > + ccwr_ae_connection_request_t ae_connection_request; > +} PACKED ccwr_ae_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 hint_count; > + u64 q0_host_shared; > + u64 q1_host_shared; > + u64 q1_host_msg_pool; > + u64 q2_host_shared; > + u64 q2_host_msg_pool; > +} PACKED ccwr_init_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_init_rep_t; > + > +typedef union { > + ccwr_init_req_t req; > + ccwr_init_rep_t rep; > +} PACKED ccwr_init_t; > + > +/* > + * For upgrading flash. > + */ > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > +} PACKED ccwr_flash_init_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 adapter_flash_buf_offset; > + u32 adapter_flash_len; > +} PACKED ccwr_flash_init_rep_t; > + > +typedef union { > + ccwr_flash_init_req_t req; > + ccwr_flash_init_rep_t rep; > +} PACKED ccwr_flash_init_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 len; > +} PACKED ccwr_flash_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 status; > +} PACKED ccwr_flash_rep_t; > + > +typedef union { > + ccwr_flash_req_t req; > + ccwr_flash_rep_t rep; > +} PACKED ccwr_flash_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 size; > +} PACKED ccwr_buf_alloc_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 offset; /* 0 if mem not available */ > + u32 size; /* 0 if mem not available */ > +} PACKED ccwr_buf_alloc_rep_t; > + > +typedef union { > + ccwr_buf_alloc_req_t req; > + ccwr_buf_alloc_rep_t rep; > +} PACKED ccwr_buf_alloc_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 offset; /* Must match value from alloc */ > + u32 size; /* Must match value from alloc */ > +} PACKED ccwr_buf_free_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_buf_free_rep_t; > + > +typedef union { > + ccwr_buf_free_req_t req; > + ccwr_buf_free_rep_t rep; > +} PACKED ccwr_buf_free_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 offset; > + u32 size; > + u32 type; > + u32 flags; > +} PACKED ccwr_flash_write_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 status; > +} PACKED ccwr_flash_write_rep_t; > + > +typedef union { > + ccwr_flash_write_req_t req; > + ccwr_flash_write_rep_t rep; > +} PACKED ccwr_flash_write_t; > + > +/* > + * Messages for LLP connection setup. > + */ > + > +/* > + * Listen Request. This allocates a listening endpoint to allow passive > + * connection setup. Newly established LLP connections are passed up > + * via an AE. See ccwr_ae_connection_request_t > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u64 user_context; /* returned in AEs. */ > + u32 rnic_handle; > + u32 local_addr; /* local addr, or 0 */ > + u16 local_port; /* 0 means "pick one" */ > + u16 pad; > + u32 backlog; /* tradional tcp listen bl */ > +} PACKED ccwr_ep_listen_create_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 ep_handle; /* handle to new listening ep */ > + u16 local_port; /* resulting port... */ > + u16 pad; > +} PACKED ccwr_ep_listen_create_rep_t; > + > +typedef union { > + ccwr_ep_listen_create_req_t req; > + ccwr_ep_listen_create_rep_t rep; > +} PACKED ccwr_ep_listen_create_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 ep_handle; > +} PACKED ccwr_ep_listen_destroy_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_ep_listen_destroy_rep_t; > + > +typedef union { > + ccwr_ep_listen_destroy_req_t req; > + ccwr_ep_listen_destroy_rep_t rep; > +} PACKED ccwr_ep_listen_destroy_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 ep_handle; > +} PACKED ccwr_ep_query_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 local_addr; > + u32 remote_addr; > + u16 local_port; > + u16 remote_port; > +} PACKED ccwr_ep_query_rep_t; > + > +typedef union { > + ccwr_ep_query_req_t req; > + ccwr_ep_query_rep_t rep; > +} PACKED ccwr_ep_query_t; > + > + > +/* > + * The host passes this down to indicate acceptance of a pending iWARP > + * connection. The cr_handle was obtained from the CONNECTION_REQUEST > + * AE passed up by the adapter. See ccwr_ae_connection_request_t. > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 qp_handle; /* QP to bind to this LLP conn */ > + u32 ep_handle; /* LLP handle to accept */ > + u32 private_data_length; > + u8 private_data[0]; /* data in-line in msg. */ > +} PACKED ccwr_cr_accept_req_t; > + > +/* > + * adapter sends reply when private data is successfully submitted to > + * the LLP. > + */ > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_cr_accept_rep_t; > + > +typedef union { > + ccwr_cr_accept_req_t req; > + ccwr_cr_accept_rep_t rep; > +} PACKED ccwr_cr_accept_t; > + > +/* > + * The host sends this down if a given iWARP connection request was > + * rejected by the consumer. The cr_handle was obtained from a > + * previous ccwr_ae_connection_request_t AE sent by the adapter. > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 ep_handle; /* LLP handle to reject */ > +} PACKED ccwr_cr_reject_req_t; > + > +/* > + * Dunno if this is needed, but we'll add it for now. The adapter will > + * send the reject_reply after the LLP endpoint has been destroyed. > + */ > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_cr_reject_rep_t; > + > +typedef union { > + ccwr_cr_reject_req_t req; > + ccwr_cr_reject_rep_t rep; > +} PACKED ccwr_cr_reject_t; > + > +/* > + * console command. Used to implement a debug console over the verbs > + * request and reply queues. > + */ > + > +/* > + * Console request message. It contains: > + * - message hdr with id = CCWR_CONSOLE > + * - the physaddr/len of host memory to be used for the reply. > + * - the command string. eg: "netstat -s" or "zoneinfo" > + */ > +typedef struct { > + ccwr_hdr_t hdr; /* id = CCWR_CONSOLE */ > + u64 reply_buf; /* pinned host buf for reply */ > + u32 reply_buf_len; /* length of reply buffer */ > + u8 command[0]; /* NUL terminated ascii string */ > + /* containing the command req */ > +} PACKED ccwr_console_req_t; > + > +/* > + * flags used in the console reply. > + */ > +typedef enum { > + CONS_REPLY_TRUNCATED = 0x00000001 /* reply was truncated */ > +} PACKED cc_console_flags_t; > + > +/* > + * Console reply message. > + * hdr.result contains the cc_status_t error if the reply was _not_ generated, > + * or CC_OK if the reply was generated. > + */ > +typedef struct { > + ccwr_hdr_t hdr; /* id = CCWR_CONSOLE */ > + u32 flags; /* see cc_console_flags_t */ > +} PACKED ccwr_console_rep_t; > + > +typedef union { > + ccwr_console_req_t req; > + ccwr_console_rep_t rep; > +} PACKED ccwr_console_t; > + > + > +/* > + * Giant union with all WRs. Makes life easier... > + */ > +typedef union { > + ccwr_hdr_t hdr; > + ccwr_user_hdr_t user_hdr; > + ccwr_rnic_open_t rnic_open; > + ccwr_rnic_query_t rnic_query; > + ccwr_rnic_getconfig_t rnic_getconfig; > + ccwr_rnic_setconfig_t rnic_setconfig; > + ccwr_rnic_close_t rnic_close; > + ccwr_cq_create_t cq_create; > + ccwr_cq_modify_t cq_modify; > + ccwr_cq_destroy_t cq_destroy; > + ccwr_pd_alloc_t pd_alloc; > + ccwr_pd_dealloc_t pd_dealloc; > + ccwr_srq_create_t srq_create; > + ccwr_srq_destroy_t srq_destroy; > + ccwr_qp_create_t qp_create; > + ccwr_qp_query_t qp_query; > + ccwr_qp_modify_t qp_modify; > + ccwr_qp_destroy_t qp_destroy; > + ccwr_qp_connect_t qp_connect; > + ccwr_nsmr_stag_alloc_t nsmr_stag_alloc; > + ccwr_nsmr_register_t nsmr_register; > + ccwr_nsmr_pbl_t nsmr_pbl; > + ccwr_mr_query_t mr_query; > + ccwr_mw_query_t mw_query; > + ccwr_stag_dealloc_t stag_dealloc; > + ccwr_sqwr_t sqwr; > + ccwr_rqwr_t rqwr; > + ccwr_ce_t ce; > + ccwr_ae_t ae; > + ccwr_init_t init; > + ccwr_ep_listen_create_t ep_listen_create; > + ccwr_ep_listen_destroy_t ep_listen_destroy; > + ccwr_cr_accept_t cr_accept; > + ccwr_cr_reject_t cr_reject; > + ccwr_console_t console; > + ccwr_flash_init_t flash_init; > + ccwr_flash_t flash; > + ccwr_buf_alloc_t buf_alloc; > + ccwr_buf_free_t buf_free; > + ccwr_flash_write_t flash_write; > +} PACKED ccwr_t; > + > + > +/* > + * Accessors for the wr fields that are packed together tightly to > + * reduce the wr message size. The wr arguments are void* so that > + * either a ccwr_t*, a ccwr_hdr_t*, or a pointer to any of the types > + * in the ccwr_t union can be passed in. > + */ > +static __inline__ u8 > +cc_wr_get_id(void *wr) > +{ > + return ((ccwr_hdr_t *)wr)->id; > +} > +static __inline__ void > +c2_wr_set_id(void *wr, u8 id) > +{ > + ((ccwr_hdr_t *)wr)->id = id; > +} > +static __inline__ u8 > +cc_wr_get_result(void *wr) > +{ > + return ((ccwr_hdr_t *)wr)->result; > +} > +static __inline__ void > +cc_wr_set_result(void *wr, u8 result) > +{ > + ((ccwr_hdr_t *)wr)->result = result; > +} > +static __inline__ u8 > +cc_wr_get_flags(void *wr) > +{ > + return ((ccwr_hdr_t *)wr)->flags; > +} > +static __inline__ void > +cc_wr_set_flags(void *wr, u8 flags) > +{ > + ((ccwr_hdr_t *)wr)->flags = flags; > +} > +static __inline__ u8 > +cc_wr_get_sge_count(void *wr) > +{ > + return ((ccwr_hdr_t *)wr)->sge_count; > +} > +static __inline__ void > +cc_wr_set_sge_count(void *wr, u8 sge_count) > +{ > + ((ccwr_hdr_t *)wr)->sge_count = sge_count; > +} > +static __inline__ u32 > +cc_wr_get_wqe_count(void *wr) > +{ > + return ((ccwr_hdr_t *)wr)->wqe_count; > +} > +static __inline__ void > +cc_wr_set_wqe_count(void *wr, u32 wqe_count) > +{ > + ((ccwr_hdr_t *)wr)->wqe_count = wqe_count; > +} > + > +#undef PACKED > + > +#ifdef _MSC_VER > +#pragma pack(pop) > +#endif > + > +#endif /* _CC_WR_H_ */ > Index: hw/amso1100/c2.c > =================================================================== > --- hw/amso1100/c2.c (revision 0) > +++ hw/amso1100/c2.c (revision 0) > @@ -0,0 +1,1221 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > +#include > + > +#include > +#include "c2.h" > +#include "c2_provider.h" > + > +MODULE_AUTHOR("Tom Tucker "); > +MODULE_DESCRIPTION("Ammasso AMSO1100 Low-level iWARP Driver"); > +MODULE_LICENSE("Dual BSD/GPL"); > +MODULE_VERSION(DRV_VERSION); > + > +static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK > + | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN; > + > +static int debug = -1; /* defaults above */ > +module_param(debug, int, 0); > +MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); > + > +char *rnic_ip_addr = "192.168.69.169"; > +module_param(rnic_ip_addr, charp, S_IRUGO); > +MODULE_PARM_DESC(rnic_ip_addr, "IP Address for the AMSO1100 Adapter"); > + > +static int c2_up(struct net_device *netdev); > +static int c2_down(struct net_device *netdev); > +static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev); > +static void c2_tx_interrupt(struct net_device *netdev); > +static void c2_rx_interrupt(struct net_device *netdev); > +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs); > +static void c2_tx_timeout(struct net_device *netdev); > +static int c2_change_mtu(struct net_device *netdev, int new_mtu); > +static void c2_reset(struct c2_port *c2_port); > +static struct net_device_stats* c2_get_stats(struct net_device *netdev); > + > +extern void c2_rnic_interrupt(struct c2_dev *c2dev); > + > +static struct pci_device_id c2_pci_table[] = { > + { 0x18b8, 0xb001, PCI_ANY_ID, PCI_ANY_ID }, > + { 0 } > +}; > + > +MODULE_DEVICE_TABLE(pci, c2_pci_table); > + > +static void c2_print_macaddr(struct net_device *netdev) > +{ > + dprintk(KERN_INFO PFX "%s: MAC %02X:%02X:%02X:%02X:%02X:%02X, " > + "IRQ %u\n", netdev->name, > + netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2], > + netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5], > + netdev->irq); > +} > + > +static void c2_set_rxbufsize(struct c2_port *c2_port) > +{ > + struct net_device *netdev = c2_port->netdev; > + > + assert(netdev != NULL); > + > + if (netdev->mtu > RX_BUF_SIZE) > + c2_port->rx_buf_size = netdev->mtu + ETH_HLEN + sizeof(struct > c2_rxp_hdr) + NET_IP_ALIGN; > + else > + c2_port->rx_buf_size = sizeof(struct c2_rxp_hdr) + RX_BUF_SIZE; > +} > + > +/* > + * Allocate TX ring elements and chain them together. > + * One-to-one association of adapter descriptors with ring elements. > + */ > +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr, dma_addr_t base, > + void __iomem *mmio_txp_ring) > +{ > + struct c2_tx_desc *tx_desc; > + struct c2_txp_desc *txp_desc; > + struct c2_element *elem; > + int i; > + > + tx_ring->start = kmalloc(sizeof(*elem)*tx_ring->count, GFP_KERNEL); > + if (!tx_ring->start) > + return -ENOMEM; > + > + for (i = 0, elem = tx_ring->start, tx_desc = vaddr, txp_desc = mmio_txp_ring; > + i < tx_ring->count; i++, elem++, tx_desc++, txp_desc++) > + { > + tx_desc->len = 0; > + tx_desc->status = 0; > + > + /* Set TXP_HTXD_UNINIT */ > + c2_write64((void *)txp_desc + C2_TXP_ADDR, cpu_to_be64(0x1122334455667788ULL)); > + c2_write16((void *)txp_desc + C2_TXP_LEN, cpu_to_be16(0)); > + c2_write16((void *)txp_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_UNINIT)); > + > + elem->skb = NULL; > + elem->ht_desc = tx_desc; > + elem->hw_desc = txp_desc; > + > + if (i == tx_ring->count - 1) { > + elem->next = tx_ring->start; > + tx_desc->next_offset = base; > + } else { > + elem->next = elem + 1; > + tx_desc->next_offset = base + (i + 1) * sizeof(*tx_desc); > + } > + } > + > + tx_ring->to_use = tx_ring->to_clean = tx_ring->start; > + > + return 0; > +} > + > +/* > + * Allocate RX ring elements and chain them together. > + * One-to-one association of adapter descriptors with ring elements. > + */ > +static int c2_rx_ring_alloc(struct c2_ring *rx_ring, void *vaddr, dma_addr_t base, > + void __iomem *mmio_rxp_ring) > +{ > + struct c2_rx_desc *rx_desc; > + struct c2_rxp_desc *rxp_desc; > + struct c2_element *elem; > + int i; > + > + rx_ring->start = kmalloc(sizeof(*elem) * rx_ring->count, GFP_KERNEL); > + if (!rx_ring->start) > + return -ENOMEM; > + > + for (i = 0, elem = rx_ring->start, rx_desc = vaddr, rxp_desc = mmio_rxp_ring; > + i < rx_ring->count; i++, elem++, rx_desc++, rxp_desc++) > + { > + rx_desc->len = 0; > + rx_desc->status = 0; > + > + /* Set RXP_HRXD_UNINIT */ > + c2_write16((void *)rxp_desc + C2_RXP_STATUS, cpu_to_be16(RXP_HRXD_OK)); > + c2_write16((void *)rxp_desc + C2_RXP_COUNT, cpu_to_be16(0)); > + c2_write16((void *)rxp_desc + C2_RXP_LEN, cpu_to_be16(0)); > + c2_write64((void *)rxp_desc + C2_RXP_ADDR, cpu_to_be64(0x99aabbccddeeffULL)); > + c2_write16((void *)rxp_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_UNINIT)); > + > + elem->skb = NULL; > + elem->ht_desc = rx_desc; > + elem->hw_desc = rxp_desc; > + > + if (i == rx_ring->count - 1) { > + elem->next = rx_ring->start; > + rx_desc->next_offset = base; > + } else { > + elem->next = elem + 1; > + rx_desc->next_offset = base + (i + 1) * sizeof(*rx_desc); > + } > + } > + > + rx_ring->to_use = rx_ring->to_clean = rx_ring->start; > + > + return 0; > +} > + > +/* Setup buffer for receiving */ > +static inline int c2_rx_alloc(struct c2_port *c2_port, struct c2_element *elem) > +{ > + struct c2_dev *c2dev = c2_port->c2dev; > + struct c2_rx_desc *rx_desc = elem->ht_desc; > + struct sk_buff *skb; > + dma_addr_t mapaddr; > + u32 maplen; > + struct c2_rxp_hdr *rxp_hdr; > + > + skb = dev_alloc_skb(c2_port->rx_buf_size); > + if (unlikely(!skb)) { > + dprintk(KERN_ERR PFX "%s: out of memory for receive\n", > + c2_port->netdev->name); > + return -ENOMEM; > + } > + > + /* Zero out the rxp hdr in the sk_buff */ > + memset(skb->data, 0, sizeof(*rxp_hdr)); > + > + skb->dev = c2_port->netdev; > + > + maplen = c2_port->rx_buf_size; > + mapaddr = pci_map_single(c2dev->pcidev, skb->data, maplen, PCI_DMA_FROMDEVICE); > + > + /* Set the sk_buff RXP_header to RXP_HRXD_READY */ > + rxp_hdr = (struct c2_rxp_hdr *) skb->data; > + rxp_hdr->flags = RXP_HRXD_READY; > + > + /* c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); */ > + c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); > + c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16((u16)maplen - > sizeof(*rxp_hdr))); > + c2_write64(elem->hw_desc + C2_RXP_ADDR, cpu_to_be64(mapaddr)); > + c2_write16(elem->hw_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_READY)); > + > + elem->skb = skb; > + elem->mapaddr = mapaddr; > + elem->maplen = maplen; > + rx_desc->len = maplen; > + > + return 0; > +} > + > +/* > + * Allocate buffers for the Rx ring > + * For receive: rx_ring.to_clean is next received frame > + */ > +static int c2_rx_fill(struct c2_port *c2_port) > +{ > + struct c2_ring *rx_ring = &c2_port->rx_ring; > + struct c2_element *elem; > + int ret = 0; > + > + elem = rx_ring->start; > + do { > + if (c2_rx_alloc(c2_port, elem)) { > + ret = 1; > + break; > + } > + } while ((elem = elem->next) != rx_ring->start); > + > + rx_ring->to_clean = rx_ring->start; > + return ret; > +} > + > +/* Free all buffers in RX ring, assumes receiver stopped */ > +static void c2_rx_clean(struct c2_port *c2_port) > +{ > + struct c2_dev *c2dev = c2_port->c2dev; > + struct c2_ring *rx_ring = &c2_port->rx_ring; > + struct c2_element *elem; > + struct c2_rx_desc *rx_desc; > + > + elem = rx_ring->start; > + do { > + rx_desc = elem->ht_desc; > + rx_desc->len = 0; > + > + c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); > + c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); > + c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16(0)); > + c2_write64(elem->hw_desc + C2_RXP_ADDR, cpu_to_be64(0x99aabbccddeeffULL)); > + c2_write16(elem->hw_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_UNINIT)); > + > + if (elem->skb) { > + pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen, > + PCI_DMA_FROMDEVICE); > + dev_kfree_skb(elem->skb); > + elem->skb = NULL; > + } > + } while ((elem = elem->next) != rx_ring->start); > +} > + > +static inline int c2_tx_free(struct c2_dev *c2dev, struct c2_element *elem) > +{ > + struct c2_tx_desc *tx_desc = elem->ht_desc; > + > + tx_desc->len = 0; > + > + pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen, PCI_DMA_TODEVICE); > + > + if (elem->skb) { > + dev_kfree_skb_any(elem->skb); > + elem->skb = NULL; > + } > + > + return 0; > +} > + > +/* Free all buffers in TX ring, assumes transmitter stopped */ > +static void c2_tx_clean(struct c2_port *c2_port) > +{ > + struct c2_ring *tx_ring = &c2_port->tx_ring; > + struct c2_element *elem; > + struct c2_txp_desc txp_htxd; > + int retry; > + unsigned long flags; > + > + spin_lock_irqsave(&c2_port->tx_lock, flags); > + > + elem = tx_ring->start; > + > + do { > + retry = 0; > + do { > + txp_htxd.flags = c2_read16(elem->hw_desc + C2_TXP_FLAGS); > + > + if (txp_htxd.flags == TXP_HTXD_READY) { > + retry = 1; > + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(0)); > + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(0)); > + c2_write16(elem->hw_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_DONE)); > + c2_port->netstats.tx_dropped++; > + break; > + } else { > + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(0)); > + c2_write64(elem->hw_desc + C2_TXP_ADDR, > cpu_to_be64(0x1122334455667788ULL)); > + c2_write16(elem->hw_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_UNINIT)); > + } > + > + c2_tx_free(c2_port->c2dev, elem); > + > + } while ((elem = elem->next) != tx_ring->start); > + } while (retry); > + > + c2_port->tx_avail = c2_port->tx_ring.count - 1; > + c2_port->c2dev->cur_tx = tx_ring->to_use - tx_ring->start; > + > + if (c2_port->tx_avail > MAX_SKB_FRAGS + 1) > + netif_wake_queue(c2_port->netdev); > + > + spin_unlock_irqrestore(&c2_port->tx_lock, flags); > +} > + > +/* > + * Process transmit descriptors marked 'DONE' by the firmware, > + * freeing up their unneeded sk_buffs. > + */ > +static void c2_tx_interrupt(struct net_device *netdev) > +{ > + struct c2_port *c2_port = netdev_priv(netdev); > + struct c2_dev *c2dev = c2_port->c2dev; > + struct c2_ring *tx_ring = &c2_port->tx_ring; > + struct c2_element *elem; > + struct c2_txp_desc txp_htxd; > + > + spin_lock(&c2_port->tx_lock); > + > + for(elem = tx_ring->to_clean; elem != tx_ring->to_use; elem = elem->next) > + { > + txp_htxd.flags = be16_to_cpu(c2_read16(elem->hw_desc + C2_TXP_FLAGS)); > + > + if (txp_htxd.flags != TXP_HTXD_DONE) > + break; > + > + if (netif_msg_tx_done(c2_port)) { > + /* PCI reads are expensive in fast path */ > + //txp_htxd.addr = be64_to_cpu(c2_read64(elem->hw_desc + C2_TXP_ADDR)); > + txp_htxd.len = be16_to_cpu(c2_read16(elem->hw_desc + C2_TXP_LEN)); > + dprintk(KERN_INFO PFX > + "%s: tx done slot %3Zu status 0x%x len %5u bytes\n", > + netdev->name, elem - tx_ring->start, > + txp_htxd.flags, txp_htxd.len); > + } > + > + c2_tx_free(c2dev, elem); > + ++(c2_port->tx_avail); > + } > + > + tx_ring->to_clean = elem; > + > + if (netif_queue_stopped(netdev) && c2_port->tx_avail > MAX_SKB_FRAGS + 1) > + netif_wake_queue(netdev); > + > + spin_unlock(&c2_port->tx_lock); > +} > + > +static void c2_rx_error(struct c2_port *c2_port, struct c2_element *elem) > +{ > + struct c2_rx_desc *rx_desc = elem->ht_desc; > + struct c2_rxp_hdr *rxp_hdr = (struct c2_rxp_hdr *)elem->skb->data; > + > + if (rxp_hdr->status != RXP_HRXD_OK || > + rxp_hdr->len > (rx_desc->len - sizeof(*rxp_hdr))) { > + dprintk(KERN_ERR PFX "BAD RXP_HRXD\n"); > + dprintk(KERN_ERR PFX " rx_desc : %p\n", rx_desc); > + dprintk(KERN_ERR PFX " index : %Zu\n", elem - c2_port->rx_ring.start); > + dprintk(KERN_ERR PFX " len : %u\n", rx_desc->len); > + dprintk(KERN_ERR PFX " rxp_hdr : %p [PA %p]\n", rxp_hdr, > + (void *)__pa((unsigned long)rxp_hdr)); > + dprintk(KERN_ERR PFX " flags : 0x%x\n", rxp_hdr->flags); > + dprintk(KERN_ERR PFX " status: 0x%x\n", rxp_hdr->status); > + dprintk(KERN_ERR PFX " len : %u\n", rxp_hdr->len); > + dprintk(KERN_ERR PFX " rsvd : 0x%x\n", rxp_hdr->rsvd); > + } > + > + /* Setup the skb for reuse since we're dropping this pkt */ > + elem->skb->tail = elem->skb->data = elem->skb->head; > + > + /* Zero out the rxp hdr in the sk_buff */ > + memset(elem->skb->data, 0, sizeof(*rxp_hdr)); > + > + /* Write the descriptor to the adapter's rx ring */ > + c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); > + c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); > + c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16((u16)elem->maplen - > sizeof(*rxp_hdr))); > + c2_write64(elem->hw_desc + C2_RXP_ADDR, cpu_to_be64(elem->mapaddr)); > + c2_write16(elem->hw_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_READY)); > + > + dprintk(KERN_INFO PFX "packet dropped\n"); > + c2_port->netstats.rx_dropped++; > +} > + > +static void c2_rx_interrupt(struct net_device *netdev) > +{ > + struct c2_port *c2_port = netdev_priv(netdev); > + struct c2_dev *c2dev = c2_port->c2dev; > + struct c2_ring *rx_ring = &c2_port->rx_ring; > + struct c2_element *elem; > + struct c2_rx_desc *rx_desc; > + struct c2_rxp_hdr *rxp_hdr; > + struct sk_buff *skb; > + dma_addr_t mapaddr; > + u32 maplen, buflen; > + unsigned long flags; > + > + spin_lock_irqsave(&c2dev->lock, flags); > + > + /* Begin where we left off */ > + rx_ring->to_clean = rx_ring->start + c2dev->cur_rx; > + > + for(elem = rx_ring->to_clean; elem->next != rx_ring->to_clean; elem = elem->next) > + { > + rx_desc = elem->ht_desc; > + mapaddr = elem->mapaddr; > + maplen = elem->maplen; > + skb = elem->skb; > + rxp_hdr = (struct c2_rxp_hdr *)skb->data; > + > + if (rxp_hdr->flags != RXP_HRXD_DONE) > + break; > + > + if (netif_msg_rx_status(c2_port)) > + dprintk(KERN_INFO PFX "%s: rx done slot %3Zu status 0x%x len %5u bytes\n", > + netdev->name, elem - rx_ring->start, > + rxp_hdr->flags, rxp_hdr->len); > + > + buflen = rxp_hdr->len; > + > + /* Sanity check the RXP header */ > + if (rxp_hdr->status != RXP_HRXD_OK || > + buflen > (rx_desc->len - sizeof(*rxp_hdr))) { > + c2_rx_error(c2_port, elem); > + continue; > + } > + > + /* Allocate and map a new skb for replenishing the host RX desc */ > + if (c2_rx_alloc(c2_port, elem)) { > + c2_rx_error(c2_port, elem); > + continue; > + } > + > + /* Unmap the old skb */ > + pci_unmap_single(c2dev->pcidev, mapaddr, maplen, PCI_DMA_FROMDEVICE); > + > + /* > + * Skip past the leading 8 bytes comprising of the > + * "struct c2_rxp_hdr", prepended by the adapter > + * to the usual Ethernet header ("struct ethhdr"), > + * to the start of the raw Ethernet packet. > + * > + * Fix up the various fields in the sk_buff before > + * passing it up to netif_rx(). The transfer size > + * (in bytes) specified by the adapter len field of > + * the "struct rxp_hdr_t" does NOT include the > + * "sizeof(struct c2_rxp_hdr)". > + */ > + skb->data += sizeof(*rxp_hdr); > + skb->tail = skb->data + buflen; > + skb->len = buflen; > + skb->dev = netdev; > + skb->protocol = eth_type_trans(skb, netdev); > + > + netif_rx(skb); > + > + netdev->last_rx = jiffies; > + c2_port->netstats.rx_packets++; > + c2_port->netstats.rx_bytes += buflen; > + } > + > + /* Save where we left off */ > + rx_ring->to_clean = elem; > + c2dev->cur_rx = elem - rx_ring->start; > + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); > + > + spin_unlock_irqrestore(&c2dev->lock, flags); > +} > + > +/* > + * Handle netisr0 TX & RX interrupts. > + */ > +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs) > +{ > + unsigned int netisr0, dmaisr; > + int handled = 0; > + struct c2_dev *c2dev = (struct c2_dev *)dev_id; > + > + assert(c2dev != NULL); > + > + /* Process CCILNET interrupts */ > + netisr0 = c2_read32(c2dev->regs + C2_NISR0); > + if (netisr0) { > + > + /* > + * There is an issue with the firmware that always > + * provides the status of RX for both TX & RX > + * interrupts. So process both queues here. > + */ > + c2_rx_interrupt(c2dev->netdev); > + c2_tx_interrupt(c2dev->netdev); > + > + /* Clear the interrupt */ > + c2_write32(c2dev->regs + C2_NISR0, netisr0); > + handled++; > + } > + > + /* Process RNIC interrupts */ > + dmaisr = c2_read32(c2dev->regs + C2_DISR); > + if (dmaisr) { > + c2_write32(c2dev->regs + C2_DISR, dmaisr); > + c2_rnic_interrupt(c2dev); > + handled++; > + } > + > + if (handled) { > + return IRQ_HANDLED; > + } else { > + return IRQ_NONE; > + } > +} > + > +static int c2_up(struct net_device *netdev) > +{ > + struct c2_port *c2_port = netdev_priv(netdev); > + struct c2_dev *c2dev = c2_port->c2dev; > + struct c2_element *elem; > + struct c2_rxp_hdr *rxp_hdr; > + size_t rx_size, tx_size; > + int ret, i; > + unsigned int netimr0; > + > + assert(c2dev != NULL); > + > + if (netif_msg_ifup(c2_port)) > + dprintk(KERN_INFO PFX "%s: enabling interface\n", netdev->name); > + > + /* Set the Rx buffer size based on MTU */ > + c2_set_rxbufsize(c2_port); > + > + /* Allocate DMA'able memory for Tx/Rx host descriptor rings */ > + rx_size = c2_port->rx_ring.count * sizeof(struct c2_rx_desc); > + tx_size = c2_port->tx_ring.count * sizeof(struct c2_tx_desc); > + > + c2_port->mem_size = tx_size + rx_size; > + c2_port->mem = pci_alloc_consistent(c2dev->pcidev, c2_port->mem_size, > + &c2_port->dma); > + if (c2_port->mem == NULL) { > + dprintk(KERN_ERR PFX "Unable to allocate memory for host descriptor rings\n"); > + return -ENOMEM; > + } > + > + memset(c2_port->mem, 0, c2_port->mem_size); > + > + /* Create the Rx host descriptor ring */ > + if ((ret = c2_rx_ring_alloc(&c2_port->rx_ring, c2_port->mem, c2_port->dma, > + c2dev->mmio_rxp_ring))) { > + dprintk(KERN_ERR PFX "Unable to create RX ring\n"); > + goto bail0; > + } > + > + /* Allocate Rx buffers for the host descriptor ring */ > + if (c2_rx_fill(c2_port)) { > + dprintk(KERN_ERR PFX "Unable to fill RX ring\n"); > + goto bail1; > + } > + > + /* Create the Tx host descriptor ring */ > + if ((ret = c2_tx_ring_alloc(&c2_port->tx_ring, c2_port->mem + rx_size, > + c2_port->dma + rx_size, c2dev->mmio_txp_ring))) { > + dprintk(KERN_ERR PFX "Unable to create TX ring\n"); > + goto bail1; > + } > + > + /* Set the TX pointer to where we left off */ > + c2_port->tx_avail = c2_port->tx_ring.count - 1; > + c2_port->tx_ring.to_use = c2_port->tx_ring.to_clean = c2_port->tx_ring. > start + c2dev->cur_tx; > + > + /* missing: Initialize MAC */ > + > + BUG_ON(c2_port->tx_ring.to_use != c2_port->tx_ring.to_clean); > + > + /* Reset the adapter, ensures the driver is in sync with the RXP */ > + c2_reset(c2_port); > + > + /* Reset the READY bit in the sk_buff RXP headers & adapter HRXDQ */ > + for(i = 0, elem = c2_port->rx_ring.start; i < c2_port->rx_ring.count; > + i++, elem++) > + { > + rxp_hdr = (struct c2_rxp_hdr *)elem->skb->data; > + rxp_hdr->flags = 0; > + c2_write16(elem->hw_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_READY)); > + } > + > + /* Enable network packets */ > + netif_start_queue(netdev); > + > + /* Enable IRQ */ > + c2_write32(c2dev->regs + C2_IDIS, 0); > + netimr0 = c2_read32(c2dev->regs + C2_NIMR0); > + netimr0 &= ~(C2_PCI_HTX_INT | C2_PCI_HRX_INT); > + c2_write32(c2dev->regs + C2_NIMR0, netimr0); > + > + return 0; > + > + bail1: > + c2_rx_clean(c2_port); > + kfree(c2_port->rx_ring.start); > + > + bail0: > + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, c2_port->dma); > + > + return ret; > +} > + > +static int c2_down(struct net_device *netdev) > +{ > + struct c2_port *c2_port = netdev_priv(netdev); > + struct c2_dev *c2dev = c2_port->c2dev; > + > + if (netif_msg_ifdown(c2_port)) > + dprintk(KERN_INFO PFX "%s: disabling interface\n", netdev->name); > + > + /* Wait for all the queued packets to get sent */ > + c2_tx_interrupt(netdev); > + > + /* Disable network packets */ > + netif_stop_queue(netdev); > + > + /* Disable IRQs by clearing the interrupt mask */ > + c2_write32(c2dev->regs + C2_IDIS, 1); > + c2_write32(c2dev->regs + C2_NIMR0, 0); > + > + /* missing: Stop transmitter */ > + > + /* missing: Stop receiver */ > + > + /* Reset the adapter, ensures the driver is in sync with the RXP */ > + c2_reset(c2_port); > + > + /* missing: Turn off LEDs here */ > + > + /* Free all buffers in the host descriptor rings */ > + c2_tx_clean(c2_port); > + c2_rx_clean(c2_port); > + > + /* Free the host descriptor rings */ > + kfree(c2_port->rx_ring.start); > + kfree(c2_port->tx_ring.start); > + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, c2_port->dma); > + > + return 0; > +} > + > +static void c2_reset(struct c2_port *c2_port) > +{ > + struct c2_dev *c2dev = c2_port->c2dev; > + unsigned int cur_rx = c2dev->cur_rx; > + > + /* Tell the hardware to quiesce */ > + C2_SET_CUR_RX(c2dev, cur_rx|C2_PCI_HRX_QUI); > + > + /* > + * The hardware will reset the C2_PCI_HRX_QUI bit once > + * the RXP is quiesced. Wait 2 seconds for this. > + */ > + ssleep(2); > + > + cur_rx = C2_GET_CUR_RX(c2dev); > + > + if (cur_rx & C2_PCI_HRX_QUI) > + dprintk(KERN_ERR PFX "c2_reset: failed to quiesce the hardware!\n"); > + > + cur_rx &= ~C2_PCI_HRX_QUI; > + > + c2dev->cur_rx = cur_rx; > + > + dprintk("Current RX: %u\n", c2dev->cur_rx); > +} > + > +static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev) > +{ > + struct c2_port *c2_port = netdev_priv(netdev); > + struct c2_dev *c2dev = c2_port->c2dev; > + struct c2_ring *tx_ring = &c2_port->tx_ring; > + struct c2_element *elem; > + dma_addr_t mapaddr; > + u32 maplen; > + unsigned long flags; > + unsigned int i; > + > + spin_lock_irqsave(&c2_port->tx_lock, flags); > + > + if (unlikely(c2_port->tx_avail < (skb_shinfo(skb)->nr_frags + 1))) { > + netif_stop_queue(netdev); > + spin_unlock_irqrestore(&c2_port->tx_lock, flags); > + > + dprintk(KERN_WARNING PFX "%s: Tx ring full when queue awake!\n", > + netdev->name); > + return NETDEV_TX_BUSY; > + } > + > + maplen = skb_headlen(skb); > + mapaddr = pci_map_single(c2dev->pcidev, skb->data, maplen, PCI_DMA_TODEVICE); > + > + elem = tx_ring->to_use; > + elem->skb = skb; > + elem->mapaddr = mapaddr; > + elem->maplen = maplen; > + > + /* Tell HW to xmit */ > + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(mapaddr)); > + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(maplen)); > + c2_write16(elem->hw_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_READY)); > + > + c2_port->netstats.tx_packets++; > + c2_port->netstats.tx_bytes += maplen; > + > + /* Loop thru additional data fragments and queue them */ > + if (skb_shinfo(skb)->nr_frags) { > + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) > + { > + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; > + maplen = frag->size; > + mapaddr = pci_map_page(c2dev->pcidev, frag->page, frag->page_offset, > + maplen, PCI_DMA_TODEVICE); > + > + elem = elem->next; > + elem->skb = NULL; > + elem->mapaddr = mapaddr; > + elem->maplen = maplen; > + > + /* Tell HW to xmit */ > + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(mapaddr)); > + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(maplen)); > + c2_write16(elem->hw_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_READY)); > + > + c2_port->netstats.tx_packets++; > + c2_port->netstats.tx_bytes += maplen; > + } > + } > + > + tx_ring->to_use = elem->next; > + c2_port->tx_avail -= (skb_shinfo(skb)->nr_frags + 1); > + > + if (netif_msg_tx_queued(c2_port)) > + dprintk(KERN_DEBUG PFX "%s: tx queued, slot %3Zu, len %5u bytes, avail = %u\n", > + netdev->name, elem - tx_ring->start, maplen, c2_port->tx_avail); > + > + if (c2_port->tx_avail <= MAX_SKB_FRAGS + 1) { > + netif_stop_queue(netdev); > + if (netif_msg_tx_queued(c2_port)) > + dprintk(KERN_INFO PFX "%s: transmit queue full\n", netdev->name); > + } > + > + spin_unlock_irqrestore(&c2_port->tx_lock, flags); > + > + netdev->trans_start = jiffies; > + > + return NETDEV_TX_OK; > +} > + > +static struct net_device_stats *c2_get_stats(struct net_device *netdev) > +{ > + struct c2_port *c2_port = netdev_priv(netdev); > + > + return &c2_port->netstats; > +} > + > +static int c2_set_mac_address(struct net_device *netdev, void *p) > +{ > + return -1; > +} > + > +static void c2_tx_timeout(struct net_device *netdev) > +{ > + struct c2_port *c2_port = netdev_priv(netdev); > + > + if (netif_msg_timer(c2_port)) > + dprintk(KERN_DEBUG PFX "%s: tx timeout\n", netdev->name); > + > + c2_tx_clean(c2_port); > +} > + > +static int c2_change_mtu(struct net_device *netdev, int new_mtu) > +{ > + int ret = 0; > + > + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) > + return -EINVAL; > + > + netdev->mtu = new_mtu; > + > + if (netif_running(netdev)) { > + c2_down(netdev); > + > + c2_up(netdev); > + } > + > + return ret; > +} > + > +/* Initialize network device */ > +static struct net_device *c2_devinit(struct c2_dev *c2dev, void __iomem *mmio_addr) > +{ > + struct c2_port *c2_port = NULL; > + struct net_device *netdev = alloc_etherdev(sizeof(*c2_port)); > + > + if (!netdev) { > + dprintk(KERN_ERR PFX "c2_port etherdev alloc failed"); > + return NULL; > + } > + > + SET_MODULE_OWNER(netdev); > + SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev); > + > + netdev->open = c2_up; > + netdev->stop = c2_down; > + netdev->hard_start_xmit = c2_xmit_frame; > + netdev->get_stats = c2_get_stats; > + netdev->tx_timeout = c2_tx_timeout; > + netdev->set_mac_address = c2_set_mac_address; > + netdev->change_mtu = c2_change_mtu; > + netdev->watchdog_timeo = C2_TX_TIMEOUT; > + netdev->irq = c2dev->pcidev->irq; > + > + c2_port = netdev_priv(netdev); > + c2_port->netdev = netdev; > + c2_port->c2dev = c2dev; > + c2_port->msg_enable = netif_msg_init(debug, default_msg); > + c2_port->tx_ring.count = C2_NUM_TX_DESC; > + c2_port->rx_ring.count = C2_NUM_RX_DESC; > + > + spin_lock_init(&c2_port->tx_lock); > + > + /* Copy our 48-bit ethernet hardware address */ > +#if 1 > + memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_ENADDR, 6); > +#else > + memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_RDMA_ENADDR, 6); > +#endif > + /* Validate the MAC address */ > + if(!is_valid_ether_addr(netdev->dev_addr)) { > + dprintk(KERN_ERR PFX "Invalid MAC Address\n"); > + c2_print_macaddr(netdev); > + free_netdev(netdev); > + return NULL; > + } > + > + c2dev->netdev = netdev; > + > + return netdev; > +} > + > +static int __devinit c2_probe(struct pci_dev *pcidev, const struct pci_device_id *ent) > +{ > + int ret = 0, i; > + unsigned long reg0_start, reg0_flags, reg0_len; > + unsigned long reg2_start, reg2_flags, reg2_len; > + unsigned long reg4_start, reg4_flags, reg4_len; > + unsigned kva_map_size; > + struct net_device *netdev = NULL; > + struct c2_dev *c2dev = NULL; > + void __iomem *mmio_regs = NULL; > + > + assert(pcidev != NULL); > + assert(ent != NULL); > + > + dprintk(KERN_INFO PFX "AMSO1100 Gigabit Ethernet driver v%s loaded\n", > + DRV_VERSION); > + > + /* Enable PCI device */ > + ret = pci_enable_device(pcidev); > + if (ret) { > + dprintk(KERN_ERR PFX "%s: Unable to enable PCI device\n", pci_name(pcidev)); > + goto bail0; > + } > + > + reg0_start = pci_resource_start(pcidev, BAR_0); > + reg0_len = pci_resource_len(pcidev, BAR_0); > + reg0_flags = pci_resource_flags(pcidev, BAR_0); > + > + reg2_start = pci_resource_start(pcidev, BAR_2); > + reg2_len = pci_resource_len(pcidev, BAR_2); > + reg2_flags = pci_resource_flags(pcidev, BAR_2); > + > + reg4_start = pci_resource_start(pcidev, BAR_4); > + reg4_len = pci_resource_len(pcidev, BAR_4); > + reg4_flags = pci_resource_flags(pcidev, BAR_4); > + > + dprintk(KERN_INFO PFX "BAR0 size = 0x%lX bytes\n", reg0_len); > + dprintk(KERN_INFO PFX "BAR2 size = 0x%lX bytes\n", reg2_len); > + dprintk(KERN_INFO PFX "BAR4 size = 0x%lX bytes\n", reg4_len); > + > + /* Make sure PCI base addr are MMIO */ > + if (!(reg0_flags & IORESOURCE_MEM) || > + !(reg2_flags & IORESOURCE_MEM) || > + !(reg4_flags & IORESOURCE_MEM)) { > + dprintk (KERN_ERR PFX "PCI regions not an MMIO resource\n"); > + ret = -ENODEV; > + goto bail1; > + } > + > + /* Check for weird/broken PCI region reporting */ > + if ((reg0_len < C2_REG0_SIZE) || > + (reg2_len < C2_REG2_SIZE) || > + (reg4_len < C2_REG4_SIZE)) { > + dprintk (KERN_ERR PFX "Invalid PCI region sizes\n"); > + ret = -ENODEV; > + goto bail1; > + } > + > + /* Reserve PCI I/O and memory resources */ > + ret = pci_request_regions(pcidev, DRV_NAME); > + if (ret) { > + dprintk(KERN_ERR PFX "%s: Unable to request regions\n", pci_name(pcidev)); > + goto bail1; > + } > + > + if ((sizeof(dma_addr_t) > 4)) { > + ret = pci_set_dma_mask(pcidev, DMA_64BIT_MASK); > + if (ret < 0) { > + dprintk(KERN_ERR PFX "64b DMA configuration failed\n"); > + goto bail2; > + } > + } else { > + ret = pci_set_dma_mask(pcidev, DMA_32BIT_MASK); > + if (ret < 0) { > + dprintk(KERN_ERR PFX "32b DMA configuration failed\n"); > + goto bail2; > + } > + } > + > + /* Enables bus-mastering on the device */ > + pci_set_master(pcidev); > + > + /* Remap the adapter PCI registers in BAR4 */ > + mmio_regs = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, > + sizeof(struct c2_adapter_pci_regs)); > + if (mmio_regs == 0UL) { > + dprintk(KERN_ERR PFX "Unable to remap adapter PCI registers in BAR4\n"); > + ret = -EIO; > + goto bail2; > + } > + > + /* Validate PCI regs magic */ > + for (i = 0; i < sizeof(c2_magic); i++) > + { > + if (c2_magic[i] != c2_read8(mmio_regs + C2_REGS_MAGIC + i)) { > + dprintk(KERN_ERR PFX > + "Invalid PCI regs magic [%d/%Zd: got 0x%x, exp 0x%x]\n", > + i + 1, sizeof(c2_magic), > + c2_read8(mmio_regs + C2_REGS_MAGIC + i), c2_magic[i]); > + dprintk(KERN_ERR PFX "Adapter not claimed\n"); > + iounmap(mmio_regs); > + ret = -EIO; > + goto bail2; > + } > + } > + > + /* Validate the adapter version */ > + if (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_VERS)) != C2_VERSION) { > + dprintk(KERN_ERR PFX "Version mismatch [fw=%u, c2=%u], Adapter not claimed\n", > + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_VERS)), C2_VERSION); > + ret = -EINVAL; > + iounmap(mmio_regs); > + goto bail2; > + } > + > + /* Validate the adapter IVN */ > + if (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_IVN)) != C2_IVN) { > + dprintk(KERN_ERR PFX "IVN mismatch [fw=0x%x, c2=0x%x], Adapter not claimed\n", > + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_IVN)), C2_IVN); > + ret = -EINVAL; > + iounmap(mmio_regs); > + goto bail2; > + } > + > + /* Allocate hardware structure */ > + c2dev = (struct c2_dev*)ib_alloc_device(sizeof *c2dev); > + if (!c2dev) { > + dprintk(KERN_ERR PFX "%s: Unable to alloc hardware struct\n", > + pci_name(pcidev)); > + ret = -ENOMEM; > + iounmap(mmio_regs); > + goto bail2; > + } > + > + memset(c2dev, 0, sizeof(*c2dev)); > + spin_lock_init(&c2dev->lock); > + c2dev->pcidev = pcidev; > + c2dev->cur_tx = 0; > + > + /* Get the last RX index */ > + c2dev->cur_rx = (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_HRX_CUR)) - > 0xffffc000) / sizeof(struct c2_rxp_desc); > + > + /* Request an interrupt line for the driver */ > + ret = request_irq(pcidev->irq, c2_interrupt, SA_SHIRQ, DRV_NAME, c2dev); > + if (ret) { > + dprintk(KERN_ERR PFX "%s: requested IRQ %u is busy\n", > + pci_name(pcidev), pcidev->irq); > + iounmap(mmio_regs); > + goto bail3; > + } > + > + /* Set driver specific data */ > + pci_set_drvdata(pcidev, c2dev); > + > + /* Initialize network device */ > + if ((netdev = c2_devinit(c2dev, mmio_regs)) == NULL) { > + iounmap(mmio_regs); > + goto bail4; > + } > + > + /* Save off the actual size prior to unmapping mmio_regs */ > + kva_map_size = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_PCI_WINSIZE)); > + > + /* Unmap the adapter PCI registers in BAR4 */ > + iounmap(mmio_regs); > + > + /* Register network device */ > + ret = register_netdev(netdev); > + if (ret) { > + dprintk(KERN_ERR PFX "Unable to register netdev, ret = %d\n", ret); > + goto bail5; > + } > + > + /* Disable network packets */ > + netif_stop_queue(netdev); > + > + /* Remap the adapter HRXDQ PA space to kernel VA space */ > + c2dev->mmio_rxp_ring = ioremap_nocache(reg4_start + C2_RXP_HRXDQ_OFFSET, > + C2_RXP_HRXDQ_SIZE); > + if (c2dev->mmio_rxp_ring == 0UL) { > + dprintk(KERN_ERR PFX "Unable to remap MMIO HRXDQ region\n"); > + ret = -EIO; > + goto bail6; > + } > + > + /* Remap the adapter HTXDQ PA space to kernel VA space */ > + c2dev->mmio_txp_ring = ioremap_nocache(reg4_start + C2_TXP_HTXDQ_OFFSET, > + C2_TXP_HTXDQ_SIZE); > + if (c2dev->mmio_txp_ring == 0UL) { > + dprintk(KERN_ERR PFX "Unable to remap MMIO HTXDQ region\n"); > + ret = -EIO; > + goto bail7; > + } > + > + /* Save off the current RX index in the last 4 bytes of the TXP Ring */ > + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); > + > + /* Remap the PCI registers in adapter BAR0 to kernel VA space */ > + c2dev->regs = ioremap_nocache(reg0_start, reg0_len); > + if (c2dev->regs == 0UL) { > + dprintk(KERN_ERR PFX "Unable to remap BAR0\n"); > + ret = -EIO; > + goto bail8; > + } > + > + /* Remap the PCI registers in adapter BAR4 to kernel VA space */ > + c2dev->pa = (void *)(reg4_start + C2_PCI_REGS_OFFSET); > + c2dev->kva = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, kva_map_size); > + if (c2dev->kva == 0UL) { > + dprintk(KERN_ERR PFX "Unable to remap BAR4\n"); > + ret = -EIO; > + goto bail9; > + } > + > + /* Print out the MAC address */ > + c2_print_macaddr(netdev); > + > + ret = c2_rnic_init(c2dev); > + if (ret) { > + dprintk(KERN_ERR PFX "c2_rnic_init failed: %d\n", ret); > + goto bail10; > + } > + > + c2_register_device(c2dev); > + > + return 0; > + > + bail10: > + iounmap(c2dev->kva); > + > + bail9: > + iounmap(c2dev->regs); > + > + bail8: > + iounmap(c2dev->mmio_txp_ring); > + > + bail7: > + iounmap(c2dev->mmio_rxp_ring); > + > + bail6: > + unregister_netdev(netdev); > + > + bail5: > + free_netdev(netdev); > + > + bail4: > + free_irq(pcidev->irq, c2dev); > + > + bail3: > + ib_dealloc_device(&c2dev->ibdev); > + > + bail2: > + pci_release_regions(pcidev); > + > + bail1: > + pci_disable_device(pcidev); > + > + bail0: > + return ret; > +} > + > +static void __devexit c2_remove(struct pci_dev *pcidev) > +{ > + struct c2_dev *c2dev = pci_get_drvdata(pcidev); > + struct net_device *netdev = c2dev->netdev; > + > + assert(netdev != NULL); > + > + /* Unregister with OpenIB */ > + ib_unregister_device(&c2dev->ibdev); > + > + /* Clean up the RNIC resources */ > + c2_rnic_term(c2dev); > + > + /* Remove network device from the kernel */ > + unregister_netdev(netdev); > + > + /* Free network device */ > + free_netdev(netdev); > + > + /* Free the interrupt line */ > + free_irq(pcidev->irq, c2dev); > + > + /* missing: Turn LEDs off here */ > + > + /* Unmap adapter PA space */ > + iounmap(c2dev->kva); > + iounmap(c2dev->regs); > + iounmap(c2dev->mmio_txp_ring); > + iounmap(c2dev->mmio_rxp_ring); > + > + /* Free the hardware structure */ > + ib_dealloc_device(&c2dev->ibdev); > + > + /* Release reserved PCI I/O and memory resources */ > + pci_release_regions(pcidev); > + > + /* Disable PCI device */ > + pci_disable_device(pcidev); > + > + /* Clear driver specific data */ > + pci_set_drvdata(pcidev, NULL); > +} > + > +static struct pci_driver c2_pci_driver = { > + .name = DRV_NAME, > + .id_table = c2_pci_table, > + .probe = c2_probe, > + .remove = __devexit_p(c2_remove), > +}; > + > +static int __init c2_init_module(void) > +{ > + return pci_module_init(&c2_pci_driver); > +} > + > +static void __exit c2_exit_module(void) > +{ > + pci_unregister_driver(&c2_pci_driver); > +} > + > +module_init(c2_init_module); > +module_exit(c2_exit_module); > Index: hw/amso1100/c2_qp.c > =================================================================== > --- hw/amso1100/c2_qp.c (revision 0) > +++ hw/amso1100/c2_qp.c (revision 0) > @@ -0,0 +1,840 @@ > +/* > + * Copyright (c) 2004 Topspin Communications. All rights reserved. > + * Copyright (c) 2005 Cisco Systems. All rights reserved. > + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. > + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#include "c2.h" > +#include "c2_vq.h" > +#include "cc_status.h" > + > +#define C2_MAX_ORD_PER_QP 128 > +#define C2_MAX_IRD_PER_QP 128 > + > +#define CC_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count) > +#define CC_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16) > +#define CC_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF) > + > +enum c2_qp_state { > + C2_QP_STATE_IDLE = 0x01, > + C2_QP_STATE_CONNECTING = 0x02, > + C2_QP_STATE_RTS = 0x04, > + C2_QP_STATE_CLOSING = 0x08, > + C2_QP_STATE_TERMINATE = 0x10, > + C2_QP_STATE_ERROR = 0x20, > +}; > + > +#define NO_SUPPORT -1 > +static const u8 c2_opcode[] = { > + [IB_WR_SEND] = CC_WR_TYPE_SEND, > + [IB_WR_SEND_WITH_IMM] = NO_SUPPORT, > + [IB_WR_RDMA_WRITE] = CC_WR_TYPE_RDMA_WRITE, > + [IB_WR_RDMA_WRITE_WITH_IMM] = NO_SUPPORT, > + [IB_WR_RDMA_READ] = CC_WR_TYPE_RDMA_READ, > + [IB_WR_ATOMIC_CMP_AND_SWP] = NO_SUPPORT, > + [IB_WR_ATOMIC_FETCH_AND_ADD] = NO_SUPPORT, > +}; > + > +void c2_qp_event(struct c2_dev *c2dev, u32 qpn, > + enum ib_event_type event_type) > +{ > + struct c2_qp *qp; > + struct ib_event event; > + > + spin_lock(&c2dev->qp_table.lock); > + qp = c2_array_get(&c2dev->qp_table.qp, qpn & (c2dev->max_qp - 1)); > + if (qp) > + atomic_inc(&qp->refcount); > + spin_unlock(&c2dev->qp_table.lock); > + > + if (!qp) { > + dprintk("Async event for bogus QP %08x\n", qpn); > + return; > + } > + > + event.device = &c2dev->ibdev; > + event.event = event_type; > + event.element.qp = &qp->ibqp; > + if (qp->ibqp.event_handler) > + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); > + > + if (atomic_dec_and_test(&qp->refcount)) > + wake_up(&qp->wait); > +} > + > +static int to_c2_state(enum ib_qp_state ib_state) > +{ > + switch (ib_state) { > + case IB_QPS_RESET: return C2_QP_STATE_IDLE; > + case IB_QPS_RTS: return C2_QP_STATE_RTS; > + case IB_QPS_SQD: return C2_QP_STATE_CLOSING; > + case IB_QPS_SQE: return C2_QP_STATE_CLOSING; > + case IB_QPS_ERR: return C2_QP_STATE_ERROR; > + default: return -1; > + } > +} > + > +#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF > + > +int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp, > + struct ib_qp_attr *attr, int attr_mask) > +{ > + ccwr_qp_modify_req_t wr; > + ccwr_qp_modify_rep_t *reply; > + struct c2_vq_req *vq_req; > + int err; > + > + vq_req = vq_req_alloc(c2dev); > + if (!vq_req) > + return -ENOMEM; > + > + c2_wr_set_id(&wr, CCWR_QP_MODIFY); > + wr.hdr.context = (unsigned long)vq_req; > + wr.rnic_handle = c2dev->adapter_handle; > + wr.qp_handle = qp->adapter_handle; > + wr.ord = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); > + wr.ird = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); > + wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); > + wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); > + > + if (attr_mask & IB_QP_STATE) { > + > + /* Ensure the state is valid */ > + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) > + return -EINVAL; > + > + wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state)); > + > + } else if (attr_mask & IB_QP_CUR_STATE) { > + > + if (attr->cur_qp_state != IB_QPS_RTR && > + attr->cur_qp_state != IB_QPS_RTS && > + attr->cur_qp_state != IB_QPS_SQD && > + attr->cur_qp_state != IB_QPS_SQE) > + return -EINVAL; > + else > + wr.next_qp_state = cpu_to_be32(to_c2_state(attr->cur_qp_state)); > + } else { > + err = 0; > + goto bail0; > + } > + > + /* reference the request struct */ > + vq_req_get(c2dev, vq_req); > + > + err = vq_send_wr(c2dev, (ccwr_t *)&wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail0; > + } > + > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) > + goto bail0; > + > + reply = (ccwr_qp_modify_rep_t *)(unsigned long)vq_req->reply_msg; > + if (!reply) { > + err = -ENOMEM; > + goto bail0; > + } > + > + err = c2_errno(reply); > + > + vq_repbuf_free(c2dev, reply); > +bail0: > + vq_req_free(c2dev, vq_req); > + return err; > +} > + > +static int destroy_qp(struct c2_dev *c2dev, > + struct c2_qp *qp) > +{ > + struct c2_vq_req *vq_req; > + ccwr_qp_destroy_req_t wr; > + ccwr_qp_destroy_rep_t *reply; > + int err; > + > + /* > + * Allocate a verb request message > + */ > + vq_req = vq_req_alloc(c2dev); > + if (!vq_req) { > + return -ENOMEM; > + } > + > + /* > + * Initialize the WR > + */ > + c2_wr_set_id(&wr, CCWR_QP_DESTROY); > + wr.hdr.context = (unsigned long)vq_req; > + wr.rnic_handle = c2dev->adapter_handle; > + wr.qp_handle = qp->adapter_handle; > + > + /* > + * reference the request struct. dereferenced in the int handler. > + */ > + vq_req_get(c2dev, vq_req); > + > + /* > + * Send WR to adapter > + */ > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail0; > + } > + > + /* > + * Wait for reply from adapter > + */ > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) { > + goto bail0; > + } > + > + /* > + * Process reply > + */ > + reply = (ccwr_qp_destroy_rep_t*)(unsigned long)(vq_req->reply_msg); > + if (!reply) { > + err = -ENOMEM; > + goto bail0; > + } > + > + if ( (err = c2_errno(reply)) != 0) { > + // XXX print error > + } > + > + vq_repbuf_free(c2dev, reply); > +bail0: > + vq_req_free(c2dev, vq_req); > + return err; > +} > + > +int c2_alloc_qp(struct c2_dev *c2dev, > + struct c2_pd *pd, > + struct ib_qp_init_attr *qp_attrs, > + struct c2_qp *qp) > +{ > + ccwr_qp_create_req_t wr; > + ccwr_qp_create_rep_t *reply; > + struct c2_vq_req *vq_req; > + struct c2_cq *send_cq = to_c2cq(qp_attrs->send_cq); > + struct c2_cq *recv_cq = to_c2cq(qp_attrs->recv_cq); > + unsigned long peer_pa; > + u32 q_size, msg_size, mmap_size; > + void *mmap; > + int err; > + > + qp->qpn = c2_alloc(&c2dev->qp_table.alloc); > + if (qp->qpn == -1) > + return -ENOMEM; > + > + /* Allocate the SQ and RQ shared pointers */ > + qp->sq_mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); > + if (!qp->sq_mq.shared) { > + err = -ENOMEM; > + goto bail0; > + } > + > + qp->rq_mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); > + if (!qp->rq_mq.shared) { > + err = -ENOMEM; > + goto bail1; > + } > + > + /* Allocate the verbs request */ > + vq_req = vq_req_alloc(c2dev); > + if (vq_req == NULL) { > + err = -ENOMEM; > + goto bail2; > + } > + > + /* Initialize the work request */ > + memset(&wr, 0, sizeof(wr)); > + c2_wr_set_id(&wr, CCWR_QP_CREATE); > + wr.hdr.context = (unsigned long)vq_req; > + wr.rnic_handle = c2dev->adapter_handle; > + wr.sq_cq_handle = send_cq->adapter_handle; > + wr.rq_cq_handle = recv_cq->adapter_handle; > + wr.sq_depth = cpu_to_be32(qp_attrs->cap.max_send_wr+1); > + wr.rq_depth = cpu_to_be32(qp_attrs->cap.max_recv_wr+1); > + wr.srq_handle = 0; > + wr.flags = cpu_to_be32(QP_RDMA_READ | QP_RDMA_WRITE | QP_MW_BIND | > + QP_ZERO_STAG | QP_RDMA_READ_RESPONSE); > + wr.send_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge); > + wr.recv_sgl_depth = cpu_to_be32(qp_attrs->cap.max_recv_sge); > + wr.rdma_write_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge); // > XXX no write depth? > + wr.shared_sq_ht = cpu_to_be64(__pa(qp->sq_mq.shared)); > + wr.shared_rq_ht = cpu_to_be64(__pa(qp->rq_mq.shared)); > + wr.ord = cpu_to_be32(C2_MAX_ORD_PER_QP); > + wr.ird = cpu_to_be32(C2_MAX_IRD_PER_QP); > + wr.pd_id = pd->pd_id; > + wr.user_context = (unsigned long)qp; > + > + vq_req_get(c2dev, vq_req); > + > + /* Send the WR to the adapter */ > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail3; > + } > + > + /* Wait for the verb reply */ > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) { > + goto bail3; > + } > + > + /* Process the reply */ > + reply = (ccwr_qp_create_rep_t*)(unsigned long)(vq_req->reply_msg); > + if (!reply) { > + err = -ENOMEM; > + goto bail3; > + } > + > + if ( (err = c2_wr_get_result(reply)) != 0) { > + goto bail4; > + } > + > + /* Fill in the kernel QP struct */ > + atomic_set(&qp->refcount, 1); > + qp->adapter_handle = reply->qp_handle; > + qp->state = IB_QPS_RESET; > + qp->send_sgl_depth = qp_attrs->cap.max_send_sge; > + qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge; > + qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge; > + > + /* Initialize the SQ MQ */ > + q_size = be32_to_cpu(reply->sq_depth); > + msg_size = be32_to_cpu(reply->sq_msg_size); > + peer_pa = (unsigned long)(c2dev->pa + be32_to_cpu(reply->sq_mq_start)); > + mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size); > + mmap = ioremap_nocache(peer_pa, mmap_size); > + if (!mmap) { > + err = -ENOMEM; > + goto bail5; > + } > + > + c2_mq_init(&qp->sq_mq, > + be32_to_cpu(reply->sq_mq_index), > + q_size, > + msg_size, > + mmap + sizeof(struct c2_mq_shared), /* pool start */ > + mmap, /* peer */ > + C2_MQ_ADAPTER_TARGET); > + > + /* Initialize the RQ mq */ > + q_size = be32_to_cpu(reply->rq_depth); > + msg_size = be32_to_cpu(reply->rq_msg_size); > + peer_pa = (unsigned long)(c2dev->pa + be32_to_cpu(reply->rq_mq_start)); > + mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size); > + mmap = ioremap_nocache(peer_pa, mmap_size); > + if (!mmap) { > + err = -ENOMEM; > + goto bail6; > + } > + > + c2_mq_init(&qp->rq_mq, > + be32_to_cpu(reply->rq_mq_index), > + q_size, > + msg_size, > + mmap + sizeof(struct c2_mq_shared), /* pool start */ > + mmap, /* peer */ > + C2_MQ_ADAPTER_TARGET); > + > + vq_repbuf_free(c2dev, reply); > + vq_req_free(c2dev, vq_req); > + > + spin_lock_irq(&c2dev->qp_table.lock); > + c2_array_set(&c2dev->qp_table.qp, > + qp->qpn & (c2dev->max_qp - 1), qp); > + spin_unlock_irq(&c2dev->qp_table.lock); > + > + return 0; > + > +bail6: > + iounmap(qp->sq_mq.peer); > +bail5: > + destroy_qp(c2dev, qp); > +bail4: > + vq_repbuf_free(c2dev, reply); > +bail3: > + vq_req_free(c2dev, vq_req); > +bail2: > + c2_free_mqsp(qp->rq_mq.shared); > +bail1: > + c2_free_mqsp(qp->sq_mq.shared); > +bail0: > + c2_free(&c2dev->qp_table.alloc, qp->qpn); > + return err; > +} > + > +void c2_free_qp(struct c2_dev *c2dev, > + struct c2_qp *qp) > +{ > + struct c2_cq *send_cq; > + struct c2_cq *recv_cq; > + > + send_cq = to_c2cq(qp->ibqp.send_cq); > + recv_cq = to_c2cq(qp->ibqp.recv_cq); > + > + /* > + * Lock CQs here, so that CQ polling code can do QP lookup > + * without taking a lock. > + */ > + spin_lock_irq(&send_cq->lock); > + if (send_cq != recv_cq) > + spin_lock(&recv_cq->lock); > + > + spin_lock(&c2dev->qp_table.lock); > + c2_array_clear(&c2dev->qp_table.qp, > + qp->qpn & (c2dev->max_qp - 1)); > + spin_unlock(&c2dev->qp_table.lock); > + > + if (send_cq != recv_cq) > + spin_unlock(&recv_cq->lock); > + spin_unlock_irq(&send_cq->lock); > + > + atomic_dec(&qp->refcount); > + wait_event(qp->wait, !atomic_read(&qp->refcount)); > + > + /* > + * Destory qp in the rnic... > + */ > + destroy_qp(c2dev, qp); > + > + /* > + * Mark any unreaped CQEs as null and void. > + */ > + c2_cq_clean(c2dev, qp, send_cq->cqn); > + if (send_cq != recv_cq) > + c2_cq_clean(c2dev, qp, recv_cq->cqn); > + /* > + * Unmap the MQs and return the shared pointers > + * to the message pool. > + */ > + iounmap(qp->sq_mq.peer); > + iounmap(qp->rq_mq.peer); > + c2_free_mqsp(qp->sq_mq.shared); > + c2_free_mqsp(qp->rq_mq.shared); > + > + c2_free(&c2dev->qp_table.alloc, qp->qpn); > +} > + > +/* > + * Function: move_sgl > + * > + * Description: > + * Move an SGL from the user's work request struct into a CCIL Work Request > + * message, swapping to WR byte order and ensure the total length doesn't > + * overflow. > + * > + * IN: > + * dst - ptr to CCIL Work Request message SGL memory. > + * src - ptr to the consumers SGL memory. > + * > + * OUT: none > + * > + * Return: > + * CCIL status codes. > + */ > +static int > +move_sgl(cc_data_addr_t *dst, struct ib_sge *src, int count, u32 *p_len, u8 > *actual_count) > +{ > + u32 tot = 0; /* running total */ > + u8 acount = 0; /* running total non-0 len sge's */ > + > + while (count > 0) { > + /* > + * If the addition of this SGE causes the > + * total SGL length to exceed 2^32-1, then > + * fail-n-bail. > + * > + * If the current total plus the next element length > + * wraps, then it will go negative and be less than the > + * current total... > + */ > + if ((tot+src->length) < tot) { > + return -EINVAL; > + } > + /* > + * Bug: 1456 (as well as 1498 & 1643) > + * Skip over any sge's supplied with len=0 > + */ > + if (src->length) { > + tot += src->length; > + dst->stag = cpu_to_be32(src->lkey); > + dst->to = cpu_to_be64(src->addr); > + dst->length = cpu_to_be32(src->length); > + dst++; > + acount++; > + } > + src++; > + count--; > + } > + > + if (acount == 0) { > + /* > + * Bug: 1476 (as well as 1498, 1456 and 1643) > + * Setup the SGL in the WR to make it easier for the RNIC. > + * This way, the FW doesn't have to deal with special cases. > + * Setting length=0 should be sufficient. > + */ > + dst->stag = 0; > + dst->to = 0; > + dst->length = 0; > + } > + > + *p_len = tot; > + *actual_count = acount; > + return 0; > +} > + > +/* > + * Function: c2_activity (private function) > + * > + * Description: > + * Post an mq index to the host->adapter activity fifo. > + * > + * IN: > + * c2dev - ptr to c2dev structure > + * mq_index - mq index to post > + * shared - value most recently written to shared > + * > + * OUT: > + * > + * Return: > + * none > + */ > +static inline void > +c2_activity(struct c2_dev *c2dev, u32 mq_index, u16 shared) > +{ > + /* > + * First read the register to see if the FIFO is full, and if so, > + * spin until it's not. This isn't perfect -- there is no > + * synchronization among the clients of the register, but in > + * practice it prevents multiple CPU from hammering the bus > + * with PCI RETRY. Note that when this does happen, the card > + * cannot get on the bus and the card and system hang in a > + * deadlock -- thus the need for this code. [TOT] > + */ > + while (c2_read32(c2dev->regs + PCI_BAR0_ADAPTER_HINT) & 0x80000000) { > + set_current_state(TASK_UNINTERRUPTIBLE); > + schedule_timeout(0); > + } > + > + c2_write32(c2dev->regs + PCI_BAR0_ADAPTER_HINT, CC_HINT_MAKE(mq_index, shared)); > +} > + > +/* > + * Function: qp_wr_post > + * > + * Description: > + * This in-line function allocates a MQ msg, then moves the host-copy of > + * the completed WR into msg. Then it posts the message. > + * > + * IN: > + * q - ptr to user MQ. > + * wr - ptr to host-copy of the WR. > + * qp - ptr to user qp > + * size - Number of bytes to post. Assumed to be divisible by 4. > + * > + * OUT: none > + * > + * Return: > + * CCIL status codes. > + */ > +static int > +qp_wr_post(struct c2_mq *q, ccwr_t *wr, struct c2_qp *qp, u32 size) > +{ > + ccwr_t *msg; > + > + msg = c2_mq_alloc(q); > + if (msg == NULL) { > + return -EINVAL; > + } > + > +#ifdef CCMSGMAGIC > + ((ccwr_hdr_t *)wr)->magic = cpu_to_be32(CCWR_MAGIC); > +#endif > + > + /* > + * Since all header fields in the WR are the same as the > + * CQE, set the following so the adapter need not. > + */ > + c2_wr_set_result(wr, CCERR_PENDING); > + > + /* > + * Copy the wr down to the adapter > + */ > + memcpy((void *)msg, (void *)wr, size); > + > + c2_mq_produce(q); > + return 0; > +} > + > + > +int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, > + struct ib_send_wr **bad_wr) > +{ > + struct c2_dev *c2dev = to_c2dev(ibqp->device); > + struct c2_qp *qp = to_c2qp(ibqp); > + ccwr_t wr; > + int err = 0; > + > + u32 flags; > + u32 tot_len; > + u8 actual_sge_count; > + u32 msg_size; > + > + if (qp->state > IB_QPS_RTS) > + return -EINVAL; > + > + while (ib_wr) { > + > + flags = 0; > + wr.sqwr.sq_hdr.user_hdr.hdr.context = ib_wr->wr_id; > + if (ib_wr->send_flags & IB_SEND_SIGNALED) { > + flags |= SQ_SIGNALED; > + } > + > + switch (ib_wr->opcode) { > + case IB_WR_SEND: > + if (ib_wr->send_flags & IB_SEND_SOLICITED) { > + c2_wr_set_id(&wr, CC_WR_TYPE_SEND_SE); > + msg_size = sizeof(ccwr_send_se_req_t); > + } else { > + c2_wr_set_id(&wr, CC_WR_TYPE_SEND); > + msg_size = sizeof(ccwr_send_req_t); > + } > + > + wr.sqwr.send.remote_stag = 0; > + msg_size += sizeof(cc_data_addr_t) * ib_wr->num_sge; > + if (ib_wr->num_sge > qp->send_sgl_depth) { > + err = -EINVAL; > + break; > + } > + if (ib_wr->send_flags & IB_SEND_FENCE) { > + flags |= SQ_READ_FENCE; > + } > + err = move_sgl((cc_data_addr_t*)&(wr.sqwr.send.data), > + ib_wr->sg_list, > + ib_wr->num_sge, > + &tot_len, > + &actual_sge_count); > + wr.sqwr.send.sge_len = cpu_to_be32(tot_len); > + c2_wr_set_sge_count(&wr, actual_sge_count); > + break; > + case IB_WR_RDMA_WRITE: > + c2_wr_set_id(&wr, CC_WR_TYPE_RDMA_WRITE); > + msg_size = sizeof(ccwr_rdma_write_req_t) + > + (sizeof(cc_data_addr_t) * ib_wr->num_sge); > + if (ib_wr->num_sge > qp->rdma_write_sgl_depth) { > + err = -EINVAL; > + break; > + } > + if (ib_wr->send_flags & IB_SEND_FENCE) { > + flags |= SQ_READ_FENCE; > + } > + wr.sqwr.rdma_write.remote_stag = cpu_to_be32(ib_wr->wr.rdma.rkey); > + wr.sqwr.rdma_write.remote_to = cpu_to_be64(ib_wr->wr.rdma.remote_addr); > + err = move_sgl((cc_data_addr_t*) > + &(wr.sqwr.rdma_write.data), > + ib_wr->sg_list, > + ib_wr->num_sge, > + &tot_len, > + &actual_sge_count); > + wr.sqwr.rdma_write.sge_len = cpu_to_be32(tot_len); > + c2_wr_set_sge_count(&wr, actual_sge_count); > + break; > + case IB_WR_RDMA_READ: > + c2_wr_set_id(&wr, CC_WR_TYPE_RDMA_READ); > + msg_size = sizeof(ccwr_rdma_read_req_t); > + > + /* IWarp only suppots 1 sge for RDMA reads */ > + if (ib_wr->num_sge > 1) { > + err = -EINVAL; > + break; > + } > + > + /* > + * Move the local and remote stag/to/len into the WR. > + */ > + wr.sqwr.rdma_read.local_stag = > + cpu_to_be32(ib_wr->sg_list->lkey); > + wr.sqwr.rdma_read.local_to = > + cpu_to_be64(ib_wr->sg_list->addr); > + wr.sqwr.rdma_read.remote_stag = > + cpu_to_be32(ib_wr->wr.rdma.rkey); > + wr.sqwr.rdma_read.remote_to = > + cpu_to_be64(ib_wr->wr.rdma.remote_addr); > + wr.sqwr.rdma_read.length = > + cpu_to_be32(ib_wr->sg_list->length); > + break; > + default: > + /* error */ > + msg_size = 0; > + err = -EINVAL; > + break; > + } > + > + /* > + * If we had an error on the last wr build, then > + * break out. Possible errors include bogus WR > + * type, and a bogus SGL length... > + */ > + if (err) { > + break; > + } > + > + /* > + * Store flags > + */ > + c2_wr_set_flags(&wr, flags); > + > + /* > + * Post the puppy! > + */ > + err = qp_wr_post(&qp->sq_mq, &wr, qp, msg_size); > + if (err) { > + break; > + } > + > + /* > + * Enqueue mq index to activity FIFO. > + */ > + c2_activity(c2dev, qp->sq_mq.index, qp->sq_mq.hint_count); > + > + ib_wr = ib_wr->next; > + } > + > + if (err) > + *bad_wr = ib_wr; > + return err; > +} > + > +int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, > + struct ib_recv_wr **bad_wr) > +{ > + struct c2_dev *c2dev = to_c2dev(ibqp->device); > + struct c2_qp *qp = to_c2qp(ibqp); > + ccwr_t wr; > + int err = 0; > + > + if (qp->state > IB_QPS_RTS) > + return -EINVAL; > + > + /* > + * Try and post each work request > + */ > + while (ib_wr) { > + u32 tot_len; > + u8 actual_sge_count; > + > + if (ib_wr->num_sge > qp->recv_sgl_depth) { > + err = -EINVAL; > + break; > + } > + > + /* > + * Create local host-copy of the WR > + */ > + wr.rqwr.rq_hdr.user_hdr.hdr.context = ib_wr->wr_id; > + c2_wr_set_id(&wr, CCWR_RECV); > + c2_wr_set_flags(&wr, 0); > + > + /* sge_count is limited to eight bits. */ > + assert(ib_wr->num_sge < 256); > + err = move_sgl((cc_data_addr_t*)&(wr.rqwr.data), > + ib_wr->sg_list, > + ib_wr->num_sge, > + &tot_len, > + &actual_sge_count); > + c2_wr_set_sge_count(&wr, actual_sge_count); > + > + /* > + * If we had an error on the last wr build, then > + * break out. Possible errors include bogus WR > + * type, and a bogus SGL length... > + */ > + if (err) { > + break; > + } > + > + err = qp_wr_post(&qp->rq_mq, &wr, qp, qp->rq_mq.msg_size); > + if (err) { > + break; > + } > + > + /* > + * Enqueue mq index to activity FIFO > + */ > + c2_activity(c2dev, qp->rq_mq.index, qp->rq_mq.hint_count); > + > + ib_wr = ib_wr->next; > + } > + > + if (err) > + *bad_wr = ib_wr; > + return err; > +} > + > +int __devinit c2_init_qp_table(struct c2_dev *c2dev) > +{ > + int err; > + > + spin_lock_init(&c2dev->qp_table.lock); > + > + err = c2_alloc_init(&c2dev->qp_table.alloc, > + c2dev->max_qp, > + 0); > + if (err) > + return err; > + > + err = c2_array_init(&c2dev->qp_table.qp, > + c2dev->max_qp); > + if (err) { > + c2_alloc_cleanup(&c2dev->qp_table.alloc); > + return err; > + } > + > + return 0; > +} > + > +void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev) > +{ > + c2_alloc_cleanup(&c2dev->qp_table.alloc); > +} > Index: hw/amso1100/cc_ivn.h > =================================================================== > --- hw/amso1100/cc_ivn.h (revision 0) > +++ hw/amso1100/cc_ivn.h (revision 0) > @@ -0,0 +1,57 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#ifndef _CC_IVN_H_ > +#define _CC_IVN_H_ > + > +/* > + * The following value must be incremented each time structures shared > + * between the firmware and host drivers are changed. This includes > + * structures, types, and Max number of queue pairs.. > + */ > +#define CC_IVN_BASE 18 > + > +/* Used to mask of the CCMSGMAGIC bit */ > +#define CC_IVN_MASK 0x7fffffff > + > + > +/* > + * The high order bit indicates a CCMSGMAGIC build, which changes the > + * adapter<->host message formats. > + */ > +#ifdef CCMSGMAGIC > +#define CC_IVN (CC_IVN_BASE | 0x80000000) > +#else > +#define CC_IVN (CC_IVN_BASE & 0x7fffffff) > +#endif > + > +#endif /* _CC_IVN_H_ */ > Index: hw/amso1100/c2_mq.h > =================================================================== > --- hw/amso1100/c2_mq.h (revision 0) > +++ hw/amso1100/c2_mq.h (revision 0) > @@ -0,0 +1,104 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > + > +#ifndef _C2_MQ_H_ > +#define _C2_MQ_H_ > +#include > +#include "c2_wr.h" > + > +enum c2_shared_regs { > + > + C2_SHARED_ARMED = 0x10, > + C2_SHARED_NOTIFY = 0x18, > + C2_SHARED_SHARED = 0x40, > +}; > + > +struct c2_mq_shared { > + u16 unused1; > + u8 armed; > + u8 notification_type; > + u32 unused2; > + u16 shared; > + /* Pad to 64 bytes. */ > + u8 pad[64-sizeof(u16)-2*sizeof(u8)-sizeof(u32)-sizeof(u16)]; > +}; > + > +enum c2_mq_type { > + C2_MQ_HOST_TARGET = 1, > + C2_MQ_ADAPTER_TARGET = 2, > +}; > + > +/* > + * c2_mq_t is for kernel-mode MQs like the VQs and the AEQ. > + * c2_user_mq_t (which is the same format) is for user-mode MQs... > + */ > +#define C2_MQ_MAGIC 0x4d512020 /* 'MQ ' */ > +struct c2_mq { > + u32 magic; > + u8* msg_pool; > + u16 hint_count; > + u16 priv; > + struct c2_mq_shared *peer; > + u16* shared; > + u32 q_size; > + u32 msg_size; > + u32 index; > + enum c2_mq_type type; > +}; > + > +#define BUMP(q,p) (p) = ((p)+1) % (q)->q_size > +#define BUMP_SHARED(q,p) (p) = cpu_to_be16((be16_to_cpu(p)+1) % (q)->q_size) > + > +static __inline__ int > +c2_mq_empty(struct c2_mq *q) > +{ > + return q->priv == be16_to_cpu(*q->shared); > +} > + > +static __inline__ int > +c2_mq_full(struct c2_mq *q) > +{ > + return q->priv == (be16_to_cpu(*q->shared) + q->q_size-1) % q->q_size; > +} > + > +extern void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count); > +extern void * c2_mq_alloc(struct c2_mq *q); > +extern void c2_mq_produce(struct c2_mq *q); > +extern void * c2_mq_consume(struct c2_mq *q); > +extern void c2_mq_free(struct c2_mq *q); > +extern u32 c2_mq_count(struct c2_mq *q); > +extern void c2_mq_init(struct c2_mq *q, u32 index, u32 q_size, > + u32 msg_size, u8 *pool_start, > + u16 *peer, u32 type); > + > +#endif /* _C2_MQ_H_ */ > Index: hw/amso1100/c2_user.h > =================================================================== > --- hw/amso1100/c2_user.h (revision 0) > +++ hw/amso1100/c2_user.h (revision 0) > @@ -0,0 +1,82 @@ > +/* > + * Copyright (c) 2005 Topspin Communications. All rights reserved. > + * Copyright (c) 2005 Cisco Systems. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#ifndef C2_USER_H > +#define C2_USER_H > + > +#include > + > +/* > + * Make sure that all structs defined in this file remain laid out so > + * that they pack the same way on 32-bit and 64-bit architectures (to > + * avoid incompatibility between 32-bit userspace and 64-bit kernels). > + * In particular do not use pointer types -- pass pointers in __u64 > + * instead. > + */ > + > +struct c2_alloc_ucontext_resp { > + __u32 qp_tab_size; > + __u32 uarc_size; > +}; > + > +struct c2_alloc_pd_resp { > + __u32 pdn; > + __u32 reserved; > +}; > + > +struct c2_create_cq { > + __u32 lkey; > + __u32 pdn; > + __u64 arm_db_page; > + __u64 set_db_page; > + __u32 arm_db_index; > + __u32 set_db_index; > +}; > + > +struct c2_create_cq_resp { > + __u32 cqn; > + __u32 reserved; > +}; > + > +struct c2_create_qp { > + __u32 lkey; > + __u32 reserved; > + __u64 sq_db_page; > + __u64 rq_db_page; > + __u32 sq_db_index; > + __u32 rq_db_index; > +}; > + > +#endif /* C2_USER_H */ > Index: hw/amso1100/c2_ae.c > =================================================================== > --- hw/amso1100/c2_ae.c (revision 0) > +++ hw/amso1100/c2_ae.c (revision 0) > @@ -0,0 +1,216 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#include "c2.h" > +#include > +#include "cc_status.h" > +#include "cc_ae.h" > + > +static int c2_convert_cm_status(u32 cc_status) > +{ > + switch (cc_status) { > + case CC_CONN_STATUS_SUCCESS: > + return 0; > + case CC_CONN_STATUS_REJECTED: > + return -ENETRESET; > + case CC_CONN_STATUS_REFUSED: > + return -ECONNREFUSED; > + case CC_CONN_STATUS_TIMEDOUT: > + return -ETIMEDOUT; > + case CC_CONN_STATUS_NETUNREACH: > + return -ENETUNREACH; > + case CC_CONN_STATUS_HOSTUNREACH: > + return -EHOSTUNREACH; > + case CC_CONN_STATUS_INVALID_RNIC: > + return -EINVAL; > + case CC_CONN_STATUS_INVALID_QP: > + return -EINVAL; > + case CC_CONN_STATUS_INVALID_QP_STATE: > + return -EINVAL; > + default: > + panic("Unable to convert CM status: %d\n", cc_status); > + break; > + } > +} > + > +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) > +{ > + struct c2_mq *mq = c2dev->qptr_array[mq_index]; > + ccwr_t *wr; > + void *resource_user_context; > + struct iw_cm_event cm_event; > + struct ib_event ib_event; > + cc_resource_indicator_t resource_indicator; > + cc_event_id_t event_id; > + u8 *pdata = NULL; > + > + /* > + * retreive the message > + */ > + wr = c2_mq_consume(mq); > + if (!wr) > + return; > + > + memset(&cm_event, 0, sizeof(cm_event)); > + > + event_id = c2_wr_get_id(wr); > + resource_indicator = be32_to_cpu(wr->ae.ae_generic.resource_type); > + resource_user_context = (void *)(unsigned long)wr->ae.ae_generic.user_context; > + > + cm_event.status = c2_convert_cm_status(c2_wr_get_result(wr)); > + > + switch (resource_indicator) { > + case CC_RES_IND_QP: { > + > + struct c2_qp *qp = (struct c2_qp *)resource_user_context; > + > + switch (event_id) { > + case CCAE_ACTIVE_CONNECT_RESULTS: > + cm_event.event = IW_CM_EVENT_CONNECT_REPLY; > + cm_event.local_addr.sin_addr.s_addr = > + wr->ae.ae_active_connect_results.laddr; > + cm_event.remote_addr.sin_addr.s_addr = > + wr->ae.ae_active_connect_results.raddr; > + cm_event.local_addr.sin_port = > + wr->ae.ae_active_connect_results.lport; > + cm_event.remote_addr.sin_port = > + wr->ae.ae_active_connect_results.rport; > + cm_event.private_data_len = > + be32_to_cpu(wr->ae.ae_active_connect_results.private_data_length); > + > + if (cm_event.private_data_len) { > + /* XXX */ > + pdata = kmalloc(cm_event.private_data_len, GFP_ATOMIC); > + if (!pdata) { > + /* Ignore the request, maybe the remote peer > + * will retry */ > + dprintk("Ignored connect request -- no memory for pdata" > + "private_data_len=%d\n", cm_event.private_data_len); > + goto ignore_it; > + } > + > + memcpy(pdata, > + wr->ae.ae_active_connect_results.private_data, > + cm_event.private_data_len); > + > + cm_event.private_data = pdata; > + } > + if (qp->cm_id->event_handler) > + qp->cm_id->event_handler(qp->cm_id, &cm_event); > + > + break; > + > + case CCAE_TERMINATE_MESSAGE_RECEIVED: > + case CCAE_CQ_SQ_COMPLETION_OVERFLOW: > + ib_event.device = &c2dev->ibdev; > + ib_event.element.qp = &qp->ibqp; > + ib_event.event = IB_EVENT_QP_REQ_ERR; > + > + if(qp->ibqp.event_handler) > + (*qp->ibqp.event_handler)(&ib_event, > + qp->ibqp.qp_context); > + case CCAE_BAD_CLOSE: > + case CCAE_LLP_CLOSE_COMPLETE: > + case CCAE_LLP_CONNECTION_RESET: > + case CCAE_LLP_CONNECTION_LOST: > + default: > + cm_event.event = IW_CM_EVENT_CLOSE; > + if (qp->cm_id->event_handler) > + qp->cm_id->event_handler(qp->cm_id, &cm_event); > + > + } > + break; > + } > + > + case CC_RES_IND_EP: { > + > + struct iw_cm_id* cm_id = (struct iw_cm_id*)resource_user_context; > + > + dprintk("CC_RES_IND_EP event_id=%d\n", event_id); > + if (event_id != CCAE_CONNECTION_REQUEST) { > + dprintk("%s: Invalid event_id: %d\n", __FUNCTION__, event_id); > + break; > + } > + > + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; > + cm_event.provider_id = > + wr->ae.ae_connection_request.cr_handle; > + cm_event.local_addr.sin_addr.s_addr = > + wr->ae.ae_connection_request.laddr; > + cm_event.remote_addr.sin_addr.s_addr = > + wr->ae.ae_connection_request.raddr; > + cm_event.local_addr.sin_port = > + wr->ae.ae_connection_request.lport; > + cm_event.remote_addr.sin_port = > + wr->ae.ae_connection_request.rport; > + cm_event.private_data_len = > + be32_to_cpu(wr->ae.ae_connection_request.private_data_length); > + > + if (cm_event.private_data_len) { > + pdata = kmalloc(cm_event.private_data_len, GFP_ATOMIC); > + if (!pdata) { > + /* Ignore the request, maybe the remote peer > + * will retry */ > + dprintk("Ignored connect request -- no memory for pdata" > + "private_data_len=%d\n", cm_event.private_data_len); > + goto ignore_it; > + } > + memcpy(pdata, > + wr->ae.ae_connection_request.private_data, > + cm_event.private_data_len); > + > + cm_event.private_data = pdata; > + } > + if (cm_id->event_handler) > + cm_id->event_handler(cm_id, &cm_event); > + break; > + } > + > + case CC_RES_IND_CQ: { > + struct c2_cq *cq = (struct c2_cq *)resource_user_context; > + > + dprintk("IB_EVENT_CQ_ERR\n"); > + ib_event.device = &c2dev->ibdev; > + ib_event.element.cq = &cq->ibcq; > + ib_event.event = IB_EVENT_CQ_ERR; > + > + if (cq->ibcq.event_handler) > + cq->ibcq.event_handler(&ib_event, cq->ibcq.cq_context); > + } > + > + default: > + break; > + } > + > + ignore_it: > + c2_mq_free(mq); > +} > Index: hw/amso1100/c2.h > =================================================================== > --- hw/amso1100/c2.h (revision 0) > +++ hw/amso1100/c2.h (revision 0) > @@ -0,0 +1,617 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > + > +#ifndef __C2_H > +#define __C2_H > + > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "c2_provider.h" > +#include "c2_mq.h" > +#include "cc_status.h" > + > +#define DRV_NAME "c2" > +#define DRV_VERSION "1.1" > +#define PFX DRV_NAME ": " > + > +#ifdef C2_DEBUG > +#define assert(expr) \ > + if(!(expr)) { \ > + printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n",\ > + #expr, __FILE__, __FUNCTION__, __LINE__); \ > + } > +#define dprintk(fmt, args...) do {printk(KERN_INFO PFX fmt, ##args);} while (0) > +#else > +#define assert(expr) do {} while (0) > +#define dprintk(fmt, args...) do {} while (0) > +#endif /* C2_DEBUG */ > + > +#define BAR_0 0 > +#define BAR_2 2 > +#define BAR_4 4 > + > +#define RX_BUF_SIZE (1536 + 8) > +#define ETH_JUMBO_MTU 9000 > +#define C2_MAGIC "CEPHEUS" > +#define C2_VERSION 4 > +#define C2_IVN (18 & 0x7fffffff) > + > +#define C2_REG0_SIZE (16 * 1024) > +#define C2_REG2_SIZE (2 * 1024 * 1024) > +#define C2_REG4_SIZE (256 * 1024 * 1024) > +#define C2_NUM_TX_DESC 341 > +#define C2_NUM_RX_DESC 256 > +#define C2_PCI_REGS_OFFSET (0x10000) > +#define C2_RXP_HRXDQ_OFFSET (((C2_REG4_SIZE)/2)) > +#define C2_RXP_HRXDQ_SIZE (4096) > +#define C2_TXP_HTXDQ_OFFSET (((C2_REG4_SIZE)/2) + C2_RXP_HRXDQ_SIZE) > +#define C2_TXP_HTXDQ_SIZE (4096) > +#define C2_TX_TIMEOUT (6*HZ) > + > +/* CEPHEUS */ > +static const u8 c2_magic[] = { > + 0x43, 0x45, 0x50, 0x48, 0x45, 0x55, 0x53 > + }; > + > +enum adapter_pci_regs { > + C2_REGS_MAGIC = 0x0000, > + C2_REGS_VERS = 0x0008, > + C2_REGS_IVN = 0x000C, > + C2_REGS_PCI_WINSIZE = 0x0010, > + C2_REGS_Q0_QSIZE = 0x0014, > + C2_REGS_Q0_MSGSIZE = 0x0018, > + C2_REGS_Q0_POOLSTART = 0x001C, > + C2_REGS_Q0_SHARED = 0x0020, > + C2_REGS_Q1_QSIZE = 0x0024, > + C2_REGS_Q1_MSGSIZE = 0x0028, > + C2_REGS_Q1_SHARED = 0x0030, > + C2_REGS_Q2_QSIZE = 0x0034, > + C2_REGS_Q2_MSGSIZE = 0x0038, > + C2_REGS_Q2_SHARED = 0x0040, > + C2_REGS_ENADDR = 0x004C, > + C2_REGS_RDMA_ENADDR = 0x0054, > + C2_REGS_HRX_CUR = 0x006C, > +}; > + > +struct c2_adapter_pci_regs { > + char reg_magic[8]; > + u32 version; > + u32 ivn; > + u32 pci_window_size; > + u32 q0_q_size; > + u32 q0_msg_size; > + u32 q0_pool_start; > + u32 q0_shared; > + u32 q1_q_size; > + u32 q1_msg_size; > + u32 q1_pool_start; > + u32 q1_shared; > + u32 q2_q_size; > + u32 q2_msg_size; > + u32 q2_pool_start; > + u32 q2_shared; > + u32 log_start; > + u32 log_size; > + u8 host_enaddr[8]; > + u8 rdma_enaddr[8]; > + u32 crash_entry; > + u32 crash_ready[2]; > + u32 fw_txd_cur; > + u32 fw_hrxd_cur; > + u32 fw_rxd_cur; > +}; > + > +enum pci_regs { > + C2_HISR = 0x0000, > + C2_DISR = 0x0004, > + C2_HIMR = 0x0008, > + C2_DIMR = 0x000C, > + C2_NISR0 = 0x0010, > + C2_NISR1 = 0x0014, > + C2_NIMR0 = 0x0018, > + C2_NIMR1 = 0x001C, > + C2_IDIS = 0x0020, > +}; > + > +enum { > + C2_PCI_HRX_INT = 1<<8, > + C2_PCI_HTX_INT = 1<<17, > + C2_PCI_HRX_QUI = 1<<31, > +}; > + > +/* > + * Cepheus registers in BAR0. > + */ > +struct c2_pci_regs { > + u32 hostisr; > + u32 dmaisr; > + u32 hostimr; > + u32 dmaimr; > + u32 netisr0; > + u32 netisr1; > + u32 netimr0; > + u32 netimr1; > + u32 int_disable; > +}; > + > +/* TXP flags */ > +enum c2_txp_flags { > + TXP_HTXD_DONE = 0, > + TXP_HTXD_READY = 1<<0, > + TXP_HTXD_UNINIT = 1<<1, > +}; > + > +/* RXP flags */ > +enum c2_rxp_flags { > + RXP_HRXD_UNINIT = 0, > + RXP_HRXD_READY = 1<<0, > + RXP_HRXD_DONE = 1<<1, > +}; > + > +/* RXP status */ > +enum c2_rxp_status { > + RXP_HRXD_ZERO = 0, > + RXP_HRXD_OK = 1<<0, > + RXP_HRXD_BUF_OV = 1<<1, > +}; > + > +/* TXP descriptor fields */ > +enum txp_desc { > + C2_TXP_FLAGS = 0x0000, > + C2_TXP_LEN = 0x0002, > + C2_TXP_ADDR = 0x0004, > +}; > + > +/* RXP descriptor fields */ > +enum rxp_desc { > + C2_RXP_FLAGS = 0x0000, > + C2_RXP_STATUS = 0x0002, > + C2_RXP_COUNT = 0x0004, > + C2_RXP_LEN = 0x0006, > + C2_RXP_ADDR = 0x0008, > +}; > + > +struct c2_txp_desc { > + u16 flags; > + u16 len; > + u64 addr; > +} __attribute__ ((packed)); > + > +struct c2_rxp_desc { > + u16 flags; > + u16 status; > + u16 count; > + u16 len; > + u64 addr; > +} __attribute__ ((packed)); > + > +struct c2_rxp_hdr { > + u16 flags; > + u16 status; > + u16 len; > + u16 rsvd; > +} __attribute__ ((packed)); > + > +struct c2_tx_desc { > + u32 len; > + u32 status; > + dma_addr_t next_offset; > +}; > + > +struct c2_rx_desc { > + u32 len; > + u32 status; > + dma_addr_t next_offset; > +}; > + > +struct c2_alloc { > + u32 last; > + u32 max; > + spinlock_t lock; > + unsigned long *table; > +}; > + > +struct c2_array { > + struct { > + void **page; > + int used; > + } *page_list; > +}; > + > +/* > + * The MQ shared pointer pool is organized as a linked list of > + * chunks. Each chunk contains a linked list of free shared pointers > + * that can be allocated to a given user mode client. > + * > + */ > +struct sp_chunk { > + struct sp_chunk* next; > + u32 gfp_mask; > + u16 head; > + u16 shared_ptr[0]; > +}; > + > +struct c2_pd_table { > + struct c2_alloc alloc; > + struct c2_array pd; > +}; > + > +struct c2_qp_table { > + struct c2_alloc alloc; > + u32 rdb_base; > + int rdb_shift; > + int sqp_start; > + spinlock_t lock; > + struct c2_array qp; > + struct c2_icm_table *qp_table; > + struct c2_icm_table *eqp_table; > + struct c2_icm_table *rdb_table; > +}; > + > +struct c2_element { > + struct c2_element *next; > + void *ht_desc; /* host descriptor */ > + void *hw_desc; /* hardware descriptor */ > + struct sk_buff *skb; > + dma_addr_t mapaddr; > + u32 maplen; > +}; > + > +struct c2_ring { > + struct c2_element *to_clean; > + struct c2_element *to_use; > + struct c2_element *start; > + unsigned long count; > +}; > + > +struct c2_dev { > + struct ib_device ibdev; > + void __iomem *regs; > + void __iomem *mmio_txp_ring; /* remapped adapter memory for hw rings */ > + void __iomem *mmio_rxp_ring; > + spinlock_t lock; > + struct pci_dev *pcidev; > + struct net_device *netdev; > + unsigned int cur_tx; > + unsigned int cur_rx; > + u64 fw_ver; > + u32 adapter_handle; > + u32 hw_rev; > + u32 device_cap_flags; > + u32 vendor_id; > + u32 vendor_part_id; > + void __iomem *kva; /* KVA device memory */ > + void __iomem *pa; /* PA device memory */ > + void **qptr_array; > + > + kmem_cache_t* host_msg_cache; > + //kmem_cache_t* ae_msg_cache; > + > + struct list_head cca_link; /* adapter list */ > + struct list_head eh_wakeup_list; /* event wakeup list */ > + wait_queue_head_t req_vq_wo; > + > + /* RNIC Limits */ > + u32 max_mr; > + u32 max_mr_size; > + u32 max_qp; > + u32 max_qp_wr; > + u32 max_sge; > + u32 max_cq; > + u32 max_cqe; > + u32 max_pd; > + > + struct c2_pd_table pd_table; > + struct c2_qp_table qp_table; > +#if 0 > + struct c2_mr_table mr_table; > +#endif > + int ports; /* num of GigE ports */ > + int devnum; > + spinlock_t vqlock; /* sync vbs req MQ */ > + > + /* Verbs Queues */ > + struct c2_mq req_vq; /* Verbs Request MQ */ > + struct c2_mq rep_vq; /* Verbs Reply MQ */ > + struct c2_mq aeq; /* Async Events MQ */ > + > + /* Kernel client MQs */ > + struct sp_chunk* kern_mqsp_pool; > + > + /* Device updates these values when posting messages to a host > + * target queue */ > + u16 req_vq_shared; > + u16 rep_vq_shared; > + u16 aeq_shared; > + u16 irq_claimed; > + > + /* > + * Shared host target pages for user-accessible MQs. > + */ > + int hthead; /* index of first free entry */ > + void* htpages; /* kernel vaddr */ > + int htlen; /* length of htpages memory */ > + void* htuva; /* user mapped vaddr */ > + spinlock_t htlock; /* serialize allocation */ > + > + u64 adapter_hint_uva; /* access to the activity FIFO */ > + > + spinlock_t aeq_lock; > + spinlock_t rnic_lock; > + > + > + u16 hint_count; > + u16 hints_read; > + > + int init; /* TRUE if it's ready */ > + char ae_cache_name[16]; > + char vq_cache_name[16]; > +}; > + > +struct c2_port { > + u32 msg_enable; > + struct c2_dev *c2dev; > + struct net_device *netdev; > + > + spinlock_t tx_lock; > + u32 tx_avail; > + struct c2_ring tx_ring; > + struct c2_ring rx_ring; > + > + void *mem; /* PCI memory for host rings */ > + dma_addr_t dma; > + unsigned long mem_size; > + > + u32 rx_buf_size; > + > + struct net_device_stats netstats; > +}; > + > +/* > + * Activity FIFO registers in BAR0. > + */ > +#define PCI_BAR0_HOST_HINT 0x100 > +#define PCI_BAR0_ADAPTER_HINT 0x2000 > + > +/* > + * Ammasso PCI vendor id and Cepheus PCI device id. > + */ > +#define CQ_ARMED 0x01 > +#define CQ_WAIT_FOR_DMA 0x80 > + > +/* > + * The format of a hint is as follows: > + * Lower 16 bits are the count of hints for the queue. > + * Next 15 bits are the qp_index > + * Upper most bit depends on who reads it: > + * If read by producer, then it means Full (1) or Not-Full (0) > + * If read by consumer, then it means Empty (1) or Not-Empty (0) > + */ > +#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count) > +#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16) > +#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF) > + > + > +/* > + * The following defines the offset in SDRAM for the cc_adapter_pci_regs_t > + * struct. > + */ > +#define C2_ADAPTER_PCI_REGS_OFFSET 0x10000 > + > +#ifndef readq > +static inline u64 readq(const void __iomem *addr) > +{ > + u64 ret = readl(addr + 4); > + ret <<= 32; > + ret |= readl(addr); > + > + return ret; > +} > +#endif > + > +#ifndef writeq > +static inline void writeq(u64 val, void __iomem *addr) > +{ > + writel((u32) (val), addr); > + writel((u32) (val >> 32), (addr + 4)); > +} > +#endif > + > +/* Read from memory-mapped device */ > +static inline u64 c2_read64(const void __iomem *addr) > +{ > + return readq(addr); > +} > + > +static inline u32 c2_read32(const void __iomem *addr) > +{ > + return readl(addr); > +} > + > +static inline u16 c2_read16(const void __iomem *addr) > +{ > + return readw(addr); > +} > + > +static inline u8 c2_read8(const void __iomem *addr) > +{ > + return readb(addr); > +} > + > +/* Write to memory-mapped device */ > +static inline void c2_write64(void __iomem *addr, u64 val) > +{ > + writeq(val, addr); > +} > + > +static inline void c2_write32(void __iomem *addr, u32 val) > +{ > + writel(val, addr); > +} > + > +static inline void c2_write16(void __iomem *addr, u16 val) > +{ > + writew(val, addr); > +} > + > +static inline void c2_write8(void __iomem *addr, u8 val) > +{ > + writeb(val, addr); > +} > + > +#define C2_SET_CUR_RX(c2dev, cur_rx) \ > + c2_write32(c2dev->mmio_txp_ring + 4092, cpu_to_be32(cur_rx)) > + > +#define C2_GET_CUR_RX(c2dev) \ > + be32_to_cpu(c2_read32(c2dev->mmio_txp_ring + 4092)) > + > +static inline struct c2_dev *to_c2dev(struct ib_device* ibdev) > +{ > + return container_of(ibdev, struct c2_dev, ibdev); > +} > + > +static inline int c2_errno(void *reply) > +{ > + switch(c2_wr_get_result(reply)) { > + case CC_OK: > + return 0; > + case CCERR_NO_BUFS: > + case CCERR_INSUFFICIENT_RESOURCES: > + case CCERR_ZERO_RDMA_READ_RESOURCES: > + return -ENOMEM; > + case CCERR_MR_IN_USE: > + case CCERR_QP_IN_USE: > + return -EBUSY; > + case CCERR_ADDR_IN_USE: > + return -EADDRINUSE; > + case CCERR_ADDR_NOT_AVAIL: > + return -EADDRNOTAVAIL; > + case CCERR_CONN_RESET: > + return -ECONNRESET; > + case CCERR_NOT_IMPLEMENTED: > + case CCERR_INVALID_WQE: > + return -ENOSYS; > + case CCERR_QP_NOT_PRIVILEGED: > + return -EPERM; > + case CCERR_STACK_ERROR: > + return -EPROTO; > + case CCERR_ACCESS_VIOLATION: > + case CCERR_BASE_AND_BOUNDS_VIOLATION: > + return -EFAULT; > + case CCERR_STAG_STATE_NOT_INVALID: > + case CCERR_INVALID_ADDRESS: > + case CCERR_INVALID_CQ: > + case CCERR_INVALID_EP: > + case CCERR_INVALID_MODIFIER: > + case CCERR_INVALID_MTU: > + case CCERR_INVALID_PD_ID: > + case CCERR_INVALID_QP: > + case CCERR_INVALID_RNIC: > + case CCERR_INVALID_STAG: > + return -EINVAL; > + default: > + return -EAGAIN; > + } > +} > + > +/* Device */ > +extern int c2_register_device(struct c2_dev *c2dev); > +extern void c2_unregister_device(struct c2_dev *c2dev); > +extern int c2_rnic_init(struct c2_dev* c2dev); > +extern void c2_rnic_term(struct c2_dev* c2dev); > + > +/* QPs */ > +extern int c2_alloc_qp(struct c2_dev *c2dev, struct c2_pd *pd, > + struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp); > +extern void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp); > +extern int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp, > + struct ib_qp_attr *attr, int attr_mask); > +extern int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, > + struct ib_send_wr **bad_wr); > +extern int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, > + struct ib_recv_wr **bad_wr); > +extern int __devinit c2_init_qp_table(struct c2_dev *c2dev); > +extern void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev); > + > +/* PDs */ > +extern int c2_pd_alloc(struct c2_dev *c2dev, int privileged, struct c2_pd *pd); > +extern void c2_pd_free(struct c2_dev *c2dev, struct c2_pd *pd); > +extern int __devinit c2_init_pd_table(struct c2_dev *c2dev); > +extern void __devexit c2_cleanup_pd_table(struct c2_dev *c2dev); > + > +/* CQs */ > +extern int c2_init_cq(struct c2_dev *c2dev, int entries, struct c2_ucontext *ctx, > + struct c2_cq *cq); > +extern void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq); > +extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); > +extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index); > +extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); > +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); > + > +/* CM */ > +extern int c2_llp_connect(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len); > +extern int c2_llp_accept(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len); > +extern int c2_llp_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len); > +extern int c2_llp_service_create(struct iw_cm_id* cm_id, int backlog); > +extern int c2_llp_service_destroy(struct iw_cm_id* cm_id); > + > +/* MM */ > +extern int c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 **addr_list, > + int pbl_depth, u32 length, u64 *va, > + cc_acf_t acf, struct c2_mr *mr); > +extern int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index); > + > +/* AE */ > +extern void c2_ae_event(struct c2_dev *c2dev, u32 mq_index); > + > +/* Allocators */ > +extern u32 c2_alloc(struct c2_alloc *alloc); > +extern void c2_free(struct c2_alloc *alloc, u32 obj); > +extern int c2_alloc_init(struct c2_alloc *alloc, u32 num, u32 reserved); > +extern void c2_alloc_cleanup(struct c2_alloc *alloc); > +extern int c2_init_mqsp_pool(unsigned int gfp_mask, struct sp_chunk** root); > +extern void c2_free_mqsp_pool(struct sp_chunk* root); > +extern u16* c2_alloc_mqsp(struct sp_chunk* head); > +extern void c2_free_mqsp(u16* mqsp); > +extern int c2_array_init(struct c2_array *array, int nent); > +extern void c2_array_clear(struct c2_array *array, int index); > +extern int c2_array_set(struct c2_array *array, int index, void *value); > +extern void *c2_array_get(struct c2_array *array, int index); > + > +#endif > + > Index: hw/amso1100/c2_vq.c > =================================================================== > --- hw/amso1100/c2_vq.c (revision 0) > +++ hw/amso1100/c2_vq.c (revision 0) > @@ -0,0 +1,272 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#include > +#include > + > +#include "c2_vq.h" > + > +/* > + * Verbs Request Objects: > + * > + * VQ Request Objects are allocated by the kernel verbs handlers. > + * They contain a wait object, a refcnt, an atomic bool indicating that the > + * adapter has replied, and a copy of the verb reply work request. > + * A pointer to the VQ Request Object is passed down in the context > + * field of the work request message, and reflected back by the adapter > + * in the verbs reply message. The function handle_vq() in the interrupt > + * path will use this pointer to: > + * 1) append a copy of the verbs reply message > + * 2) mark that the reply is ready > + * 3) wake up the kernel verbs handler blocked awaiting the reply. > + * > + * > + * The kernel verbs handlers do a "get" to put a 2nd reference on the > + * VQ Request object. If the kernel verbs handler exits before the adapter > + * can respond, this extra reference will keep the VQ Request object around > + * until the adapter's reply can be processed. The reason we need this is > + * because a pointer to this object is stuffed into the context field of > + * the verbs work request message, and reflected back in the reply message. > + * It is used in the interrupt handler (handle_vq()) to wake up the appropriate > + * kernel verb handler that is blocked awaiting the verb reply. > + * So handle_vq() will do a "put" on the object when it's done accessing it. > + * NOTE: If we guarantee that the kernel verb handler will never bail before > + * getting the reply, then we don't need these refcnts. > + * > + * > + * VQ Request objects are freed by the kernel verbs handlers only > + * after the verb has been processed, or when the adapter fails and > + * does not reply. > + * > + * > + * Verbs Reply Buffers: > + * > + * VQ Reply bufs are local host memory copies of a outstanding Verb Request reply > + * message. The are always allocated by the kernel verbs handlers, and _may_ be > + * freed by either the kernel verbs handler -or- the interrupt handler. The > + * kernel verbs handler _must_ free the repbuf, then free the vq request object > + * in that order. > + */ > + > +int > +vq_init(struct c2_dev* c2dev) > +{ > + sprintf(c2dev->vq_cache_name, "c2-vq:dev%c", (char ) ('0' + c2dev->devnum)); > + c2dev->host_msg_cache = kmem_cache_create(c2dev->vq_cache_name, > + c2dev->rep_vq.msg_size, 0, > + SLAB_HWCACHE_ALIGN, NULL, NULL); > + if (c2dev->host_msg_cache == NULL) { > + return -ENOMEM; > + } > + return 0; > +} > + > +void > +vq_term(struct c2_dev* c2dev) > +{ > + kmem_cache_destroy(c2dev->host_msg_cache); > +} > + > +/* vq_req_alloc - allocate a VQ Request Object and initialize it. > + * The refcnt is set to 1. > + */ > +struct c2_vq_req * > +vq_req_alloc(struct c2_dev *c2dev) > +{ > + struct c2_vq_req *r; > + > + r = (struct c2_vq_req *)kmalloc(sizeof(struct c2_vq_req), GFP_KERNEL); > + if (r) { > + init_waitqueue_head(&r->wait_object); > + r->reply_msg = (u64)NULL; > + atomic_set(&r->refcnt, 1); > + atomic_set(&r->reply_ready, 0); > + } > + return r; > +} > + > + > +/* vq_req_free - free the VQ Request Object. It is assumed the verbs handler > + * has already free the VQ Reply Buffer if it existed. > + */ > +void > +vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *r) > +{ > + r->reply_msg = (u64)NULL; > + if (atomic_dec_and_test(&r->refcnt)) { > + kfree(r); > + } > +} > + > +/* vq_req_get - reference a VQ Request Object. Done > + * only in the kernel verbs handlers. > + */ > +void > +vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *r) > +{ > + atomic_inc(&r->refcnt); > +} > + > + > +/* vq_req_put - dereference and potentially free a VQ Request Object. > + * > + * This is only called by handle_vq() on the interrupt when it is done processing > + * a verb reply message. If the associated kernel verbs handler has already bailed, > + * then this put will actually free the VQ Request object _and_ the VQ Reply Buffer > + * if it exists. > + */ > +void > +vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *r) > +{ > + if (atomic_dec_and_test(&r->refcnt)) { > + if (r->reply_msg != (u64)NULL) > + vq_repbuf_free(c2dev, (void *)(unsigned long)r->reply_msg); > + kfree(r); > + } > +} > + > + > +/* > + * vq_repbuf_alloc - allocate a VQ Reply Buffer. > + */ > +void * > +vq_repbuf_alloc(struct c2_dev *c2dev) > +{ > + return kmem_cache_alloc(c2dev->host_msg_cache, SLAB_ATOMIC); > +} > + > +/* > + * vq_send_wr - post a verbs request message to the Verbs Request Queue. > + * If a message is not available in the MQ, then block until one is available. > + * NOTE: handle_mq() on the interrupt context will wake up threads blocked here. > + * When the adapter drains the Verbs Request Queue, it inserts MQ index 0 in to the > + * adapter->host activity fifo and interrupts the host. > + */ > +int > +vq_send_wr(struct c2_dev *c2dev, ccwr_t *wr) > +{ > + void *msg; > + wait_queue_t __wait; > + > + /* > + * grab adapter vq lock > + */ > + spin_lock(&c2dev->vqlock); > + > + /* > + * allocate msg > + */ > + msg = c2_mq_alloc(&c2dev->req_vq); > + > + /* > + * If we cannot get a msg, then we'll wait > + * When a messages are available, the int handler will wake_up() > + * any waiters. > + */ > + while (msg == NULL) { > + init_waitqueue_entry(&__wait, current); > + add_wait_queue(&c2dev->req_vq_wo, &__wait); > + spin_unlock(&c2dev->vqlock); > + for (;;) { > + set_current_state(TASK_INTERRUPTIBLE); > + if (!c2_mq_full(&c2dev->req_vq)) { > + break; > + } > + if (!signal_pending(current)) { > + schedule_timeout(1*HZ); /* 1 second... */ > + continue; > + } > + set_current_state(TASK_RUNNING); > + remove_wait_queue(&c2dev->req_vq_wo, &__wait); > + return -EINTR; > + } > + set_current_state(TASK_RUNNING); > + remove_wait_queue(&c2dev->req_vq_wo, &__wait); > + spin_lock(&c2dev->vqlock); > + msg = c2_mq_alloc(&c2dev->req_vq); > + } > + > + /* > + * copy wr into adapter msg > + */ > + memcpy(msg, wr, c2dev->req_vq.msg_size); > + > + /* > + * post msg > + */ > + c2_mq_produce(&c2dev->req_vq); > + > + /* > + * release adapter vq lock > + */ > + spin_unlock(&c2dev->vqlock); > + return 0; > +} > + > + > +/* > + * vq_wait_for_reply - block until the adapter posts a Verb Reply Message. > + */ > +int > +vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req) > +{ > + wait_queue_t __wait; > + int rc = 0; > + > + /* > + * Add this request to the wait queue. > + */ > + init_waitqueue_entry(&__wait, current); > + add_wait_queue(&req->wait_object, &__wait); > + for (;;) { > + set_current_state(TASK_UNINTERRUPTIBLE); > + if (atomic_read(&req->reply_ready)) { > + break; > + } > + if (schedule_timeout(60*HZ) == 0) { > + rc = -ETIMEDOUT; > + break; > + } > + } > + set_current_state(TASK_RUNNING); > + remove_wait_queue(&req->wait_object, &__wait); > + return rc; > +} > + > +/* > + * vq_repbuf_free - Free a Verbs Reply Buffer. > + */ > +void > +vq_repbuf_free(struct c2_dev *c2dev, void *reply) > +{ > + kmem_cache_free(c2dev->host_msg_cache, reply); > +} > Index: hw/amso1100/README > =================================================================== > --- hw/amso1100/README (revision 0) > +++ hw/amso1100/README (revision 0) > @@ -0,0 +1,11 @@ > + > +This is the OpenIB iWARP driver for the AMSO1100 HCA from > +Open Grid Computing. The adapter is a 1Gb RDMA capable PCI-X RNIC. > + > +The driver implements an iWARP CM Provider and OpenIB verbs > +provider. The company that created the device (Ammasso, Inc.) > +is no longer in business, however, limited quantities of the cards > +are available for development purposes from Open Grid Computing. > + > +Please contact 512-343-9196 x 108 or e-mail tom at opengridcomputing.com > +for more information. > Index: hw/amso1100/c2_provider.c > =================================================================== > --- hw/amso1100/c2_provider.c (revision 0) > +++ hw/amso1100/c2_provider.c (revision 0) > @@ -0,0 +1,704 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > +#include > + > +#include > +#include "c2.h" > +#include "c2_provider.h" > +#include "c2_user.h" > + > +static int c2_query_device(struct ib_device *ibdev, > + struct ib_device_attr *props) > +{ > + struct c2_dev* c2dev = to_c2dev(ibdev); > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + memset(props, 0, sizeof *props); > + > + memcpy(&props->sys_image_guid, c2dev->netdev->dev_addr, 6); > + memcpy(&props->node_guid, c2dev->netdev->dev_addr, 6); > + > + props->fw_ver = c2dev->fw_ver; > + props->device_cap_flags = c2dev->device_cap_flags; > + props->vendor_id = c2dev->vendor_id; > + props->vendor_part_id = c2dev->vendor_part_id; > + props->hw_ver = c2dev->hw_rev; > + props->max_mr_size = ~0ull; > + props->max_qp = c2dev->max_qp; > + props->max_qp_wr = c2dev->max_qp_wr; > + props->max_sge = c2dev->max_sge; > + props->max_cq = c2dev->max_cq; > + props->max_cqe = c2dev->max_cqe; > + props->max_mr = c2dev->max_mr; > + props->max_pd = c2dev->max_pd; > + props->max_qp_rd_atom = 0; > + props->max_qp_init_rd_atom = 0; > + props->local_ca_ack_delay = 0; > + > + return 0; > +} > + > +static int c2_query_port(struct ib_device *ibdev, > + u8 port, struct ib_port_attr *props) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + props->max_mtu = IB_MTU_4096; > + props->lid = 0; > + props->lmc = 0; > + props->sm_lid = 0; > + props->sm_sl = 0; > + props->state = IB_PORT_ACTIVE; > + props->phys_state = 0; > + props->port_cap_flags = > + IB_PORT_CM_SUP | > + IB_PORT_SNMP_TUNNEL_SUP | > + IB_PORT_REINIT_SUP | > + IB_PORT_DEVICE_MGMT_SUP | > + IB_PORT_VENDOR_CLASS_SUP| > + IB_PORT_BOOT_MGMT_SUP; > + props->gid_tbl_len = 128; > + props->pkey_tbl_len = 1; > + props->qkey_viol_cntr = 0; > + props->active_width = 1; > + props->active_speed = 1; > + > + return 0; > +} > + > +static int c2_modify_port(struct ib_device *ibdev, > + u8 port, int port_modify_mask, > + struct ib_port_modify *props) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return 0; > +} > + > +static int c2_query_pkey(struct ib_device *ibdev, > + u8 port, u16 index, u16 *pkey) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + *pkey = 0; > + return 0; > +} > + > +static int c2_query_gid(struct ib_device *ibdev, u8 port, > + int index, union ib_gid *gid) > +{ > + struct c2_dev* c2dev = to_c2dev(ibdev); > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + memcpy(&(gid->raw[0]),c2dev->netdev->dev_addr, MAX_ADDR_LEN); > + > + return 0; > +} > + > +/* Allocate the user context data structure. This keeps track > + * of all objects associated with a particular user-mode client. > + */ > +static struct ib_ucontext *c2_alloc_ucontext(struct ib_device *ibdev, > + struct ib_udata *udata) > +{ > + struct c2_alloc_ucontext_resp uresp; > + struct c2_ucontext *context; > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + memset(&uresp, 0, sizeof uresp); > + > + uresp.qp_tab_size = to_c2dev(ibdev)->max_qp; > + > + context = kmalloc(sizeof *context, GFP_KERNEL); > + if (!context) > + return ERR_PTR(-ENOMEM); > + > + /* The OpenIB user context is logically similar to the RNIC > + * Instance of our existing driver > + */ > + /* context->rnic_p = rnic_open */ > + > + if (ib_copy_to_udata(udata, &uresp, sizeof uresp)) { > + kfree(context); > + return ERR_PTR(-EFAULT); > + } > + > + return &context->ibucontext; > +} > + > +static int c2_dealloc_ucontext(struct ib_ucontext *context) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return -ENOSYS; > +} > + > +static int c2_mmap_uar(struct ib_ucontext *context, > + struct vm_area_struct *vma) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return -ENOSYS; > +} > + > +static struct ib_pd *c2_alloc_pd(struct ib_device *ibdev, > + struct ib_ucontext *context, > + struct ib_udata *udata) > +{ > + struct c2_pd* pd; > + int err; > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + pd = kmalloc(sizeof *pd, GFP_KERNEL); > + if (!pd) > + return ERR_PTR(-ENOMEM); > + > + err = c2_pd_alloc(to_c2dev(ibdev), !context, pd); > + if (err) { > + kfree(pd); > + return ERR_PTR(err); > + } > + > + if (context) { > + if (ib_copy_to_udata(udata, &pd->pd_id, sizeof (__u32))) { > + c2_pd_free(to_c2dev(ibdev), pd); > + kfree(pd); > + return ERR_PTR(-EFAULT); > + } > + } > + > + return &pd->ibpd; > +} > + > +static int c2_dealloc_pd(struct ib_pd *pd) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + c2_pd_free(to_c2dev(pd->device), to_c2pd(pd)); > + kfree(pd); > + > + return 0; > +} > + > +static struct ib_ah *c2_ah_create(struct ib_pd *pd, > + struct ib_ah_attr *ah_attr) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return ERR_PTR(-ENOSYS); > +} > + > +static int c2_ah_destroy(struct ib_ah *ah) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return -ENOSYS; > +} > + > +static struct ib_qp *c2_create_qp(struct ib_pd *pd, > + struct ib_qp_init_attr *init_attr, > + struct ib_udata *udata) > +{ > + struct c2_qp *qp; > + int err; > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + switch(init_attr->qp_type) { > + case IB_QPT_RC: > + qp = kmalloc(sizeof(*qp), GFP_KERNEL); > + if (!qp) { > + dprintk("%s: Unable to allocate QP\n", __FUNCTION__); > + return ERR_PTR(-ENOMEM); > + } > + > + if (pd->uobject) { > + /* XXX userspace specific */ > + } > + > + err = c2_alloc_qp(to_c2dev(pd->device), > + to_c2pd(pd), > + init_attr, > + qp); > + if (err && pd->uobject) { > + /* XXX userspace specific */ > + } > + > + break; > + default: > + dprintk("%s: Invalid QP type: %d\n", __FUNCTION__, init_attr->qp_type); > + return ERR_PTR(-EINVAL); > + break; > + } > + > + if (err) { > + kfree(pd); > + return ERR_PTR(err); > + } > + > + return &qp->ibqp; > +} > + > +static int c2_destroy_qp(struct ib_qp *ib_qp) > +{ > + struct c2_qp *qp = to_c2qp(ib_qp); > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + c2_free_qp(to_c2dev(ib_qp->device), qp); > + kfree(qp); > + > + return 0; > +} > + > +static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, > + struct ib_ucontext *context, > + struct ib_udata *udata) > +{ > + struct c2_cq *cq; > + int err; > + > + cq = kmalloc(sizeof(*cq), GFP_KERNEL); > + if (!cq) { > + dprintk("%s: Unable to allocate CQ\n", __FUNCTION__); > + return ERR_PTR(-ENOMEM); > + } > + > + err = c2_init_cq(to_c2dev(ibdev), entries, NULL, cq); > + if (err) { > + dprintk("%s: error initializing CQ\n", __FUNCTION__); > + kfree(cq); > + return ERR_PTR(err); > + } > + > + return &cq->ibcq; > +} > + > +static int c2_destroy_cq(struct ib_cq *ib_cq) > +{ > + struct c2_cq *cq = to_c2cq(ib_cq); > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + c2_free_cq(to_c2dev(ib_cq->device), cq); > + kfree(cq); > + > + return 0; > +} > + > +static inline u32 c2_convert_access(int acc) > +{ > + return (acc & IB_ACCESS_REMOTE_WRITE ? CC_ACF_REMOTE_WRITE : 0) | > + (acc & IB_ACCESS_REMOTE_READ ? CC_ACF_REMOTE_READ : 0) | > + (acc & IB_ACCESS_LOCAL_WRITE ? CC_ACF_LOCAL_WRITE : 0) | > + CC_ACF_LOCAL_READ | CC_ACF_WINDOW_BIND; > +} > + > +static struct ib_mr *c2_reg_phys_mr(struct ib_pd *ib_pd, > + struct ib_phys_buf *buffer_list, > + int num_phys_buf, > + int acc, > + u64 *iova_start) > +{ > + struct c2_mr *mr; > + u64 **page_list; > + u32 total_len; > + int err, i, j, k, pbl_depth; > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + pbl_depth = 0; > + total_len = 0; > + > + for (i = 0; i < num_phys_buf; i++) { > + > + int size; > + > + if (buffer_list[i].addr & ~PAGE_MASK) { > + dprintk("Unaligned Memory Buffer: 0x%x\n", > + (unsigned int)buffer_list[i].addr); > + return ERR_PTR(-EINVAL); > + } > + > + if (!buffer_list[i].size) { > + dprintk("Invalid Buffer Size\n"); > + return ERR_PTR(-EINVAL); > + } > + > + size = buffer_list[i].size; > + total_len += size; > + while (size) { > + pbl_depth++; > + size -= PAGE_SIZE; > + } > + } > + > + page_list = kmalloc(sizeof(u64 *) * pbl_depth, GFP_KERNEL); > + if (!page_list) > + return ERR_PTR(-ENOMEM); > + > + for (i = 0, j = 0; i < num_phys_buf; i++) { > + > + int naddrs; > + > + naddrs = (u32)buffer_list[i].size % ~PAGE_MASK; > + for (k = 0; k < naddrs; k++) > + page_list[j++] = > + (u64 *)(unsigned long)(buffer_list[i].addr + (k << PAGE_SHIFT)); > + } > + > + mr = kmalloc(sizeof(*mr), GFP_KERNEL); > + if (!mr) > + return ERR_PTR(-ENOMEM); > + > + mr->pd = to_c2pd(ib_pd); > + > + err = c2_nsmr_register_phys_kern(to_c2dev(ib_pd->device), page_list, > + pbl_depth, total_len, iova_start, > + c2_convert_access(acc), mr); > + kfree(page_list); > + if (err) { > + kfree(mr); > + return ERR_PTR(err); > + } > + > + return &mr->ibmr; > +} > + > +static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc) > +{ > + struct ib_phys_buf bl; > + u64 kva; > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + bl.size = 4096; > + kva = (u64)(unsigned long)kmalloc(bl.size, GFP_KERNEL); > + if (!kva) > + return ERR_PTR(-ENOMEM); > + > + bl.addr = __pa(kva); > + return c2_reg_phys_mr(pd, &bl, 1, acc, &kva); > +} > + > +static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, > + int acc, struct ib_udata *udata) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return ERR_PTR(-ENOSYS); > +} > + > +static int c2_dereg_mr(struct ib_mr *ib_mr) > +{ > + struct c2_mr *mr = to_c2mr(ib_mr); > + int err; > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + err = c2_stag_dealloc(to_c2dev(ib_mr->device), ib_mr->lkey); > + if (err) > + dprintk("c2_stag_dealloc failed: %d\n", err); > + else > + kfree(mr); > + > + return err; > +} > + > +static ssize_t show_rev(struct class_device *cdev, char *buf) > +{ > + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return sprintf(buf, "%x\n", dev->hw_rev); > +} > + > +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) > +{ > + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return sprintf(buf, "%x.%x.%x\n", > + (int)(dev->fw_ver >> 32), > + (int)(dev->fw_ver >> 16) & 0xffff, > + (int)(dev->fw_ver & 0xffff)); > +} > + > +static ssize_t show_hca(struct class_device *cdev, char *buf) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return sprintf(buf, "AMSO1100\n"); > +} > + > +static ssize_t show_board(struct class_device *cdev, char *buf) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return sprintf(buf, "%.*s\n", 32, "AMSO1100 Board ID"); > +} > + > +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); > +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); > +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); > +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); > + > +static struct class_device_attribute *c2_class_attributes[] = { > + &class_device_attr_hw_rev, > + &class_device_attr_fw_ver, > + &class_device_attr_hca_type, > + &class_device_attr_board_id > +}; > + > +static int c2_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) > +{ > + int err; > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + err = c2_qp_modify(to_c2dev(ibqp->device), to_c2qp(ibqp), attr, attr_mask); > + > + return err; > +} > + > +static int c2_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return -ENOSYS; > +} > + > +static int c2_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return -ENOSYS; > +} > + > +static int c2_process_mad(struct ib_device *ibdev, > + int mad_flags, > + u8 port_num, > + struct ib_wc *in_wc, > + struct ib_grh *in_grh, > + struct ib_mad *in_mad, > + struct ib_mad *out_mad) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return -ENOSYS; > +} > + > +static int c2_connect(struct iw_cm_id* cm_id, > + const void* pdata, u8 pdata_len) > +{ > + int err; > + struct c2_qp* qp = container_of(cm_id->qp, struct c2_qp, ibqp); > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + if (cm_id->qp == NULL) > + return -EINVAL; > + > + /* Cache the cm_id in the qp */ > + qp->cm_id = cm_id; > + > + err = c2_llp_connect(cm_id, pdata, pdata_len); > + > + return err; > +} > + > +static int c2_disconnect(struct iw_cm_id* cm_id, int abrupt) > +{ > + struct ib_qp_attr attr; > + struct ib_qp *ib_qp = cm_id->qp; > + int err; > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + if (ib_qp == 0) > + /* If this is a lietening endpoint, there is no QP */ > + return 0; > + > + memset(&attr, 0, sizeof(struct ib_qp_attr)); > + if (abrupt) > + attr.qp_state = IB_QPS_ERR; > + else > + attr.qp_state = IB_QPS_SQD; > + > + err = c2_modify_qp(ib_qp, &attr, IB_QP_STATE); > + return err; > +} > + > +static int c2_accept(struct iw_cm_id* cm_id, const void *pdata, u8 pdata_len) > +{ > + int err; > + struct c2_qp* qp = container_of(cm_id->qp, struct c2_qp, ibqp); > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + /* Cache the cm_id in the qp */ > + qp->cm_id = cm_id; > + > + err = c2_llp_accept(cm_id, pdata, pdata_len); > + > + return err; > +} > + > +static int c2_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) > +{ > + int err; > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + err = c2_llp_reject(cm_id, pdata, pdata_len); > + return err; > +} > + > +static int c2_getpeername(struct iw_cm_id* cm_id, > + struct sockaddr_in* local_addr, > + struct sockaddr_in* remote_addr ) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + *local_addr = cm_id->local_addr; > + *remote_addr = cm_id->remote_addr; > + return 0; > +} > + > +static int c2_service_create(struct iw_cm_id* cm_id, int backlog) > +{ > + int err; > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + err = c2_llp_service_create(cm_id, backlog); > + return err; > +} > + > +static int c2_service_destroy(struct iw_cm_id* cm_id) > +{ > + int err; > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + > + err = c2_llp_service_destroy(cm_id); > + > + return err; > +} > + > +int c2_register_device(struct c2_dev *dev) > +{ > + int ret; > + int i; > + > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX); > + dev->ibdev.owner = THIS_MODULE; > + > + dev->ibdev.node_type = IB_NODE_RNIC; > + memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); > + memcpy(&dev->ibdev.node_guid, dev->netdev->dev_addr, 6); > + dev->ibdev.phys_port_cnt = 1; > + dev->ibdev.dma_device = &dev->pcidev->dev; > + dev->ibdev.class_dev.dev = &dev->pcidev->dev; > + dev->ibdev.query_device = c2_query_device; > + dev->ibdev.query_port = c2_query_port; > + dev->ibdev.modify_port = c2_modify_port; > + dev->ibdev.query_pkey = c2_query_pkey; > + dev->ibdev.query_gid = c2_query_gid; > + dev->ibdev.alloc_ucontext = c2_alloc_ucontext; > + dev->ibdev.dealloc_ucontext = c2_dealloc_ucontext; > + dev->ibdev.mmap = c2_mmap_uar; > + dev->ibdev.alloc_pd = c2_alloc_pd; > + dev->ibdev.dealloc_pd = c2_dealloc_pd; > + dev->ibdev.create_ah = c2_ah_create; > + dev->ibdev.destroy_ah = c2_ah_destroy; > + dev->ibdev.create_qp = c2_create_qp; > + dev->ibdev.modify_qp = c2_modify_qp; > + dev->ibdev.destroy_qp = c2_destroy_qp; > + dev->ibdev.create_cq = c2_create_cq; > + dev->ibdev.destroy_cq = c2_destroy_cq; > + dev->ibdev.poll_cq = c2_poll_cq; > + dev->ibdev.get_dma_mr = c2_get_dma_mr; > + dev->ibdev.reg_phys_mr = c2_reg_phys_mr; > + dev->ibdev.reg_user_mr = c2_reg_user_mr; > + dev->ibdev.dereg_mr = c2_dereg_mr; > + > + dev->ibdev.alloc_fmr = 0; > + dev->ibdev.unmap_fmr = 0; > + dev->ibdev.dealloc_fmr = 0; > + dev->ibdev.map_phys_fmr = 0; > + > + dev->ibdev.attach_mcast = c2_multicast_attach; > + dev->ibdev.detach_mcast = c2_multicast_detach; > + dev->ibdev.process_mad = c2_process_mad; > + > + dev->ibdev.req_notify_cq = c2_arm_cq; > + dev->ibdev.post_send = c2_post_send; > + dev->ibdev.post_recv = c2_post_receive; > + > + dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL); > + dev->ibdev.iwcm->connect = c2_connect; > + dev->ibdev.iwcm->disconnect = c2_disconnect; > + dev->ibdev.iwcm->accept = c2_accept; > + dev->ibdev.iwcm->reject = c2_reject; > + dev->ibdev.iwcm->getpeername = c2_getpeername; > + dev->ibdev.iwcm->create_listen = c2_service_create; > + dev->ibdev.iwcm->destroy_listen = c2_service_destroy; > + > + ret = ib_register_device(&dev->ibdev); > + if (ret) > + return ret; > + > + for (i = 0; i < ARRAY_SIZE(c2_class_attributes); ++i) { > + ret = class_device_create_file(&dev->ibdev.class_dev, > + c2_class_attributes[i]); > + if (ret) { > + ib_unregister_device(&dev->ibdev); > + return ret; > + } > + } > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + return 0; > +} > + > +void c2_unregister_device(struct c2_dev *dev) > +{ > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > + ib_unregister_device(&dev->ibdev); > +} > Index: hw/amso1100/c2_alloc.c > =================================================================== > --- hw/amso1100/c2_alloc.c (revision 0) > +++ hw/amso1100/c2_alloc.c (revision 0) > @@ -0,0 +1,255 @@ > +/* > + * Copyright (c) 2004 Topspin Communications. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > + > +#include > +#include > +#include > + > +#include "c2.h" > + > +/* Trivial bitmap-based allocator */ > +u32 c2_alloc(struct c2_alloc *alloc) > +{ > + u32 obj; > + > + spin_lock(&alloc->lock); > + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); > + if (obj < alloc->max) { > + set_bit(obj, alloc->table); > + alloc->last = obj; > + } else > + obj = -1; > + > + spin_unlock(&alloc->lock); > + > + return obj; > +} > + > +void c2_free(struct c2_alloc *alloc, u32 obj) > +{ > + spin_lock(&alloc->lock); > + clear_bit(obj, alloc->table); > + alloc->last = min(alloc->last, obj); > + spin_unlock(&alloc->lock); > +} > + > +int c2_alloc_init(struct c2_alloc *alloc, u32 num, u32 reserved) > +{ > + int i; > + > + alloc->last = 0; > + alloc->max = num; > + spin_lock_init(&alloc->lock); > + alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long), > + GFP_KERNEL); > + if (!alloc->table) > + return -ENOMEM; > + > + bitmap_zero(alloc->table, num); > + for (i = 0; i < reserved; ++i) > + set_bit(i, alloc->table); > + > + return 0; > +} > + > +void c2_alloc_cleanup(struct c2_alloc *alloc) > +{ > + kfree(alloc->table); > +} > + > +/* > + * Array of pointers with lazy allocation of leaf pages. Callers of > + * _get, _set and _clear methods must use a lock or otherwise > + * serialize access to the array. > + */ > + > +void *c2_array_get(struct c2_array *array, int index) > +{ > + int p = (index * sizeof (void *)) >> PAGE_SHIFT; > + > + if (array->page_list[p].page) { > + int i = index & (PAGE_SIZE / sizeof (void *) - 1); > + return array->page_list[p].page[i]; > + } else > + return NULL; > +} > + > +int c2_array_set(struct c2_array *array, int index, void *value) > +{ > + int p = (index * sizeof (void *)) >> PAGE_SHIFT; > + > + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ > + if (!array->page_list[p].page) > + array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC); > + > + if (!array->page_list[p].page) > + return -ENOMEM; > + > + array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] = > + value; > + ++array->page_list[p].used; > + > + return 0; > +} > + > +void c2_array_clear(struct c2_array *array, int index) > +{ > + int p = (index * sizeof (void *)) >> PAGE_SHIFT; > + > + if (--array->page_list[p].used == 0) { > + free_page((unsigned long) array->page_list[p].page); > + array->page_list[p].page = NULL; > + } > + > + if (array->page_list[p].used < 0) > + pr_debug("Array %p index %d page %d with ref count %d < 0\n", > + array, index, p, array->page_list[p].used); > +} > + > +int c2_array_init(struct c2_array *array, int nent) > +{ > + int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; > + int i; > + > + array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL); > + if (!array->page_list) > + return -ENOMEM; > + > + for (i = 0; i < npage; ++i) { > + array->page_list[i].page = NULL; > + array->page_list[i].used = 0; > + } > + > + return 0; > +} > + > +void c2_array_cleanup(struct c2_array *array, int nent) > +{ > + int i; > + > + for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i) > + free_page((unsigned long) array->page_list[i].page); > + > + kfree(array->page_list); > +} > + > +static int c2_alloc_mqsp_chunk(unsigned int gfp_mask, struct sp_chunk** head) > +{ > + int i; > + struct sp_chunk* new_head; > + > + new_head = (struct sp_chunk*)__get_free_page(gfp_mask|GFP_DMA); > + if (new_head == NULL) > + return -ENOMEM; > + > + new_head->next = NULL; > + new_head->head = 0; > + new_head->gfp_mask = gfp_mask; > + > + /* build list where each index is the next free slot */ > + for (i = 0; > + i < (PAGE_SIZE-sizeof(struct sp_chunk*)-sizeof(u16)) / sizeof(u16)-1; > + i++) { > + new_head->shared_ptr[i] = i+1; > + } > + /* terminate list */ > + new_head->shared_ptr[i] = 0xFFFF; > + > + *head = new_head; > + return 0; > +} > + > +int c2_init_mqsp_pool(unsigned int gfp_mask, struct sp_chunk** root) { > + return c2_alloc_mqsp_chunk(gfp_mask, root); > +} > + > +void c2_free_mqsp_pool(struct sp_chunk* root) > +{ > + struct sp_chunk* next; > + > + while (root) { > + next = root->next; > + __free_page((struct page*)root); > + root = next; > + } > +} > + > +u16* c2_alloc_mqsp(struct sp_chunk* head) > +{ > + u16 mqsp; > + > + while (head) { > + mqsp = head->head; > + if (mqsp != 0xFFFF) { > + head->head = head->shared_ptr[mqsp]; > + break; > + } else if (head->next == NULL) { > + if (c2_alloc_mqsp_chunk(head->gfp_mask, &head->next) == 0) { > + head = head->next; > + mqsp = head->head; > + head->head = > + head->shared_ptr[mqsp]; > + break; > + } > + else > + return 0; > + } > + else > + head = head->next; > + } > + if (head) > + return &(head->shared_ptr[mqsp]); > + return 0; > +} > + > +void c2_free_mqsp(u16* mqsp) > +{ > + struct sp_chunk* head; > + u16 idx; > + > + /* The chunk containing this ptr begins at the page boundary */ > + head = (struct sp_chunk*)((unsigned long)mqsp & PAGE_MASK); > + > + /* Link head to new mqsp */ > + *mqsp = head->head; > + > + /* Compute the shared_ptr index */ > + idx = ((unsigned long)mqsp & ~PAGE_MASK) >> 1; > + idx -= (unsigned long)&(((struct sp_chunk*)0)->shared_ptr[0]) >> 1; > + > + /* Point this index at the head */ > + head->shared_ptr[idx] = head->head; > + > + /* Point head at this index */ > + head->head = idx; > +} > Index: hw/amso1100/cc_types.h > =================================================================== > --- hw/amso1100/cc_types.h (revision 0) > +++ hw/amso1100/cc_types.h (revision 0) > @@ -0,0 +1,297 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#ifndef _CC_TYPES_H_ > +#define _CC_TYPES_H_ > + > +#include > + > +#ifndef NULL > +#define NULL 0 > +#endif > +#ifndef TRUE > +#define TRUE 1 > +#endif > +#ifndef FALSE > +#define FALSE 0 > +#endif > + > +#define PTR_TO_CTX(p) (u64)(u32)(p) > + > +#define CC_PTR_TO_64(p) (u64)(u32)(p) > +#define CC_64_TO_PTR(c) (void*)(u32)(c) > + > + > + > +/* > + * not really a "type" however this needs > + * to be common between adapter and host. > + * this is the best place to put it. > + */ > +#define CC_QP_NO_ATTR_CHANGE 0xFFFFFFFF > + > +/* Maximum allowed size in bytes of private_data exchange > + * on connect. > + */ > +#define CC_MAX_PRIVATE_DATA_SIZE 200 > + > +/* > + * These types are shared among the adapter, host, and CCIL consumer. Thus > + * they are placed here since everyone includes cc_types.h... > + */ > +typedef enum { > + CC_CQ_NOTIFICATION_TYPE_NONE = 1, > + CC_CQ_NOTIFICATION_TYPE_NEXT, > + CC_CQ_NOTIFICATION_TYPE_NEXT_SE > +} cc_cq_notification_type_t; > + > +typedef enum { > + CC_CFG_ADD_ADDR = 1, > + CC_CFG_DEL_ADDR = 2, > + CC_CFG_ADD_ROUTE = 3, > + CC_CFG_DEL_ROUTE = 4 > +} cc_setconfig_cmd_t; > + > +typedef enum { > + CC_GETCONFIG_ROUTES = 1, > + CC_GETCONFIG_ADDRS > +} cc_getconfig_cmd_t; > + > +/* > + * CCIL Work Request Identifiers > + */ > +typedef enum { > + CCWR_RNIC_OPEN = 1, > + CCWR_RNIC_QUERY, > + CCWR_RNIC_SETCONFIG, > + CCWR_RNIC_GETCONFIG, > + CCWR_RNIC_CLOSE, > + CCWR_CQ_CREATE, > + CCWR_CQ_QUERY, > + CCWR_CQ_MODIFY, > + CCWR_CQ_DESTROY, > + CCWR_QP_CONNECT, > + CCWR_PD_ALLOC, > + CCWR_PD_DEALLOC, > + CCWR_SRQ_CREATE, > + CCWR_SRQ_QUERY, > + CCWR_SRQ_MODIFY, > + CCWR_SRQ_DESTROY, > + CCWR_QP_CREATE, > + CCWR_QP_QUERY, > + CCWR_QP_MODIFY, > + CCWR_QP_DESTROY, > + CCWR_NSMR_STAG_ALLOC, > + CCWR_NSMR_REGISTER, > + CCWR_NSMR_PBL, > + CCWR_STAG_DEALLOC, > + CCWR_NSMR_REREGISTER, > + CCWR_SMR_REGISTER, > + CCWR_MR_QUERY, > + CCWR_MW_ALLOC, > + CCWR_MW_QUERY, > + CCWR_EP_CREATE, > + CCWR_EP_GETOPT, > + CCWR_EP_SETOPT, > + CCWR_EP_DESTROY, > + CCWR_EP_BIND, > + CCWR_EP_CONNECT, > + CCWR_EP_LISTEN, > + CCWR_EP_SHUTDOWN, > + CCWR_EP_LISTEN_CREATE, > + CCWR_EP_LISTEN_DESTROY, > + CCWR_EP_QUERY, > + CCWR_CR_ACCEPT, > + CCWR_CR_REJECT, > + CCWR_CONSOLE, > + CCWR_TERM, > + CCWR_FLASH_INIT, > + CCWR_FLASH, > + CCWR_BUF_ALLOC, > + CCWR_BUF_FREE, > + CCWR_FLASH_WRITE, > + CCWR_INIT, /* WARNING: Don't move this ever again! */ > + > + > + > + /* Add new IDs here */ > + > + > + > + /* > + * WARNING: CCWR_LAST must always be the last verbs id defined! > + * All the preceding IDs are fixed, and must not change. > + * You can add new IDs, but must not remove or reorder > + * any IDs. If you do, YOU will ruin any hope of > + * compatability between versions. > + */ > + CCWR_LAST, > + > + /* > + * Start over at 1 so that arrays indexed by user wr id's > + * begin at 1. This is OK since the verbs and user wr id's > + * are always used on disjoint sets of queues. > + */ > +#if 0 > + CCWR_SEND = 1, > + CCWR_SEND_SE, > + CCWR_SEND_INV, > + CCWR_SEND_SE_INV, > +#else > + /* > + * The order of the CCWR_SEND_XX verbs must > + * match the order of the RDMA_OPs > + */ > + CCWR_SEND = 1, > + CCWR_SEND_INV, > + CCWR_SEND_SE, > + CCWR_SEND_SE_INV, > +#endif > + CCWR_RDMA_WRITE, > + CCWR_RDMA_READ, > + CCWR_RDMA_READ_INV, > + CCWR_MW_BIND, > + CCWR_NSMR_FASTREG, > + CCWR_STAG_INVALIDATE, > + CCWR_RECV, > + CCWR_NOP, > + CCWR_UNIMPL, /* WARNING: This must always be the last user wr id defined! */ > +} ccwr_ids_t; > +#define RDMA_SEND_OPCODE_FROM_WR_ID(x) (x+2) > + > +/* > + * SQ/RQ Work Request Types > + */ > +typedef enum { > + CC_WR_TYPE_SEND = CCWR_SEND, > + CC_WR_TYPE_SEND_SE = CCWR_SEND_SE, > + CC_WR_TYPE_SEND_INV = CCWR_SEND_INV, > + CC_WR_TYPE_SEND_SE_INV = CCWR_SEND_SE_INV, > + CC_WR_TYPE_RDMA_WRITE = CCWR_RDMA_WRITE, > + CC_WR_TYPE_RDMA_READ = CCWR_RDMA_READ, > + CC_WR_TYPE_RDMA_READ_INV_STAG = CCWR_RDMA_READ_INV, > + CC_WR_TYPE_BIND_MW = CCWR_MW_BIND, > + CC_WR_TYPE_FASTREG_NSMR = CCWR_NSMR_FASTREG, > + CC_WR_TYPE_INV_STAG = CCWR_STAG_INVALIDATE, > + CC_WR_TYPE_RECV = CCWR_RECV, > + CC_WR_TYPE_NOP = CCWR_NOP, > +} cc_wr_type_t; > + > +/* > + * These are used as bitfields for efficient comparison of multiple possible > + * states. > + */ > +typedef enum { > + CC_QP_STATE_IDLE = 0x01, /* initial state */ > + CC_QP_STATE_CONNECTING = 0x02, /* LLP is connecting */ > + CC_QP_STATE_RTS = 0x04, /* RDDP/RDMAP enabled */ > + CC_QP_STATE_CLOSING = 0x08, /* LLP is shutting down */ > + CC_QP_STATE_TERMINATE = 0x10, /* Connection Terminat[ing|ed] */ > + CC_QP_STATE_ERROR = 0x20, /* Error state to flush everything */ > +} cc_qp_state_t; > + > +typedef struct _cc_netaddr_s { > + u32 ip_addr; > + u32 netmask; > + u32 mtu; > +} cc_netaddr_t; > + > +typedef struct _cc_route_s { > + u32 ip_addr; /* 0 indicates the default route */ > + u32 netmask; /* netmask associated with dst */ > + u32 flags; > + union { > + u32 ipaddr; /* address of the nexthop interface */ > + u8 enaddr[6]; > + } nexthop; > +} cc_route_t; > + > +/* > + * A Scatter Gather Entry. > + */ > +typedef u32 cc_stag_t; > + > +typedef struct { > + cc_stag_t stag; > + u32 length; > + u64 to; > +} cc_data_addr_t; > + > +/* > + * MR and MW flags used by the consumer, RI, and RNIC. > + */ > +typedef enum { > + MEM_REMOTE = 0x0001, /* allow mw binds with remote access. */ > + MEM_VA_BASED = 0x0002, /* Not Zero-based */ > + MEM_PBL_COMPLETE = 0x0004, /* PBL array is complete in this msg */ > + MEM_LOCAL_READ = 0x0008, /* allow local reads */ > + MEM_LOCAL_WRITE = 0x0010, /* allow local writes */ > + MEM_REMOTE_READ = 0x0020, /* allow remote reads */ > + MEM_REMOTE_WRITE = 0x0040, /* allow remote writes */ > + MEM_WINDOW_BIND = 0x0080, /* binds allowed */ > + MEM_SHARED = 0x0100, /* set if MR is shared */ > + MEM_STAG_VALID = 0x0200 /* set if STAG is in valid state */ > +} cc_mm_flags_t; > + > +/* > + * CCIL API ACF flags defined in terms of the low level mem flags. > + * This minimizes translation needed in the user API > + */ > +typedef enum { > + CC_ACF_LOCAL_READ = MEM_LOCAL_READ, > + CC_ACF_LOCAL_WRITE = MEM_LOCAL_WRITE, > + CC_ACF_REMOTE_READ = MEM_REMOTE_READ, > + CC_ACF_REMOTE_WRITE = MEM_REMOTE_WRITE, > + CC_ACF_WINDOW_BIND = MEM_WINDOW_BIND > +} cc_acf_t; > + > +/* > + * Image types of objects written to flash > + */ > +#define CC_FLASH_IMG_BITFILE 1 > +#define CC_FLASH_IMG_OPTION_ROM 2 > +#define CC_FLASH_IMG_VPD 3 > + > +/* > + * to fix bug 1815 we define the max size allowable of the > + * terminate message (per the IETF spec).Refer to the IETF > + * protocal specification, section 12.1.6, page 64) > + * The message is prefixed by 20 types of DDP info. > + * > + * Then the message has 6 bytes for the terminate control > + * and DDP segment length info plus a DDP header (either > + * 14 or 18 byts) plus 28 bytes for the RDMA header. > + * Thus the max size in: > + * 20 + (6 + 18 + 28) = 72 > + */ > +#define CC_MAX_TERMINATE_MESSAGE_SIZE (72) > +#endif > Index: hw/amso1100/c2_rnic.c > =================================================================== > --- hw/amso1100/c2_rnic.c (revision 0) > +++ hw/amso1100/c2_rnic.c (revision 0) > @@ -0,0 +1,581 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#ifdef NETEVENT_NOTIFIER > +#include > +#include > +#include > +#endif > + > + > +#include > +#include > +#include > +#include > +#include "c2.h" > +#include "c2_vq.h" > + > +#define C2_MAX_MRS 32768 > +#define C2_MAX_QPS 16000 > +#define C2_MAX_WQE_SZ 256 > +#define C2_MAX_QP_WR ((128*1024)/C2_MAX_WQE_SZ) > +#define C2_MAX_SGES 4 > +#define C2_MAX_CQS 32768 > +#define C2_MAX_CQES 4096 > +#define C2_MAX_PDS 16384 > + > +/* > + * Send the adapter INIT message to the amso1100 > + */ > +static int c2_adapter_init(struct c2_dev *c2dev) > +{ > + ccwr_init_req_t wr; > + int err; > + > + memset(&wr, 0, sizeof(wr)); > + c2_wr_set_id(&wr, CCWR_INIT); > + wr.hdr.context = 0; > + wr.hint_count = cpu_to_be64(__pa(&c2dev->hint_count)); > + wr.q0_host_shared = > + cpu_to_be64(__pa(c2dev->req_vq.shared)); > + wr.q1_host_shared = > + cpu_to_be64(__pa(c2dev->rep_vq.shared)); > + wr.q1_host_msg_pool = > + cpu_to_be64(__pa(c2dev->rep_vq.msg_pool)); > + wr.q2_host_shared = > + cpu_to_be64(__pa(c2dev->aeq.shared)); > + wr.q2_host_msg_pool = > + cpu_to_be64(__pa(c2dev->aeq.msg_pool)); > + > + /* Post the init message */ > + err = vq_send_wr(c2dev, (ccwr_t *)&wr); > + > + return err; > +} > + > +/* > + * Send the adapter TERM message to the amso1100 > + */ > +static void c2_adapter_term(struct c2_dev *c2dev) > +{ > + ccwr_init_req_t wr; > + > + memset(&wr, 0, sizeof(wr)); > + c2_wr_set_id(&wr, CCWR_TERM); > + wr.hdr.context = 0; > + > + /* Post the init message */ > + vq_send_wr(c2dev, (ccwr_t *)&wr); > + c2dev->init = 0; > + > + return; > +} > + > +/* > + * Hack to hard code an ip address > + */ > +extern char *rnic_ip_addr; > +static int c2_setconfig_hack(struct c2_dev *c2dev) > +{ > + struct c2_vq_req *vq_req; > + ccwr_rnic_setconfig_req_t *wr; > + ccwr_rnic_setconfig_rep_t *reply; > + cc_netaddr_t netaddr; > + int err, len; > + > + vq_req = vq_req_alloc(c2dev); > + if (!vq_req) > + return -ENOMEM; > + > + len = sizeof(cc_netaddr_t); > + wr = kmalloc(sizeof(*wr) + len, GFP_KERNEL); > + if (!wr) { > + err = -ENOMEM; > + goto bail0; > + } > + > + c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG); > + wr->hdr.context = (unsigned long)vq_req; > + wr->rnic_handle = c2dev->adapter_handle; > + wr->option = cpu_to_be32(CC_CFG_ADD_ADDR); > + > + netaddr.ip_addr = in_aton(rnic_ip_addr); > + netaddr.netmask = htonl(0xFFFFFF00); > + netaddr.mtu = 0; > + > + memcpy(wr->data, &netaddr, len); > + > + vq_req_get(c2dev, vq_req); > + > + err = vq_send_wr(c2dev, (ccwr_t *)wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail1; > + } > + > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) > + goto bail1; > + > + reply = (ccwr_rnic_setconfig_rep_t *)(unsigned long)(vq_req->reply_msg); > + if (!reply) { > + err = -ENOMEM; > + goto bail1; > + } > + > + err = c2_errno(reply); > + vq_repbuf_free(c2dev, reply); > + > +bail1: > + kfree(wr); > +bail0: > + vq_req_free(c2dev, vq_req); > + return err; > +} > + > +/* > + * Open a single RNIC instance to use with all > + * low level openib calls > + */ > +static int c2_rnic_open(struct c2_dev *c2dev) > +{ > + struct c2_vq_req *vq_req; > + ccwr_t wr; > + ccwr_rnic_open_rep_t* reply; > + int err; > + > + vq_req = vq_req_alloc(c2dev); > + if (vq_req == NULL) { > + return -ENOMEM; > + } > + > + memset(&wr, 0, sizeof(wr)); > + c2_wr_set_id(&wr, CCWR_RNIC_OPEN); > + wr.rnic_open.req.hdr.context = (unsigned long)(vq_req); > + wr.rnic_open.req.flags = cpu_to_be16(RNIC_PRIV_MODE); > + wr.rnic_open.req.port_num = cpu_to_be16(0); > + wr.rnic_open.req.user_context = (unsigned long)c2dev; > + > + vq_req_get(c2dev, vq_req); > + > + err = vq_send_wr(c2dev, &wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail0; > + } > + > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) { > + goto bail0; > + } > + > + reply = (ccwr_rnic_open_rep_t*)(unsigned long)(vq_req->reply_msg); > + if (!reply) { > + err = -ENOMEM; > + goto bail0; > + } > + > + if ( (err = c2_errno(reply)) != 0) { > + goto bail1; > + } > + > + c2dev->adapter_handle = reply->rnic_handle; > + > +bail1: > + vq_repbuf_free(c2dev, reply); > +bail0: > + vq_req_free(c2dev, vq_req); > + return err; > +} > + > +/* > + * Close the RNIC instance > + */ > +static int c2_rnic_close(struct c2_dev *c2dev) > +{ > + struct c2_vq_req *vq_req; > + ccwr_t wr; > + ccwr_rnic_close_rep_t *reply; > + int err; > + > + vq_req = vq_req_alloc(c2dev); > + if (vq_req == NULL) { > + return -ENOMEM; > + } > + > + memset(&wr, 0, sizeof(wr)); > + c2_wr_set_id(&wr, CCWR_RNIC_CLOSE); > + wr.rnic_close.req.hdr.context = (unsigned long)vq_req; > + wr.rnic_close.req.rnic_handle = c2dev->adapter_handle; > + > + vq_req_get(c2dev, vq_req); > + > + err = vq_send_wr(c2dev, &wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail0; > + } > + > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) { > + goto bail0; > + } > + > + reply = (ccwr_rnic_close_rep_t*)(unsigned long)(vq_req->reply_msg); > + if (!reply) { > + err = -ENOMEM; > + goto bail0; > + } > + > + if ( (err = c2_errno(reply)) != 0) { > + goto bail1; > + } > + > + c2dev->adapter_handle = 0; > + > +bail1: > + vq_repbuf_free(c2dev, reply); > +bail0: > + vq_req_free(c2dev, vq_req); > + return err; > +} > +#ifdef NETEVENT_NOTIFIER > +static int netevent_notifier(struct notifier_block *self, unsigned long > event, void* data) > +{ > + int i; > + u8* ha; > + struct neighbour* neigh = data; > + struct netevent_redirect* redir = data; > + struct netevent_route_change* rev = data; > + > + switch (event) { > + case NETEVENT_ROUTE_UPDATE: > + printk(KERN_ERR "NETEVENT_ROUTE_UPDATE:\n"); > + printk(KERN_ERR "fib_flags : %d\n", > + rev->fib_info->fib_flags); > + printk(KERN_ERR "fib_protocol : %d\n", > + rev->fib_info->fib_protocol); > + printk(KERN_ERR "fib_prefsrc : %08x\n", > + rev->fib_info->fib_prefsrc); > + printk(KERN_ERR "fib_priority : %d\n", > + rev->fib_info->fib_priority); > + break; > + > + case NETEVENT_NEIGH_UPDATE: > + printk(KERN_ERR "NETEVENT_NEIGH_UPDATE:\n"); > + printk(KERN_ERR "nud_state : %d\n", neigh->nud_state); > + printk(KERN_ERR "refcnt : %d\n", neigh->refcnt); > + printk(KERN_ERR "used : %d\n", neigh->used); > + printk(KERN_ERR "confirmed : %d\n", neigh->confirmed); > + printk(KERN_ERR " ha: "); > + for (i=0; i < neigh->dev->addr_len; i+=4) { > + ha = &neigh->ha[i]; > + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], ha[3]); > + } > + printk("\n"); > + > + printk(KERN_ERR "%8s: ", neigh->dev->name); > + for (i=0; i < neigh->dev->addr_len; i+=4) { > + ha = &neigh->ha[i]; > + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], ha[3]); > + } > + printk("\n"); > + break; > + > + case NETEVENT_REDIRECT: > + printk(KERN_ERR "NETEVENT_REDIRECT:\n"); > + printk(KERN_ERR "old: "); > + for (i=0; i < redir->old->neighbour->dev->addr_len; i+=4) { > + ha = &redir->old->neighbour->ha[i]; > + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], ha[3]); > + } > + printk("\n"); > + > + printk(KERN_ERR "new: "); > + for (i=0; i < redir->new->neighbour->dev->addr_len; i+=4) { > + ha = &redir->new->neighbour->ha[i]; > + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], ha[3]); > + } > + printk("\n"); > + break; > + > + default: > + printk(KERN_ERR "NETEVENT_WTFO:\n"); > + } > + > + return NOTIFY_DONE; > +} > + > +static struct notifier_block nb = { > + .notifier_call = netevent_notifier, > +}; > +#endif > +/* > + * Called by c2_probe to initialize the RNIC. This principally > + * involves initalizing the various limits and resouce pools that > + * comprise the RNIC instance. > + */ > +int c2_rnic_init(struct c2_dev* c2dev) > +{ > + int err; > + u32 qsize, msgsize; > + void *q1_pages; > + void *q2_pages; > + void __iomem *mmio_regs; > + > + /* Initialize the adapter limits */ > + c2dev->max_mr = C2_MAX_MRS; > + c2dev->max_mr_size = ~0; > + c2dev->max_qp = C2_MAX_QPS; > + c2dev->max_qp_wr = C2_MAX_QP_WR; > + c2dev->max_sge = C2_MAX_SGES; > + c2dev->max_cq = C2_MAX_CQS; > + c2dev->max_cqe = C2_MAX_CQES; > + c2dev->max_pd = C2_MAX_PDS; > + > + /* Device capabilities */ > + c2dev->device_cap_flags = > + ( > + IB_DEVICE_RESIZE_MAX_WR | > + IB_DEVICE_CURR_QP_STATE_MOD | > + IB_DEVICE_SYS_IMAGE_GUID | > + IB_DEVICE_ZERO_STAG | > + IB_DEVICE_SEND_W_INV | > + IB_DEVICE_MW | > + IB_DEVICE_ARP > + ); > + > + /* Allocate the qptr_array */ > + c2dev->qptr_array = vmalloc(C2_MAX_CQS*sizeof(void *)); > + if (!c2dev->qptr_array) { > + return -ENOMEM; > + } > + > + /* Inialize the qptr_array */ > + memset(c2dev->qptr_array, 0, C2_MAX_CQS*sizeof(void *)); > + c2dev->qptr_array[0] = (void *)&c2dev->req_vq; > + c2dev->qptr_array[1] = (void *)&c2dev->rep_vq; > + c2dev->qptr_array[2] = (void *)&c2dev->aeq; > + > + /* Initialize data structures */ > + init_waitqueue_head(&c2dev->req_vq_wo); > + spin_lock_init(&c2dev->vqlock); > + spin_lock_init(&c2dev->aeq_lock); > + > + > + /* Allocate MQ shared pointer pool for kernel clients. User > + * mode client pools are hung off the user context > + */ > + err = c2_init_mqsp_pool(GFP_KERNEL, &c2dev->kern_mqsp_pool); > + if (err) { > + goto bail0; > + } > + > + /* Allocate shared pointers for Q0, Q1, and Q2 from > + * the shared pointer pool. > + */ > + c2dev->req_vq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); > + c2dev->rep_vq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); > + c2dev->aeq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); > + if (!c2dev->req_vq.shared || > + !c2dev->rep_vq.shared || > + !c2dev->aeq.shared) { > + err = -ENOMEM; > + goto bail1; > + } > + > + mmio_regs = c2dev->kva; > + /* Initialize the Verbs Request Queue */ > + c2_mq_init(&c2dev->req_vq, 0, > + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q0_QSIZE)), > + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q0_MSGSIZE)), > + mmio_regs + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q0_POOLSTART)), > + mmio_regs + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q0_SHARED)), > + C2_MQ_ADAPTER_TARGET); > + > + /* Initialize the Verbs Reply Queue */ > + qsize = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q1_QSIZE)); > + msgsize = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q1_MSGSIZE)); > + q1_pages = kmalloc(qsize * msgsize, GFP_KERNEL); > + if (!q1_pages) { > + err = -ENOMEM; > + goto bail1; > + } > + c2_mq_init(&c2dev->rep_vq, > + 1, > + qsize, > + msgsize, > + q1_pages, > + mmio_regs + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q1_SHARED)), > + C2_MQ_HOST_TARGET); > + > + /* Initialize the Asynchronus Event Queue */ > + qsize = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q2_QSIZE)); > + msgsize = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q2_MSGSIZE)); > + q2_pages = kmalloc(qsize * msgsize, GFP_KERNEL); > + if (!q2_pages) { > + err = -ENOMEM; > + goto bail2; > + } > + c2_mq_init(&c2dev->aeq, > + 2, > + qsize, > + msgsize, > + q2_pages, > + mmio_regs + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q2_SHARED)), > + C2_MQ_HOST_TARGET); > + > + /* Initialize the verbs request allocator */ > + err = vq_init(c2dev); > + if (err) { > + goto bail3; > + } > + > + /* Enable interrupts on the adapter */ > + c2_write32(c2dev->regs + C2_IDIS, 0); > + > + /* create the WR init message */ > + err = c2_adapter_init(c2dev); > + if (err) { > + goto bail4; > + } > + c2dev->init++; > + > + /* open an adapter instance */ > + err = c2_rnic_open(c2dev); > + if (err) { > + goto bail4; > + } > + > + /* Initialize the PD pool */ > + err = c2_init_pd_table(c2dev); > + if (err) > + goto bail5; > + > + /* Initialize the QP pool */ > + err = c2_init_qp_table(c2dev); > + if (err) > + goto bail6; > + > + /* XXX hardcode an address */ > + err = c2_setconfig_hack(c2dev); > + if (err) > + goto bail7; > + > +#ifdef NETEVENT_NOTIFIER > + register_netevent_notifier(&nb); > +#endif > + return 0; > + > +bail7: > + c2_cleanup_qp_table(c2dev); > +bail6: > + c2_cleanup_pd_table(c2dev); > +bail5: > + c2_rnic_close(c2dev); > +bail4: > + vq_term(c2dev); > +bail3: > + kfree(q2_pages); > +bail2: > + kfree(q1_pages); > +bail1: > + c2_free_mqsp_pool(c2dev->kern_mqsp_pool); > +bail0: > + vfree(c2dev->qptr_array); > + > + return err; > +} > + > +/* > + * Called by c2_remove to cleanup the RNIC resources. > + */ > +void c2_rnic_term(struct c2_dev* c2dev) > +{ > +#ifdef NETEVENT_NOTIFIER > + unregister_netevent_notifier(&nb); > +#endif > + > + /* Close the open adapter instance */ > + c2_rnic_close(c2dev); > + > + /* Send the TERM message to the adapter */ > + c2_adapter_term(c2dev); > + > + /* Disable interrupts on the adapter */ > + c2_write32(c2dev->regs + C2_IDIS, 1); > + > + /* Free the QP pool */ > + c2_cleanup_qp_table(c2dev); > + > + /* Free the PD pool */ > + c2_cleanup_pd_table(c2dev); > + > + /* Free the verbs request allocator */ > + vq_term(c2dev); > + > + /* Free the asynchronus event queue */ > + kfree(c2dev->aeq.msg_pool); > + > + /* Free the verbs reply queue */ > + kfree(c2dev->rep_vq.msg_pool); > + > + /* Free the MQ shared pointer pool */ > + c2_free_mqsp_pool(c2dev->kern_mqsp_pool); > + > + /* Free the qptr_array */ > + vfree(c2dev->qptr_array); > + > + return; > +} > Index: hw/amso1100/c2_vq.h > =================================================================== > --- hw/amso1100/c2_vq.h (revision 0) > +++ hw/amso1100/c2_vq.h (revision 0) > @@ -0,0 +1,60 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#ifndef _C2_VQ_H_ > +#define _C2_VQ_H_ > +#include > + > +#include "c2.h" > +#include "c2_wr.h" > + > +struct c2_vq_req{ > + u64 reply_msg; /* ptr to reply msg */ > + wait_queue_head_t wait_object; /* wait object for vq reqs */ > + atomic_t reply_ready; /* set when reply is ready */ > + atomic_t refcnt; /* used to cancel WRs... */ > +}; > + > +extern int vq_init(struct c2_dev* c2dev); > +extern void vq_term(struct c2_dev* c2dev); > + > +extern struct c2_vq_req* vq_req_alloc(struct c2_dev *c2dev); > +extern void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *req); > +extern void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *req); > +extern void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *req); > +extern int vq_send_wr(struct c2_dev *c2dev, ccwr_t *wr); > + > +extern void* vq_repbuf_alloc(struct c2_dev *c2dev); > +extern void vq_repbuf_free(struct c2_dev *c2dev, void *reply); > + > +extern int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req); > +#endif /* _C2_VQ_H_ */ > Index: hw/amso1100/c2_wr.h > =================================================================== > --- hw/amso1100/c2_wr.h (revision 0) > +++ hw/amso1100/c2_wr.h (revision 0) > @@ -0,0 +1,1343 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#ifndef _CC_WR_H_ > +#define _CC_WR_H_ > +#include "cc_types.h" > +/* > + * WARNING: If you change this file, also bump CC_IVN_BASE > + * in common/include/clustercore/cc_ivn.h. > + */ > + > +#ifdef CCDEBUG > +#define CCWR_MAGIC 0xb07700b0 > +#endif > + > +/* > + * Build String Length. It must be the same as CC_BUILD_STR_LEN in ccil_api.h > + */ > +#define WR_BUILD_STR_LEN 64 > + > +#ifdef _MSC_VER > +#define PACKED > +#pragma pack(push) > +#pragma pack(1) > +#define __inline__ __inline > +#else > +#define PACKED __attribute__ ((packed)) > +#endif > + > +/* > + * WARNING: All of these structs need to align any 64bit types on > + * 64 bit boundaries! 64bit types include u64 and u64. > + */ > + > +/* > + * Clustercore Work Request Header. Be sensitive to field layout > + * and alignment. > + */ > +typedef struct { > + /* wqe_count is part of the cqe. It is put here so the > + * adapter can write to it while the wr is pending without > + * clobbering part of the wr. This word need not be dma'd > + * from the host to adapter by libccil, but we copy it anyway > + * to make the memcpy to the adapter better aligned. > + */ > + u32 wqe_count; > + > + /* Put these fields next so that later 32- and 64-bit > + * quantities are naturally aligned. > + */ > + u8 id; > + u8 result; /* adapter -> host */ > + u8 sge_count; /* host -> adapter */ > + u8 flags; /* host -> adapter */ > + > + u64 context; > +#ifdef CCMSGMAGIC > + u32 magic; > + u32 pad; > +#endif > +} PACKED ccwr_hdr_t; > + > +/* > + *------------------------ RNIC ------------------------ > + */ > + > +/* > + * WR_RNIC_OPEN > + */ > + > +/* > + * Flags for the RNIC WRs > + */ > +typedef enum { > + RNIC_IRD_STATIC = 0x0001, > + RNIC_ORD_STATIC = 0x0002, > + RNIC_QP_STATIC = 0x0004, > + RNIC_SRQ_SUPPORTED = 0x0008, > + RNIC_PBL_BLOCK_MODE = 0x0010, > + RNIC_SRQ_MODEL_ARRIVAL = 0x0020, > + RNIC_CQ_OVF_DETECTED = 0x0040, > + RNIC_PRIV_MODE = 0x0080 > +} PACKED cc_rnic_flags_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 user_context; > + u16 flags; /* See cc_rnic_flags_t */ > + u16 port_num; > +} PACKED ccwr_rnic_open_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > +} PACKED ccwr_rnic_open_rep_t; > + > +typedef union { > + ccwr_rnic_open_req_t req; > + ccwr_rnic_open_rep_t rep; > +} PACKED ccwr_rnic_open_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > +} PACKED ccwr_rnic_query_req_t; > + > +/* > + * WR_RNIC_QUERY > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u64 user_context; > + u32 vendor_id; > + u32 part_number; > + u32 hw_version; > + u32 fw_ver_major; > + u32 fw_ver_minor; > + u32 fw_ver_patch; > + char fw_ver_build_str[WR_BUILD_STR_LEN]; > + u32 max_qps; > + u32 max_qp_depth; > + u32 max_srq_depth; > + u32 max_send_sgl_depth; > + u32 max_rdma_sgl_depth; > + u32 max_cqs; > + u32 max_cq_depth; > + u32 max_cq_event_handlers; > + u32 max_mrs; > + u32 max_pbl_depth; > + u32 max_pds; > + u32 max_global_ird; > + u32 max_global_ord; > + u32 max_qp_ird; > + u32 max_qp_ord; > + u32 flags; /* See cc_rnic_flags_t */ > + u32 max_mws; > + u32 pbe_range_low; > + u32 pbe_range_high; > + u32 max_srqs; > + u32 page_size; > +} PACKED ccwr_rnic_query_rep_t; > + > +typedef union { > + ccwr_rnic_query_req_t req; > + ccwr_rnic_query_rep_t rep; > +} PACKED ccwr_rnic_query_t; > + > +/* > + * WR_RNIC_GETCONFIG > + */ > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 option; /* see cc_getconfig_cmd_t */ > + u64 reply_buf; > + u32 reply_buf_len; > +} PACKED ccwr_rnic_getconfig_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 option; /* see cc_getconfig_cmd_t */ > + u32 count_len; /* length of the number of addresses configured */ > +} PACKED ccwr_rnic_getconfig_rep_t; > + > +typedef union { > + ccwr_rnic_getconfig_req_t req; > + ccwr_rnic_getconfig_rep_t rep; > +} PACKED ccwr_rnic_getconfig_t; > + > +/* > + * WR_RNIC_SETCONFIG > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 option; /* See cc_setconfig_cmd_t */ > + /* variable data and pad See cc_netaddr_t and > + * cc_route_t > + */ > + u8 data[0]; > +} PACKED ccwr_rnic_setconfig_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_rnic_setconfig_rep_t; > + > +typedef union { > + ccwr_rnic_setconfig_req_t req; > + ccwr_rnic_setconfig_rep_t rep; > +} PACKED ccwr_rnic_setconfig_t; > + > +/* > + * WR_RNIC_CLOSE > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > +} PACKED ccwr_rnic_close_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_rnic_close_rep_t; > + > +typedef union { > + ccwr_rnic_close_req_t req; > + ccwr_rnic_close_rep_t rep; > +} PACKED ccwr_rnic_close_t; > + > +/* > + *------------------------ CQ ------------------------ > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u64 shared_ht; > + u64 user_context; > + u64 msg_pool; > + u32 rnic_handle; > + u32 msg_size; > + u32 depth; > +} PACKED ccwr_cq_create_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 mq_index; > + u32 adapter_shared; > + u32 cq_handle; > +} PACKED ccwr_cq_create_rep_t; > + > +typedef union { > + ccwr_cq_create_req_t req; > + ccwr_cq_create_rep_t rep; > +} PACKED ccwr_cq_create_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 cq_handle; > + u32 new_depth; > + u64 new_msg_pool; > +} PACKED ccwr_cq_modify_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_cq_modify_rep_t; > + > +typedef union { > + ccwr_cq_modify_req_t req; > + ccwr_cq_modify_rep_t rep; > +} PACKED ccwr_cq_modify_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 cq_handle; > +} PACKED ccwr_cq_destroy_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_cq_destroy_rep_t; > + > +typedef union { > + ccwr_cq_destroy_req_t req; > + ccwr_cq_destroy_rep_t rep; > +} PACKED ccwr_cq_destroy_t; > + > +/* > + *------------------------ PD ------------------------ > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 pd_id; > +} PACKED ccwr_pd_alloc_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_pd_alloc_rep_t; > + > +typedef union { > + ccwr_pd_alloc_req_t req; > + ccwr_pd_alloc_rep_t rep; > +} PACKED ccwr_pd_alloc_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 pd_id; > +} PACKED ccwr_pd_dealloc_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_pd_dealloc_rep_t; > + > +typedef union { > + ccwr_pd_dealloc_req_t req; > + ccwr_pd_dealloc_rep_t rep; > +} PACKED ccwr_pd_dealloc_t; > + > +/* > + *------------------------ SRQ ------------------------ > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u64 shared_ht; > + u64 user_context; > + u32 rnic_handle; > + u32 srq_depth; > + u32 srq_limit; > + u32 sgl_depth; > + u32 pd_id; > +} PACKED ccwr_srq_create_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 srq_depth; > + u32 sgl_depth; > + u32 msg_size; > + u32 mq_index; > + u32 mq_start; > + u32 srq_handle; > +} PACKED ccwr_srq_create_rep_t; > + > +typedef union { > + ccwr_srq_create_req_t req; > + ccwr_srq_create_rep_t rep; > +} PACKED ccwr_srq_create_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 srq_handle; > +} PACKED ccwr_srq_destroy_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_srq_destroy_rep_t; > + > +typedef union { > + ccwr_srq_destroy_req_t req; > + ccwr_srq_destroy_rep_t rep; > +} PACKED ccwr_srq_destroy_t; > + > +/* > + *------------------------ QP ------------------------ > + */ > +typedef enum { > + QP_RDMA_READ = 0x00000001, /* RDMA read enabled? */ > + QP_RDMA_WRITE = 0x00000002, /* RDMA write enabled? */ > + QP_MW_BIND = 0x00000004, /* MWs enabled */ > + QP_ZERO_STAG = 0x00000008, /* enabled? */ > + QP_REMOTE_TERMINATION = 0x00000010, /* remote end terminated */ > + QP_RDMA_READ_RESPONSE = 0x00000020 /* Remote RDMA read */ > + /* enabled? */ > +} PACKED ccwr_qp_flags_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 shared_sq_ht; > + u64 shared_rq_ht; > + u64 user_context; > + u32 rnic_handle; > + u32 sq_cq_handle; > + u32 rq_cq_handle; > + u32 sq_depth; > + u32 rq_depth; > + u32 srq_handle; > + u32 srq_limit; > + u32 flags; /* see ccwr_qp_flags_t */ > + u32 send_sgl_depth; > + u32 recv_sgl_depth; > + u32 rdma_write_sgl_depth; > + u32 ord; > + u32 ird; > + u32 pd_id; > +} PACKED ccwr_qp_create_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 sq_depth; > + u32 rq_depth; > + u32 send_sgl_depth; > + u32 recv_sgl_depth; > + u32 rdma_write_sgl_depth; > + u32 ord; > + u32 ird; > + u32 sq_msg_size; > + u32 sq_mq_index; > + u32 sq_mq_start; > + u32 rq_msg_size; > + u32 rq_mq_index; > + u32 rq_mq_start; > + u32 qp_handle; > +} PACKED ccwr_qp_create_rep_t; > + > +typedef union { > + ccwr_qp_create_req_t req; > + ccwr_qp_create_rep_t rep; > +} PACKED ccwr_qp_create_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 qp_handle; > +} PACKED ccwr_qp_query_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 user_context; > + u32 rnic_handle; > + u32 sq_depth; > + u32 rq_depth; > + u32 send_sgl_depth; > + u32 rdma_write_sgl_depth; > + u32 recv_sgl_depth; > + u32 ord; > + u32 ird; > + u16 qp_state; > + u16 flags; /* see ccwr_qp_flags_t */ > + u32 qp_id; > + u32 local_addr; > + u32 remote_addr; > + u16 local_port; > + u16 remote_port; > + u32 terminate_msg_length; /* 0 if not present */ > + u8 data[0]; > + /* Terminate Message in-line here. */ > +} PACKED ccwr_qp_query_rep_t; > + > +typedef union { > + ccwr_qp_query_req_t req; > + ccwr_qp_query_rep_t rep; > +} PACKED ccwr_qp_query_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 stream_msg; > + u32 stream_msg_length; > + u32 rnic_handle; > + u32 qp_handle; > + u32 next_qp_state; > + u32 ord; > + u32 ird; > + u32 sq_depth; > + u32 rq_depth; > + u32 llp_ep_handle; > +} PACKED ccwr_qp_modify_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 ord; > + u32 ird; > + u32 sq_depth; > + u32 rq_depth; > + u32 sq_msg_size; > + u32 sq_mq_index; > + u32 sq_mq_start; > + u32 rq_msg_size; > + u32 rq_mq_index; > + u32 rq_mq_start; > +} PACKED ccwr_qp_modify_rep_t; > + > +typedef union { > + ccwr_qp_modify_req_t req; > + ccwr_qp_modify_rep_t rep; > +} PACKED ccwr_qp_modify_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 qp_handle; > +} PACKED ccwr_qp_destroy_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_qp_destroy_rep_t; > + > +typedef union { > + ccwr_qp_destroy_req_t req; > + ccwr_qp_destroy_rep_t rep; > +} PACKED ccwr_qp_destroy_t; > + > +/* > + * The CCWR_QP_CONNECT msg is posted on the verbs request queue. It can > + * only be posted when a QP is in IDLE state. After the connect request is > + * submitted to the LLP, the adapter moves the QP to CONNECT_PENDING state. > + * No synchronous reply from adapter to this WR. The results of > + * connection are passed back in an async event CCAE_ACTIVE_CONNECT_RESULTS > + * See ccwr_ae_active_connect_results_t > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 qp_handle; > + u32 remote_addr; > + u16 remote_port; > + u16 pad; > + u32 private_data_length; > + u8 private_data[0]; /* Private data in-line. */ > +} PACKED ccwr_qp_connect_req_t; > + > +typedef struct { > + ccwr_qp_connect_req_t req; > + /* no synchronous reply. */ > +} PACKED ccwr_qp_connect_t; > + > + > +/* > + *------------------------ MM ------------------------ > + */ > + > +typedef cc_mm_flags_t ccwr_mr_flags_t; /* cc_types.h */ > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 pbl_depth; > + u32 pd_id; > + u32 flags; /* See ccwr_mr_flags_t */ > +} PACKED ccwr_nsmr_stag_alloc_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 pbl_depth; > + u32 stag_index; > +} PACKED ccwr_nsmr_stag_alloc_rep_t; > + > +typedef union { > + ccwr_nsmr_stag_alloc_req_t req; > + ccwr_nsmr_stag_alloc_rep_t rep; > +} PACKED ccwr_nsmr_stag_alloc_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 va; > + u32 rnic_handle; > + u16 flags; /* See ccwr_mr_flags_t */ > + u8 stag_key; > + u8 pad; > + u32 pd_id; > + u32 pbl_depth; > + u32 pbe_size; > + u32 fbo; > + u32 length; > + u32 addrs_length; > + /* array of paddrs (must be aligned on a 64bit boundary) */ > + u64 paddrs[0]; > +} PACKED ccwr_nsmr_register_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 pbl_depth; > + u32 stag_index; > +} PACKED ccwr_nsmr_register_rep_t; > + > +typedef union { > + ccwr_nsmr_register_req_t req; > + ccwr_nsmr_register_rep_t rep; > +} PACKED ccwr_nsmr_register_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 flags; /* See ccwr_mr_flags_t */ > + u32 stag_index; > + u32 addrs_length; > + /* array of paddrs (must be aligned on a 64bit boundary) */ > + u64 paddrs[0]; > +} PACKED ccwr_nsmr_pbl_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_nsmr_pbl_rep_t; > + > +typedef union { > + ccwr_nsmr_pbl_req_t req; > + ccwr_nsmr_pbl_rep_t rep; > +} PACKED ccwr_nsmr_pbl_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 stag_index; > +} PACKED ccwr_mr_query_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u8 stag_key; > + u8 pad[3]; > + u32 pd_id; > + u32 flags; /* See ccwr_mr_flags_t */ > + u32 pbl_depth; > +} PACKED ccwr_mr_query_rep_t; > + > +typedef union { > + ccwr_mr_query_req_t req; > + ccwr_mr_query_rep_t rep; > +} PACKED ccwr_mr_query_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 stag_index; > +} PACKED ccwr_mw_query_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u8 stag_key; > + u8 pad[3]; > + u32 pd_id; > + u32 flags; /* See ccwr_mr_flags_t */ > +} PACKED ccwr_mw_query_rep_t; > + > +typedef union { > + ccwr_mw_query_req_t req; > + ccwr_mw_query_rep_t rep; > +} PACKED ccwr_mw_query_t; > + > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 stag_index; > +} PACKED ccwr_stag_dealloc_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_stag_dealloc_rep_t; > + > +typedef union { > + ccwr_stag_dealloc_req_t req; > + ccwr_stag_dealloc_rep_t rep; > +} PACKED ccwr_stag_dealloc_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 va; > + u32 rnic_handle; > + u16 flags; /* See ccwr_mr_flags_t */ > + u8 stag_key; > + u8 pad; > + u32 stag_index; > + u32 pd_id; > + u32 pbl_depth; > + u32 pbe_size; > + u32 fbo; > + u32 length; > + u32 addrs_length; > + u32 pad1; > + /* array of paddrs (must be aligned on a 64bit boundary) */ > + u64 paddrs[0]; > +} PACKED ccwr_nsmr_reregister_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 pbl_depth; > + u32 stag_index; > +} PACKED ccwr_nsmr_reregister_rep_t; > + > +typedef union { > + ccwr_nsmr_reregister_req_t req; > + ccwr_nsmr_reregister_rep_t rep; > +} PACKED ccwr_nsmr_reregister_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 va; > + u32 rnic_handle; > + u16 flags; /* See ccwr_mr_flags_t */ > + u8 stag_key; > + u8 pad; > + u32 stag_index; > + u32 pd_id; > +} PACKED ccwr_smr_register_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 stag_index; > +} PACKED ccwr_smr_register_rep_t; > + > +typedef union { > + ccwr_smr_register_req_t req; > + ccwr_smr_register_rep_t rep; > +} PACKED ccwr_smr_register_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 pd_id; > +} PACKED ccwr_mw_alloc_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 stag_index; > +} PACKED ccwr_mw_alloc_rep_t; > + > +typedef union { > + ccwr_mw_alloc_req_t req; > + ccwr_mw_alloc_rep_t rep; > +} PACKED ccwr_mw_alloc_t; > + > +/* > + *------------------------ WRs ----------------------- > + */ > + > +typedef struct { > + ccwr_hdr_t hdr; /* Has status and WR Type */ > +} PACKED ccwr_user_hdr_t; > + > +/* Completion queue entry. */ > +typedef struct { > + ccwr_hdr_t hdr; /* Has status and WR Type */ > + u64 qp_user_context;/* cc_user_qp_t * */ > + u32 qp_state; /* Current QP State */ > + u32 handle; /* QPID or EP Handle */ > + u32 bytes_rcvd; /* valid for RECV WCs */ > + u32 stag; > +} PACKED ccwr_ce_t; > + > + > +/* > + * Flags used for all post-sq WRs. These must fit in the flags > + * field of the ccwr_hdr_t (eight bits). > + */ > +typedef enum { > + SQ_SIGNALED = 0x01, > + SQ_READ_FENCE = 0x02, > + SQ_FENCE = 0x04, > +} PACKED cc_sq_flags_t; > + > +/* > + * Common fields for all post-sq WRs. Namely the standard header and a > + * secondary header with fields common to all post-sq WRs. > + */ > +typedef struct { > + ccwr_user_hdr_t user_hdr; > +} PACKED cc_sq_hdr_t; > + > +/* > + * Same as above but for post-rq WRs. > + */ > +typedef struct { > + ccwr_user_hdr_t user_hdr; > +} PACKED cc_rq_hdr_t; > + > +/* > + * use the same struct for all sends. > + */ > +typedef struct { > + cc_sq_hdr_t sq_hdr; > + u32 sge_len; > + u32 remote_stag; > + u8 data[0]; /* SGE array */ > +} PACKED ccwr_send_req_t, ccwr_send_se_req_t, ccwr_send_inv_req_t, > ccwr_send_se_inv_req_t; > + > +typedef ccwr_ce_t ccwr_send_rep_t; > + > +typedef union { > + ccwr_send_req_t req; > + ccwr_send_rep_t rep; > +} PACKED ccwr_send_t, ccwr_send_se_t, ccwr_send_inv_t, ccwr_send_se_inv_t; > + > +typedef struct { > + cc_sq_hdr_t sq_hdr; > + u64 remote_to; > + u32 remote_stag; > + u32 sge_len; > + u8 data[0]; /* SGE array */ > +} PACKED ccwr_rdma_write_req_t; > + > +typedef ccwr_ce_t ccwr_rdma_write_rep_t; > + > +typedef union { > + ccwr_rdma_write_req_t req; > + ccwr_rdma_write_rep_t rep; > +} PACKED ccwr_rdma_write_t; > + > +typedef struct { > + cc_sq_hdr_t sq_hdr; > + u64 local_to; > + u64 remote_to; > + u32 local_stag; > + u32 remote_stag; > + u32 length; > +} PACKED ccwr_rdma_read_req_t,ccwr_rdma_read_inv_req_t; > + > +typedef ccwr_ce_t ccwr_rdma_read_rep_t; > + > +typedef union { > + ccwr_rdma_read_req_t req; > + ccwr_rdma_read_rep_t rep; > +} PACKED ccwr_rdma_read_t, ccwr_rdma_read_inv_t; > + > +typedef struct { > + cc_sq_hdr_t sq_hdr; > + u64 va; > + u8 stag_key; > + u8 pad[3]; > + u32 mw_stag_index; > + u32 mr_stag_index; > + u32 length; > + u32 flags; /* see ccwr_mr_flags_t; */ > +} PACKED ccwr_mw_bind_req_t; > + > +typedef ccwr_ce_t ccwr_mw_bind_rep_t; > + > +typedef union { > + ccwr_mw_bind_req_t req; > + ccwr_mw_bind_rep_t rep; > +} PACKED ccwr_mw_bind_t; > + > +typedef struct { > + cc_sq_hdr_t sq_hdr; > + u64 va; > + u8 stag_key; > + u8 pad[3]; > + u32 stag_index; > + u32 pbe_size; > + u32 fbo; > + u32 length; > + u32 addrs_length; > + /* array of paddrs (must be aligned on a 64bit boundary) */ > + u64 paddrs[0]; > +} PACKED ccwr_nsmr_fastreg_req_t; > + > +typedef ccwr_ce_t ccwr_nsmr_fastreg_rep_t; > + > +typedef union { > + ccwr_nsmr_fastreg_req_t req; > + ccwr_nsmr_fastreg_rep_t rep; > +} PACKED ccwr_nsmr_fastreg_t; > + > +typedef struct { > + cc_sq_hdr_t sq_hdr; > + u8 stag_key; > + u8 pad[3]; > + u32 stag_index; > +} PACKED ccwr_stag_invalidate_req_t; > + > +typedef ccwr_ce_t ccwr_stag_invalidate_rep_t; > + > +typedef union { > + ccwr_stag_invalidate_req_t req; > + ccwr_stag_invalidate_rep_t rep; > +} PACKED ccwr_stag_invalidate_t; > + > +typedef union { > + cc_sq_hdr_t sq_hdr; > + ccwr_send_req_t send; > + ccwr_send_se_req_t send_se; > + ccwr_send_inv_req_t send_inv; > + ccwr_send_se_inv_req_t send_se_inv; > + ccwr_rdma_write_req_t rdma_write; > + ccwr_rdma_read_req_t rdma_read; > + ccwr_mw_bind_req_t mw_bind; > + ccwr_nsmr_fastreg_req_t nsmr_fastreg; > + ccwr_stag_invalidate_req_t stag_inv; > +} PACKED ccwr_sqwr_t; > + > + > +/* > + * RQ WRs > + */ > +typedef struct { > + cc_rq_hdr_t rq_hdr; > + u8 data[0]; /* array of SGEs */ > +} PACKED ccwr_rqwr_t, ccwr_recv_req_t; > + > +typedef ccwr_ce_t ccwr_recv_rep_t; > + > +typedef union { > + ccwr_recv_req_t req; > + ccwr_recv_rep_t rep; > +} PACKED ccwr_recv_t; > + > +/* > + * All AEs start with this header. Most AEs only need to convey the > + * information in the header. Some, like LLP connection events, need > + * more info. The union typdef ccwr_ae_t has all the possible AEs. > + * > + * hdr.context is the user_context from the rnic_open WR. NULL If this > + * is not affiliated with an rnic > + * > + * hdr.id is the AE identifier (eg; CCAE_REMOTE_SHUTDOWN, > + * CCAE_LLP_CLOSE_COMPLETE) > + * > + * resource_type is one of: CC_RES_IND_QP, CC_RES_IND_CQ, CC_RES_IND_SRQ > + * > + * user_context is the context passed down when the host created the resource. > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u64 user_context; /* user context for this res. */ > + u32 resource_type; /* see cc_resource_indicator_t */ > + u32 resource; /* handle for resource */ > + u32 qp_state; /* current QP State */ > +} PACKED PACKED ccwr_ae_hdr_t; > + > +/* > + * After submitting the CCAE_ACTIVE_CONNECT_RESULTS message on the AEQ, > + * the adapter moves the QP into RTS state > + */ > +typedef struct { > + ccwr_ae_hdr_t ae_hdr; > + u32 laddr; > + u32 raddr; > + u16 lport; > + u16 rport; > + u32 private_data_length; > + u8 private_data[0]; /* data is in-line in the msg. */ > +} PACKED ccwr_ae_active_connect_results_t; > + > +/* > + * When connections are established by the stack (and the private data > + * MPA frame is received), the adapter will generate an event to the host. > + * The details of the connection, any private data, and the new connection > + * request handle is passed up via the CCAE_CONNECTION_REQUEST msg on the > + * AE queue: > + */ > +typedef struct { > + ccwr_ae_hdr_t ae_hdr; > + u32 cr_handle; /* connreq handle (sock ptr) */ > + u32 laddr; > + u32 raddr; > + u16 lport; > + u16 rport; > + u32 private_data_length; > + u8 private_data[0]; /* data is in-line in the msg. */ > +} PACKED ccwr_ae_connection_request_t; > + > +typedef union { > + ccwr_ae_hdr_t ae_generic; > + ccwr_ae_active_connect_results_t ae_active_connect_results; > + ccwr_ae_connection_request_t ae_connection_request; > +} PACKED ccwr_ae_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u64 hint_count; > + u64 q0_host_shared; > + u64 q1_host_shared; > + u64 q1_host_msg_pool; > + u64 q2_host_shared; > + u64 q2_host_msg_pool; > +} PACKED ccwr_init_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_init_rep_t; > + > +typedef union { > + ccwr_init_req_t req; > + ccwr_init_rep_t rep; > +} PACKED ccwr_init_t; > + > +/* > + * For upgrading flash. > + */ > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > +} PACKED ccwr_flash_init_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 adapter_flash_buf_offset; > + u32 adapter_flash_len; > +} PACKED ccwr_flash_init_rep_t; > + > +typedef union { > + ccwr_flash_init_req_t req; > + ccwr_flash_init_rep_t rep; > +} PACKED ccwr_flash_init_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 len; > +} PACKED ccwr_flash_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 status; > +} PACKED ccwr_flash_rep_t; > + > +typedef union { > + ccwr_flash_req_t req; > + ccwr_flash_rep_t rep; > +} PACKED ccwr_flash_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 size; > +} PACKED ccwr_buf_alloc_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 offset; /* 0 if mem not available */ > + u32 size; /* 0 if mem not available */ > +} PACKED ccwr_buf_alloc_rep_t; > + > +typedef union { > + ccwr_buf_alloc_req_t req; > + ccwr_buf_alloc_rep_t rep; > +} PACKED ccwr_buf_alloc_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 offset; /* Must match value from alloc */ > + u32 size; /* Must match value from alloc */ > +} PACKED ccwr_buf_free_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_buf_free_rep_t; > + > +typedef union { > + ccwr_buf_free_req_t req; > + ccwr_buf_free_rep_t rep; > +} PACKED ccwr_buf_free_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 offset; > + u32 size; > + u32 type; > + u32 flags; > +} PACKED ccwr_flash_write_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 status; > +} PACKED ccwr_flash_write_rep_t; > + > +typedef union { > + ccwr_flash_write_req_t req; > + ccwr_flash_write_rep_t rep; > +} PACKED ccwr_flash_write_t; > + > +/* > + * Messages for LLP connection setup. > + */ > + > +/* > + * Listen Request. This allocates a listening endpoint to allow passive > + * connection setup. Newly established LLP connections are passed up > + * via an AE. See ccwr_ae_connection_request_t > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u64 user_context; /* returned in AEs. */ > + u32 rnic_handle; > + u32 local_addr; /* local addr, or 0 */ > + u16 local_port; /* 0 means "pick one" */ > + u16 pad; > + u32 backlog; /* tradional tcp listen bl */ > +} PACKED ccwr_ep_listen_create_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 ep_handle; /* handle to new listening ep */ > + u16 local_port; /* resulting port... */ > + u16 pad; > +} PACKED ccwr_ep_listen_create_rep_t; > + > +typedef union { > + ccwr_ep_listen_create_req_t req; > + ccwr_ep_listen_create_rep_t rep; > +} PACKED ccwr_ep_listen_create_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 ep_handle; > +} PACKED ccwr_ep_listen_destroy_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_ep_listen_destroy_rep_t; > + > +typedef union { > + ccwr_ep_listen_destroy_req_t req; > + ccwr_ep_listen_destroy_rep_t rep; > +} PACKED ccwr_ep_listen_destroy_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 ep_handle; > +} PACKED ccwr_ep_query_req_t; > + > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 local_addr; > + u32 remote_addr; > + u16 local_port; > + u16 remote_port; > +} PACKED ccwr_ep_query_rep_t; > + > +typedef union { > + ccwr_ep_query_req_t req; > + ccwr_ep_query_rep_t rep; > +} PACKED ccwr_ep_query_t; > + > + > +/* > + * The host passes this down to indicate acceptance of a pending iWARP > + * connection. The cr_handle was obtained from the CONNECTION_REQUEST > + * AE passed up by the adapter. See ccwr_ae_connection_request_t. > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 qp_handle; /* QP to bind to this LLP conn */ > + u32 ep_handle; /* LLP handle to accept */ > + u32 private_data_length; > + u8 private_data[0]; /* data in-line in msg. */ > +} PACKED ccwr_cr_accept_req_t; > + > +/* > + * adapter sends reply when private data is successfully submitted to > + * the LLP. > + */ > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_cr_accept_rep_t; > + > +typedef union { > + ccwr_cr_accept_req_t req; > + ccwr_cr_accept_rep_t rep; > +} PACKED ccwr_cr_accept_t; > + > +/* > + * The host sends this down if a given iWARP connection request was > + * rejected by the consumer. The cr_handle was obtained from a > + * previous ccwr_ae_connection_request_t AE sent by the adapter. > + */ > +typedef struct { > + ccwr_hdr_t hdr; > + u32 rnic_handle; > + u32 ep_handle; /* LLP handle to reject */ > +} PACKED ccwr_cr_reject_req_t; > + > +/* > + * Dunno if this is needed, but we'll add it for now. The adapter will > + * send the reject_reply after the LLP endpoint has been destroyed. > + */ > +typedef struct { > + ccwr_hdr_t hdr; > +} PACKED ccwr_cr_reject_rep_t; > + > +typedef union { > + ccwr_cr_reject_req_t req; > + ccwr_cr_reject_rep_t rep; > +} PACKED ccwr_cr_reject_t; > + > +/* > + * console command. Used to implement a debug console over the verbs > + * request and reply queues. > + */ > + > +/* > + * Console request message. It contains: > + * - message hdr with id = CCWR_CONSOLE > + * - the physaddr/len of host memory to be used for the reply. > + * - the command string. eg: "netstat -s" or "zoneinfo" > + */ > +typedef struct { > + ccwr_hdr_t hdr; /* id = CCWR_CONSOLE */ > + u64 reply_buf; /* pinned host buf for reply */ > + u32 reply_buf_len; /* length of reply buffer */ > + u8 command[0]; /* NUL terminated ascii string */ > + /* containing the command req */ > +} PACKED ccwr_console_req_t; > + > +/* > + * flags used in the console reply. > + */ > +typedef enum { > + CONS_REPLY_TRUNCATED = 0x00000001 /* reply was truncated */ > +} PACKED cc_console_flags_t; > + > +/* > + * Console reply message. > + * hdr.result contains the cc_status_t error if the reply was _not_ generated, > + * or CC_OK if the reply was generated. > + */ > +typedef struct { > + ccwr_hdr_t hdr; /* id = CCWR_CONSOLE */ > + u32 flags; /* see cc_console_flags_t */ > +} PACKED ccwr_console_rep_t; > + > +typedef union { > + ccwr_console_req_t req; > + ccwr_console_rep_t rep; > +} PACKED ccwr_console_t; > + > + > +/* > + * Giant union with all WRs. Makes life easier... > + */ > +typedef union { > + ccwr_hdr_t hdr; > + ccwr_user_hdr_t user_hdr; > + ccwr_rnic_open_t rnic_open; > + ccwr_rnic_query_t rnic_query; > + ccwr_rnic_getconfig_t rnic_getconfig; > + ccwr_rnic_setconfig_t rnic_setconfig; > + ccwr_rnic_close_t rnic_close; > + ccwr_cq_create_t cq_create; > + ccwr_cq_modify_t cq_modify; > + ccwr_cq_destroy_t cq_destroy; > + ccwr_pd_alloc_t pd_alloc; > + ccwr_pd_dealloc_t pd_dealloc; > + ccwr_srq_create_t srq_create; > + ccwr_srq_destroy_t srq_destroy; > + ccwr_qp_create_t qp_create; > + ccwr_qp_query_t qp_query; > + ccwr_qp_modify_t qp_modify; > + ccwr_qp_destroy_t qp_destroy; > + ccwr_qp_connect_t qp_connect; > + ccwr_nsmr_stag_alloc_t nsmr_stag_alloc; > + ccwr_nsmr_register_t nsmr_register; > + ccwr_nsmr_pbl_t nsmr_pbl; > + ccwr_mr_query_t mr_query; > + ccwr_mw_query_t mw_query; > + ccwr_stag_dealloc_t stag_dealloc; > + ccwr_sqwr_t sqwr; > + ccwr_rqwr_t rqwr; > + ccwr_ce_t ce; > + ccwr_ae_t ae; > + ccwr_init_t init; > + ccwr_ep_listen_create_t ep_listen_create; > + ccwr_ep_listen_destroy_t ep_listen_destroy; > + ccwr_cr_accept_t cr_accept; > + ccwr_cr_reject_t cr_reject; > + ccwr_console_t console; > + ccwr_flash_init_t flash_init; > + ccwr_flash_t flash; > + ccwr_buf_alloc_t buf_alloc; > + ccwr_buf_free_t buf_free; > + ccwr_flash_write_t flash_write; > +} PACKED ccwr_t; > + > + > +/* > + * Accessors for the wr fields that are packed together tightly to > + * reduce the wr message size. The wr arguments are void* so that > + * either a ccwr_t*, a ccwr_hdr_t*, or a pointer to any of the types > + * in the ccwr_t union can be passed in. > + */ > +static __inline__ u8 > +c2_wr_get_id(void *wr) > +{ > + return ((ccwr_hdr_t *)wr)->id; > +} > +static __inline__ void > +c2_wr_set_id(void *wr, u8 id) > +{ > + ((ccwr_hdr_t *)wr)->id = id; > +} > +static __inline__ u8 > +c2_wr_get_result(void *wr) > +{ > + return ((ccwr_hdr_t *)wr)->result; > +} > +static __inline__ void > +c2_wr_set_result(void *wr, u8 result) > +{ > + ((ccwr_hdr_t *)wr)->result = result; > +} > +static __inline__ u8 > +c2_wr_get_flags(void *wr) > +{ > + return ((ccwr_hdr_t *)wr)->flags; > +} > +static __inline__ void > +c2_wr_set_flags(void *wr, u8 flags) > +{ > + ((ccwr_hdr_t *)wr)->flags = flags; > +} > +static __inline__ u8 > +c2_wr_get_sge_count(void *wr) > +{ > + return ((ccwr_hdr_t *)wr)->sge_count; > +} > +static __inline__ void > +c2_wr_set_sge_count(void *wr, u8 sge_count) > +{ > + ((ccwr_hdr_t *)wr)->sge_count = sge_count; > +} > +static __inline__ u32 > +c2_wr_get_wqe_count(void *wr) > +{ > + return ((ccwr_hdr_t *)wr)->wqe_count; > +} > +static __inline__ void > +c2_wr_set_wqe_count(void *wr, u32 wqe_count) > +{ > + ((ccwr_hdr_t *)wr)->wqe_count = wqe_count; > +} > + > +#undef PACKED > + > +#ifdef _MSC_VER > +#pragma pack(pop) > +#endif > + > +#endif /* _CC_WR_H_ */ > Index: hw/amso1100/c2_cm.c > =================================================================== > --- hw/amso1100/c2_cm.c (revision 0) > +++ hw/amso1100/c2_cm.c (revision 0) > @@ -0,0 +1,415 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > +#include "c2.h" > +#include "c2_vq.h" > +#include > + > +int c2_llp_connect(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) > +{ > + struct c2_dev *c2dev = to_c2dev(cm_id->device); > + struct c2_qp *qp = to_c2qp(cm_id->qp); > + ccwr_qp_connect_req_t *wr; /* variable size needs a malloc. */ > + struct c2_vq_req *vq_req; > + int err; > + > + /* > + * only support the max private_data length > + */ > + if (pdata_len > CC_MAX_PRIVATE_DATA_SIZE) { > + return -EINVAL; > + } > + > + /* > + * Create and send a WR_QP_CONNECT... > + */ > + wr = kmalloc(sizeof(*wr) + pdata_len, GFP_KERNEL); > + if (!wr) { > + return -ENOMEM; > + } > + > + vq_req = vq_req_alloc(c2dev); > + if (!vq_req) { > + err = -ENOMEM; > + goto bail0; > + } > + > + c2_wr_set_id(wr, CCWR_QP_CONNECT); > + wr->hdr.context = 0; > + wr->rnic_handle = c2dev->adapter_handle; > + wr->qp_handle = qp->adapter_handle; > + > + wr->remote_addr = cm_id->remote_addr.sin_addr.s_addr; > + wr->remote_port = cm_id->remote_addr.sin_port; > + > + /* > + * Move any private data from the callers's buf into > + * the WR. > + */ > + if (pdata) { > + wr->private_data_length = cpu_to_be32(pdata_len); > + memcpy(&wr->private_data[0], pdata, pdata_len); > + } else { > + wr->private_data_length = 0; > + } > + > + /* > + * Send WR to adapter. NOTE: There is no synch reply from > + * the adapter. > + */ > + err = vq_send_wr(c2dev, (ccwr_t*)wr); > + vq_req_free(c2dev, vq_req); > +bail0: > + kfree(wr); > + return err; > +} > + > +int > +c2_llp_service_create(struct iw_cm_id* cm_id, int backlog) > +{ > + struct c2_dev *c2dev; > + ccwr_ep_listen_create_req_t wr; > + ccwr_ep_listen_create_rep_t *reply; > + struct c2_vq_req *vq_req; > + int err; > + > + c2dev = to_c2dev(cm_id->device); > + if (c2dev == NULL) > + return -EINVAL; > + > + /* > + * Allocate verbs request. > + */ > + vq_req = vq_req_alloc(c2dev); > + if (!vq_req) > + return -ENOMEM; > + > + /* > + * Build the WR > + */ > + c2_wr_set_id(&wr, CCWR_EP_LISTEN_CREATE); > + wr.hdr.context = (u64)(unsigned long)vq_req; > + wr.rnic_handle = c2dev->adapter_handle; > + wr.local_addr = cm_id->local_addr.sin_addr.s_addr; > + wr.local_port = cm_id->local_addr.sin_port; > + wr.backlog = cpu_to_be32(backlog); > + wr.user_context = (u64)(unsigned long)cm_id; > + > + /* > + * Reference the request struct. Dereferenced in the int handler. > + */ > + vq_req_get(c2dev, vq_req); > + > + /* > + * Send WR to adapter > + */ > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail0; > + } > + > + /* > + * Wait for reply from adapter > + */ > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) { > + goto bail0; > + } > + > + /* > + * Process reply > + */ > + reply = (ccwr_ep_listen_create_rep_t*)(unsigned long)vq_req->reply_msg; > + if (!reply) { > + err = -ENOMEM; > + goto bail1; > + } > + > + if ( (err = c2_errno(reply)) != 0) { > + goto bail1; > + } > + > + /* > + * get the adapter handle > + */ > + cm_id->provider_id = reply->ep_handle; > + > + /* > + * free vq stuff > + */ > + vq_repbuf_free(c2dev, reply); > + vq_req_free(c2dev, vq_req); > + > + return 0; > + > +bail1: > + vq_repbuf_free(c2dev, reply); > +bail0: > + vq_req_free(c2dev, vq_req); > + return err; > +} > + > + > +int > +c2_llp_service_destroy(struct iw_cm_id* cm_id) > +{ > + > + struct c2_dev *c2dev; > + ccwr_ep_listen_destroy_req_t wr; > + ccwr_ep_listen_destroy_rep_t *reply; > + struct c2_vq_req *vq_req; > + int err; > + > + c2dev = to_c2dev(cm_id->device); > + if (c2dev == NULL) > + return -EINVAL; > + > + /* > + * Allocate verbs request. > + */ > + vq_req = vq_req_alloc(c2dev); > + if (!vq_req) { > + return -ENOMEM; > + } > + > + /* > + * Build the WR > + */ > + c2_wr_set_id(&wr, CCWR_EP_LISTEN_DESTROY); > + wr.hdr.context = (unsigned long)vq_req; > + wr.rnic_handle = c2dev->adapter_handle; > + wr.ep_handle = cm_id->provider_id; > + > + /* > + * reference the request struct. dereferenced in the int handler. > + */ > + vq_req_get(c2dev, vq_req); > + > + /* > + * Send WR to adapter > + */ > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail0; > + } > + > + /* > + * Wait for reply from adapter > + */ > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) { > + goto bail0; > + } > + > + /* > + * Process reply > + */ > + reply = (ccwr_ep_listen_destroy_rep_t*)(unsigned long)vq_req->reply_msg; > + if (!reply) { > + err = -ENOMEM; > + goto bail0; > + } > + if ( (err = c2_errno(reply)) != 0) { > + goto bail1; > + } > + > +bail1: > + vq_repbuf_free(c2dev, reply); > +bail0: > + vq_req_free(c2dev, vq_req); > + return err; > +} > + > + > +int > +c2_llp_accept(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) > +{ > + struct c2_dev *c2dev = to_c2dev(cm_id->device); > + struct c2_qp *qp = to_c2qp(cm_id->qp); > + ccwr_cr_accept_req_t *wr; /* variable length WR */ > + struct c2_vq_req *vq_req; > + ccwr_cr_accept_rep_t *reply; /* VQ Reply msg ptr. */ > + int err; > + > + /* Make sure there's a bound QP */ > + if (qp == 0) > + return -EINVAL; > + > + /* > + * only support the max private_data length > + */ > + if (pdata_len > CC_MAX_PRIVATE_DATA_SIZE) { > + return -EINVAL; > + } > + > + /* > + * Allocate verbs request. > + */ > + vq_req = vq_req_alloc(c2dev); > + if (!vq_req) { > + return -ENOMEM; > + } > + > + wr = kmalloc(sizeof(*wr) + pdata_len, GFP_KERNEL); > + if (!wr) { > + err = -ENOMEM; > + goto bail0; > + } > + > + /* > + * Build the WR > + */ > + c2_wr_set_id(wr, CCWR_CR_ACCEPT); > + wr->hdr.context = (unsigned long)vq_req; > + wr->rnic_handle = c2dev->adapter_handle; > + wr->ep_handle = (u32)cm_id->provider_id; > + wr->qp_handle = qp->adapter_handle; > + if (pdata) { > + wr->private_data_length = cpu_to_be32(pdata_len); > + memcpy(&wr->private_data[0], pdata, pdata_len); > + } else { > + wr->private_data_length = 0; > + } > + > + /* > + * reference the request struct. dereferenced in the int handler. > + */ > + vq_req_get(c2dev, vq_req); > + > + /* > + * Send WR to adapter > + */ > + err = vq_send_wr(c2dev, (ccwr_t*)wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail1; > + } > + > + /* > + * Wait for reply from adapter > + */ > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) { > + goto bail1; > + } > + > + /* > + * Process reply > + */ > + reply = (ccwr_cr_accept_rep_t*)(unsigned long)vq_req->reply_msg; > + if (!reply) { > + err = -ENOMEM; > + goto bail1; > + } > + > + err = c2_errno(reply); > + vq_repbuf_free(c2dev, reply); > + > +bail1: > + kfree(wr); > +bail0: > + vq_req_free(c2dev, vq_req); > + return err; > +} > + > +int > +c2_llp_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) > +{ > + struct c2_dev *c2dev; > + ccwr_cr_reject_req_t wr; > + struct c2_vq_req *vq_req; > + ccwr_cr_reject_rep_t *reply; > + int err; > + > + c2dev = to_c2dev(cm_id->device); > + > + /* > + * Allocate verbs request. > + */ > + vq_req = vq_req_alloc(c2dev); > + if (!vq_req) { > + return -ENOMEM; > + } > + > + /* > + * Build the WR > + */ > + c2_wr_set_id(&wr, CCWR_CR_REJECT); > + wr.hdr.context = (unsigned long)vq_req; > + wr.rnic_handle = c2dev->adapter_handle; > + wr.ep_handle = (u32)cm_id->provider_id; > + > + /* > + * reference the request struct. dereferenced in the int handler. > + */ > + vq_req_get(c2dev, vq_req); > + > + /* > + * Send WR to adapter > + */ > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail0; > + } > + > + /* > + * Wait for reply from adapter > + */ > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) { > + goto bail0; > + } > + > + /* > + * Process reply > + */ > + reply = (ccwr_cr_reject_rep_t*)(unsigned long)vq_req->reply_msg; > + if (!reply) { > + err = -ENOMEM; > + goto bail0; > + } > + err = c2_errno(reply); > + > + /* > + * free vq stuff > + */ > + vq_repbuf_free(c2dev, reply); > + > +bail0: > + vq_req_free(c2dev, vq_req); > + return err; > +} > + > Index: hw/amso1100/c2_provider.h > =================================================================== > --- hw/amso1100/c2_provider.h (revision 0) > +++ hw/amso1100/c2_provider.h (revision 0) > @@ -0,0 +1,174 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#ifndef C2_PROVIDER_H > +#define C2_PROVIDER_H > + > +#include > +#include > + > +#include "c2_mq.h" > +#include > + > +#define C2_MPT_FLAG_ATOMIC (1 << 14) > +#define C2_MPT_FLAG_REMOTE_WRITE (1 << 13) > +#define C2_MPT_FLAG_REMOTE_READ (1 << 12) > +#define C2_MPT_FLAG_LOCAL_WRITE (1 << 11) > +#define C2_MPT_FLAG_LOCAL_READ (1 << 10) > + > +struct c2_buf_list { > + void *buf; > + DECLARE_PCI_UNMAP_ADDR(mapping) > +}; > + > + > +/* The user context keeps track of objects allocated for a > + * particular user-mode client. */ > +struct c2_ucontext { > + struct ib_ucontext ibucontext; > + > + int index; /* rnic index (minor) */ > + int port; /* Which GigE port */ > + > + /* > + * Shared HT pages for user-accessible MQs. > + */ > + int hthead; /* index of first free entry */ > + void* htpages; /* kernel vaddr */ > + int htlen; /* length of htpages memory */ > + void* htuva; /* user mapped vaddr */ > + spinlock_t htlock; /* serialize allocation */ > + u64 adapter_hint_uva; /* Activity FIFO */ > +}; > + > +struct c2_mtt; > + > +/* All objects associated with a PD are kept in the > + * associated user context if present. > + */ > +struct c2_pd { > + struct ib_pd ibpd; > + u32 pd_id; > + atomic_t sqp_count; > +}; > + > +struct c2_mr { > + struct ib_mr ibmr; > + struct c2_pd *pd; > +}; > + > +struct c2_av; > + > +enum c2_ah_type { > + C2_AH_ON_HCA, > + C2_AH_PCI_POOL, > + C2_AH_KMALLOC > +}; > + > +struct c2_ah { > + struct ib_ah ibah; > +}; > + > +struct c2_cq { > + struct ib_cq ibcq; > + spinlock_t lock; > + atomic_t refcount; > + int cqn; > + int is_kernel; > + wait_queue_head_t wait; > + > + u32 adapter_handle; > + struct c2_mq mq; > +}; > + > +struct c2_wq { > + spinlock_t lock; > +}; > +struct iw_cm_id; > +struct c2_qp { > + struct ib_qp ibqp; > + struct iw_cm_id* cm_id; > + spinlock_t lock; > + atomic_t refcount; > + wait_queue_head_t wait; > + int qpn; > + > + u32 adapter_handle; > + u32 send_sgl_depth; > + u32 recv_sgl_depth; > + u32 rdma_write_sgl_depth; > + u8 state; > + > + struct c2_mq sq_mq; > + struct c2_mq rq_mq; > +}; > + > +struct c2_cr_query_attrs { > + u32 local_addr; > + u32 remote_addr; > + u16 local_port; > + u16 remote_port; > +}; > + > +static inline struct c2_pd *to_c2pd(struct ib_pd *ibpd) > +{ > + return container_of(ibpd, struct c2_pd, ibpd); > +} > + > +static inline struct c2_ucontext *to_c2ucontext(struct ib_ucontext *ibucontext) > +{ > + return container_of(ibucontext, struct c2_ucontext, ibucontext); > +} > + > +static inline struct c2_mr *to_c2mr(struct ib_mr *ibmr) > +{ > + return container_of(ibmr, struct c2_mr, ibmr); > +} > + > + > +static inline struct c2_ah *to_c2ah(struct ib_ah *ibah) > +{ > + return container_of(ibah, struct c2_ah, ibah); > +} > + > +static inline struct c2_cq *to_c2cq(struct ib_cq *ibcq) > +{ > + return container_of(ibcq, struct c2_cq, ibcq); > +} > + > +static inline struct c2_qp *to_c2qp(struct ib_qp *ibqp) > +{ > + return container_of(ibqp, struct c2_qp, ibqp); > +} > +#endif /* C2_PROVIDER_H */ > Index: hw/amso1100/c2_pd.c > =================================================================== > --- hw/amso1100/c2_pd.c (revision 0) > +++ hw/amso1100/c2_pd.c (revision 0) > @@ -0,0 +1,73 @@ > +/* > + * Copyright (c) 2004 Topspin Communications. All rights reserved. > + * Copyright (c) 2005 Cisco Systems. All rights reserved. > + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > + > +#include > +#include > + > +#include "c2.h" > +#include "c2_provider.h" > + > +int c2_pd_alloc(struct c2_dev *dev, int privileged, struct c2_pd *pd) > +{ > + int err = 0; > + > + might_sleep(); > + > + atomic_set(&pd->sqp_count, 0); > + pd->pd_id = c2_alloc(&dev->pd_table.alloc); > + if (pd->pd_id == -1) > + return -ENOMEM; > + > + return err; > +} > + > +void c2_pd_free(struct c2_dev *dev, struct c2_pd *pd) > +{ > + might_sleep(); > + c2_free(&dev->pd_table.alloc, pd->pd_id); > +} > + > +int __devinit c2_init_pd_table(struct c2_dev *dev) > +{ > + return c2_alloc_init(&dev->pd_table.alloc, > + dev->max_pd, > + 0); > +} > + > +void __devexit c2_cleanup_pd_table(struct c2_dev *dev) > +{ > + /* XXX check if any PDs are still allocated? */ > + c2_alloc_cleanup(&dev->pd_table.alloc); > +} > Index: hw/amso1100/c2_cq.c > =================================================================== > --- hw/amso1100/c2_cq.c (revision 0) > +++ hw/amso1100/c2_cq.c (revision 0) > @@ -0,0 +1,401 @@ > +/* > + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. > + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. > + * Copyright (c) 2005 Cisco Systems, Inc. All rights reserved. > + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. > + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > +#include "c2.h" > +#include "c2_vq.h" > +#include "cc_status.h" > + > +#define C2_CQ_MSG_SIZE ((sizeof(ccwr_ce_t) + 32-1) & ~(32-1)) > + > +void c2_cq_event(struct c2_dev *c2dev, u32 mq_index) > +{ > + struct c2_cq *cq; > + > + cq = c2dev->qptr_array[mq_index]; > + > + if (!cq) { > + dprintk("Completion event for bogus CQ %08x\n", mq_index); > + return; > + } > + > + assert(cq->ibcq.comp_handler); > + (*cq->ibcq.comp_handler)(&cq->ibcq, cq->ibcq.cq_context); > +} > + > +void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index) > +{ > + struct c2_cq *cq; > + struct c2_mq *q; > + > + cq = c2dev->qptr_array[mq_index]; > + if (!cq) > + return; > + > + spin_lock_irq(&cq->lock); > + > + q = &cq->mq; > + if (q && !c2_mq_empty(q)) { > + u16 priv = q->priv; > + ccwr_ce_t *msg; > + > + while (priv != cpu_to_be16(*q->shared)) { > + msg = (ccwr_ce_t *)(q->msg_pool + priv * q->msg_size); > + if (msg->qp_user_context == (u64)(unsigned long)qp) { > + msg->qp_user_context = (u64)0; > + } > + BUMP(q, priv); > + } > + } > + > + spin_unlock_irq(&cq->lock); > +} > + > +static inline enum ib_wc_status c2_cqe_status_to_openib(u8 status) > +{ > + switch (status) { > + case CC_OK: return IB_WC_SUCCESS; > + case CCERR_FLUSHED: return IB_WC_WR_FLUSH_ERR; > + case CCERR_BASE_AND_BOUNDS_VIOLATION: return IB_WC_LOC_PROT_ERR; > + case CCERR_ACCESS_VIOLATION: return IB_WC_LOC_ACCESS_ERR; > + case CCERR_TOTAL_LENGTH_TOO_BIG: return IB_WC_LOC_LEN_ERR; > + case CCERR_INVALID_WINDOW: return IB_WC_MW_BIND_ERR; > + default: return IB_WC_GENERAL_ERR; > + } > +} > + > + > +static inline int c2_poll_one(struct c2_dev *c2dev, > + struct c2_cq *cq, > + struct ib_wc *entry) > +{ > + ccwr_ce_t *ce; > + struct c2_qp *qp; > + int is_recv = 0; > + > + ce = (ccwr_ce_t *)c2_mq_consume(&cq->mq); > + if (!ce) { > + return -EAGAIN; > + } > + > + /* > + * if the qp returned is null then this qp has already > + * been freed and we are unable process the completion. > + * try pulling the next message > + */ > + while ( (qp = (struct c2_qp *)(unsigned long)ce->qp_user_context) == NULL) { > + c2_mq_free(&cq->mq); > + ce = (ccwr_ce_t *)c2_mq_consume(&cq->mq); > + if (!ce) > + return -EAGAIN; > + } > + > + entry->status = c2_cqe_status_to_openib(c2_wr_get_result(ce)); > + entry->wr_id = ce->hdr.context; > + entry->qp_num = ce->handle; > + entry->wc_flags = 0; > + entry->slid = 0; > + entry->sl = 0; > + entry->src_qp = 0; > + entry->dlid_path_bits = 0; > + entry->pkey_index = 0; > + > + switch (c2_wr_get_id(ce)) { > + case CC_WR_TYPE_SEND: > + entry->opcode = IB_WC_SEND; > + break; > + case CC_WR_TYPE_RDMA_WRITE: > + entry->opcode = IB_WC_RDMA_WRITE; > + break; > + case CC_WR_TYPE_RDMA_READ: > + entry->opcode = IB_WC_RDMA_READ; > + break; > + case CC_WR_TYPE_BIND_MW: > + entry->opcode = IB_WC_BIND_MW; > + break; > + case CC_WR_TYPE_RECV: > + entry->byte_len = be32_to_cpu(ce->bytes_rcvd); > + entry->opcode = IB_WC_RECV; > + is_recv = 1; > + break; > + default: > + break; > + } > + > + /* consume the WQEs */ > + if (is_recv) > + c2_mq_lconsume(&qp->rq_mq, 1); > + else > + c2_mq_lconsume(&qp->sq_mq, be32_to_cpu(c2_wr_get_wqe_count(ce))+1); > + > + /* free the message */ > + c2_mq_free(&cq->mq); > + > + return 0; > +} > + > +int c2_poll_cq(struct ib_cq *ibcq, int num_entries, > + struct ib_wc *entry) > +{ > + struct c2_dev *c2dev = to_c2dev(ibcq->device); > + struct c2_cq *cq = to_c2cq(ibcq); > + unsigned long flags; > + int npolled, err; > + > + spin_lock_irqsave(&cq->lock, flags); > + > + for (npolled = 0; npolled < num_entries; ++npolled) { > + > + err = c2_poll_one(c2dev, cq, entry + npolled); > + if (err) > + break; > + } > + > + spin_unlock_irqrestore(&cq->lock, flags); > + > + return npolled; > +} > + > +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) > +{ > + struct c2_mq_shared volatile *shared; > + struct c2_cq *cq; > + > + cq = to_c2cq(ibcq); > + shared = cq->mq.peer; > + > + if (notify == IB_CQ_NEXT_COMP) > + shared->notification_type = CC_CQ_NOTIFICATION_TYPE_NEXT; > + else if (notify == IB_CQ_SOLICITED) > + shared->notification_type = CC_CQ_NOTIFICATION_TYPE_NEXT_SE; > + else > + return -EINVAL; > + > + shared->armed = CQ_WAIT_FOR_DMA|CQ_ARMED; > + > + /* > + * Now read back shared->armed to make the PCI > + * write synchronous. This is necessary for > + * correct cq notification semantics. > + */ > + { > + volatile char c; > + c = shared->armed; > + } > + > + return 0; > +} > + > +static void c2_free_cq_buf(struct c2_mq *mq) > +{ > + int npages; > + > + npages = ((mq->q_size * mq->msg_size) + PAGE_SIZE - 1) / PAGE_SIZE; > + free_pages((unsigned long)mq->msg_pool, npages); > +} > + > +static int c2_alloc_cq_buf(struct c2_mq *mq, int q_size, int msg_size) > +{ > + unsigned long pool_start; > + int npages; > + > + npages = ( (q_size * msg_size) + PAGE_SIZE - 1) / PAGE_SIZE; > + > + pool_start = __get_free_pages(GFP_KERNEL, npages); > + if (!pool_start) > + return -ENOMEM; > + > + c2_mq_init(mq, > + 0, /* index (currently unknown) */ > + q_size, > + msg_size, > + (u8 *)pool_start, > + 0, /* peer (currently unknown) */ > + C2_MQ_HOST_TARGET); > + > + return 0; > +} > + > +int c2_init_cq(struct c2_dev *c2dev, int entries, > + struct c2_ucontext *ctx, struct c2_cq *cq) > +{ > + ccwr_cq_create_req_t wr; > + ccwr_cq_create_rep_t* reply; > + unsigned long peer_pa; > + struct c2_vq_req *vq_req; > + int err; > + > + might_sleep(); > + > + cq->ibcq.cqe = entries - 1; > + cq->is_kernel = !ctx; > + > + /* Allocate a shared pointer */ > + cq->mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); > + if (!cq->mq.shared) > + return -ENOMEM; > + > + /* Allocate pages for the message pool */ > + err = c2_alloc_cq_buf(&cq->mq, entries+1, C2_CQ_MSG_SIZE); > + if (err) > + goto bail0; > + > + vq_req = vq_req_alloc(c2dev); > + if (!vq_req) { > + err = -ENOMEM; > + goto bail1; > + } > + > + memset(&wr, 0, sizeof(wr)); > + c2_wr_set_id(&wr, CCWR_CQ_CREATE); > + wr.hdr.context = (unsigned long)vq_req; > + wr.rnic_handle = c2dev->adapter_handle; > + wr.msg_size = cpu_to_be32(cq->mq.msg_size); > + wr.depth = cpu_to_be32(cq->mq.q_size); > + wr.shared_ht = cpu_to_be64(__pa(cq->mq.shared)); > + wr.msg_pool = cpu_to_be64(__pa(cq->mq.msg_pool)); > + wr.user_context = (u64)(unsigned long)(cq); > + > + vq_req_get(c2dev, vq_req); > + > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail2; > + } > + > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) > + goto bail2; > + > + reply = (ccwr_cq_create_rep_t*)(unsigned long)(vq_req->reply_msg); > + if (!reply) { > + err = -ENOMEM; > + goto bail2; > + } > + > + if ( (err = c2_errno(reply)) != 0) > + goto bail3; > + > + cq->adapter_handle = reply->cq_handle; > + cq->mq.index = be32_to_cpu(reply->mq_index); > + > + peer_pa = (unsigned long)(c2dev->pa + be32_to_cpu(reply->adapter_shared)); > + cq->mq.peer = ioremap_nocache(peer_pa, PAGE_SIZE); > + if (!cq->mq.peer) { > + err = -ENOMEM; > + goto bail3; > + } > + > + vq_repbuf_free(c2dev, reply); > + vq_req_free(c2dev, vq_req); > + > + spin_lock_init(&cq->lock); > + atomic_set(&cq->refcount, 1); > + init_waitqueue_head(&cq->wait); > + > + /* > + * Use the MQ index allocated by the adapter to > + * store the CQ in the qptr_array > + */ > + /* XXX qptr_array lock? */ > + cq->cqn = cq->mq.index; > + c2dev->qptr_array[cq->cqn] = cq; > + > + return 0; > + > +bail3: > + vq_repbuf_free(c2dev, reply); > +bail2: > + vq_req_free(c2dev, vq_req); > +bail1: > + c2_free_cq_buf(&cq->mq); > +bail0: > + c2_free_mqsp(cq->mq.shared); > + > + return err; > +} > + > +void c2_free_cq(struct c2_dev *c2dev, > + struct c2_cq *cq) > +{ > + int err; > + struct c2_vq_req *vq_req; > + ccwr_cq_destroy_req_t wr; > + ccwr_cq_destroy_rep_t *reply; > + > + might_sleep(); > + > + atomic_dec(&cq->refcount); > + wait_event(cq->wait, !atomic_read(&cq->refcount)); > + > + vq_req = vq_req_alloc(c2dev); > + if (!vq_req) { > + goto bail0; > + } > + > + memset(&wr, 0, sizeof(wr)); > + c2_wr_set_id(&wr, CCWR_CQ_DESTROY); > + wr.hdr.context = (unsigned long)vq_req; > + wr.rnic_handle = c2dev->adapter_handle; > + wr.cq_handle = cq->adapter_handle; > + > + vq_req_get(c2dev, vq_req); > + > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail1; > + } > + > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) > + goto bail1; > + > + reply = (ccwr_cq_destroy_rep_t*)(unsigned long)(vq_req->reply_msg); > + > +//bail2: > + vq_repbuf_free(c2dev, reply); > +bail1: > + vq_req_free(c2dev, vq_req); > +bail0: > + if (cq->is_kernel) { > + c2_free_cq_buf(&cq->mq); > + } > + > + return; > +} > + > Index: hw/amso1100/Makefile > =================================================================== > --- hw/amso1100/Makefile (revision 0) > +++ hw/amso1100/Makefile (revision 0) > @@ -0,0 +1,22 @@ > +EXTRA_CFLAGS += -Idrivers/infiniband/include > + > +ifdef CONFIG_INFINIBAND_AMSO1100_DEBUG > +EXTRA_CFLAGS += -DC2_DEBUG > +endif > + > +obj-$(CONFIG_INFINIBAND_AMSO1100) += iw_c2.o > + > +iw_c2-y := \ > + c2.o \ > + c2_provider.o \ > + c2_rnic.o \ > + c2_alloc.o \ > + c2_mq.o \ > + c2_ae.o \ > + c2_vq.o \ > + c2_intr.o \ > + c2_cq.o \ > + c2_qp.o \ > + c2_cm.o \ > + c2_mm.o \ > + c2_pd.o > Index: hw/amso1100/c2_mm.c > =================================================================== > --- hw/amso1100/c2_mm.c (revision 0) > +++ hw/amso1100/c2_mm.c (revision 0) > @@ -0,0 +1,376 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#include "c2.h" > +#include "c2_vq.h" > + > +#define PBL_VIRT 1 > +#define PBL_PHYS 2 > + > +/* > + * Send all the PBL messages to convey the remainder of the PBL > + * Wait for the adapter's reply on the last one. > + * This is indicated by setting the MEM_PBL_COMPLETE in the flags. > + * > + * NOTE: vq_req is _not_ freed by this function. The VQ Host > + * Reply buffer _is_ freed by this function. > + */ > +static int > +send_pbl_messages(struct c2_dev *c2dev, u32 stag_index, > + unsigned long va, u32 pbl_depth, > + struct c2_vq_req *vq_req, int pbl_type) > +{ > + u32 pbe_count; /* amt that fits in a PBL msg */ > + u32 count; /* amt in this PBL MSG. */ > + ccwr_nsmr_pbl_req_t *wr; /* PBL WR ptr */ > + ccwr_nsmr_pbl_rep_t *reply; /* reply ptr */ > + int err, pbl_virt, i; > + > + switch (pbl_type) { > + case PBL_VIRT: > + pbl_virt = 1; > + break; > + case PBL_PHYS: > + pbl_virt = 0; > + break; > + default: > + return -EINVAL; > + break; > + } > + > + pbe_count = (c2dev->req_vq.msg_size - > + sizeof(ccwr_nsmr_pbl_req_t)) / sizeof(u64); > + wr = (ccwr_nsmr_pbl_req_t*)kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); > + if (!wr) { > + return -ENOMEM; > + } > + c2_wr_set_id(wr, CCWR_NSMR_PBL); > + > + /* > + * Only the last PBL message will generate a reply from the verbs, > + * so we set the context to 0 indicating there is no kernel verbs > + * handler blocked awaiting this reply. > + */ > + wr->hdr.context = 0; > + wr->rnic_handle = c2dev->adapter_handle; > + wr->stag_index = stag_index; /* already swapped */ > + wr->flags = 0; > + while (pbl_depth) { > + count = min(pbe_count, pbl_depth); > + wr->addrs_length = cpu_to_be32(count); > + > + /* > + * If this is the last message, then reference the > + * vq request struct cuz we're gonna wait for a reply. > + * also make this PBL msg as the last one. > + */ > + if (count == pbl_depth) { > + /* > + * reference the request struct. dereferenced in the > + * int handler. > + */ > + vq_req_get(c2dev, vq_req); > + wr->flags = cpu_to_be32(MEM_PBL_COMPLETE); > + > + /* > + * This is the last PBL message. > + * Set the context to our VQ Request Object so we can > + * wait for the reply. > + */ > + wr->hdr.context = (unsigned long)vq_req; > + } > + > + /* > + * if pbl_virt is set then va is a virtual address that describes a > + * virtually contiguous memory allocation. the wr needs the start of > + * each virtual page to be converted to the corresponding physical > + * address of the page. > + * > + * if pbl_virt is not set then va is an array of physical addresses and > + * there is no conversion to do. just fill in the wr with what is in > + * the array. > + */ > + for (i=0; i < count; i++) { > + if (pbl_virt) { > + /* XXX */ //wr->paddrs[i] = cpu_to_be64(user_virt_to_phys(va)); > + va += PAGE_SIZE; > + } else { > + wr->paddrs[i] = cpu_to_be64((u64)(unsigned long)((void **)va)[i]); > + } > + } > + > + /* > + * Send WR to adapter > + */ > + err = vq_send_wr(c2dev, (ccwr_t*)wr); > + if (err) { > + if (count <= pbe_count) { > + vq_req_put(c2dev, vq_req); > + } > + goto bail0; > + } > + pbl_depth -= count; > + } > + > + /* > + * Now wait for the reply... > + */ > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) { > + goto bail0; > + } > + > + /* > + * Process reply > + */ > + reply = (ccwr_nsmr_pbl_rep_t*)(unsigned long)vq_req->reply_msg; > + if (!reply) { > + err = -ENOMEM; > + goto bail0; > + } > + > + err = c2_errno(reply); > + > + vq_repbuf_free(c2dev, reply); > +bail0: > + kfree(wr); > + return err; > +} > + > +#define CC_PBL_MAX_DEPTH 131072 > +int > +c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 **addr_list, > + int pbl_depth, u32 length, u64 *va, > + cc_acf_t acf, struct c2_mr *mr) > +{ > + struct c2_vq_req *vq_req; > + ccwr_nsmr_register_req_t *wr; > + ccwr_nsmr_register_rep_t *reply; > + u16 flags; > + int i, pbe_count, count; > + int err; > + > + if (!va || !length || !addr_list || !pbl_depth) > + return -EINTR; > + > + /* > + * Verify PBL depth is within rnic max > + */ > + if (pbl_depth > CC_PBL_MAX_DEPTH) { > + return -EINTR; > + } > + > + /* > + * allocate verbs request object > + */ > + vq_req = vq_req_alloc(c2dev); > + if (!vq_req) > + return -ENOMEM; > + > + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); > + if (!wr) { > + err = -ENOMEM; > + goto bail0; > + } > + > + /* > + * build the WR > + */ > + c2_wr_set_id(wr, CCWR_NSMR_REGISTER); > + wr->hdr.context = (unsigned long)vq_req; > + wr->rnic_handle = c2dev->adapter_handle; > + > + flags = (acf | MEM_VA_BASED | MEM_REMOTE); > + > + /* > + * compute how many pbes can fit in the message > + */ > + pbe_count = (c2dev->req_vq.msg_size - > + sizeof(ccwr_nsmr_register_req_t)) / > + sizeof(u64); > + > + if (pbl_depth <= pbe_count) { > + flags |= MEM_PBL_COMPLETE; > + } > + wr->flags = cpu_to_be16(flags); > + wr->stag_key = 0; //stag_key; > + wr->va = cpu_to_be64(*va); > + wr->pd_id = mr->pd->pd_id; > + wr->pbe_size = cpu_to_be32(PAGE_SIZE); > + wr->length = cpu_to_be32(length); > + wr->pbl_depth = cpu_to_be32(pbl_depth); > + wr->fbo = cpu_to_be32(0); > + count = min(pbl_depth, pbe_count); > + wr->addrs_length = cpu_to_be32(count); > + > + /* > + * fill out the PBL for this message > + */ > + for (i = 0; i < count; i++) { > + wr->paddrs[i] = cpu_to_be64((u64)(unsigned long)addr_list[i]); > + } > + > + /* > + * regerence the request struct > + */ > + vq_req_get(c2dev, vq_req); > + > + /* > + * send the WR to the adapter > + */ > + err = vq_send_wr(c2dev, (ccwr_t *)wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail1; > + } > + > + /* > + * wait for reply from adapter > + */ > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) { > + goto bail1; > + } > + > + /* > + * process reply > + */ > + reply = (ccwr_nsmr_register_rep_t *)(unsigned long)(vq_req->reply_msg); > + if (!reply) { > + err = -ENOMEM; > + goto bail1; > + } > + if ( (err = c2_errno(reply))) { > + goto bail2; > + } > + //*p_pb_entries = be32_to_cpu(reply->pbl_depth); > + mr->ibmr.lkey = mr->ibmr.rkey = be32_to_cpu(reply->stag_index); > + vq_repbuf_free(c2dev, reply); > + > + /* > + * if there are still more PBEs we need to send them to > + * the adapter and wait for a reply on the final one. > + * reuse vq_req for this purpose. > + */ > + pbl_depth -= count; > + if (pbl_depth) { > + > + vq_req->reply_msg = (unsigned long)NULL; > + atomic_set(&vq_req->reply_ready, 0); > + err = send_pbl_messages(c2dev, > + cpu_to_be32(mr->ibmr.lkey), > + (unsigned long)&addr_list[i], > + pbl_depth, vq_req, PBL_PHYS); > + if (err) { > + goto bail1; > + } > + } > + > + vq_req_free(c2dev, vq_req); > + kfree(wr); > + > + return err; > + > +bail2: > + vq_repbuf_free(c2dev, reply); > +bail1: > + kfree(wr); > +bail0: > + vq_req_free(c2dev, vq_req); > + return err; > +} > + > +int > +c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index) > +{ > + struct c2_vq_req *vq_req; /* verbs request object */ > + ccwr_stag_dealloc_req_t wr; /* work request */ > + ccwr_stag_dealloc_rep_t *reply; /* WR reply */ > + int err; > + > + > + /* > + * allocate verbs request object > + */ > + vq_req = vq_req_alloc(c2dev); > + if (!vq_req) { > + return -ENOMEM; > + } > + > + /* > + * Build the WR > + */ > + c2_wr_set_id(&wr, CCWR_STAG_DEALLOC); > + wr.hdr.context = (u64)(unsigned long)vq_req; > + wr.rnic_handle = c2dev->adapter_handle; > + wr.stag_index = cpu_to_be32(stag_index); > + > + /* > + * reference the request struct. dereferenced in the int handler. > + */ > + vq_req_get(c2dev, vq_req); > + > + /* > + * Send WR to adapter > + */ > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > + if (err) { > + vq_req_put(c2dev, vq_req); > + goto bail0; > + } > + > + /* > + * Wait for reply from adapter > + */ > + err = vq_wait_for_reply(c2dev, vq_req); > + if (err) { > + goto bail0; > + } > + > + /* > + * Process reply > + */ > + reply = (ccwr_stag_dealloc_rep_t*)(unsigned long)vq_req->reply_msg; > + if (!reply) { > + err = -ENOMEM; > + goto bail0; > + } > + > + err = c2_errno(reply); > + > + vq_repbuf_free(c2dev, reply); > +bail0: > + vq_req_free(c2dev, vq_req); > + return err; > +} > + > + > Index: hw/amso1100/cc_status.h > =================================================================== > --- hw/amso1100/cc_status.h (revision 0) > +++ hw/amso1100/cc_status.h (revision 0) > @@ -0,0 +1,163 @@ > +/* > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#ifndef _CC_STATUS_H_ > +#define _CC_STATUS_H_ > + > +/* > + * Verbs Status Codes > + */ > +typedef enum { > + CC_OK = 0, /* This must be zero */ > + CCERR_INSUFFICIENT_RESOURCES = 1, > + CCERR_INVALID_MODIFIER = 2, > + CCERR_INVALID_MODE = 3, > + CCERR_IN_USE = 4, > + CCERR_INVALID_RNIC = 5, > + CCERR_INTERRUPTED_OPERATION = 6, > + CCERR_INVALID_EH = 7, > + CCERR_INVALID_CQ = 8, > + CCERR_CQ_EMPTY = 9, > + CCERR_NOT_IMPLEMENTED = 10, > + CCERR_CQ_DEPTH_TOO_SMALL = 11, > + CCERR_PD_IN_USE = 12, > + CCERR_INVALID_PD = 13, > + CCERR_INVALID_SRQ = 14, > + CCERR_INVALID_ADDRESS = 15, > + CCERR_INVALID_NETMASK = 16, > + CCERR_INVALID_QP = 17, > + CCERR_INVALID_QP_STATE = 18, > + CCERR_TOO_MANY_WRS_POSTED = 19, > + CCERR_INVALID_WR_TYPE = 20, > + CCERR_INVALID_SGL_LENGTH = 21, > + CCERR_INVALID_SQ_DEPTH = 22, > + CCERR_INVALID_RQ_DEPTH = 23, > + CCERR_INVALID_ORD = 24, > + CCERR_INVALID_IRD = 25, > + CCERR_QP_ATTR_CANNOT_CHANGE = 26, > + CCERR_INVALID_STAG = 27, > + CCERR_QP_IN_USE = 28, > + CCERR_OUTSTANDING_WRS = 29, > + CCERR_STAG_IN_USE = 30, > + CCERR_INVALID_STAG_INDEX = 31, > + CCERR_INVALID_SGL_FORMAT = 32, > + CCERR_ADAPTER_TIMEOUT = 33, > + CCERR_INVALID_CQ_DEPTH = 34, > + CCERR_INVALID_PRIVATE_DATA_LENGTH = 35, > + CCERR_INVALID_EP = 36, > + CCERR_MR_IN_USE = CCERR_STAG_IN_USE, > + CCERR_FLUSHED = 38, > + CCERR_INVALID_WQE = 39, > + CCERR_LOCAL_QP_CATASTROPHIC_ERROR = 40, > + CCERR_REMOTE_TERMINATION_ERROR = 41, > + CCERR_BASE_AND_BOUNDS_VIOLATION = 42, > + CCERR_ACCESS_VIOLATION = 43, > + CCERR_INVALID_PD_ID = 44, > + CCERR_WRAP_ERROR = 45, > + CCERR_INV_STAG_ACCESS_ERROR = 46, > + CCERR_ZERO_RDMA_READ_RESOURCES = 47, > + CCERR_QP_NOT_PRIVILEGED = 48, > + CCERR_STAG_STATE_NOT_INVALID = 49, > + CCERR_INVALID_PAGE_SIZE = 50, > + CCERR_INVALID_BUFFER_SIZE = 51, > + CCERR_INVALID_PBE = 52, > + CCERR_INVALID_FBO = 53, > + CCERR_INVALID_LENGTH = 54, > + CCERR_INVALID_ACCESS_RIGHTS = 55, > + CCERR_PBL_TOO_BIG = 56, > + CCERR_INVALID_VA = 57, > + CCERR_INVALID_REGION = 58, > + CCERR_INVALID_WINDOW = 59, > + CCERR_TOTAL_LENGTH_TOO_BIG = 60, > + CCERR_INVALID_QP_ID = 61, > + CCERR_ADDR_IN_USE = 62, > + CCERR_ADDR_NOT_AVAIL = 63, > + CCERR_NET_DOWN = 64, > + CCERR_NET_UNREACHABLE = 65, > + CCERR_CONN_ABORTED = 66, > + CCERR_CONN_RESET = 67, > + CCERR_NO_BUFS = 68, > + CCERR_CONN_TIMEDOUT = 69, > + CCERR_CONN_REFUSED = 70, > + CCERR_HOST_UNREACHABLE = 71, > + CCERR_INVALID_SEND_SGL_DEPTH = 72, > + CCERR_INVALID_RECV_SGL_DEPTH = 73, > + CCERR_INVALID_RDMA_WRITE_SGL_DEPTH = 74, > + CCERR_INSUFFICIENT_PRIVILEGES = 75, > + CCERR_STACK_ERROR = 76, > + CCERR_INVALID_VERSION = 77, > + CCERR_INVALID_MTU = 78, > + CCERR_INVALID_IMAGE = 79, > + CCERR_PENDING = 98, /* not an error; user internally by adapter */ > + CCERR_DEFER = 99, /* not an error; used internally by adapter */ > + CCERR_FAILED_WRITE = 100, > + CCERR_FAILED_ERASE = 101, > + CCERR_FAILED_VERIFICATION = 102, > + CCERR_NOT_FOUND = 103, > + > +} cc_status_t; > + > +/* > + * Verbs and Completion Status Code types... > + */ > +typedef cc_status_t cc_verbs_status_t; > +typedef cc_status_t cc_wc_status_t; > + > +/* > + * CCAE_ACTIVE_CONNECT_RESULTS status result codes. > + */ > +typedef enum { > + CC_CONN_STATUS_SUCCESS = CC_OK, > + CC_CONN_STATUS_NO_MEM = CCERR_INSUFFICIENT_RESOURCES, > + CC_CONN_STATUS_TIMEDOUT = CCERR_CONN_TIMEDOUT, > + CC_CONN_STATUS_REFUSED = CCERR_CONN_REFUSED, > + CC_CONN_STATUS_NETUNREACH = CCERR_NET_UNREACHABLE, > + CC_CONN_STATUS_HOSTUNREACH = CCERR_HOST_UNREACHABLE, > + CC_CONN_STATUS_INVALID_RNIC = CCERR_INVALID_RNIC, > + CC_CONN_STATUS_INVALID_QP = CCERR_INVALID_QP, > + CC_CONN_STATUS_INVALID_QP_STATE = CCERR_INVALID_QP_STATE, > + CC_CONN_STATUS_REJECTED = CCERR_CONN_RESET, > +} cc_connect_status_t; > + > +/* > + * Flash programming status codes. > + */ > +typedef enum { > + CC_FLASH_STATUS_SUCCESS = 0x0000, > + CC_FLASH_STATUS_VERIFY_ERR = 0x0002, > + CC_FLASH_STATUS_IMAGE_ERR = 0x0004, > + CC_FLASH_STATUS_ECLBS = 0x0400, > + CC_FLASH_STATUS_PSLBS = 0x0800, > + CC_FLASH_STATUS_VPENS = 0x1000, > +} cc_flash_status_t; > + > +#endif /* _CC_STATUS_H_ */ > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From cap at nsc.liu.se Tue Jan 24 04:25:14 2006 From: cap at nsc.liu.se (Peter =?iso-8859-1?q?Kjellstr=F6m?=) Date: Tue, 24 Jan 2006 13:25:14 +0100 Subject: [openib-general] Advice In-Reply-To: <43D076BA.8010109@ucla.edu> References: <1137734191.4338.8953.camel@hal.voltaire.com> <43D076BA.8010109@ucla.edu> Message-ID: <200601241325.22785.cap@nsc.liu.se> On Friday 20 January 2006 06:35, Scott A. Friedman wrote: > >>> Not sure what IBGD is. The OpenIB svn code matches with that shipping > >>> in the current kernels. > >> > >> Sorry, I mean the Mellanox Gold distribution thing. Using this is kinda > >> a problem for me since I need to use a recent kernel - to support some > >> non Infiniband stuff. > > > > You don't necessarily need to use IBGD. There is overlap. It depends on > > what you need/use out of this. Mellanox is working on an IB Gold 2 based > > on OpenIB too. > > I will keep an eye out for it. However, they tend to only support > particular kernel versions which doesn't work for us. Huh? I've used IBGD from 0.5.0 to now current 1.8.1 on whatever kernel I had spinning and it has almost never failed workd automagically (./install...). That said, the openib installation is lighter and works fine for us too (IP, ScaMPI, OpenMPI, mvapich). Just my experience, Peter > ... -- ------------------------------------------------------------ Peter Kjellström | National Supercomputer Centre | Sweden | http://www.nsc.liu.se -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From tom at opengridcomputing.com Tue Jan 24 05:11:38 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 24 Jan 2006 07:11:38 -0600 Subject: [openib-general] [PATCH] RFC: AMSO1100 iWARP Driver In-Reply-To: References: Message-ID: <1138108299.675.15.camel@strider.opengridcomputing.com> Thanks for the review. Good catch on the free(pd). On Tue, 2006-01-24 at 16:12 +0530, Krishna Kumar2 wrote: > Hi Tom, > > - c2_create_qp() should kfree(qp) on error and not pd. > > Some very (very) MINOR nits : > > - c2_pd_alloc() should be called c2_pd_id_alloc() ? And why is > might_sleep() required for this and > c2_pd_free() ? Shouldn't that be in c2_alloc_pd() before the kmalloc() ? > > - netevent_notifier : why is it using KERN_ERR and not KERN_INFO ? > > - c2_mq_init() does a return at the end of routine, can be removed : > > + return; > > - Remove typecasts of void *, eg : > > + reply_vq = (struct c2_mq *)c2dev->qptr_array[mq_index]; > > - Change (for consistency and to be clear) : > + rx_ring->start = kmalloc(sizeof(*elem) * rx_ring->count, > GFP_KERNEL); > to > + rx_ring->start = kmalloc(sizeof(*rx_ring->start) * rx_ring->count, > GFP_KERNEL); > > - In c2_tx_clean, you can do : > > + if (netif_queue_stopped(c2_port->netdev) && c2_port->tx_avail > > MAX_SKB_FRAGS + 1) > + netif_wake_queue(c2_port->netdev); > > - Lots of > + if (err) { > + break; > + } > > (braces for one line, not a big deal but can remove) > > - c2_init_qp_table() can be written : > + if (err) > + c2_alloc_cleanup(&c2dev->qp_table.alloc); > + return err; > > removing some redundant returns. > > Thanks, > > - KK > > > openib-general-bounces at openib.org wrote on 01/24/2006 10:45:52 AM: > > > > > > > Given some of the discussion re: support for the AMSO1100, enclosed is a > > patch for an OpenIB provider in support of the AMSO1100. While we use > > these devices extensively for testing of iWARP support at OGC, the > > driver has not seen anywhere near the kind of attention that the mthca > > driver has. > > > > This patch requires the previously submitted iWARP core support and CMA > > patch. > > > > Please review and offer suggestions as to what we can do to improve it. > > There are some known issues with ULP that do not filter based on node > > type and can become confused and crash when loading and unloading this > > driver. > > > > Patches are available for these ULP add_one and remove_one handlers, but > > these are trivial and can be considered separately. > > > > Index: Kconfig > > =================================================================== > > --- Kconfig (revision 5098) > > +++ Kconfig (working copy) > > @@ -32,6 +32,8 @@ > > > > source "drivers/infiniband/hw/mthca/Kconfig" > > > > +source "drivers/infiniband/hw/amso1100/Kconfig" > > + > > source "drivers/infiniband/hw/ehca/Kconfig" > > > > source "drivers/infiniband/ulp/ipoib/Kconfig" > > Index: Makefile > > =================================================================== > > --- Makefile (revision 5098) > > +++ Makefile (working copy) > > @@ -1,6 +1,7 @@ > > obj-$(CONFIG_INFINIBAND) += core/ > > obj-$(CONFIG_IPATH_CORE) += hw/ipath/ > > obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ > > +obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ > > obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ > > obj-$(CONFIG_INFINIBAND_SDP) += ulp/sdp/ > > obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ > > Index: hw/amso1100/cc_ae.h > > =================================================================== > > --- hw/amso1100/cc_ae.h (revision 0) > > +++ hw/amso1100/cc_ae.h (revision 0) > > @@ -0,0 +1,108 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#ifndef _CC_AE_H_ > > +#define _CC_AE_H_ > > + > > +/* > > + * WARNING: If you change this file, also bump CC_IVN_BASE > > + * in common/include/clustercore/cc_ivn.h. > > + */ > > + > > +/* > > + * Asynchronous Event Identifiers > > + * > > + * These start at 0x80 only so it's obvious from inspection that > > + * they are not work-request statuses. This isn't critical. > > + * > > + * NOTE: these event id's must fit in eight bits. > > + */ > > +typedef enum { > > + CCAE_REMOTE_SHUTDOWN = 0x80, > > + CCAE_ACTIVE_CONNECT_RESULTS, > > + CCAE_CONNECTION_REQUEST, > > + CCAE_LLP_CLOSE_COMPLETE, > > + CCAE_TERMINATE_MESSAGE_RECEIVED, > > + CCAE_LLP_CONNECTION_RESET, > > + CCAE_LLP_CONNECTION_LOST, > > + CCAE_LLP_SEGMENT_SIZE_INVALID, > > + CCAE_LLP_INVALID_CRC, > > + CCAE_LLP_BAD_FPDU, > > + CCAE_INVALID_DDP_VERSION, > > + CCAE_INVALID_RDMA_VERSION, > > + CCAE_UNEXPECTED_OPCODE, > > + CCAE_INVALID_DDP_QUEUE_NUMBER, > > + CCAE_RDMA_READ_NOT_ENABLED, > > + CCAE_RDMA_WRITE_NOT_ENABLED, > > + CCAE_RDMA_READ_TOO_SMALL, > > + CCAE_NO_L_BIT, > > + CCAE_TAGGED_INVALID_STAG, > > + CCAE_TAGGED_BASE_BOUNDS_VIOLATION, > > + CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION, > > + CCAE_TAGGED_INVALID_PD, > > + CCAE_WRAP_ERROR, > > + CCAE_BAD_CLOSE, > > + CCAE_BAD_LLP_CLOSE, > > + CCAE_INVALID_MSN_RANGE, > > + CCAE_INVALID_MSN_GAP, > > + CCAE_IRRQ_OVERFLOW, > > + CCAE_IRRQ_MSN_GAP, > > + CCAE_IRRQ_MSN_RANGE, > > + CCAE_IRRQ_INVALID_STAG, > > + CCAE_IRRQ_BASE_BOUNDS_VIOLATION, > > + CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION, > > + CCAE_IRRQ_INVALID_PD, > > + CCAE_IRRQ_WRAP_ERROR, > > + CCAE_CQ_SQ_COMPLETION_OVERFLOW, > > + CCAE_CQ_RQ_COMPLETION_ERROR, > > + CCAE_QP_SRQ_WQE_ERROR, > > + CCAE_QP_LOCAL_CATASTROPHIC_ERROR, > > + CCAE_CQ_OVERFLOW, > > + CCAE_CQ_OPERATION_ERROR, > > + CCAE_SRQ_LIMIT_REACHED, > > + CCAE_QP_RQ_LIMIT_REACHED, > > + CCAE_SRQ_CATASTROPHIC_ERROR, > > + CCAE_RNIC_CATASTROPHIC_ERROR > > + /* WARNING If you add more id's, make sure their values fit in > eight bits. */ > > +} cc_event_id_t; > > + > > +/* > > + * Resource Indicators and Identifiers > > + */ > > +typedef enum { > > + CC_RES_IND_QP = 1, > > + CC_RES_IND_EP, > > + CC_RES_IND_CQ, > > + CC_RES_IND_SRQ, > > +} cc_resource_indicator_t; > > + > > +#endif /* _CC_AE_H_ */ > > Index: hw/amso1100/Kconfig > > =================================================================== > > --- hw/amso1100/Kconfig (revision 0) > > +++ hw/amso1100/Kconfig (revision 0) > > @@ -0,0 +1,15 @@ > > +config INFINIBAND_AMSO1100 > > + tristate "Ammasso 1100 HCA support" > > + depends on PCI && INFINIBAND > > + ---help--- > > + This is a low-level driver for the Ammasso 1100 host > > + channel adapter (HCA). > > + > > +config INFINIBAND_AMSO1100_DEBUG > > + bool "Verbose debugging output" > > + depends on INFINIBAND_AMSO1100 > > + default n > > + ---help--- > > + This option causes the amso1100 driver to produce a bunch of > > + debug messages. Select this if you are developing the driver > > + or trying to diagnose a problem. > > Index: hw/amso1100/c2_intr.c > > =================================================================== > > --- hw/amso1100/c2_intr.c (revision 0) > > +++ hw/amso1100/c2_intr.c (revision 0) > > @@ -0,0 +1,177 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#include "c2.h" > > +#include "c2_vq.h" > > + > > +static void handle_mq(struct c2_dev *c2dev, u32 index); > > +static void handle_vq(struct c2_dev *c2dev, u32 mq_index); > > + > > +/* > > + * Handle RNIC interrupts > > + */ > > +void > > +c2_rnic_interrupt(struct c2_dev *c2dev) > > +{ > > + unsigned int mq_index; > > + > > + while (c2dev->hints_read != be16_to_cpu(c2dev->hint_count)) { > > + mq_index = c2_read32(c2dev->regs + PCI_BAR0_HOST_HINT); > > + if (mq_index & 0x80000000) { > > + break; > > + } > > + > > + c2dev->hints_read++; > > + handle_mq(c2dev, mq_index); > > + } > > + > > +} > > + > > +/* > > + * Top level MQ handler > > + */ > > +static void > > +handle_mq(struct c2_dev *c2dev, u32 mq_index) > > +{ > > + if (c2dev->qptr_array[mq_index] == NULL) { > > + dprintk(KERN_INFO "handle_mq: stray activity for mq_index=%d\n", > mq_index); > > + return; > > + } > > + > > + switch (mq_index) { > > + case (0): > > + /* > > + * An index of 0 in the activity queue > > + * indicates the req vq now has messages > > + * available... > > + * > > + * Wake up any waiters waiting on req VQ > > + * message availability. > > + */ > > + wake_up(&c2dev->req_vq_wo); > > + break; > > + case (1): > > + handle_vq(c2dev, mq_index); > > + break; > > + case (2): > > + spin_lock(&c2dev->aeq_lock); > > + c2_ae_event(c2dev, mq_index); > > + spin_unlock(&c2dev->aeq_lock); > > + break; > > + default: > > + c2_cq_event(c2dev, mq_index); > > + break; > > + } > > + > > + return; > > +} > > + > > +/* > > + * Handles verbs WR replies. > > + */ > > +static void > > +handle_vq(struct c2_dev *c2dev, u32 mq_index) > > +{ > > + void *adapter_msg, *reply_msg; > > + ccwr_hdr_t *host_msg; > > + ccwr_hdr_t tmp; > > + struct c2_mq *reply_vq; > > + struct c2_vq_req* req; > > + > > + reply_vq = (struct c2_mq *)c2dev->qptr_array[mq_index]; > > + > > + { > > + > > + /* > > + * get next msg from mq_index into adapter_msg. > > + * don't free it yet. > > + */ > > + adapter_msg = c2_mq_consume(reply_vq); > > + dprintk("handle_vq: adapter_msg=%p\n", adapter_msg); > > + if (adapter_msg == NULL) { > > + return; > > + } > > + > > + host_msg = vq_repbuf_alloc(c2dev); > > + > > + /* > > + * If we can't get a host buffer, then we'll still > > + * wakeup the waiter, we just won't give him the msg. > > + * It is assumed the waiter will deal with this... > > + */ > > + if (!host_msg) { > > + dprintk("handle_vq: no repbufs!\n"); > > + > > + /* > > + * just copy the WR header into a local variable. > > + * this allows us to still demux on the context > > + */ > > + host_msg = &tmp; > > + memcpy(host_msg, adapter_msg, sizeof(tmp)); > > + reply_msg = NULL; > > + } else { > > + memcpy(host_msg, adapter_msg, reply_vq->msg_size); > > + reply_msg = host_msg; > > + } > > + > > + /* > > + * consume the msg from the MQ > > + */ > > + c2_mq_free(reply_vq); > > + > > + /* > > + * wakeup the waiter. > > + */ > > + req = (struct c2_vq_req *)(unsigned long)host_msg->context; > > + if (req == NULL) { > > + /* > > + * We should never get here, as the adapter should > > + * never send us a reply that we're not expecting. > > + */ > > + vq_repbuf_free(c2dev, host_msg); > > + dprintk("handle_vq: UNEXPECTEDLY got NULL req\n"); > > + return; > > + } > > + req->reply_msg = (u64)(unsigned long)(reply_msg); > > + atomic_set(&req->reply_ready, 1); > > + dprintk("handle_vq: wakeup req %p\n", req); > > + wake_up(&req->wait_object); > > + > > + /* > > + * If the request was cancelled, then this put will > > + * free the vq_req memory...and reply_msg!!! > > + */ > > + vq_req_put(c2dev, req); > > + } > > + > > +} > > + > > Index: hw/amso1100/c2_mq.c > > =================================================================== > > --- hw/amso1100/c2_mq.c (revision 0) > > +++ hw/amso1100/c2_mq.c (revision 0) > > @@ -0,0 +1,182 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#include "c2.h" > > +#include "c2_mq.h" > > + > > +#define BUMP(q,p) (p) = ((p)+1) % (q)->q_size > > +#define BUMP_SHARED(q,p) (p) = cpu_to_be16((be16_to_cpu(p)+1) % > (q)->q_size) > > + > > +void * > > +c2_mq_alloc(struct c2_mq *q) > > +{ > > + assert(q); > > + assert(q->magic == C2_MQ_MAGIC); > > + assert(q->type == C2_MQ_ADAPTER_TARGET); > > + > > + if (c2_mq_full(q)) { > > + return NULL; > > + } else { > > +#ifdef C2_DEBUG > > + ccwr_hdr_t *m = (ccwr_hdr_t*)(q->msg_pool + q->priv * > q->msg_size); > > +#ifdef CCMSGMAGIC > > + assert(m->magic == be32_to_cpu(~CCWR_MAGIC)); > > + m->magic = cpu_to_be32(CCWR_MAGIC); > > +#endif > > + dprintk("c2_mq_alloc %p\n", m); > > + return m; > > +#else > > + return q->msg_pool + q->priv * q->msg_size; > > +#endif > > + } > > +} > > + > > +void > > +c2_mq_produce(struct c2_mq *q) > > +{ > > + assert(q); > > + assert(q->magic == C2_MQ_MAGIC); > > + assert(q->type == C2_MQ_ADAPTER_TARGET); > > + > > + if (!c2_mq_full(q)) { > > + BUMP(q, q->priv); > > + q->hint_count++; > > + /* Update peer's offset. */ > > + q->peer->shared = cpu_to_be16(q->priv); > > + } > > +} > > + > > +void * > > +c2_mq_consume(struct c2_mq *q) > > +{ > > + assert(q); > > + assert(q->magic == C2_MQ_MAGIC); > > + assert(q->type == C2_MQ_HOST_TARGET); > > + > > + if (c2_mq_empty(q)) { > > + return NULL; > > + } else { > > +#ifdef C2_DEBUG > > + ccwr_hdr_t *m = (ccwr_hdr_t*) > > + (q->msg_pool + q->priv * q->msg_size); > > +#ifdef CCMSGMAGIC > > + assert(m->magic == be32_to_cpu(CCWR_MAGIC)); > > +#endif > > + dprintk("c2_mq_consume %p\n", m); > > + return m; > > +#else > > + return q->msg_pool + q->priv * q->msg_size; > > +#endif > > + } > > +} > > + > > +void > > +c2_mq_free(struct c2_mq *q) > > +{ > > + assert(q); > > + assert(q->magic == C2_MQ_MAGIC); > > + assert(q->type == C2_MQ_HOST_TARGET); > > + > > + if (!c2_mq_empty(q)) { > > +#ifdef C2_DEBUG > > +{ > > + dprintk("c2_mq_free %p\n", (ccwr_hdr_t*)(q->msg_pool + q->priv * > q->msg_size)); > > +} > > +#endif > > + > > +#ifdef CCMSGMAGIC > > +{ > > + ccwr_hdr_t *m = (ccwr_hdr_t*) > > + (q->msg_pool + q->priv * q->msg_size); > > + m->magic = cpu_to_be32(~CCWR_MAGIC); > > +} > > +#endif > > + BUMP(q, q->priv); > > + /* Update peer's offset. */ > > + q->peer->shared = cpu_to_be16(q->priv); > > + } > > +} > > + > > + > > +void > > +c2_mq_lconsume(struct c2_mq *q, u32 wqe_count) > > +{ > > + assert(q); > > + assert(q->magic == C2_MQ_MAGIC); > > + assert(q->type == C2_MQ_ADAPTER_TARGET); > > + > > + while (wqe_count--) { > > + assert(!c2_mq_empty(q)); > > + BUMP_SHARED(q, *q->shared); > > + } > > +} > > + > > + > > +u32 > > +c2_mq_count(struct c2_mq *q) > > +{ > > + s32 count; > > + > > + assert(q); > > + if (q->type == C2_MQ_HOST_TARGET) { > > + count = be16_to_cpu(*q->shared) - q->priv; > > + } else { > > + count = q->priv - be16_to_cpu(*q->shared); > > + } > > + > > + if (count < 0) { > > + count += q->q_size; > > + } > > + > > + return (u32)count; > > +} > > + > > +void > > +c2_mq_init(struct c2_mq *q, u32 index, u32 q_size, > > + u32 msg_size, u8 *pool_start, u16 *peer, > > + u32 type) > > +{ > > + assert(q->shared); > > + > > + /* This code assumes the byte swapping has already been done! */ > > + q->index = index; > > + q->q_size = q_size; > > + q->msg_size = msg_size; > > + q->msg_pool = pool_start; > > + q->peer = (struct c2_mq_shared *)peer; > > + q->magic = C2_MQ_MAGIC; > > + q->type = type; > > + q->priv = 0; > > + q->hint_count = 0; > > + return; > > +} > > + > > Index: hw/amso1100/cc_wr.h > > =================================================================== > > --- hw/amso1100/cc_wr.h (revision 0) > > +++ hw/amso1100/cc_wr.h (revision 0) > > @@ -0,0 +1,1340 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#ifndef _CC_WR_H_ > > +#define _CC_WR_H_ > > +#include "cc_types.h" > > +/* > > + * WARNING: If you change this file, also bump CC_IVN_BASE > > + * in common/include/clustercore/cc_ivn.h. > > + */ > > + > > +#ifdef CCDEBUG > > +#define CCWR_MAGIC 0xb07700b0 > > +#endif > > + > > +#define WR_BUILD_STR_LEN 64 > > + > > +#ifdef _MSC_VER > > +#define PACKED > > +#pragma pack(push) > > +#pragma pack(1) > > +#define __inline__ __inline > > +#else > > +#define PACKED __attribute__ ((packed)) > > +#endif > > + > > +/* > > + * WARNING: All of these structs need to align any 64bit types on > > + * 64 bit boundaries! 64bit types include u64 and u64. > > + */ > > + > > +/* > > + * Clustercore Work Request Header. Be sensitive to field layout > > + * and alignment. > > + */ > > +typedef struct { > > + /* wqe_count is part of the cqe. It is put here so the > > + * adapter can write to it while the wr is pending without > > + * clobbering part of the wr. This word need not be dma'd > > + * from the host to adapter by libccil, but we copy it anyway > > + * to make the memcpy to the adapter better aligned. > > + */ > > + u32 wqe_count; > > + > > + /* Put these fields next so that later 32- and 64-bit > > + * quantities are naturally aligned. > > + */ > > + u8 id; > > + u8 result; /* adapter -> host */ > > + u8 sge_count; /* host -> adapter */ > > + u8 flags; /* host -> adapter */ > > + > > + u64 context; > > +#ifdef CCMSGMAGIC > > + u32 magic; > > + u32 pad; > > +#endif > > +} PACKED ccwr_hdr_t; > > + > > +/* > > + *------------------------ RNIC ------------------------ > > + */ > > + > > +/* > > + * WR_RNIC_OPEN > > + */ > > + > > +/* > > + * Flags for the RNIC WRs > > + */ > > +typedef enum { > > + RNIC_IRD_STATIC = 0x0001, > > + RNIC_ORD_STATIC = 0x0002, > > + RNIC_QP_STATIC = 0x0004, > > + RNIC_SRQ_SUPPORTED = 0x0008, > > + RNIC_PBL_BLOCK_MODE = 0x0010, > > + RNIC_SRQ_MODEL_ARRIVAL = 0x0020, > > + RNIC_CQ_OVF_DETECTED = 0x0040, > > + RNIC_PRIV_MODE = 0x0080 > > +} PACKED cc_rnic_flags_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 user_context; > > + u16 flags; /* See cc_rnic_flags_t */ > > + u16 port_num; > > +} PACKED ccwr_rnic_open_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > +} PACKED ccwr_rnic_open_rep_t; > > + > > +typedef union { > > + ccwr_rnic_open_req_t req; > > + ccwr_rnic_open_rep_t rep; > > +} PACKED ccwr_rnic_open_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > +} PACKED ccwr_rnic_query_req_t; > > + > > +/* > > + * WR_RNIC_QUERY > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 user_context; > > + u32 vendor_id; > > + u32 part_number; > > + u32 hw_version; > > + u32 fw_ver_major; > > + u32 fw_ver_minor; > > + u32 fw_ver_patch; > > + char fw_ver_build_str[WR_BUILD_STR_LEN]; > > + u32 max_qps; > > + u32 max_qp_depth; > > + u32 max_srq_depth; > > + u32 max_send_sgl_depth; > > + u32 max_rdma_sgl_depth; > > + u32 max_cqs; > > + u32 max_cq_depth; > > + u32 max_cq_event_handlers; > > + u32 max_mrs; > > + u32 max_pbl_depth; > > + u32 max_pds; > > + u32 max_global_ird; > > + u32 max_global_ord; > > + u32 max_qp_ird; > > + u32 max_qp_ord; > > + u32 flags; /* See cc_rnic_flags_t */ > > + u32 max_mws; > > + u32 pbe_range_low; > > + u32 pbe_range_high; > > + u32 max_srqs; > > + u32 page_size; > > +} PACKED ccwr_rnic_query_rep_t; > > + > > +typedef union { > > + ccwr_rnic_query_req_t req; > > + ccwr_rnic_query_rep_t rep; > > +} PACKED ccwr_rnic_query_t; > > + > > +/* > > + * WR_RNIC_GETCONFIG > > + */ > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 option; /* see cc_getconfig_cmd_t */ > > + u64 reply_buf; > > + u32 reply_buf_len; > > +} PACKED ccwr_rnic_getconfig_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 option; /* see cc_getconfig_cmd_t */ > > + u32 count_len; /* length of the number of addresses configured */ > > +} PACKED ccwr_rnic_getconfig_rep_t; > > + > > +typedef union { > > + ccwr_rnic_getconfig_req_t req; > > + ccwr_rnic_getconfig_rep_t rep; > > +} PACKED ccwr_rnic_getconfig_t; > > + > > +/* > > + * WR_RNIC_SETCONFIG > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 option; /* See cc_setconfig_cmd_t */ > > + /* variable data and pad See cc_netaddr_t and > > + * cc_route_t > > + */ > > + u8 data[0]; > > +} PACKED ccwr_rnic_setconfig_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_rnic_setconfig_rep_t; > > + > > +typedef union { > > + ccwr_rnic_setconfig_req_t req; > > + ccwr_rnic_setconfig_rep_t rep; > > +} PACKED ccwr_rnic_setconfig_t; > > + > > +/* > > + * WR_RNIC_CLOSE > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > +} PACKED ccwr_rnic_close_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_rnic_close_rep_t; > > + > > +typedef union { > > + ccwr_rnic_close_req_t req; > > + ccwr_rnic_close_rep_t rep; > > +} PACKED ccwr_rnic_close_t; > > + > > +/* > > + *------------------------ CQ ------------------------ > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 shared_ht; > > + u64 user_context; > > + u64 msg_pool; > > + u32 rnic_handle; > > + u32 msg_size; > > + u32 depth; > > +} PACKED ccwr_cq_create_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 mq_index; > > + u32 adapter_shared; > > + u32 cq_handle; > > +} PACKED ccwr_cq_create_rep_t; > > + > > +typedef union { > > + ccwr_cq_create_req_t req; > > + ccwr_cq_create_rep_t rep; > > +} PACKED ccwr_cq_create_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 cq_handle; > > + u32 new_depth; > > + u64 new_msg_pool; > > +} PACKED ccwr_cq_modify_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_cq_modify_rep_t; > > + > > +typedef union { > > + ccwr_cq_modify_req_t req; > > + ccwr_cq_modify_rep_t rep; > > +} PACKED ccwr_cq_modify_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 cq_handle; > > +} PACKED ccwr_cq_destroy_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_cq_destroy_rep_t; > > + > > +typedef union { > > + ccwr_cq_destroy_req_t req; > > + ccwr_cq_destroy_rep_t rep; > > +} PACKED ccwr_cq_destroy_t; > > + > > +/* > > + *------------------------ PD ------------------------ > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 pd_id; > > +} PACKED ccwr_pd_alloc_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_pd_alloc_rep_t; > > + > > +typedef union { > > + ccwr_pd_alloc_req_t req; > > + ccwr_pd_alloc_rep_t rep; > > +} PACKED ccwr_pd_alloc_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 pd_id; > > +} PACKED ccwr_pd_dealloc_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_pd_dealloc_rep_t; > > + > > +typedef union { > > + ccwr_pd_dealloc_req_t req; > > + ccwr_pd_dealloc_rep_t rep; > > +} PACKED ccwr_pd_dealloc_t; > > + > > +/* > > + *------------------------ SRQ ------------------------ > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 shared_ht; > > + u64 user_context; > > + u32 rnic_handle; > > + u32 srq_depth; > > + u32 srq_limit; > > + u32 sgl_depth; > > + u32 pd_id; > > +} PACKED ccwr_srq_create_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 srq_depth; > > + u32 sgl_depth; > > + u32 msg_size; > > + u32 mq_index; > > + u32 mq_start; > > + u32 srq_handle; > > +} PACKED ccwr_srq_create_rep_t; > > + > > +typedef union { > > + ccwr_srq_create_req_t req; > > + ccwr_srq_create_rep_t rep; > > +} PACKED ccwr_srq_create_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 srq_handle; > > +} PACKED ccwr_srq_destroy_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_srq_destroy_rep_t; > > + > > +typedef union { > > + ccwr_srq_destroy_req_t req; > > + ccwr_srq_destroy_rep_t rep; > > +} PACKED ccwr_srq_destroy_t; > > + > > +/* > > + *------------------------ QP ------------------------ > > + */ > > +typedef enum { > > + QP_RDMA_READ = 0x00000001, /* RDMA read enabled? */ > > + QP_RDMA_WRITE = 0x00000002, /* RDMA write enabled? */ > > + QP_MW_BIND = 0x00000004, /* MWs enabled */ > > + QP_ZERO_STAG = 0x00000008, /* enabled? */ > > + QP_REMOTE_TERMINATION = 0x00000010, /* remote end terminated > */ > > + QP_RDMA_READ_RESPONSE = 0x00000020 /* Remote RDMA read */ > > + /* enabled? */ > > +} PACKED ccwr_qp_flags_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 shared_sq_ht; > > + u64 shared_rq_ht; > > + u64 user_context; > > + u32 rnic_handle; > > + u32 sq_cq_handle; > > + u32 rq_cq_handle; > > + u32 sq_depth; > > + u32 rq_depth; > > + u32 srq_handle; > > + u32 srq_limit; > > + u32 flags; /* see ccwr_qp_flags_t */ > > + u32 send_sgl_depth; > > + u32 recv_sgl_depth; > > + u32 rdma_write_sgl_depth; > > + u32 ord; > > + u32 ird; > > + u32 pd_id; > > +} PACKED ccwr_qp_create_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 sq_depth; > > + u32 rq_depth; > > + u32 send_sgl_depth; > > + u32 recv_sgl_depth; > > + u32 rdma_write_sgl_depth; > > + u32 ord; > > + u32 ird; > > + u32 sq_msg_size; > > + u32 sq_mq_index; > > + u32 sq_mq_start; > > + u32 rq_msg_size; > > + u32 rq_mq_index; > > + u32 rq_mq_start; > > + u32 qp_handle; > > +} PACKED ccwr_qp_create_rep_t; > > + > > +typedef union { > > + ccwr_qp_create_req_t req; > > + ccwr_qp_create_rep_t rep; > > +} PACKED ccwr_qp_create_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 qp_handle; > > +} PACKED ccwr_qp_query_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 user_context; > > + u32 rnic_handle; > > + u32 sq_depth; > > + u32 rq_depth; > > + u32 send_sgl_depth; > > + u32 rdma_write_sgl_depth; > > + u32 recv_sgl_depth; > > + u32 ord; > > + u32 ird; > > + u16 qp_state; > > + u16 flags; /* see ccwr_qp_flags_t */ > > + u32 qp_id; > > + u32 local_addr; > > + u32 remote_addr; > > + u16 local_port; > > + u16 remote_port; > > + u32 terminate_msg_length; /* 0 if not present */ > > + u8 data[0]; > > + /* Terminate Message in-line here. */ > > +} PACKED ccwr_qp_query_rep_t; > > + > > +typedef union { > > + ccwr_qp_query_req_t req; > > + ccwr_qp_query_rep_t rep; > > +} PACKED ccwr_qp_query_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 stream_msg; > > + u32 stream_msg_length; > > + u32 rnic_handle; > > + u32 qp_handle; > > + u32 next_qp_state; > > + u32 ord; > > + u32 ird; > > + u32 sq_depth; > > + u32 rq_depth; > > + u32 llp_ep_handle; > > +} PACKED ccwr_qp_modify_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 ord; > > + u32 ird; > > + u32 sq_depth; > > + u32 rq_depth; > > + u32 sq_msg_size; > > + u32 sq_mq_index; > > + u32 sq_mq_start; > > + u32 rq_msg_size; > > + u32 rq_mq_index; > > + u32 rq_mq_start; > > +} PACKED ccwr_qp_modify_rep_t; > > + > > +typedef union { > > + ccwr_qp_modify_req_t req; > > + ccwr_qp_modify_rep_t rep; > > +} PACKED ccwr_qp_modify_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 qp_handle; > > +} PACKED ccwr_qp_destroy_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_qp_destroy_rep_t; > > + > > +typedef union { > > + ccwr_qp_destroy_req_t req; > > + ccwr_qp_destroy_rep_t rep; > > +} PACKED ccwr_qp_destroy_t; > > + > > +/* > > + * The CCWR_QP_CONNECT msg is posted on the verbs request queue. It > can > > + * only be posted when a QP is in IDLE state. After the connect > request is > > + * submitted to the LLP, the adapter moves the QP to CONNECT_PENDING > state. > > + * No synchronous reply from adapter to this WR. The results of > > + * connection are passed back in an async event > CCAE_ACTIVE_CONNECT_RESULTS > > + * See ccwr_ae_active_connect_results_t > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 qp_handle; > > + u32 remote_addr; > > + u16 remote_port; > > + u16 pad; > > + u32 private_data_length; > > + u8 private_data[0]; /* Private data in-line. */ > > +} PACKED ccwr_qp_connect_req_t; > > + > > +typedef struct { > > + ccwr_qp_connect_req_t req; > > + /* no synchronous reply. */ > > +} PACKED ccwr_qp_connect_t; > > + > > + > > +/* > > + *------------------------ MM ------------------------ > > + */ > > + > > +typedef cc_mm_flags_t ccwr_mr_flags_t; /* cc_types.h */ > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 pbl_depth; > > + u32 pd_id; > > + u32 flags; /* See ccwr_mr_flags_t */ > > +} PACKED ccwr_nsmr_stag_alloc_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 pbl_depth; > > + u32 stag_index; > > +} PACKED ccwr_nsmr_stag_alloc_rep_t; > > + > > +typedef union { > > + ccwr_nsmr_stag_alloc_req_t req; > > + ccwr_nsmr_stag_alloc_rep_t rep; > > +} PACKED ccwr_nsmr_stag_alloc_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 va; > > + u32 rnic_handle; > > + u16 flags; /* See ccwr_mr_flags_t */ > > + u8 stag_key; > > + u8 pad; > > + u32 pd_id; > > + u32 pbl_depth; > > + u32 pbe_size; > > + u32 fbo; > > + u32 length; > > + u32 addrs_length; > > + /* array of paddrs (must be aligned on a 64bit boundary) */ > > + u64 paddrs[0]; > > +} PACKED ccwr_nsmr_register_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 pbl_depth; > > + u32 stag_index; > > +} PACKED ccwr_nsmr_register_rep_t; > > + > > +typedef union { > > + ccwr_nsmr_register_req_t req; > > + ccwr_nsmr_register_rep_t rep; > > +} PACKED ccwr_nsmr_register_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 flags; /* See ccwr_mr_flags_t */ > > + u32 stag_index; > > + u32 addrs_length; > > + /* array of paddrs (must be aligned on a 64bit boundary) */ > > + u64 paddrs[0]; > > +} PACKED ccwr_nsmr_pbl_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_nsmr_pbl_rep_t; > > + > > +typedef union { > > + ccwr_nsmr_pbl_req_t req; > > + ccwr_nsmr_pbl_rep_t rep; > > +} PACKED ccwr_nsmr_pbl_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 stag_index; > > +} PACKED ccwr_mr_query_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u8 stag_key; > > + u8 pad[3]; > > + u32 pd_id; > > + u32 flags; /* See ccwr_mr_flags_t */ > > + u32 pbl_depth; > > +} PACKED ccwr_mr_query_rep_t; > > + > > +typedef union { > > + ccwr_mr_query_req_t req; > > + ccwr_mr_query_rep_t rep; > > +} PACKED ccwr_mr_query_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 stag_index; > > +} PACKED ccwr_mw_query_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u8 stag_key; > > + u8 pad[3]; > > + u32 pd_id; > > + u32 flags; /* See ccwr_mr_flags_t */ > > +} PACKED ccwr_mw_query_rep_t; > > + > > +typedef union { > > + ccwr_mw_query_req_t req; > > + ccwr_mw_query_rep_t rep; > > +} PACKED ccwr_mw_query_t; > > + > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 stag_index; > > +} PACKED ccwr_stag_dealloc_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_stag_dealloc_rep_t; > > + > > +typedef union { > > + ccwr_stag_dealloc_req_t req; > > + ccwr_stag_dealloc_rep_t rep; > > +} PACKED ccwr_stag_dealloc_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 va; > > + u32 rnic_handle; > > + u16 flags; /* See ccwr_mr_flags_t */ > > + u8 stag_key; > > + u8 pad; > > + u32 stag_index; > > + u32 pd_id; > > + u32 pbl_depth; > > + u32 pbe_size; > > + u32 fbo; > > + u32 length; > > + u32 addrs_length; > > + u32 pad1; > > + /* array of paddrs (must be aligned on a 64bit boundary) */ > > + u64 paddrs[0]; > > +} PACKED ccwr_nsmr_reregister_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 pbl_depth; > > + u32 stag_index; > > +} PACKED ccwr_nsmr_reregister_rep_t; > > + > > +typedef union { > > + ccwr_nsmr_reregister_req_t req; > > + ccwr_nsmr_reregister_rep_t rep; > > +} PACKED ccwr_nsmr_reregister_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 va; > > + u32 rnic_handle; > > + u16 flags; /* See ccwr_mr_flags_t */ > > + u8 stag_key; > > + u8 pad; > > + u32 stag_index; > > + u32 pd_id; > > +} PACKED ccwr_smr_register_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 stag_index; > > +} PACKED ccwr_smr_register_rep_t; > > + > > +typedef union { > > + ccwr_smr_register_req_t req; > > + ccwr_smr_register_rep_t rep; > > +} PACKED ccwr_smr_register_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 pd_id; > > +} PACKED ccwr_mw_alloc_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 stag_index; > > +} PACKED ccwr_mw_alloc_rep_t; > > + > > +typedef union { > > + ccwr_mw_alloc_req_t req; > > + ccwr_mw_alloc_rep_t rep; > > +} PACKED ccwr_mw_alloc_t; > > + > > +/* > > + *------------------------ WRs ----------------------- > > + */ > > + > > +typedef struct { > > + ccwr_hdr_t hdr; /* Has status and WR Type */ > > +} PACKED ccwr_user_hdr_t; > > + > > +/* Completion queue entry. */ > > +typedef struct { > > + ccwr_hdr_t hdr; /* Has status and WR Type */ > > + u64 qp_user_context;/* cc_user_qp_t * */ > > + u32 qp_state; /* Current QP State */ > > + u32 handle; /* QPID or EP Handle */ > > + u32 bytes_rcvd; /* valid for RECV WCs */ > > + u32 stag; > > +} PACKED ccwr_ce_t; > > + > > + > > +/* > > + * Flags used for all post-sq WRs. These must fit in the flags > > + * field of the ccwr_hdr_t (eight bits). > > + */ > > +typedef enum { > > + SQ_SIGNALED = 0x01, > > + SQ_READ_FENCE = 0x02, > > + SQ_FENCE = 0x04, > > +} PACKED cc_sq_flags_t; > > + > > +/* > > + * Common fields for all post-sq WRs. Namely the standard header and a > > > + * secondary header with fields common to all post-sq WRs. > > + */ > > +typedef struct { > > + ccwr_user_hdr_t user_hdr; > > +} PACKED cc_sq_hdr_t; > > + > > +/* > > + * Same as above but for post-rq WRs. > > + */ > > +typedef struct { > > + ccwr_user_hdr_t user_hdr; > > +} PACKED cc_rq_hdr_t; > > + > > +/* > > + * use the same struct for all sends. > > + */ > > +typedef struct { > > + cc_sq_hdr_t sq_hdr; > > + u32 sge_len; > > + u32 remote_stag; > > + u8 data[0]; /* SGE array */ > > +} PACKED ccwr_send_req_t, ccwr_send_se_req_t, ccwr_send_inv_req_t, > > ccwr_send_se_inv_req_t; > > + > > +typedef ccwr_ce_t ccwr_send_rep_t; > > + > > +typedef union { > > + ccwr_send_req_t req; > > + ccwr_send_rep_t rep; > > +} PACKED ccwr_send_t, ccwr_send_se_t, ccwr_send_inv_t, > ccwr_send_se_inv_t; > > + > > +typedef struct { > > + cc_sq_hdr_t sq_hdr; > > + u64 remote_to; > > + u32 remote_stag; > > + u32 sge_len; > > + u8 data[0]; /* SGE array */ > > +} PACKED ccwr_rdma_write_req_t; > > + > > +typedef ccwr_ce_t ccwr_rdma_write_rep_t; > > + > > +typedef union { > > + ccwr_rdma_write_req_t req; > > + ccwr_rdma_write_rep_t rep; > > +} PACKED ccwr_rdma_write_t; > > + > > +typedef struct { > > + cc_sq_hdr_t sq_hdr; > > + u64 local_to; > > + u64 remote_to; > > + u32 local_stag; > > + u32 remote_stag; > > + u32 length; > > +} PACKED ccwr_rdma_read_req_t,ccwr_rdma_read_inv_req_t; > > + > > +typedef ccwr_ce_t ccwr_rdma_read_rep_t; > > + > > +typedef union { > > + ccwr_rdma_read_req_t req; > > + ccwr_rdma_read_rep_t rep; > > +} PACKED ccwr_rdma_read_t, ccwr_rdma_read_inv_t; > > + > > +typedef struct { > > + cc_sq_hdr_t sq_hdr; > > + u64 va; > > + u8 stag_key; > > + u8 pad[3]; > > + u32 mw_stag_index; > > + u32 mr_stag_index; > > + u32 length; > > + u32 flags; /* see ccwr_mr_flags_t; */ > > +} PACKED ccwr_mw_bind_req_t; > > + > > +typedef ccwr_ce_t ccwr_mw_bind_rep_t; > > + > > +typedef union { > > + ccwr_mw_bind_req_t req; > > + ccwr_mw_bind_rep_t rep; > > +} PACKED ccwr_mw_bind_t; > > + > > +typedef struct { > > + cc_sq_hdr_t sq_hdr; > > + u64 va; > > + u8 stag_key; > > + u8 pad[3]; > > + u32 stag_index; > > + u32 pbe_size; > > + u32 fbo; > > + u32 length; > > + u32 addrs_length; > > + /* array of paddrs (must be aligned on a 64bit boundary) */ > > + u64 paddrs[0]; > > +} PACKED ccwr_nsmr_fastreg_req_t; > > + > > +typedef ccwr_ce_t ccwr_nsmr_fastreg_rep_t; > > + > > +typedef union { > > + ccwr_nsmr_fastreg_req_t req; > > + ccwr_nsmr_fastreg_rep_t rep; > > +} PACKED ccwr_nsmr_fastreg_t; > > + > > +typedef struct { > > + cc_sq_hdr_t sq_hdr; > > + u8 stag_key; > > + u8 pad[3]; > > + u32 stag_index; > > +} PACKED ccwr_stag_invalidate_req_t; > > + > > +typedef ccwr_ce_t ccwr_stag_invalidate_rep_t; > > + > > +typedef union { > > + ccwr_stag_invalidate_req_t req; > > + ccwr_stag_invalidate_rep_t rep; > > +} PACKED ccwr_stag_invalidate_t; > > + > > +typedef union { > > + cc_sq_hdr_t sq_hdr; > > + ccwr_send_req_t send; > > + ccwr_send_se_req_t send_se; > > + ccwr_send_inv_req_t send_inv; > > + ccwr_send_se_inv_req_t send_se_inv; > > + ccwr_rdma_write_req_t rdma_write; > > + ccwr_rdma_read_req_t rdma_read; > > + ccwr_mw_bind_req_t mw_bind; > > + ccwr_nsmr_fastreg_req_t nsmr_fastreg; > > + ccwr_stag_invalidate_req_t stag_inv; > > +} PACKED ccwr_sqwr_t; > > + > > + > > +/* > > + * RQ WRs > > + */ > > +typedef struct { > > + cc_rq_hdr_t rq_hdr; > > + u8 data[0]; /* array of SGEs */ > > +} PACKED ccwr_rqwr_t, ccwr_recv_req_t; > > + > > +typedef ccwr_ce_t ccwr_recv_rep_t; > > + > > +typedef union { > > + ccwr_recv_req_t req; > > + ccwr_recv_rep_t rep; > > +} PACKED ccwr_recv_t; > > + > > +/* > > + * All AEs start with this header. Most AEs only need to convey the > > + * information in the header. Some, like LLP connection events, need > > + * more info. The union typdef ccwr_ae_t has all the possible AEs. > > + * > > + * hdr.context is the user_context from the rnic_open WR. NULL If this > > > + * is not affiliated with an rnic > > + * > > + * hdr.id is the AE identifier (eg; CCAE_REMOTE_SHUTDOWN, > > + * CCAE_LLP_CLOSE_COMPLETE) > > + * > > + * resource_type is one of: CC_RES_IND_QP, CC_RES_IND_CQ, > CC_RES_IND_SRQ > > + * > > + * user_context is the context passed down when the host created the > resource. > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 user_context; /* user context for this res. */ > > + u32 resource_type; /* see cc_resource_indicator_t */ > > + u32 resource; /* handle for resource */ > > + u32 qp_state; /* current QP State */ > > +} PACKED PACKED ccwr_ae_hdr_t; > > + > > +/* > > + * After submitting the CCAE_ACTIVE_CONNECT_RESULTS message on the AEQ, > > > + * the adapter moves the QP into RTS state > > + */ > > +typedef struct { > > + ccwr_ae_hdr_t ae_hdr; > > + u32 laddr; > > + u32 raddr; > > + u16 lport; > > + u16 rport; > > + u32 private_data_length; > > + u8 private_data[0]; /* data is in-line in the msg. */ > > +} PACKED ccwr_ae_active_connect_results_t; > > + > > +/* > > + * When connections are established by the stack (and the private data > > + * MPA frame is received), the adapter will generate an event to the > host. > > + * The details of the connection, any private data, and the new > connection > > + * request handle is passed up via the CCAE_CONNECTION_REQUEST msg on > the > > + * AE queue: > > + */ > > +typedef struct { > > + ccwr_ae_hdr_t ae_hdr; > > + u32 cr_handle; /* connreq handle (sock ptr) */ > > + u32 laddr; > > + u32 raddr; > > + u16 lport; > > + u16 rport; > > + u32 private_data_length; > > + u8 private_data[0]; /* data is in-line in the msg. */ > > +} PACKED ccwr_ae_connection_request_t; > > + > > +typedef union { > > + ccwr_ae_hdr_t ae_generic; > > + ccwr_ae_active_connect_results_t ae_active_connect_results; > > + ccwr_ae_connection_request_t ae_connection_request; > > +} PACKED ccwr_ae_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 hint_count; > > + u64 q0_host_shared; > > + u64 q1_host_shared; > > + u64 q1_host_msg_pool; > > + u64 q2_host_shared; > > + u64 q2_host_msg_pool; > > +} PACKED ccwr_init_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_init_rep_t; > > + > > +typedef union { > > + ccwr_init_req_t req; > > + ccwr_init_rep_t rep; > > +} PACKED ccwr_init_t; > > + > > +/* > > + * For upgrading flash. > > + */ > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > +} PACKED ccwr_flash_init_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 adapter_flash_buf_offset; > > + u32 adapter_flash_len; > > +} PACKED ccwr_flash_init_rep_t; > > + > > +typedef union { > > + ccwr_flash_init_req_t req; > > + ccwr_flash_init_rep_t rep; > > +} PACKED ccwr_flash_init_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 len; > > +} PACKED ccwr_flash_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 status; > > +} PACKED ccwr_flash_rep_t; > > + > > +typedef union { > > + ccwr_flash_req_t req; > > + ccwr_flash_rep_t rep; > > +} PACKED ccwr_flash_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 size; > > +} PACKED ccwr_buf_alloc_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 offset; /* 0 if mem not available */ > > + u32 size; /* 0 if mem not available */ > > +} PACKED ccwr_buf_alloc_rep_t; > > + > > +typedef union { > > + ccwr_buf_alloc_req_t req; > > + ccwr_buf_alloc_rep_t rep; > > +} PACKED ccwr_buf_alloc_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 offset; /* Must match value from alloc */ > > + u32 size; /* Must match value from alloc */ > > +} PACKED ccwr_buf_free_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_buf_free_rep_t; > > + > > +typedef union { > > + ccwr_buf_free_req_t req; > > + ccwr_buf_free_rep_t rep; > > +} PACKED ccwr_buf_free_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 offset; > > + u32 size; > > + u32 type; > > + u32 flags; > > +} PACKED ccwr_flash_write_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 status; > > +} PACKED ccwr_flash_write_rep_t; > > + > > +typedef union { > > + ccwr_flash_write_req_t req; > > + ccwr_flash_write_rep_t rep; > > +} PACKED ccwr_flash_write_t; > > + > > +/* > > + * Messages for LLP connection setup. > > + */ > > + > > +/* > > + * Listen Request. This allocates a listening endpoint to allow > passive > > + * connection setup. Newly established LLP connections are passed up > > + * via an AE. See ccwr_ae_connection_request_t > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 user_context; /* returned in AEs. */ > > + u32 rnic_handle; > > + u32 local_addr; /* local addr, or 0 */ > > + u16 local_port; /* 0 means "pick one" */ > > + u16 pad; > > + u32 backlog; /* tradional tcp listen bl */ > > +} PACKED ccwr_ep_listen_create_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 ep_handle; /* handle to new listening ep */ > > + u16 local_port; /* resulting port... */ > > + u16 pad; > > +} PACKED ccwr_ep_listen_create_rep_t; > > + > > +typedef union { > > + ccwr_ep_listen_create_req_t req; > > + ccwr_ep_listen_create_rep_t rep; > > +} PACKED ccwr_ep_listen_create_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 ep_handle; > > +} PACKED ccwr_ep_listen_destroy_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_ep_listen_destroy_rep_t; > > + > > +typedef union { > > + ccwr_ep_listen_destroy_req_t req; > > + ccwr_ep_listen_destroy_rep_t rep; > > +} PACKED ccwr_ep_listen_destroy_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 ep_handle; > > +} PACKED ccwr_ep_query_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 local_addr; > > + u32 remote_addr; > > + u16 local_port; > > + u16 remote_port; > > +} PACKED ccwr_ep_query_rep_t; > > + > > +typedef union { > > + ccwr_ep_query_req_t req; > > + ccwr_ep_query_rep_t rep; > > +} PACKED ccwr_ep_query_t; > > + > > + > > +/* > > + * The host passes this down to indicate acceptance of a pending iWARP > > + * connection. The cr_handle was obtained from the CONNECTION_REQUEST > > + * AE passed up by the adapter. See ccwr_ae_connection_request_t. > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 qp_handle; /* QP to bind to this LLP conn */ > > + u32 ep_handle; /* LLP handle to accept */ > > + u32 private_data_length; > > + u8 private_data[0]; /* data in-line in msg. */ > > +} PACKED ccwr_cr_accept_req_t; > > + > > +/* > > + * adapter sends reply when private data is successfully submitted to > > + * the LLP. > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_cr_accept_rep_t; > > + > > +typedef union { > > + ccwr_cr_accept_req_t req; > > + ccwr_cr_accept_rep_t rep; > > +} PACKED ccwr_cr_accept_t; > > + > > +/* > > + * The host sends this down if a given iWARP connection request was > > + * rejected by the consumer. The cr_handle was obtained from a > > + * previous ccwr_ae_connection_request_t AE sent by the adapter. > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 ep_handle; /* LLP handle to reject */ > > +} PACKED ccwr_cr_reject_req_t; > > + > > +/* > > + * Dunno if this is needed, but we'll add it for now. The adapter will > > + * send the reject_reply after the LLP endpoint has been destroyed. > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_cr_reject_rep_t; > > + > > +typedef union { > > + ccwr_cr_reject_req_t req; > > + ccwr_cr_reject_rep_t rep; > > +} PACKED ccwr_cr_reject_t; > > + > > +/* > > + * console command. Used to implement a debug console over the verbs > > + * request and reply queues. > > + */ > > + > > +/* > > + * Console request message. It contains: > > + * - message hdr with id = CCWR_CONSOLE > > + * - the physaddr/len of host memory to be used for the reply. > > + * - the command string. eg: "netstat -s" or "zoneinfo" > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; /* id = CCWR_CONSOLE */ > > + u64 reply_buf; /* pinned host buf for reply */ > > + u32 reply_buf_len; /* length of reply buffer */ > > + u8 command[0]; /* NUL terminated ascii string */ > > + /* containing the command req */ > > +} PACKED ccwr_console_req_t; > > + > > +/* > > + * flags used in the console reply. > > + */ > > +typedef enum { > > + CONS_REPLY_TRUNCATED = 0x00000001 /* reply was truncated */ > > +} PACKED cc_console_flags_t; > > + > > +/* > > + * Console reply message. > > + * hdr.result contains the cc_status_t error if the reply was _not_ > generated, > > + * or CC_OK if the reply was generated. > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; /* id = CCWR_CONSOLE */ > > + u32 flags; /* see cc_console_flags_t */ > > +} PACKED ccwr_console_rep_t; > > + > > +typedef union { > > + ccwr_console_req_t req; > > + ccwr_console_rep_t rep; > > +} PACKED ccwr_console_t; > > + > > + > > +/* > > + * Giant union with all WRs. Makes life easier... > > + */ > > +typedef union { > > + ccwr_hdr_t hdr; > > + ccwr_user_hdr_t user_hdr; > > + ccwr_rnic_open_t rnic_open; > > + ccwr_rnic_query_t rnic_query; > > + ccwr_rnic_getconfig_t rnic_getconfig; > > + ccwr_rnic_setconfig_t rnic_setconfig; > > + ccwr_rnic_close_t rnic_close; > > + ccwr_cq_create_t cq_create; > > + ccwr_cq_modify_t cq_modify; > > + ccwr_cq_destroy_t cq_destroy; > > + ccwr_pd_alloc_t pd_alloc; > > + ccwr_pd_dealloc_t pd_dealloc; > > + ccwr_srq_create_t srq_create; > > + ccwr_srq_destroy_t srq_destroy; > > + ccwr_qp_create_t qp_create; > > + ccwr_qp_query_t qp_query; > > + ccwr_qp_modify_t qp_modify; > > + ccwr_qp_destroy_t qp_destroy; > > + ccwr_qp_connect_t qp_connect; > > + ccwr_nsmr_stag_alloc_t nsmr_stag_alloc; > > + ccwr_nsmr_register_t nsmr_register; > > + ccwr_nsmr_pbl_t nsmr_pbl; > > + ccwr_mr_query_t mr_query; > > + ccwr_mw_query_t mw_query; > > + ccwr_stag_dealloc_t stag_dealloc; > > + ccwr_sqwr_t sqwr; > > + ccwr_rqwr_t rqwr; > > + ccwr_ce_t ce; > > + ccwr_ae_t ae; > > + ccwr_init_t init; > > + ccwr_ep_listen_create_t ep_listen_create; > > + ccwr_ep_listen_destroy_t ep_listen_destroy; > > + ccwr_cr_accept_t cr_accept; > > + ccwr_cr_reject_t cr_reject; > > + ccwr_console_t console; > > + ccwr_flash_init_t flash_init; > > + ccwr_flash_t flash; > > + ccwr_buf_alloc_t buf_alloc; > > + ccwr_buf_free_t buf_free; > > + ccwr_flash_write_t flash_write; > > +} PACKED ccwr_t; > > + > > + > > +/* > > + * Accessors for the wr fields that are packed together tightly to > > + * reduce the wr message size. The wr arguments are void* so that > > + * either a ccwr_t*, a ccwr_hdr_t*, or a pointer to any of the types > > + * in the ccwr_t union can be passed in. > > + */ > > +static __inline__ u8 > > +cc_wr_get_id(void *wr) > > +{ > > + return ((ccwr_hdr_t *)wr)->id; > > +} > > +static __inline__ void > > +c2_wr_set_id(void *wr, u8 id) > > +{ > > + ((ccwr_hdr_t *)wr)->id = id; > > +} > > +static __inline__ u8 > > +cc_wr_get_result(void *wr) > > +{ > > + return ((ccwr_hdr_t *)wr)->result; > > +} > > +static __inline__ void > > +cc_wr_set_result(void *wr, u8 result) > > +{ > > + ((ccwr_hdr_t *)wr)->result = result; > > +} > > +static __inline__ u8 > > +cc_wr_get_flags(void *wr) > > +{ > > + return ((ccwr_hdr_t *)wr)->flags; > > +} > > +static __inline__ void > > +cc_wr_set_flags(void *wr, u8 flags) > > +{ > > + ((ccwr_hdr_t *)wr)->flags = flags; > > +} > > +static __inline__ u8 > > +cc_wr_get_sge_count(void *wr) > > +{ > > + return ((ccwr_hdr_t *)wr)->sge_count; > > +} > > +static __inline__ void > > +cc_wr_set_sge_count(void *wr, u8 sge_count) > > +{ > > + ((ccwr_hdr_t *)wr)->sge_count = sge_count; > > +} > > +static __inline__ u32 > > +cc_wr_get_wqe_count(void *wr) > > +{ > > + return ((ccwr_hdr_t *)wr)->wqe_count; > > +} > > +static __inline__ void > > +cc_wr_set_wqe_count(void *wr, u32 wqe_count) > > +{ > > + ((ccwr_hdr_t *)wr)->wqe_count = wqe_count; > > +} > > + > > +#undef PACKED > > + > > +#ifdef _MSC_VER > > +#pragma pack(pop) > > +#endif > > + > > +#endif /* _CC_WR_H_ */ > > Index: hw/amso1100/c2.c > > =================================================================== > > --- hw/amso1100/c2.c (revision 0) > > +++ hw/amso1100/c2.c (revision 0) > > @@ -0,0 +1,1221 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#include > > +#include > > +#include > > + > > +#include > > +#include "c2.h" > > +#include "c2_provider.h" > > + > > +MODULE_AUTHOR("Tom Tucker "); > > +MODULE_DESCRIPTION("Ammasso AMSO1100 Low-level iWARP Driver"); > > +MODULE_LICENSE("Dual BSD/GPL"); > > +MODULE_VERSION(DRV_VERSION); > > + > > +static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | > NETIF_MSG_LINK > > + | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN; > > + > > +static int debug = -1; /* defaults above */ > > +module_param(debug, int, 0); > > +MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); > > + > > +char *rnic_ip_addr = "192.168.69.169"; > > +module_param(rnic_ip_addr, charp, S_IRUGO); > > +MODULE_PARM_DESC(rnic_ip_addr, "IP Address for the AMSO1100 Adapter"); > > + > > +static int c2_up(struct net_device *netdev); > > +static int c2_down(struct net_device *netdev); > > +static int c2_xmit_frame(struct sk_buff *skb, struct net_device > *netdev); > > +static void c2_tx_interrupt(struct net_device *netdev); > > +static void c2_rx_interrupt(struct net_device *netdev); > > +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs > *regs); > > +static void c2_tx_timeout(struct net_device *netdev); > > +static int c2_change_mtu(struct net_device *netdev, int new_mtu); > > +static void c2_reset(struct c2_port *c2_port); > > +static struct net_device_stats* c2_get_stats(struct net_device > *netdev); > > + > > +extern void c2_rnic_interrupt(struct c2_dev *c2dev); > > + > > +static struct pci_device_id c2_pci_table[] = { > > + { 0x18b8, 0xb001, PCI_ANY_ID, PCI_ANY_ID }, > > + { 0 } > > +}; > > + > > +MODULE_DEVICE_TABLE(pci, c2_pci_table); > > + > > +static void c2_print_macaddr(struct net_device *netdev) > > +{ > > + dprintk(KERN_INFO PFX "%s: MAC %02X:%02X:%02X:%02X:%02X:%02X, " > > + "IRQ %u\n", netdev->name, > > + netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2], > > + netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5], > > + netdev->irq); > > +} > > + > > +static void c2_set_rxbufsize(struct c2_port *c2_port) > > +{ > > + struct net_device *netdev = c2_port->netdev; > > + > > + assert(netdev != NULL); > > + > > + if (netdev->mtu > RX_BUF_SIZE) > > + c2_port->rx_buf_size = netdev->mtu + ETH_HLEN + sizeof(struct > > c2_rxp_hdr) + NET_IP_ALIGN; > > + else > > + c2_port->rx_buf_size = sizeof(struct c2_rxp_hdr) + RX_BUF_SIZE; > > +} > > + > > +/* > > + * Allocate TX ring elements and chain them together. > > + * One-to-one association of adapter descriptors with ring elements. > > + */ > > +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr, > dma_addr_t base, > > + void __iomem *mmio_txp_ring) > > +{ > > + struct c2_tx_desc *tx_desc; > > + struct c2_txp_desc *txp_desc; > > + struct c2_element *elem; > > + int i; > > + > > + tx_ring->start = kmalloc(sizeof(*elem)*tx_ring->count, GFP_KERNEL); > > + if (!tx_ring->start) > > + return -ENOMEM; > > + > > + for (i = 0, elem = tx_ring->start, tx_desc = vaddr, txp_desc = > mmio_txp_ring; > > + i < tx_ring->count; i++, elem++, tx_desc++, txp_desc++) > > + { > > + tx_desc->len = 0; > > + tx_desc->status = 0; > > + > > + /* Set TXP_HTXD_UNINIT */ > > + c2_write64((void *)txp_desc + C2_TXP_ADDR, > cpu_to_be64(0x1122334455667788ULL)); > > + c2_write16((void *)txp_desc + C2_TXP_LEN, cpu_to_be16(0)); > > + c2_write16((void *)txp_desc + C2_TXP_FLAGS, > cpu_to_be16(TXP_HTXD_UNINIT)); > > + > > + elem->skb = NULL; > > + elem->ht_desc = tx_desc; > > + elem->hw_desc = txp_desc; > > + > > + if (i == tx_ring->count - 1) { > > + elem->next = tx_ring->start; > > + tx_desc->next_offset = base; > > + } else { > > + elem->next = elem + 1; > > + tx_desc->next_offset = base + (i + 1) * sizeof(*tx_desc); > > + } > > + } > > + > > + tx_ring->to_use = tx_ring->to_clean = tx_ring->start; > > + > > + return 0; > > +} > > + > > +/* > > + * Allocate RX ring elements and chain them together. > > + * One-to-one association of adapter descriptors with ring elements. > > + */ > > +static int c2_rx_ring_alloc(struct c2_ring *rx_ring, void *vaddr, > dma_addr_t base, > > + void __iomem *mmio_rxp_ring) > > +{ > > + struct c2_rx_desc *rx_desc; > > + struct c2_rxp_desc *rxp_desc; > > + struct c2_element *elem; > > + int i; > > + > > + rx_ring->start = kmalloc(sizeof(*elem) * rx_ring->count, > GFP_KERNEL); > > + if (!rx_ring->start) > > + return -ENOMEM; > > + > > + for (i = 0, elem = rx_ring->start, rx_desc = vaddr, rxp_desc = > mmio_rxp_ring; > > + i < rx_ring->count; i++, elem++, rx_desc++, rxp_desc++) > > + { > > + rx_desc->len = 0; > > + rx_desc->status = 0; > > + > > + /* Set RXP_HRXD_UNINIT */ > > + c2_write16((void *)rxp_desc + C2_RXP_STATUS, > cpu_to_be16(RXP_HRXD_OK)); > > + c2_write16((void *)rxp_desc + C2_RXP_COUNT, cpu_to_be16(0)); > > + c2_write16((void *)rxp_desc + C2_RXP_LEN, cpu_to_be16(0)); > > + c2_write64((void *)rxp_desc + C2_RXP_ADDR, > cpu_to_be64(0x99aabbccddeeffULL)); > > + c2_write16((void *)rxp_desc + C2_RXP_FLAGS, > cpu_to_be16(RXP_HRXD_UNINIT)); > > + > > + elem->skb = NULL; > > + elem->ht_desc = rx_desc; > > + elem->hw_desc = rxp_desc; > > + > > + if (i == rx_ring->count - 1) { > > + elem->next = rx_ring->start; > > + rx_desc->next_offset = base; > > + } else { > > + elem->next = elem + 1; > > + rx_desc->next_offset = base + (i + 1) * sizeof(*rx_desc); > > + } > > + } > > + > > + rx_ring->to_use = rx_ring->to_clean = rx_ring->start; > > + > > + return 0; > > +} > > + > > +/* Setup buffer for receiving */ > > +static inline int c2_rx_alloc(struct c2_port *c2_port, struct > c2_element *elem) > > +{ > > + struct c2_dev *c2dev = c2_port->c2dev; > > + struct c2_rx_desc *rx_desc = elem->ht_desc; > > + struct sk_buff *skb; > > + dma_addr_t mapaddr; > > + u32 maplen; > > + struct c2_rxp_hdr *rxp_hdr; > > + > > + skb = dev_alloc_skb(c2_port->rx_buf_size); > > + if (unlikely(!skb)) { > > + dprintk(KERN_ERR PFX "%s: out of memory for receive\n", > > + c2_port->netdev->name); > > + return -ENOMEM; > > + } > > + > > + /* Zero out the rxp hdr in the sk_buff */ > > + memset(skb->data, 0, sizeof(*rxp_hdr)); > > + > > + skb->dev = c2_port->netdev; > > + > > + maplen = c2_port->rx_buf_size; > > + mapaddr = pci_map_single(c2dev->pcidev, skb->data, maplen, > PCI_DMA_FROMDEVICE); > > + > > + /* Set the sk_buff RXP_header to RXP_HRXD_READY */ > > + rxp_hdr = (struct c2_rxp_hdr *) skb->data; > > + rxp_hdr->flags = RXP_HRXD_READY; > > + > > + /* c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); */ > > + c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); > > + c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16((u16)maplen - > > sizeof(*rxp_hdr))); > > + c2_write64(elem->hw_desc + C2_RXP_ADDR, cpu_to_be64(mapaddr)); > > + c2_write16(elem->hw_desc + C2_RXP_FLAGS, > cpu_to_be16(RXP_HRXD_READY)); > > + > > + elem->skb = skb; > > + elem->mapaddr = mapaddr; > > + elem->maplen = maplen; > > + rx_desc->len = maplen; > > + > > + return 0; > > +} > > + > > +/* > > + * Allocate buffers for the Rx ring > > + * For receive: rx_ring.to_clean is next received frame > > + */ > > +static int c2_rx_fill(struct c2_port *c2_port) > > +{ > > + struct c2_ring *rx_ring = &c2_port->rx_ring; > > + struct c2_element *elem; > > + int ret = 0; > > + > > + elem = rx_ring->start; > > + do { > > + if (c2_rx_alloc(c2_port, elem)) { > > + ret = 1; > > + break; > > + } > > + } while ((elem = elem->next) != rx_ring->start); > > + > > + rx_ring->to_clean = rx_ring->start; > > + return ret; > > +} > > + > > +/* Free all buffers in RX ring, assumes receiver stopped */ > > +static void c2_rx_clean(struct c2_port *c2_port) > > +{ > > + struct c2_dev *c2dev = c2_port->c2dev; > > + struct c2_ring *rx_ring = &c2_port->rx_ring; > > + struct c2_element *elem; > > + struct c2_rx_desc *rx_desc; > > + > > + elem = rx_ring->start; > > + do { > > + rx_desc = elem->ht_desc; > > + rx_desc->len = 0; > > + > > + c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); > > + c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); > > + c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16(0)); > > + c2_write64(elem->hw_desc + C2_RXP_ADDR, > cpu_to_be64(0x99aabbccddeeffULL)); > > + c2_write16(elem->hw_desc + C2_RXP_FLAGS, > cpu_to_be16(RXP_HRXD_UNINIT)); > > + > > + if (elem->skb) { > > + pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen, > > + PCI_DMA_FROMDEVICE); > > + dev_kfree_skb(elem->skb); > > + elem->skb = NULL; > > + } > > + } while ((elem = elem->next) != rx_ring->start); > > +} > > + > > +static inline int c2_tx_free(struct c2_dev *c2dev, struct c2_element > *elem) > > +{ > > + struct c2_tx_desc *tx_desc = elem->ht_desc; > > + > > + tx_desc->len = 0; > > + > > + pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen, > PCI_DMA_TODEVICE); > > + > > + if (elem->skb) { > > + dev_kfree_skb_any(elem->skb); > > + elem->skb = NULL; > > + } > > + > > + return 0; > > +} > > + > > +/* Free all buffers in TX ring, assumes transmitter stopped */ > > +static void c2_tx_clean(struct c2_port *c2_port) > > +{ > > + struct c2_ring *tx_ring = &c2_port->tx_ring; > > + struct c2_element *elem; > > + struct c2_txp_desc txp_htxd; > > + int retry; > > + unsigned long flags; > > + > > + spin_lock_irqsave(&c2_port->tx_lock, flags); > > + > > + elem = tx_ring->start; > > + > > + do { > > + retry = 0; > > + do { > > + txp_htxd.flags = c2_read16(elem->hw_desc + C2_TXP_FLAGS); > > + > > + if (txp_htxd.flags == TXP_HTXD_READY) { > > + retry = 1; > > + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(0)); > > + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(0)); > > + c2_write16(elem->hw_desc + C2_TXP_FLAGS, > cpu_to_be16(TXP_HTXD_DONE)); > > + c2_port->netstats.tx_dropped++; > > + break; > > + } else { > > + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(0)); > > + c2_write64(elem->hw_desc + C2_TXP_ADDR, > > cpu_to_be64(0x1122334455667788ULL)); > > + c2_write16(elem->hw_desc + C2_TXP_FLAGS, > cpu_to_be16(TXP_HTXD_UNINIT)); > > + } > > + > > + c2_tx_free(c2_port->c2dev, elem); > > + > > + } while ((elem = elem->next) != tx_ring->start); > > + } while (retry); > > + > > + c2_port->tx_avail = c2_port->tx_ring.count - 1; > > + c2_port->c2dev->cur_tx = tx_ring->to_use - tx_ring->start; > > + > > + if (c2_port->tx_avail > MAX_SKB_FRAGS + 1) > > + netif_wake_queue(c2_port->netdev); > > + > > + spin_unlock_irqrestore(&c2_port->tx_lock, flags); > > +} > > + > > +/* > > + * Process transmit descriptors marked 'DONE' by the firmware, > > + * freeing up their unneeded sk_buffs. > > + */ > > +static void c2_tx_interrupt(struct net_device *netdev) > > +{ > > + struct c2_port *c2_port = netdev_priv(netdev); > > + struct c2_dev *c2dev = c2_port->c2dev; > > + struct c2_ring *tx_ring = &c2_port->tx_ring; > > + struct c2_element *elem; > > + struct c2_txp_desc txp_htxd; > > + > > + spin_lock(&c2_port->tx_lock); > > + > > + for(elem = tx_ring->to_clean; elem != tx_ring->to_use; elem = > elem->next) > > + { > > + txp_htxd.flags = be16_to_cpu(c2_read16(elem->hw_desc + > C2_TXP_FLAGS)); > > + > > + if (txp_htxd.flags != TXP_HTXD_DONE) > > + break; > > + > > + if (netif_msg_tx_done(c2_port)) { > > + /* PCI reads are expensive in fast path */ > > + //txp_htxd.addr = be64_to_cpu(c2_read64(elem->hw_desc + > C2_TXP_ADDR)); > > + txp_htxd.len = be16_to_cpu(c2_read16(elem->hw_desc + > C2_TXP_LEN)); > > + dprintk(KERN_INFO PFX > > + "%s: tx done slot %3Zu status 0x%x len %5u bytes\n", > > + netdev->name, elem - tx_ring->start, > > + txp_htxd.flags, txp_htxd.len); > > + } > > + > > + c2_tx_free(c2dev, elem); > > + ++(c2_port->tx_avail); > > + } > > + > > + tx_ring->to_clean = elem; > > + > > + if (netif_queue_stopped(netdev) && c2_port->tx_avail > MAX_SKB_FRAGS > + 1) > > + netif_wake_queue(netdev); > > + > > + spin_unlock(&c2_port->tx_lock); > > +} > > + > > +static void c2_rx_error(struct c2_port *c2_port, struct c2_element > *elem) > > +{ > > + struct c2_rx_desc *rx_desc = elem->ht_desc; > > + struct c2_rxp_hdr *rxp_hdr = (struct c2_rxp_hdr *)elem->skb->data; > > + > > + if (rxp_hdr->status != RXP_HRXD_OK || > > + rxp_hdr->len > (rx_desc->len - sizeof(*rxp_hdr))) { > > + dprintk(KERN_ERR PFX "BAD RXP_HRXD\n"); > > + dprintk(KERN_ERR PFX " rx_desc : %p\n", rx_desc); > > + dprintk(KERN_ERR PFX " index : %Zu\n", elem - > c2_port->rx_ring.start); > > + dprintk(KERN_ERR PFX " len : %u\n", rx_desc->len); > > + dprintk(KERN_ERR PFX " rxp_hdr : %p [PA %p]\n", rxp_hdr, > > + (void *)__pa((unsigned long)rxp_hdr)); > > + dprintk(KERN_ERR PFX " flags : 0x%x\n", rxp_hdr->flags); > > + dprintk(KERN_ERR PFX " status: 0x%x\n", rxp_hdr->status); > > + dprintk(KERN_ERR PFX " len : %u\n", rxp_hdr->len); > > + dprintk(KERN_ERR PFX " rsvd : 0x%x\n", rxp_hdr->rsvd); > > + } > > + > > + /* Setup the skb for reuse since we're dropping this pkt */ > > + elem->skb->tail = elem->skb->data = elem->skb->head; > > + > > + /* Zero out the rxp hdr in the sk_buff */ > > + memset(elem->skb->data, 0, sizeof(*rxp_hdr)); > > + > > + /* Write the descriptor to the adapter's rx ring */ > > + c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); > > + c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); > > + c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16((u16)elem->maplen > - > > sizeof(*rxp_hdr))); > > + c2_write64(elem->hw_desc + C2_RXP_ADDR, cpu_to_be64(elem->mapaddr)); > > + c2_write16(elem->hw_desc + C2_RXP_FLAGS, > cpu_to_be16(RXP_HRXD_READY)); > > + > > + dprintk(KERN_INFO PFX "packet dropped\n"); > > + c2_port->netstats.rx_dropped++; > > +} > > + > > +static void c2_rx_interrupt(struct net_device *netdev) > > +{ > > + struct c2_port *c2_port = netdev_priv(netdev); > > + struct c2_dev *c2dev = c2_port->c2dev; > > + struct c2_ring *rx_ring = &c2_port->rx_ring; > > + struct c2_element *elem; > > + struct c2_rx_desc *rx_desc; > > + struct c2_rxp_hdr *rxp_hdr; > > + struct sk_buff *skb; > > + dma_addr_t mapaddr; > > + u32 maplen, buflen; > > + unsigned long flags; > > + > > + spin_lock_irqsave(&c2dev->lock, flags); > > + > > + /* Begin where we left off */ > > + rx_ring->to_clean = rx_ring->start + c2dev->cur_rx; > > + > > + for(elem = rx_ring->to_clean; elem->next != rx_ring->to_clean; elem > = elem->next) > > + { > > + rx_desc = elem->ht_desc; > > + mapaddr = elem->mapaddr; > > + maplen = elem->maplen; > > + skb = elem->skb; > > + rxp_hdr = (struct c2_rxp_hdr *)skb->data; > > + > > + if (rxp_hdr->flags != RXP_HRXD_DONE) > > + break; > > + > > + if (netif_msg_rx_status(c2_port)) > > + dprintk(KERN_INFO PFX "%s: rx done slot %3Zu status 0x%x len > %5u bytes\n", > > + netdev->name, elem - rx_ring->start, > > + rxp_hdr->flags, rxp_hdr->len); > > + > > + buflen = rxp_hdr->len; > > + > > + /* Sanity check the RXP header */ > > + if (rxp_hdr->status != RXP_HRXD_OK || > > + buflen > (rx_desc->len - sizeof(*rxp_hdr))) { > > + c2_rx_error(c2_port, elem); > > + continue; > > + } > > + > > + /* Allocate and map a new skb for replenishing the host RX desc > */ > > + if (c2_rx_alloc(c2_port, elem)) { > > + c2_rx_error(c2_port, elem); > > + continue; > > + } > > + > > + /* Unmap the old skb */ > > + pci_unmap_single(c2dev->pcidev, mapaddr, maplen, > PCI_DMA_FROMDEVICE); > > + > > + /* > > + * Skip past the leading 8 bytes comprising of the > > + * "struct c2_rxp_hdr", prepended by the adapter > > + * to the usual Ethernet header ("struct ethhdr"), > > + * to the start of the raw Ethernet packet. > > + * > > + * Fix up the various fields in the sk_buff before > > + * passing it up to netif_rx(). The transfer size > > + * (in bytes) specified by the adapter len field of > > + * the "struct rxp_hdr_t" does NOT include the > > + * "sizeof(struct c2_rxp_hdr)". > > + */ > > + skb->data += sizeof(*rxp_hdr); > > + skb->tail = skb->data + buflen; > > + skb->len = buflen; > > + skb->dev = netdev; > > + skb->protocol = eth_type_trans(skb, netdev); > > + > > + netif_rx(skb); > > + > > + netdev->last_rx = jiffies; > > + c2_port->netstats.rx_packets++; > > + c2_port->netstats.rx_bytes += buflen; > > + } > > + > > + /* Save where we left off */ > > + rx_ring->to_clean = elem; > > + c2dev->cur_rx = elem - rx_ring->start; > > + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); > > + > > + spin_unlock_irqrestore(&c2dev->lock, flags); > > +} > > + > > +/* > > + * Handle netisr0 TX & RX interrupts. > > + */ > > +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs > *regs) > > +{ > > + unsigned int netisr0, dmaisr; > > + int handled = 0; > > + struct c2_dev *c2dev = (struct c2_dev *)dev_id; > > + > > + assert(c2dev != NULL); > > + > > + /* Process CCILNET interrupts */ > > + netisr0 = c2_read32(c2dev->regs + C2_NISR0); > > + if (netisr0) { > > + > > + /* > > + * There is an issue with the firmware that always > > + * provides the status of RX for both TX & RX > > + * interrupts. So process both queues here. > > + */ > > + c2_rx_interrupt(c2dev->netdev); > > + c2_tx_interrupt(c2dev->netdev); > > + > > + /* Clear the interrupt */ > > + c2_write32(c2dev->regs + C2_NISR0, netisr0); > > + handled++; > > + } > > + > > + /* Process RNIC interrupts */ > > + dmaisr = c2_read32(c2dev->regs + C2_DISR); > > + if (dmaisr) { > > + c2_write32(c2dev->regs + C2_DISR, dmaisr); > > + c2_rnic_interrupt(c2dev); > > + handled++; > > + } > > + > > + if (handled) { > > + return IRQ_HANDLED; > > + } else { > > + return IRQ_NONE; > > + } > > +} > > + > > +static int c2_up(struct net_device *netdev) > > +{ > > + struct c2_port *c2_port = netdev_priv(netdev); > > + struct c2_dev *c2dev = c2_port->c2dev; > > + struct c2_element *elem; > > + struct c2_rxp_hdr *rxp_hdr; > > + size_t rx_size, tx_size; > > + int ret, i; > > + unsigned int netimr0; > > + > > + assert(c2dev != NULL); > > + > > + if (netif_msg_ifup(c2_port)) > > + dprintk(KERN_INFO PFX "%s: enabling interface\n", netdev->name); > > + > > + /* Set the Rx buffer size based on MTU */ > > + c2_set_rxbufsize(c2_port); > > + > > + /* Allocate DMA'able memory for Tx/Rx host descriptor rings */ > > + rx_size = c2_port->rx_ring.count * sizeof(struct c2_rx_desc); > > + tx_size = c2_port->tx_ring.count * sizeof(struct c2_tx_desc); > > + > > + c2_port->mem_size = tx_size + rx_size; > > + c2_port->mem = pci_alloc_consistent(c2dev->pcidev, > c2_port->mem_size, > > + &c2_port->dma); > > + if (c2_port->mem == NULL) { > > + dprintk(KERN_ERR PFX "Unable to allocate memory for host > descriptor rings\n"); > > + return -ENOMEM; > > + } > > + > > + memset(c2_port->mem, 0, c2_port->mem_size); > > + > > + /* Create the Rx host descriptor ring */ > > + if ((ret = c2_rx_ring_alloc(&c2_port->rx_ring, c2_port->mem, > c2_port->dma, > > + c2dev->mmio_rxp_ring))) { > > + dprintk(KERN_ERR PFX "Unable to create RX ring\n"); > > + goto bail0; > > + } > > + > > + /* Allocate Rx buffers for the host descriptor ring */ > > + if (c2_rx_fill(c2_port)) { > > + dprintk(KERN_ERR PFX "Unable to fill RX ring\n"); > > + goto bail1; > > + } > > + > > + /* Create the Tx host descriptor ring */ > > + if ((ret = c2_tx_ring_alloc(&c2_port->tx_ring, c2_port->mem + > rx_size, > > + c2_port->dma + rx_size, c2dev->mmio_txp_ring))) > { > > + dprintk(KERN_ERR PFX "Unable to create TX ring\n"); > > + goto bail1; > > + } > > + > > + /* Set the TX pointer to where we left off */ > > + c2_port->tx_avail = c2_port->tx_ring.count - 1; > > + c2_port->tx_ring.to_use = c2_port->tx_ring.to_clean = > c2_port->tx_ring. > > start + c2dev->cur_tx; > > + > > + /* missing: Initialize MAC */ > > + > > + BUG_ON(c2_port->tx_ring.to_use != c2_port->tx_ring.to_clean); > > + > > + /* Reset the adapter, ensures the driver is in sync with the RXP */ > > + c2_reset(c2_port); > > + > > + /* Reset the READY bit in the sk_buff RXP headers & adapter HRXDQ */ > > + for(i = 0, elem = c2_port->rx_ring.start; i < > c2_port->rx_ring.count; > > + i++, elem++) > > + { > > + rxp_hdr = (struct c2_rxp_hdr *)elem->skb->data; > > + rxp_hdr->flags = 0; > > + c2_write16(elem->hw_desc + C2_RXP_FLAGS, > cpu_to_be16(RXP_HRXD_READY)); > > + } > > + > > + /* Enable network packets */ > > + netif_start_queue(netdev); > > + > > + /* Enable IRQ */ > > + c2_write32(c2dev->regs + C2_IDIS, 0); > > + netimr0 = c2_read32(c2dev->regs + C2_NIMR0); > > + netimr0 &= ~(C2_PCI_HTX_INT | C2_PCI_HRX_INT); > > + c2_write32(c2dev->regs + C2_NIMR0, netimr0); > > + > > + return 0; > > + > > + bail1: > > + c2_rx_clean(c2_port); > > + kfree(c2_port->rx_ring.start); > > + > > + bail0: > > + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, > c2_port->dma); > > + > > + return ret; > > +} > > + > > +static int c2_down(struct net_device *netdev) > > +{ > > + struct c2_port *c2_port = netdev_priv(netdev); > > + struct c2_dev *c2dev = c2_port->c2dev; > > + > > + if (netif_msg_ifdown(c2_port)) > > + dprintk(KERN_INFO PFX "%s: disabling interface\n", netdev->name); > > + > > + /* Wait for all the queued packets to get sent */ > > + c2_tx_interrupt(netdev); > > + > > + /* Disable network packets */ > > + netif_stop_queue(netdev); > > + > > + /* Disable IRQs by clearing the interrupt mask */ > > + c2_write32(c2dev->regs + C2_IDIS, 1); > > + c2_write32(c2dev->regs + C2_NIMR0, 0); > > + > > + /* missing: Stop transmitter */ > > + > > + /* missing: Stop receiver */ > > + > > + /* Reset the adapter, ensures the driver is in sync with the RXP */ > > + c2_reset(c2_port); > > + > > + /* missing: Turn off LEDs here */ > > + > > + /* Free all buffers in the host descriptor rings */ > > + c2_tx_clean(c2_port); > > + c2_rx_clean(c2_port); > > + > > + /* Free the host descriptor rings */ > > + kfree(c2_port->rx_ring.start); > > + kfree(c2_port->tx_ring.start); > > + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, > c2_port->dma); > > + > > + return 0; > > +} > > + > > +static void c2_reset(struct c2_port *c2_port) > > +{ > > + struct c2_dev *c2dev = c2_port->c2dev; > > + unsigned int cur_rx = c2dev->cur_rx; > > + > > + /* Tell the hardware to quiesce */ > > + C2_SET_CUR_RX(c2dev, cur_rx|C2_PCI_HRX_QUI); > > + > > + /* > > + * The hardware will reset the C2_PCI_HRX_QUI bit once > > + * the RXP is quiesced. Wait 2 seconds for this. > > + */ > > + ssleep(2); > > + > > + cur_rx = C2_GET_CUR_RX(c2dev); > > + > > + if (cur_rx & C2_PCI_HRX_QUI) > > + dprintk(KERN_ERR PFX "c2_reset: failed to quiesce the > hardware!\n"); > > + > > + cur_rx &= ~C2_PCI_HRX_QUI; > > + > > + c2dev->cur_rx = cur_rx; > > + > > + dprintk("Current RX: %u\n", c2dev->cur_rx); > > +} > > + > > +static int c2_xmit_frame(struct sk_buff *skb, struct net_device > *netdev) > > +{ > > + struct c2_port *c2_port = netdev_priv(netdev); > > + struct c2_dev *c2dev = c2_port->c2dev; > > + struct c2_ring *tx_ring = &c2_port->tx_ring; > > + struct c2_element *elem; > > + dma_addr_t mapaddr; > > + u32 maplen; > > + unsigned long flags; > > + unsigned int i; > > + > > + spin_lock_irqsave(&c2_port->tx_lock, flags); > > + > > + if (unlikely(c2_port->tx_avail < (skb_shinfo(skb)->nr_frags + 1))) { > > + netif_stop_queue(netdev); > > + spin_unlock_irqrestore(&c2_port->tx_lock, flags); > > + > > + dprintk(KERN_WARNING PFX "%s: Tx ring full when queue awake!\n", > > + netdev->name); > > + return NETDEV_TX_BUSY; > > + } > > + > > + maplen = skb_headlen(skb); > > + mapaddr = pci_map_single(c2dev->pcidev, skb->data, maplen, > PCI_DMA_TODEVICE); > > + > > + elem = tx_ring->to_use; > > + elem->skb = skb; > > + elem->mapaddr = mapaddr; > > + elem->maplen = maplen; > > + > > + /* Tell HW to xmit */ > > + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(mapaddr)); > > + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(maplen)); > > + c2_write16(elem->hw_desc + C2_TXP_FLAGS, > cpu_to_be16(TXP_HTXD_READY)); > > + > > + c2_port->netstats.tx_packets++; > > + c2_port->netstats.tx_bytes += maplen; > > + > > + /* Loop thru additional data fragments and queue them */ > > + if (skb_shinfo(skb)->nr_frags) { > > + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) > > + { > > + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; > > + maplen = frag->size; > > + mapaddr = pci_map_page(c2dev->pcidev, frag->page, > frag->page_offset, > > + maplen, PCI_DMA_TODEVICE); > > + > > + elem = elem->next; > > + elem->skb = NULL; > > + elem->mapaddr = mapaddr; > > + elem->maplen = maplen; > > + > > + /* Tell HW to xmit */ > > + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(mapaddr)); > > + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(maplen)); > > + c2_write16(elem->hw_desc + C2_TXP_FLAGS, > cpu_to_be16(TXP_HTXD_READY)); > > + > > + c2_port->netstats.tx_packets++; > > + c2_port->netstats.tx_bytes += maplen; > > + } > > + } > > + > > + tx_ring->to_use = elem->next; > > + c2_port->tx_avail -= (skb_shinfo(skb)->nr_frags + 1); > > + > > + if (netif_msg_tx_queued(c2_port)) > > + dprintk(KERN_DEBUG PFX "%s: tx queued, slot %3Zu, len %5u bytes, > avail = %u\n", > > + netdev->name, elem - tx_ring->start, maplen, > c2_port->tx_avail); > > + > > + if (c2_port->tx_avail <= MAX_SKB_FRAGS + 1) { > > + netif_stop_queue(netdev); > > + if (netif_msg_tx_queued(c2_port)) > > + dprintk(KERN_INFO PFX "%s: transmit queue full\n", > netdev->name); > > + } > > + > > + spin_unlock_irqrestore(&c2_port->tx_lock, flags); > > + > > + netdev->trans_start = jiffies; > > + > > + return NETDEV_TX_OK; > > +} > > + > > +static struct net_device_stats *c2_get_stats(struct net_device *netdev) > > +{ > > + struct c2_port *c2_port = netdev_priv(netdev); > > + > > + return &c2_port->netstats; > > +} > > + > > +static int c2_set_mac_address(struct net_device *netdev, void *p) > > +{ > > + return -1; > > +} > > + > > +static void c2_tx_timeout(struct net_device *netdev) > > +{ > > + struct c2_port *c2_port = netdev_priv(netdev); > > + > > + if (netif_msg_timer(c2_port)) > > + dprintk(KERN_DEBUG PFX "%s: tx timeout\n", netdev->name); > > + > > + c2_tx_clean(c2_port); > > +} > > + > > +static int c2_change_mtu(struct net_device *netdev, int new_mtu) > > +{ > > + int ret = 0; > > + > > + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) > > + return -EINVAL; > > + > > + netdev->mtu = new_mtu; > > + > > + if (netif_running(netdev)) { > > + c2_down(netdev); > > + > > + c2_up(netdev); > > + } > > + > > + return ret; > > +} > > + > > +/* Initialize network device */ > > +static struct net_device *c2_devinit(struct c2_dev *c2dev, void __iomem > *mmio_addr) > > +{ > > + struct c2_port *c2_port = NULL; > > + struct net_device *netdev = alloc_etherdev(sizeof(*c2_port)); > > + > > + if (!netdev) { > > + dprintk(KERN_ERR PFX "c2_port etherdev alloc failed"); > > + return NULL; > > + } > > + > > + SET_MODULE_OWNER(netdev); > > + SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev); > > + > > + netdev->open = c2_up; > > + netdev->stop = c2_down; > > + netdev->hard_start_xmit = c2_xmit_frame; > > + netdev->get_stats = c2_get_stats; > > + netdev->tx_timeout = c2_tx_timeout; > > + netdev->set_mac_address = c2_set_mac_address; > > + netdev->change_mtu = c2_change_mtu; > > + netdev->watchdog_timeo = C2_TX_TIMEOUT; > > + netdev->irq = c2dev->pcidev->irq; > > + > > + c2_port = netdev_priv(netdev); > > + c2_port->netdev = netdev; > > + c2_port->c2dev = c2dev; > > + c2_port->msg_enable = netif_msg_init(debug, default_msg); > > + c2_port->tx_ring.count = C2_NUM_TX_DESC; > > + c2_port->rx_ring.count = C2_NUM_RX_DESC; > > + > > + spin_lock_init(&c2_port->tx_lock); > > + > > + /* Copy our 48-bit ethernet hardware address */ > > +#if 1 > > + memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_ENADDR, 6); > > +#else > > + memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_RDMA_ENADDR, 6); > > +#endif > > + /* Validate the MAC address */ > > + if(!is_valid_ether_addr(netdev->dev_addr)) { > > + dprintk(KERN_ERR PFX "Invalid MAC Address\n"); > > + c2_print_macaddr(netdev); > > + free_netdev(netdev); > > + return NULL; > > + } > > + > > + c2dev->netdev = netdev; > > + > > + return netdev; > > +} > > + > > +static int __devinit c2_probe(struct pci_dev *pcidev, const struct > pci_device_id *ent) > > +{ > > + int ret = 0, i; > > + unsigned long reg0_start, reg0_flags, reg0_len; > > + unsigned long reg2_start, reg2_flags, reg2_len; > > + unsigned long reg4_start, reg4_flags, reg4_len; > > + unsigned kva_map_size; > > + struct net_device *netdev = NULL; > > + struct c2_dev *c2dev = NULL; > > + void __iomem *mmio_regs = NULL; > > + > > + assert(pcidev != NULL); > > + assert(ent != NULL); > > + > > + dprintk(KERN_INFO PFX "AMSO1100 Gigabit Ethernet driver v%s > loaded\n", > > + DRV_VERSION); > > + > > + /* Enable PCI device */ > > + ret = pci_enable_device(pcidev); > > + if (ret) { > > + dprintk(KERN_ERR PFX "%s: Unable to enable PCI device\n", > pci_name(pcidev)); > > + goto bail0; > > + } > > + > > + reg0_start = pci_resource_start(pcidev, BAR_0); > > + reg0_len = pci_resource_len(pcidev, BAR_0); > > + reg0_flags = pci_resource_flags(pcidev, BAR_0); > > + > > + reg2_start = pci_resource_start(pcidev, BAR_2); > > + reg2_len = pci_resource_len(pcidev, BAR_2); > > + reg2_flags = pci_resource_flags(pcidev, BAR_2); > > + > > + reg4_start = pci_resource_start(pcidev, BAR_4); > > + reg4_len = pci_resource_len(pcidev, BAR_4); > > + reg4_flags = pci_resource_flags(pcidev, BAR_4); > > + > > + dprintk(KERN_INFO PFX "BAR0 size = 0x%lX bytes\n", reg0_len); > > + dprintk(KERN_INFO PFX "BAR2 size = 0x%lX bytes\n", reg2_len); > > + dprintk(KERN_INFO PFX "BAR4 size = 0x%lX bytes\n", reg4_len); > > + > > + /* Make sure PCI base addr are MMIO */ > > + if (!(reg0_flags & IORESOURCE_MEM) || > > + !(reg2_flags & IORESOURCE_MEM) || > > + !(reg4_flags & IORESOURCE_MEM)) { > > + dprintk (KERN_ERR PFX "PCI regions not an MMIO resource\n"); > > + ret = -ENODEV; > > + goto bail1; > > + } > > + > > + /* Check for weird/broken PCI region reporting */ > > + if ((reg0_len < C2_REG0_SIZE) || > > + (reg2_len < C2_REG2_SIZE) || > > + (reg4_len < C2_REG4_SIZE)) { > > + dprintk (KERN_ERR PFX "Invalid PCI region sizes\n"); > > + ret = -ENODEV; > > + goto bail1; > > + } > > + > > + /* Reserve PCI I/O and memory resources */ > > + ret = pci_request_regions(pcidev, DRV_NAME); > > + if (ret) { > > + dprintk(KERN_ERR PFX "%s: Unable to request regions\n", > pci_name(pcidev)); > > + goto bail1; > > + } > > + > > + if ((sizeof(dma_addr_t) > 4)) { > > + ret = pci_set_dma_mask(pcidev, DMA_64BIT_MASK); > > + if (ret < 0) { > > + dprintk(KERN_ERR PFX "64b DMA configuration failed\n"); > > + goto bail2; > > + } > > + } else { > > + ret = pci_set_dma_mask(pcidev, DMA_32BIT_MASK); > > + if (ret < 0) { > > + dprintk(KERN_ERR PFX "32b DMA configuration failed\n"); > > + goto bail2; > > + } > > + } > > + > > + /* Enables bus-mastering on the device */ > > + pci_set_master(pcidev); > > + > > + /* Remap the adapter PCI registers in BAR4 */ > > + mmio_regs = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, > > + sizeof(struct c2_adapter_pci_regs)); > > + if (mmio_regs == 0UL) { > > + dprintk(KERN_ERR PFX "Unable to remap adapter PCI registers in > BAR4\n"); > > + ret = -EIO; > > + goto bail2; > > + } > > + > > + /* Validate PCI regs magic */ > > + for (i = 0; i < sizeof(c2_magic); i++) > > + { > > + if (c2_magic[i] != c2_read8(mmio_regs + C2_REGS_MAGIC + i)) { > > + dprintk(KERN_ERR PFX > > + "Invalid PCI regs magic [%d/%Zd: got 0x%x, exp 0x%x]\n", > > + i + 1, sizeof(c2_magic), > > + c2_read8(mmio_regs + C2_REGS_MAGIC + i), c2_magic[i]); > > + dprintk(KERN_ERR PFX "Adapter not claimed\n"); > > + iounmap(mmio_regs); > > + ret = -EIO; > > + goto bail2; > > + } > > + } > > + > > + /* Validate the adapter version */ > > + if (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_VERS)) != C2_VERSION) > { > > + dprintk(KERN_ERR PFX "Version mismatch [fw=%u, c2=%u], Adapter > not claimed\n", > > + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_VERS)), > C2_VERSION); > > + ret = -EINVAL; > > + iounmap(mmio_regs); > > + goto bail2; > > + } > > + > > + /* Validate the adapter IVN */ > > + if (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_IVN)) != C2_IVN) { > > + dprintk(KERN_ERR PFX "IVN mismatch [fw=0x%x, c2=0x%x], Adapter > not claimed\n", > > + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_IVN)), C2_IVN); > > + ret = -EINVAL; > > + iounmap(mmio_regs); > > + goto bail2; > > + } > > + > > + /* Allocate hardware structure */ > > + c2dev = (struct c2_dev*)ib_alloc_device(sizeof *c2dev); > > + if (!c2dev) { > > + dprintk(KERN_ERR PFX "%s: Unable to alloc hardware struct\n", > > + pci_name(pcidev)); > > + ret = -ENOMEM; > > + iounmap(mmio_regs); > > + goto bail2; > > + } > > + > > + memset(c2dev, 0, sizeof(*c2dev)); > > + spin_lock_init(&c2dev->lock); > > + c2dev->pcidev = pcidev; > > + c2dev->cur_tx = 0; > > + > > + /* Get the last RX index */ > > + c2dev->cur_rx = (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_HRX_CUR)) > - > > 0xffffc000) / sizeof(struct c2_rxp_desc); > > + > > + /* Request an interrupt line for the driver */ > > + ret = request_irq(pcidev->irq, c2_interrupt, SA_SHIRQ, DRV_NAME, > c2dev); > > + if (ret) { > > + dprintk(KERN_ERR PFX "%s: requested IRQ %u is busy\n", > > + pci_name(pcidev), pcidev->irq); > > + iounmap(mmio_regs); > > + goto bail3; > > + } > > + > > + /* Set driver specific data */ > > + pci_set_drvdata(pcidev, c2dev); > > + > > + /* Initialize network device */ > > + if ((netdev = c2_devinit(c2dev, mmio_regs)) == NULL) { > > + iounmap(mmio_regs); > > + goto bail4; > > + } > > + > > + /* Save off the actual size prior to unmapping mmio_regs */ > > + kva_map_size = be32_to_cpu(c2_read32(mmio_regs + > C2_REGS_PCI_WINSIZE)); > > + > > + /* Unmap the adapter PCI registers in BAR4 */ > > + iounmap(mmio_regs); > > + > > + /* Register network device */ > > + ret = register_netdev(netdev); > > + if (ret) { > > + dprintk(KERN_ERR PFX "Unable to register netdev, ret = %d\n", > ret); > > + goto bail5; > > + } > > + > > + /* Disable network packets */ > > + netif_stop_queue(netdev); > > + > > + /* Remap the adapter HRXDQ PA space to kernel VA space */ > > + c2dev->mmio_rxp_ring = ioremap_nocache(reg4_start + > C2_RXP_HRXDQ_OFFSET, > > + C2_RXP_HRXDQ_SIZE); > > + if (c2dev->mmio_rxp_ring == 0UL) { > > + dprintk(KERN_ERR PFX "Unable to remap MMIO HRXDQ region\n"); > > + ret = -EIO; > > + goto bail6; > > + } > > + > > + /* Remap the adapter HTXDQ PA space to kernel VA space */ > > + c2dev->mmio_txp_ring = ioremap_nocache(reg4_start + > C2_TXP_HTXDQ_OFFSET, > > + C2_TXP_HTXDQ_SIZE); > > + if (c2dev->mmio_txp_ring == 0UL) { > > + dprintk(KERN_ERR PFX "Unable to remap MMIO HTXDQ region\n"); > > + ret = -EIO; > > + goto bail7; > > + } > > + > > + /* Save off the current RX index in the last 4 bytes of the TXP Ring > */ > > + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); > > + > > + /* Remap the PCI registers in adapter BAR0 to kernel VA space */ > > + c2dev->regs = ioremap_nocache(reg0_start, reg0_len); > > + if (c2dev->regs == 0UL) { > > + dprintk(KERN_ERR PFX "Unable to remap BAR0\n"); > > + ret = -EIO; > > + goto bail8; > > + } > > + > > + /* Remap the PCI registers in adapter BAR4 to kernel VA space */ > > + c2dev->pa = (void *)(reg4_start + C2_PCI_REGS_OFFSET); > > + c2dev->kva = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, > kva_map_size); > > + if (c2dev->kva == 0UL) { > > + dprintk(KERN_ERR PFX "Unable to remap BAR4\n"); > > + ret = -EIO; > > + goto bail9; > > + } > > + > > + /* Print out the MAC address */ > > + c2_print_macaddr(netdev); > > + > > + ret = c2_rnic_init(c2dev); > > + if (ret) { > > + dprintk(KERN_ERR PFX "c2_rnic_init failed: %d\n", ret); > > + goto bail10; > > + } > > + > > + c2_register_device(c2dev); > > + > > + return 0; > > + > > + bail10: > > + iounmap(c2dev->kva); > > + > > + bail9: > > + iounmap(c2dev->regs); > > + > > + bail8: > > + iounmap(c2dev->mmio_txp_ring); > > + > > + bail7: > > + iounmap(c2dev->mmio_rxp_ring); > > + > > + bail6: > > + unregister_netdev(netdev); > > + > > + bail5: > > + free_netdev(netdev); > > + > > + bail4: > > + free_irq(pcidev->irq, c2dev); > > + > > + bail3: > > + ib_dealloc_device(&c2dev->ibdev); > > + > > + bail2: > > + pci_release_regions(pcidev); > > + > > + bail1: > > + pci_disable_device(pcidev); > > + > > + bail0: > > + return ret; > > +} > > + > > +static void __devexit c2_remove(struct pci_dev *pcidev) > > +{ > > + struct c2_dev *c2dev = pci_get_drvdata(pcidev); > > + struct net_device *netdev = c2dev->netdev; > > + > > + assert(netdev != NULL); > > + > > + /* Unregister with OpenIB */ > > + ib_unregister_device(&c2dev->ibdev); > > + > > + /* Clean up the RNIC resources */ > > + c2_rnic_term(c2dev); > > + > > + /* Remove network device from the kernel */ > > + unregister_netdev(netdev); > > + > > + /* Free network device */ > > + free_netdev(netdev); > > + > > + /* Free the interrupt line */ > > + free_irq(pcidev->irq, c2dev); > > + > > + /* missing: Turn LEDs off here */ > > + > > + /* Unmap adapter PA space */ > > + iounmap(c2dev->kva); > > + iounmap(c2dev->regs); > > + iounmap(c2dev->mmio_txp_ring); > > + iounmap(c2dev->mmio_rxp_ring); > > + > > + /* Free the hardware structure */ > > + ib_dealloc_device(&c2dev->ibdev); > > + > > + /* Release reserved PCI I/O and memory resources */ > > + pci_release_regions(pcidev); > > + > > + /* Disable PCI device */ > > + pci_disable_device(pcidev); > > + > > + /* Clear driver specific data */ > > + pci_set_drvdata(pcidev, NULL); > > +} > > + > > +static struct pci_driver c2_pci_driver = { > > + .name = DRV_NAME, > > + .id_table = c2_pci_table, > > + .probe = c2_probe, > > + .remove = __devexit_p(c2_remove), > > +}; > > + > > +static int __init c2_init_module(void) > > +{ > > + return pci_module_init(&c2_pci_driver); > > +} > > + > > +static void __exit c2_exit_module(void) > > +{ > > + pci_unregister_driver(&c2_pci_driver); > > +} > > + > > +module_init(c2_init_module); > > +module_exit(c2_exit_module); > > Index: hw/amso1100/c2_qp.c > > =================================================================== > > --- hw/amso1100/c2_qp.c (revision 0) > > +++ hw/amso1100/c2_qp.c (revision 0) > > @@ -0,0 +1,840 @@ > > +/* > > + * Copyright (c) 2004 Topspin Communications. All rights reserved. > > + * Copyright (c) 2005 Cisco Systems. All rights reserved. > > + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. > > + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + * > > + */ > > + > > +#include "c2.h" > > +#include "c2_vq.h" > > +#include "cc_status.h" > > + > > +#define C2_MAX_ORD_PER_QP 128 > > +#define C2_MAX_IRD_PER_QP 128 > > + > > +#define CC_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | > hint_count) > > +#define CC_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16) > > +#define CC_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF) > > + > > +enum c2_qp_state { > > + C2_QP_STATE_IDLE = 0x01, > > + C2_QP_STATE_CONNECTING = 0x02, > > + C2_QP_STATE_RTS = 0x04, > > + C2_QP_STATE_CLOSING = 0x08, > > + C2_QP_STATE_TERMINATE = 0x10, > > + C2_QP_STATE_ERROR = 0x20, > > +}; > > + > > +#define NO_SUPPORT -1 > > +static const u8 c2_opcode[] = { > > + [IB_WR_SEND] = CC_WR_TYPE_SEND, > > + [IB_WR_SEND_WITH_IMM] = NO_SUPPORT, > > + [IB_WR_RDMA_WRITE] = CC_WR_TYPE_RDMA_WRITE, > > + [IB_WR_RDMA_WRITE_WITH_IMM] = NO_SUPPORT, > > + [IB_WR_RDMA_READ] = CC_WR_TYPE_RDMA_READ, > > + [IB_WR_ATOMIC_CMP_AND_SWP] = NO_SUPPORT, > > + [IB_WR_ATOMIC_FETCH_AND_ADD] = NO_SUPPORT, > > +}; > > + > > +void c2_qp_event(struct c2_dev *c2dev, u32 qpn, > > + enum ib_event_type event_type) > > +{ > > + struct c2_qp *qp; > > + struct ib_event event; > > + > > + spin_lock(&c2dev->qp_table.lock); > > + qp = c2_array_get(&c2dev->qp_table.qp, qpn & (c2dev->max_qp - 1)); > > + if (qp) > > + atomic_inc(&qp->refcount); > > + spin_unlock(&c2dev->qp_table.lock); > > + > > + if (!qp) { > > + dprintk("Async event for bogus QP %08x\n", qpn); > > + return; > > + } > > + > > + event.device = &c2dev->ibdev; > > + event.event = event_type; > > + event.element.qp = &qp->ibqp; > > + if (qp->ibqp.event_handler) > > + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); > > + > > + if (atomic_dec_and_test(&qp->refcount)) > > + wake_up(&qp->wait); > > +} > > + > > +static int to_c2_state(enum ib_qp_state ib_state) > > +{ > > + switch (ib_state) { > > + case IB_QPS_RESET: return C2_QP_STATE_IDLE; > > + case IB_QPS_RTS: return C2_QP_STATE_RTS; > > + case IB_QPS_SQD: return C2_QP_STATE_CLOSING; > > + case IB_QPS_SQE: return C2_QP_STATE_CLOSING; > > + case IB_QPS_ERR: return C2_QP_STATE_ERROR; > > + default: return -1; > > + } > > +} > > + > > +#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF > > + > > +int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp, > > + struct ib_qp_attr *attr, int attr_mask) > > +{ > > + ccwr_qp_modify_req_t wr; > > + ccwr_qp_modify_rep_t *reply; > > + struct c2_vq_req *vq_req; > > + int err; > > + > > + vq_req = vq_req_alloc(c2dev); > > + if (!vq_req) > > + return -ENOMEM; > > + > > + c2_wr_set_id(&wr, CCWR_QP_MODIFY); > > + wr.hdr.context = (unsigned long)vq_req; > > + wr.rnic_handle = c2dev->adapter_handle; > > + wr.qp_handle = qp->adapter_handle; > > + wr.ord = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); > > + wr.ird = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); > > + wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); > > + wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); > > + > > + if (attr_mask & IB_QP_STATE) { > > + > > + /* Ensure the state is valid */ > > + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) > > + return -EINVAL; > > + > > + wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state)); > > + > > + } else if (attr_mask & IB_QP_CUR_STATE) { > > + > > + if (attr->cur_qp_state != IB_QPS_RTR && > > + attr->cur_qp_state != IB_QPS_RTS && > > + attr->cur_qp_state != IB_QPS_SQD && > > + attr->cur_qp_state != IB_QPS_SQE) > > + return -EINVAL; > > + else > > + wr.next_qp_state = > cpu_to_be32(to_c2_state(attr->cur_qp_state)); > > + } else { > > + err = 0; > > + goto bail0; > > + } > > + > > + /* reference the request struct */ > > + vq_req_get(c2dev, vq_req); > > + > > + err = vq_send_wr(c2dev, (ccwr_t *)&wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail0; > > + } > > + > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) > > + goto bail0; > > + > > + reply = (ccwr_qp_modify_rep_t *)(unsigned long)vq_req->reply_msg; > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + > > + err = c2_errno(reply); > > + > > + vq_repbuf_free(c2dev, reply); > > +bail0: > > + vq_req_free(c2dev, vq_req); > > + return err; > > +} > > + > > +static int destroy_qp(struct c2_dev *c2dev, > > + struct c2_qp *qp) > > +{ > > + struct c2_vq_req *vq_req; > > + ccwr_qp_destroy_req_t wr; > > + ccwr_qp_destroy_rep_t *reply; > > + int err; > > + > > + /* > > + * Allocate a verb request message > > + */ > > + vq_req = vq_req_alloc(c2dev); > > + if (!vq_req) { > > + return -ENOMEM; > > + } > > + > > + /* > > + * Initialize the WR > > + */ > > + c2_wr_set_id(&wr, CCWR_QP_DESTROY); > > + wr.hdr.context = (unsigned long)vq_req; > > + wr.rnic_handle = c2dev->adapter_handle; > > + wr.qp_handle = qp->adapter_handle; > > + > > + /* > > + * reference the request struct. dereferenced in the int handler. > > + */ > > + vq_req_get(c2dev, vq_req); > > + > > + /* > > + * Send WR to adapter > > + */ > > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail0; > > + } > > + > > + /* > > + * Wait for reply from adapter > > + */ > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) { > > + goto bail0; > > + } > > + > > + /* > > + * Process reply > > + */ > > + reply = (ccwr_qp_destroy_rep_t*)(unsigned long)(vq_req->reply_msg); > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + > > + if ( (err = c2_errno(reply)) != 0) { > > + // XXX print error > > + } > > + > > + vq_repbuf_free(c2dev, reply); > > +bail0: > > + vq_req_free(c2dev, vq_req); > > + return err; > > +} > > + > > +int c2_alloc_qp(struct c2_dev *c2dev, > > + struct c2_pd *pd, > > + struct ib_qp_init_attr *qp_attrs, > > + struct c2_qp *qp) > > +{ > > + ccwr_qp_create_req_t wr; > > + ccwr_qp_create_rep_t *reply; > > + struct c2_vq_req *vq_req; > > + struct c2_cq *send_cq = to_c2cq(qp_attrs->send_cq); > > + struct c2_cq *recv_cq = to_c2cq(qp_attrs->recv_cq); > > + unsigned long peer_pa; > > + u32 q_size, msg_size, mmap_size; > > + void *mmap; > > + int err; > > + > > + qp->qpn = c2_alloc(&c2dev->qp_table.alloc); > > + if (qp->qpn == -1) > > + return -ENOMEM; > > + > > + /* Allocate the SQ and RQ shared pointers */ > > + qp->sq_mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); > > + if (!qp->sq_mq.shared) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + > > + qp->rq_mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); > > + if (!qp->rq_mq.shared) { > > + err = -ENOMEM; > > + goto bail1; > > + } > > + > > + /* Allocate the verbs request */ > > + vq_req = vq_req_alloc(c2dev); > > + if (vq_req == NULL) { > > + err = -ENOMEM; > > + goto bail2; > > + } > > + > > + /* Initialize the work request */ > > + memset(&wr, 0, sizeof(wr)); > > + c2_wr_set_id(&wr, CCWR_QP_CREATE); > > + wr.hdr.context = (unsigned long)vq_req; > > + wr.rnic_handle = c2dev->adapter_handle; > > + wr.sq_cq_handle = send_cq->adapter_handle; > > + wr.rq_cq_handle = recv_cq->adapter_handle; > > + wr.sq_depth = cpu_to_be32(qp_attrs->cap.max_send_wr+1); > > + wr.rq_depth = cpu_to_be32(qp_attrs->cap.max_recv_wr+1); > > + wr.srq_handle = 0; > > + wr.flags = cpu_to_be32(QP_RDMA_READ | QP_RDMA_WRITE | > QP_MW_BIND | > > + QP_ZERO_STAG | QP_RDMA_READ_RESPONSE); > > + wr.send_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge); > > + wr.recv_sgl_depth = cpu_to_be32(qp_attrs->cap.max_recv_sge); > > + wr.rdma_write_sgl_depth = > cpu_to_be32(qp_attrs->cap.max_send_sge); // > > XXX no write depth? > > + wr.shared_sq_ht = cpu_to_be64(__pa(qp->sq_mq.shared)); > > + wr.shared_rq_ht = cpu_to_be64(__pa(qp->rq_mq.shared)); > > + wr.ord = cpu_to_be32(C2_MAX_ORD_PER_QP); > > + wr.ird = cpu_to_be32(C2_MAX_IRD_PER_QP); > > + wr.pd_id = pd->pd_id; > > + wr.user_context = (unsigned long)qp; > > + > > + vq_req_get(c2dev, vq_req); > > + > > + /* Send the WR to the adapter */ > > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail3; > > + } > > + > > + /* Wait for the verb reply */ > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) { > > + goto bail3; > > + } > > + > > + /* Process the reply */ > > + reply = (ccwr_qp_create_rep_t*)(unsigned long)(vq_req->reply_msg); > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail3; > > + } > > + > > + if ( (err = c2_wr_get_result(reply)) != 0) { > > + goto bail4; > > + } > > + > > + /* Fill in the kernel QP struct */ > > + atomic_set(&qp->refcount, 1); > > + qp->adapter_handle = reply->qp_handle; > > + qp->state = IB_QPS_RESET; > > + qp->send_sgl_depth = qp_attrs->cap.max_send_sge; > > + qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge; > > + qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge; > > + > > + /* Initialize the SQ MQ */ > > + q_size = be32_to_cpu(reply->sq_depth); > > + msg_size = be32_to_cpu(reply->sq_msg_size); > > + peer_pa = (unsigned long)(c2dev->pa + > be32_to_cpu(reply->sq_mq_start)); > > + mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * > q_size); > > + mmap = ioremap_nocache(peer_pa, mmap_size); > > + if (!mmap) { > > + err = -ENOMEM; > > + goto bail5; > > + } > > + > > + c2_mq_init(&qp->sq_mq, > > + be32_to_cpu(reply->sq_mq_index), > > + q_size, > > + msg_size, > > + mmap + sizeof(struct c2_mq_shared), /* pool start */ > > + mmap, /* peer */ > > + C2_MQ_ADAPTER_TARGET); > > + > > + /* Initialize the RQ mq */ > > + q_size = be32_to_cpu(reply->rq_depth); > > + msg_size = be32_to_cpu(reply->rq_msg_size); > > + peer_pa = (unsigned long)(c2dev->pa + > be32_to_cpu(reply->rq_mq_start)); > > + mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * > q_size); > > + mmap = ioremap_nocache(peer_pa, mmap_size); > > + if (!mmap) { > > + err = -ENOMEM; > > + goto bail6; > > + } > > + > > + c2_mq_init(&qp->rq_mq, > > + be32_to_cpu(reply->rq_mq_index), > > + q_size, > > + msg_size, > > + mmap + sizeof(struct c2_mq_shared), /* pool start */ > > + mmap, /* peer */ > > + C2_MQ_ADAPTER_TARGET); > > + > > + vq_repbuf_free(c2dev, reply); > > + vq_req_free(c2dev, vq_req); > > + > > + spin_lock_irq(&c2dev->qp_table.lock); > > + c2_array_set(&c2dev->qp_table.qp, > > + qp->qpn & (c2dev->max_qp - 1), qp); > > + spin_unlock_irq(&c2dev->qp_table.lock); > > + > > + return 0; > > + > > +bail6: > > + iounmap(qp->sq_mq.peer); > > +bail5: > > + destroy_qp(c2dev, qp); > > +bail4: > > + vq_repbuf_free(c2dev, reply); > > +bail3: > > + vq_req_free(c2dev, vq_req); > > +bail2: > > + c2_free_mqsp(qp->rq_mq.shared); > > +bail1: > > + c2_free_mqsp(qp->sq_mq.shared); > > +bail0: > > + c2_free(&c2dev->qp_table.alloc, qp->qpn); > > + return err; > > +} > > + > > +void c2_free_qp(struct c2_dev *c2dev, > > + struct c2_qp *qp) > > +{ > > + struct c2_cq *send_cq; > > + struct c2_cq *recv_cq; > > + > > + send_cq = to_c2cq(qp->ibqp.send_cq); > > + recv_cq = to_c2cq(qp->ibqp.recv_cq); > > + > > + /* > > + * Lock CQs here, so that CQ polling code can do QP lookup > > + * without taking a lock. > > + */ > > + spin_lock_irq(&send_cq->lock); > > + if (send_cq != recv_cq) > > + spin_lock(&recv_cq->lock); > > + > > + spin_lock(&c2dev->qp_table.lock); > > + c2_array_clear(&c2dev->qp_table.qp, > > + qp->qpn & (c2dev->max_qp - 1)); > > + spin_unlock(&c2dev->qp_table.lock); > > + > > + if (send_cq != recv_cq) > > + spin_unlock(&recv_cq->lock); > > + spin_unlock_irq(&send_cq->lock); > > + > > + atomic_dec(&qp->refcount); > > + wait_event(qp->wait, !atomic_read(&qp->refcount)); > > + > > + /* > > + * Destory qp in the rnic... > > + */ > > + destroy_qp(c2dev, qp); > > + > > + /* > > + * Mark any unreaped CQEs as null and void. > > + */ > > + c2_cq_clean(c2dev, qp, send_cq->cqn); > > + if (send_cq != recv_cq) > > + c2_cq_clean(c2dev, qp, recv_cq->cqn); > > + /* > > + * Unmap the MQs and return the shared pointers > > + * to the message pool. > > + */ > > + iounmap(qp->sq_mq.peer); > > + iounmap(qp->rq_mq.peer); > > + c2_free_mqsp(qp->sq_mq.shared); > > + c2_free_mqsp(qp->rq_mq.shared); > > + > > + c2_free(&c2dev->qp_table.alloc, qp->qpn); > > +} > > + > > +/* > > + * Function: move_sgl > > + * > > + * Description: > > + * Move an SGL from the user's work request struct into a CCIL Work > Request > > + * message, swapping to WR byte order and ensure the total length > doesn't > > + * overflow. > > + * > > + * IN: > > + * dst - ptr to CCIL Work Request message SGL memory. > > + * src - ptr to the consumers SGL memory. > > + * > > + * OUT: none > > + * > > + * Return: > > + * CCIL status codes. > > + */ > > +static int > > +move_sgl(cc_data_addr_t *dst, struct ib_sge *src, int count, u32 > *p_len, u8 > > *actual_count) > > +{ > > + u32 tot = 0; /* running total */ > > + u8 acount = 0; /* running total non-0 len sge's */ > > + > > + while (count > 0) { > > + /* > > + * If the addition of this SGE causes the > > + * total SGL length to exceed 2^32-1, then > > + * fail-n-bail. > > + * > > + * If the current total plus the next element length > > + * wraps, then it will go negative and be less than the > > + * current total... > > + */ > > + if ((tot+src->length) < tot) { > > + return -EINVAL; > > + } > > + /* > > + * Bug: 1456 (as well as 1498 & 1643) > > + * Skip over any sge's supplied with len=0 > > + */ > > + if (src->length) { > > + tot += src->length; > > + dst->stag = cpu_to_be32(src->lkey); > > + dst->to = cpu_to_be64(src->addr); > > + dst->length = cpu_to_be32(src->length); > > + dst++; > > + acount++; > > + } > > + src++; > > + count--; > > + } > > + > > + if (acount == 0) { > > + /* > > + * Bug: 1476 (as well as 1498, 1456 and 1643) > > + * Setup the SGL in the WR to make it easier for the RNIC. > > + * This way, the FW doesn't have to deal with special cases. > > + * Setting length=0 should be sufficient. > > + */ > > + dst->stag = 0; > > + dst->to = 0; > > + dst->length = 0; > > + } > > + > > + *p_len = tot; > > + *actual_count = acount; > > + return 0; > > +} > > + > > +/* > > + * Function: c2_activity (private function) > > + * > > + * Description: > > + * Post an mq index to the host->adapter activity fifo. > > + * > > + * IN: > > + * c2dev - ptr to c2dev structure > > + * mq_index - mq index to post > > + * shared - value most recently written to shared > > + * > > + * OUT: > > + * > > + * Return: > > + * none > > + */ > > +static inline void > > +c2_activity(struct c2_dev *c2dev, u32 mq_index, u16 shared) > > +{ > > + /* > > + * First read the register to see if the FIFO is full, and if so, > > + * spin until it's not. This isn't perfect -- there is no > > + * synchronization among the clients of the register, but in > > + * practice it prevents multiple CPU from hammering the bus > > + * with PCI RETRY. Note that when this does happen, the card > > + * cannot get on the bus and the card and system hang in a > > + * deadlock -- thus the need for this code. [TOT] > > + */ > > + while (c2_read32(c2dev->regs + PCI_BAR0_ADAPTER_HINT) & 0x80000000) > { > > + set_current_state(TASK_UNINTERRUPTIBLE); > > + schedule_timeout(0); > > + } > > + > > + c2_write32(c2dev->regs + PCI_BAR0_ADAPTER_HINT, > CC_HINT_MAKE(mq_index, shared)); > > +} > > + > > +/* > > + * Function: qp_wr_post > > + * > > + * Description: > > + * This in-line function allocates a MQ msg, then moves the host-copy > of > > + * the completed WR into msg. Then it posts the message. > > + * > > + * IN: > > + * q - ptr to user MQ. > > + * wr - ptr to host-copy of the WR. > > + * qp - ptr to user qp > > + * size - Number of bytes to post. Assumed to be divisible by 4. > > + * > > + * OUT: none > > + * > > + * Return: > > + * CCIL status codes. > > + */ > > +static int > > +qp_wr_post(struct c2_mq *q, ccwr_t *wr, struct c2_qp *qp, u32 size) > > +{ > > + ccwr_t *msg; > > + > > + msg = c2_mq_alloc(q); > > + if (msg == NULL) { > > + return -EINVAL; > > + } > > + > > +#ifdef CCMSGMAGIC > > + ((ccwr_hdr_t *)wr)->magic = cpu_to_be32(CCWR_MAGIC); > > +#endif > > + > > + /* > > + * Since all header fields in the WR are the same as the > > + * CQE, set the following so the adapter need not. > > + */ > > + c2_wr_set_result(wr, CCERR_PENDING); > > + > > + /* > > + * Copy the wr down to the adapter > > + */ > > + memcpy((void *)msg, (void *)wr, size); > > + > > + c2_mq_produce(q); > > + return 0; > > +} > > + > > + > > +int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, > > + struct ib_send_wr **bad_wr) > > +{ > > + struct c2_dev *c2dev = to_c2dev(ibqp->device); > > + struct c2_qp *qp = to_c2qp(ibqp); > > + ccwr_t wr; > > + int err = 0; > > + > > + u32 flags; > > + u32 tot_len; > > + u8 actual_sge_count; > > + u32 msg_size; > > + > > + if (qp->state > IB_QPS_RTS) > > + return -EINVAL; > > + > > + while (ib_wr) { > > + > > + flags = 0; > > + wr.sqwr.sq_hdr.user_hdr.hdr.context = ib_wr->wr_id; > > + if (ib_wr->send_flags & IB_SEND_SIGNALED) { > > + flags |= SQ_SIGNALED; > > + } > > + > > + switch (ib_wr->opcode) { > > + case IB_WR_SEND: > > + if (ib_wr->send_flags & IB_SEND_SOLICITED) { > > + c2_wr_set_id(&wr, CC_WR_TYPE_SEND_SE); > > + msg_size = sizeof(ccwr_send_se_req_t); > > + } else { > > + c2_wr_set_id(&wr, CC_WR_TYPE_SEND); > > + msg_size = sizeof(ccwr_send_req_t); > > + } > > + > > + wr.sqwr.send.remote_stag = 0; > > + msg_size += sizeof(cc_data_addr_t) * ib_wr->num_sge; > > + if (ib_wr->num_sge > qp->send_sgl_depth) { > > + err = -EINVAL; > > + break; > > + } > > + if (ib_wr->send_flags & IB_SEND_FENCE) { > > + flags |= SQ_READ_FENCE; > > + } > > + err = move_sgl((cc_data_addr_t*)&(wr.sqwr.send.data), > > + ib_wr->sg_list, > > + ib_wr->num_sge, > > + &tot_len, > > + &actual_sge_count); > > + wr.sqwr.send.sge_len = cpu_to_be32(tot_len); > > + c2_wr_set_sge_count(&wr, actual_sge_count); > > + break; > > + case IB_WR_RDMA_WRITE: > > + c2_wr_set_id(&wr, CC_WR_TYPE_RDMA_WRITE); > > + msg_size = sizeof(ccwr_rdma_write_req_t) + > > + (sizeof(cc_data_addr_t) * ib_wr->num_sge); > > + if (ib_wr->num_sge > qp->rdma_write_sgl_depth) { > > + err = -EINVAL; > > + break; > > + } > > + if (ib_wr->send_flags & IB_SEND_FENCE) { > > + flags |= SQ_READ_FENCE; > > + } > > + wr.sqwr.rdma_write.remote_stag = > cpu_to_be32(ib_wr->wr.rdma.rkey); > > + wr.sqwr.rdma_write.remote_to = > cpu_to_be64(ib_wr->wr.rdma.remote_addr); > > + err = move_sgl((cc_data_addr_t*) > > + &(wr.sqwr.rdma_write.data), > > + ib_wr->sg_list, > > + ib_wr->num_sge, > > + &tot_len, > > + &actual_sge_count); > > + wr.sqwr.rdma_write.sge_len = cpu_to_be32(tot_len); > > + c2_wr_set_sge_count(&wr, actual_sge_count); > > + break; > > + case IB_WR_RDMA_READ: > > + c2_wr_set_id(&wr, CC_WR_TYPE_RDMA_READ); > > + msg_size = sizeof(ccwr_rdma_read_req_t); > > + > > + /* IWarp only suppots 1 sge for RDMA reads */ > > + if (ib_wr->num_sge > 1) { > > + err = -EINVAL; > > + break; > > + } > > + > > + /* > > + * Move the local and remote stag/to/len into the WR. > > + */ > > + wr.sqwr.rdma_read.local_stag = > > + cpu_to_be32(ib_wr->sg_list->lkey); > > + wr.sqwr.rdma_read.local_to = > > + cpu_to_be64(ib_wr->sg_list->addr); > > + wr.sqwr.rdma_read.remote_stag = > > + cpu_to_be32(ib_wr->wr.rdma.rkey); > > + wr.sqwr.rdma_read.remote_to = > > + cpu_to_be64(ib_wr->wr.rdma.remote_addr); > > + wr.sqwr.rdma_read.length = > > + cpu_to_be32(ib_wr->sg_list->length); > > + break; > > + default: > > + /* error */ > > + msg_size = 0; > > + err = -EINVAL; > > + break; > > + } > > + > > + /* > > + * If we had an error on the last wr build, then > > + * break out. Possible errors include bogus WR > > + * type, and a bogus SGL length... > > + */ > > + if (err) { > > + break; > > + } > > + > > + /* > > + * Store flags > > + */ > > + c2_wr_set_flags(&wr, flags); > > + > > + /* > > + * Post the puppy! > > + */ > > + err = qp_wr_post(&qp->sq_mq, &wr, qp, msg_size); > > + if (err) { > > + break; > > + } > > + > > + /* > > + * Enqueue mq index to activity FIFO. > > + */ > > + c2_activity(c2dev, qp->sq_mq.index, qp->sq_mq.hint_count); > > + > > + ib_wr = ib_wr->next; > > + } > > + > > + if (err) > > + *bad_wr = ib_wr; > > + return err; > > +} > > + > > +int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, > > + struct ib_recv_wr **bad_wr) > > +{ > > + struct c2_dev *c2dev = to_c2dev(ibqp->device); > > + struct c2_qp *qp = to_c2qp(ibqp); > > + ccwr_t wr; > > + int err = 0; > > + > > + if (qp->state > IB_QPS_RTS) > > + return -EINVAL; > > + > > + /* > > + * Try and post each work request > > + */ > > + while (ib_wr) { > > + u32 tot_len; > > + u8 actual_sge_count; > > + > > + if (ib_wr->num_sge > qp->recv_sgl_depth) { > > + err = -EINVAL; > > + break; > > + } > > + > > + /* > > + * Create local host-copy of the WR > > + */ > > + wr.rqwr.rq_hdr.user_hdr.hdr.context = ib_wr->wr_id; > > + c2_wr_set_id(&wr, CCWR_RECV); > > + c2_wr_set_flags(&wr, 0); > > + > > + /* sge_count is limited to eight bits. */ > > + assert(ib_wr->num_sge < 256); > > + err = move_sgl((cc_data_addr_t*)&(wr.rqwr.data), > > + ib_wr->sg_list, > > + ib_wr->num_sge, > > + &tot_len, > > + &actual_sge_count); > > + c2_wr_set_sge_count(&wr, actual_sge_count); > > + > > + /* > > + * If we had an error on the last wr build, then > > + * break out. Possible errors include bogus WR > > + * type, and a bogus SGL length... > > + */ > > + if (err) { > > + break; > > + } > > + > > + err = qp_wr_post(&qp->rq_mq, &wr, qp, qp->rq_mq.msg_size); > > + if (err) { > > + break; > > + } > > + > > + /* > > + * Enqueue mq index to activity FIFO > > + */ > > + c2_activity(c2dev, qp->rq_mq.index, qp->rq_mq.hint_count); > > + > > + ib_wr = ib_wr->next; > > + } > > + > > + if (err) > > + *bad_wr = ib_wr; > > + return err; > > +} > > + > > +int __devinit c2_init_qp_table(struct c2_dev *c2dev) > > +{ > > + int err; > > + > > + spin_lock_init(&c2dev->qp_table.lock); > > + > > + err = c2_alloc_init(&c2dev->qp_table.alloc, > > + c2dev->max_qp, > > + 0); > > + if (err) > > + return err; > > + > > + err = c2_array_init(&c2dev->qp_table.qp, > > + c2dev->max_qp); > > + if (err) { > > + c2_alloc_cleanup(&c2dev->qp_table.alloc); > > + return err; > > + } > > + > > + return 0; > > +} > > + > > +void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev) > > +{ > > + c2_alloc_cleanup(&c2dev->qp_table.alloc); > > +} > > Index: hw/amso1100/cc_ivn.h > > =================================================================== > > --- hw/amso1100/cc_ivn.h (revision 0) > > +++ hw/amso1100/cc_ivn.h (revision 0) > > @@ -0,0 +1,57 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#ifndef _CC_IVN_H_ > > +#define _CC_IVN_H_ > > + > > +/* > > + * The following value must be incremented each time structures shared > > + * between the firmware and host drivers are changed. This includes > > + * structures, types, and Max number of queue pairs.. > > + */ > > +#define CC_IVN_BASE 18 > > + > > +/* Used to mask of the CCMSGMAGIC bit */ > > +#define CC_IVN_MASK 0x7fffffff > > + > > + > > +/* > > + * The high order bit indicates a CCMSGMAGIC build, which changes the > > + * adapter<->host message formats. > > + */ > > +#ifdef CCMSGMAGIC > > +#define CC_IVN (CC_IVN_BASE | 0x80000000) > > +#else > > +#define CC_IVN (CC_IVN_BASE & 0x7fffffff) > > +#endif > > + > > +#endif /* _CC_IVN_H_ */ > > Index: hw/amso1100/c2_mq.h > > =================================================================== > > --- hw/amso1100/c2_mq.h (revision 0) > > +++ hw/amso1100/c2_mq.h (revision 0) > > @@ -0,0 +1,104 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > + > > +#ifndef _C2_MQ_H_ > > +#define _C2_MQ_H_ > > +#include > > +#include "c2_wr.h" > > + > > +enum c2_shared_regs { > > + > > + C2_SHARED_ARMED = 0x10, > > + C2_SHARED_NOTIFY = 0x18, > > + C2_SHARED_SHARED = 0x40, > > +}; > > + > > +struct c2_mq_shared { > > + u16 unused1; > > + u8 armed; > > + u8 notification_type; > > + u32 unused2; > > + u16 shared; > > + /* Pad to 64 bytes. */ > > + u8 pad[64-sizeof(u16)-2*sizeof(u8)-sizeof(u32)-sizeof(u16)]; > > +}; > > + > > +enum c2_mq_type { > > + C2_MQ_HOST_TARGET = 1, > > + C2_MQ_ADAPTER_TARGET = 2, > > +}; > > + > > +/* > > + * c2_mq_t is for kernel-mode MQs like the VQs and the AEQ. > > + * c2_user_mq_t (which is the same format) is for user-mode MQs... > > + */ > > +#define C2_MQ_MAGIC 0x4d512020 /* 'MQ ' */ > > +struct c2_mq { > > + u32 magic; > > + u8* msg_pool; > > + u16 hint_count; > > + u16 priv; > > + struct c2_mq_shared *peer; > > + u16* shared; > > + u32 q_size; > > + u32 msg_size; > > + u32 index; > > + enum c2_mq_type type; > > +}; > > + > > +#define BUMP(q,p) (p) = ((p)+1) % (q)->q_size > > +#define BUMP_SHARED(q,p) (p) = cpu_to_be16((be16_to_cpu(p)+1) % > (q)->q_size) > > + > > +static __inline__ int > > +c2_mq_empty(struct c2_mq *q) > > +{ > > + return q->priv == be16_to_cpu(*q->shared); > > +} > > + > > +static __inline__ int > > +c2_mq_full(struct c2_mq *q) > > +{ > > + return q->priv == (be16_to_cpu(*q->shared) + q->q_size-1) % > q->q_size; > > +} > > + > > +extern void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count); > > +extern void * c2_mq_alloc(struct c2_mq *q); > > +extern void c2_mq_produce(struct c2_mq *q); > > +extern void * c2_mq_consume(struct c2_mq *q); > > +extern void c2_mq_free(struct c2_mq *q); > > +extern u32 c2_mq_count(struct c2_mq *q); > > +extern void c2_mq_init(struct c2_mq *q, u32 index, u32 q_size, > > + u32 msg_size, u8 *pool_start, > > + u16 *peer, u32 type); > > + > > +#endif /* _C2_MQ_H_ */ > > Index: hw/amso1100/c2_user.h > > =================================================================== > > --- hw/amso1100/c2_user.h (revision 0) > > +++ hw/amso1100/c2_user.h (revision 0) > > @@ -0,0 +1,82 @@ > > +/* > > + * Copyright (c) 2005 Topspin Communications. All rights reserved. > > + * Copyright (c) 2005 Cisco Systems. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + * > > + */ > > + > > +#ifndef C2_USER_H > > +#define C2_USER_H > > + > > +#include > > + > > +/* > > + * Make sure that all structs defined in this file remain laid out so > > + * that they pack the same way on 32-bit and 64-bit architectures (to > > + * avoid incompatibility between 32-bit userspace and 64-bit kernels). > > + * In particular do not use pointer types -- pass pointers in __u64 > > + * instead. > > + */ > > + > > +struct c2_alloc_ucontext_resp { > > + __u32 qp_tab_size; > > + __u32 uarc_size; > > +}; > > + > > +struct c2_alloc_pd_resp { > > + __u32 pdn; > > + __u32 reserved; > > +}; > > + > > +struct c2_create_cq { > > + __u32 lkey; > > + __u32 pdn; > > + __u64 arm_db_page; > > + __u64 set_db_page; > > + __u32 arm_db_index; > > + __u32 set_db_index; > > +}; > > + > > +struct c2_create_cq_resp { > > + __u32 cqn; > > + __u32 reserved; > > +}; > > + > > +struct c2_create_qp { > > + __u32 lkey; > > + __u32 reserved; > > + __u64 sq_db_page; > > + __u64 rq_db_page; > > + __u32 sq_db_index; > > + __u32 rq_db_index; > > +}; > > + > > +#endif /* C2_USER_H */ > > Index: hw/amso1100/c2_ae.c > > =================================================================== > > --- hw/amso1100/c2_ae.c (revision 0) > > +++ hw/amso1100/c2_ae.c (revision 0) > > @@ -0,0 +1,216 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#include "c2.h" > > +#include > > +#include "cc_status.h" > > +#include "cc_ae.h" > > + > > +static int c2_convert_cm_status(u32 cc_status) > > +{ > > + switch (cc_status) { > > + case CC_CONN_STATUS_SUCCESS: > > + return 0; > > + case CC_CONN_STATUS_REJECTED: > > + return -ENETRESET; > > + case CC_CONN_STATUS_REFUSED: > > + return -ECONNREFUSED; > > + case CC_CONN_STATUS_TIMEDOUT: > > + return -ETIMEDOUT; > > + case CC_CONN_STATUS_NETUNREACH: > > + return -ENETUNREACH; > > + case CC_CONN_STATUS_HOSTUNREACH: > > + return -EHOSTUNREACH; > > + case CC_CONN_STATUS_INVALID_RNIC: > > + return -EINVAL; > > + case CC_CONN_STATUS_INVALID_QP: > > + return -EINVAL; > > + case CC_CONN_STATUS_INVALID_QP_STATE: > > + return -EINVAL; > > + default: > > + panic("Unable to convert CM status: %d\n", cc_status); > > + break; > > + } > > +} > > + > > +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) > > +{ > > + struct c2_mq *mq = c2dev->qptr_array[mq_index]; > > + ccwr_t *wr; > > + void *resource_user_context; > > + struct iw_cm_event cm_event; > > + struct ib_event ib_event; > > + cc_resource_indicator_t resource_indicator; > > + cc_event_id_t event_id; > > + u8 *pdata = NULL; > > + > > + /* > > + * retreive the message > > + */ > > + wr = c2_mq_consume(mq); > > + if (!wr) > > + return; > > + > > + memset(&cm_event, 0, sizeof(cm_event)); > > + > > + event_id = c2_wr_get_id(wr); > > + resource_indicator = be32_to_cpu(wr->ae.ae_generic.resource_type); > > + resource_user_context = (void *)(unsigned > long)wr->ae.ae_generic.user_context; > > + > > + cm_event.status = c2_convert_cm_status(c2_wr_get_result(wr)); > > + > > + switch (resource_indicator) { > > + case CC_RES_IND_QP: { > > + > > + struct c2_qp *qp = (struct c2_qp *)resource_user_context; > > + > > + switch (event_id) { > > + case CCAE_ACTIVE_CONNECT_RESULTS: > > + cm_event.event = IW_CM_EVENT_CONNECT_REPLY; > > + cm_event.local_addr.sin_addr.s_addr = > > + wr->ae.ae_active_connect_results.laddr; > > + cm_event.remote_addr.sin_addr.s_addr = > > + wr->ae.ae_active_connect_results.raddr; > > + cm_event.local_addr.sin_port = > > + wr->ae.ae_active_connect_results.lport; > > + cm_event.remote_addr.sin_port = > > + wr->ae.ae_active_connect_results.rport; > > + cm_event.private_data_len = > > + be32_to_cpu(wr->ae.ae_active_connect_results.private_data_length); > > + > > + if (cm_event.private_data_len) { > > + /* XXX */ > > + pdata = kmalloc(cm_event.private_data_len, GFP_ATOMIC); > > + if (!pdata) { > > + /* Ignore the request, maybe the remote peer > > + * will retry */ > > + dprintk("Ignored connect request -- no memory for pdata" > > + "private_data_len=%d\n", cm_event.private_data_len); > > + goto ignore_it; > > + } > > + > > + memcpy(pdata, > > + wr->ae.ae_active_connect_results.private_data, > > + cm_event.private_data_len); > > + > > + cm_event.private_data = pdata; > > + } > > + if (qp->cm_id->event_handler) > > + qp->cm_id->event_handler(qp->cm_id, &cm_event); > > + > > + break; > > + > > + case CCAE_TERMINATE_MESSAGE_RECEIVED: > > + case CCAE_CQ_SQ_COMPLETION_OVERFLOW: > > + ib_event.device = &c2dev->ibdev; > > + ib_event.element.qp = &qp->ibqp; > > + ib_event.event = IB_EVENT_QP_REQ_ERR; > > + > > + if(qp->ibqp.event_handler) > > + (*qp->ibqp.event_handler)(&ib_event, > > + qp->ibqp.qp_context); > > + case CCAE_BAD_CLOSE: > > + case CCAE_LLP_CLOSE_COMPLETE: > > + case CCAE_LLP_CONNECTION_RESET: > > + case CCAE_LLP_CONNECTION_LOST: > > + default: > > + cm_event.event = IW_CM_EVENT_CLOSE; > > + if (qp->cm_id->event_handler) > > + qp->cm_id->event_handler(qp->cm_id, &cm_event); > > + > > + } > > + break; > > + } > > + > > + case CC_RES_IND_EP: { > > + > > + struct iw_cm_id* cm_id = (struct iw_cm_id*)resource_user_context; > > + > > + dprintk("CC_RES_IND_EP event_id=%d\n", event_id); > > + if (event_id != CCAE_CONNECTION_REQUEST) { > > + dprintk("%s: Invalid event_id: %d\n", __FUNCTION__, event_id); > > + break; > > + } > > + > > + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; > > + cm_event.provider_id = > > + wr->ae.ae_connection_request.cr_handle; > > + cm_event.local_addr.sin_addr.s_addr = > > + wr->ae.ae_connection_request.laddr; > > + cm_event.remote_addr.sin_addr.s_addr = > > + wr->ae.ae_connection_request.raddr; > > + cm_event.local_addr.sin_port = > > + wr->ae.ae_connection_request.lport; > > + cm_event.remote_addr.sin_port = > > + wr->ae.ae_connection_request.rport; > > + cm_event.private_data_len = > > + be32_to_cpu(wr->ae.ae_connection_request.private_data_length); > > + > > + if (cm_event.private_data_len) { > > + pdata = kmalloc(cm_event.private_data_len, GFP_ATOMIC); > > + if (!pdata) { > > + /* Ignore the request, maybe the remote peer > > + * will retry */ > > + dprintk("Ignored connect request -- no memory for pdata" > > + "private_data_len=%d\n", cm_event.private_data_len); > > + goto ignore_it; > > + } > > + memcpy(pdata, > > + wr->ae.ae_connection_request.private_data, > > + cm_event.private_data_len); > > + > > + cm_event.private_data = pdata; > > + } > > + if (cm_id->event_handler) > > + cm_id->event_handler(cm_id, &cm_event); > > + break; > > + } > > + > > + case CC_RES_IND_CQ: { > > + struct c2_cq *cq = (struct c2_cq *)resource_user_context; > > + > > + dprintk("IB_EVENT_CQ_ERR\n"); > > + ib_event.device = &c2dev->ibdev; > > + ib_event.element.cq = &cq->ibcq; > > + ib_event.event = IB_EVENT_CQ_ERR; > > + > > + if (cq->ibcq.event_handler) > > + cq->ibcq.event_handler(&ib_event, cq->ibcq.cq_context); > > + } > > + > > + default: > > + break; > > + } > > + > > + ignore_it: > > + c2_mq_free(mq); > > +} > > Index: hw/amso1100/c2.h > > =================================================================== > > --- hw/amso1100/c2.h (revision 0) > > +++ hw/amso1100/c2.h (revision 0) > > @@ -0,0 +1,617 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > + > > +#ifndef __C2_H > > +#define __C2_H > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#include "c2_provider.h" > > +#include "c2_mq.h" > > +#include "cc_status.h" > > + > > +#define DRV_NAME "c2" > > +#define DRV_VERSION "1.1" > > +#define PFX DRV_NAME ": " > > + > > +#ifdef C2_DEBUG > > +#define assert(expr) \ > > + if(!(expr)) { \ > > + printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n",\ > > + #expr, __FILE__, __FUNCTION__, __LINE__); \ > > + } > > +#define dprintk(fmt, args...) do {printk(KERN_INFO PFX fmt, ##args);} > while (0) > > +#else > > +#define assert(expr) do {} while (0) > > +#define dprintk(fmt, args...) do {} while (0) > > +#endif /* C2_DEBUG */ > > + > > +#define BAR_0 0 > > +#define BAR_2 2 > > +#define BAR_4 4 > > + > > +#define RX_BUF_SIZE (1536 + 8) > > +#define ETH_JUMBO_MTU 9000 > > +#define C2_MAGIC "CEPHEUS" > > +#define C2_VERSION 4 > > +#define C2_IVN (18 & 0x7fffffff) > > + > > +#define C2_REG0_SIZE (16 * 1024) > > +#define C2_REG2_SIZE (2 * 1024 * 1024) > > +#define C2_REG4_SIZE (256 * 1024 * 1024) > > +#define C2_NUM_TX_DESC 341 > > +#define C2_NUM_RX_DESC 256 > > +#define C2_PCI_REGS_OFFSET (0x10000) > > +#define C2_RXP_HRXDQ_OFFSET (((C2_REG4_SIZE)/2)) > > +#define C2_RXP_HRXDQ_SIZE (4096) > > +#define C2_TXP_HTXDQ_OFFSET (((C2_REG4_SIZE)/2) + C2_RXP_HRXDQ_SIZE) > > +#define C2_TXP_HTXDQ_SIZE (4096) > > +#define C2_TX_TIMEOUT (6*HZ) > > + > > +/* CEPHEUS */ > > +static const u8 c2_magic[] = { > > + 0x43, 0x45, 0x50, 0x48, 0x45, 0x55, 0x53 > > + }; > > + > > +enum adapter_pci_regs { > > + C2_REGS_MAGIC = 0x0000, > > + C2_REGS_VERS = 0x0008, > > + C2_REGS_IVN = 0x000C, > > + C2_REGS_PCI_WINSIZE = 0x0010, > > + C2_REGS_Q0_QSIZE = 0x0014, > > + C2_REGS_Q0_MSGSIZE = 0x0018, > > + C2_REGS_Q0_POOLSTART = 0x001C, > > + C2_REGS_Q0_SHARED = 0x0020, > > + C2_REGS_Q1_QSIZE = 0x0024, > > + C2_REGS_Q1_MSGSIZE = 0x0028, > > + C2_REGS_Q1_SHARED = 0x0030, > > + C2_REGS_Q2_QSIZE = 0x0034, > > + C2_REGS_Q2_MSGSIZE = 0x0038, > > + C2_REGS_Q2_SHARED = 0x0040, > > + C2_REGS_ENADDR = 0x004C, > > + C2_REGS_RDMA_ENADDR = 0x0054, > > + C2_REGS_HRX_CUR = 0x006C, > > +}; > > + > > +struct c2_adapter_pci_regs { > > + char reg_magic[8]; > > + u32 version; > > + u32 ivn; > > + u32 pci_window_size; > > + u32 q0_q_size; > > + u32 q0_msg_size; > > + u32 q0_pool_start; > > + u32 q0_shared; > > + u32 q1_q_size; > > + u32 q1_msg_size; > > + u32 q1_pool_start; > > + u32 q1_shared; > > + u32 q2_q_size; > > + u32 q2_msg_size; > > + u32 q2_pool_start; > > + u32 q2_shared; > > + u32 log_start; > > + u32 log_size; > > + u8 host_enaddr[8]; > > + u8 rdma_enaddr[8]; > > + u32 crash_entry; > > + u32 crash_ready[2]; > > + u32 fw_txd_cur; > > + u32 fw_hrxd_cur; > > + u32 fw_rxd_cur; > > +}; > > + > > +enum pci_regs { > > + C2_HISR = 0x0000, > > + C2_DISR = 0x0004, > > + C2_HIMR = 0x0008, > > + C2_DIMR = 0x000C, > > + C2_NISR0 = 0x0010, > > + C2_NISR1 = 0x0014, > > + C2_NIMR0 = 0x0018, > > + C2_NIMR1 = 0x001C, > > + C2_IDIS = 0x0020, > > +}; > > + > > +enum { > > + C2_PCI_HRX_INT = 1<<8, > > + C2_PCI_HTX_INT = 1<<17, > > + C2_PCI_HRX_QUI = 1<<31, > > +}; > > + > > +/* > > + * Cepheus registers in BAR0. > > + */ > > +struct c2_pci_regs { > > + u32 hostisr; > > + u32 dmaisr; > > + u32 hostimr; > > + u32 dmaimr; > > + u32 netisr0; > > + u32 netisr1; > > + u32 netimr0; > > + u32 netimr1; > > + u32 int_disable; > > +}; > > + > > +/* TXP flags */ > > +enum c2_txp_flags { > > + TXP_HTXD_DONE = 0, > > + TXP_HTXD_READY = 1<<0, > > + TXP_HTXD_UNINIT = 1<<1, > > +}; > > + > > +/* RXP flags */ > > +enum c2_rxp_flags { > > + RXP_HRXD_UNINIT = 0, > > + RXP_HRXD_READY = 1<<0, > > + RXP_HRXD_DONE = 1<<1, > > +}; > > + > > +/* RXP status */ > > +enum c2_rxp_status { > > + RXP_HRXD_ZERO = 0, > > + RXP_HRXD_OK = 1<<0, > > + RXP_HRXD_BUF_OV = 1<<1, > > +}; > > + > > +/* TXP descriptor fields */ > > +enum txp_desc { > > + C2_TXP_FLAGS = 0x0000, > > + C2_TXP_LEN = 0x0002, > > + C2_TXP_ADDR = 0x0004, > > +}; > > + > > +/* RXP descriptor fields */ > > +enum rxp_desc { > > + C2_RXP_FLAGS = 0x0000, > > + C2_RXP_STATUS = 0x0002, > > + C2_RXP_COUNT = 0x0004, > > + C2_RXP_LEN = 0x0006, > > + C2_RXP_ADDR = 0x0008, > > +}; > > + > > +struct c2_txp_desc { > > + u16 flags; > > + u16 len; > > + u64 addr; > > +} __attribute__ ((packed)); > > + > > +struct c2_rxp_desc { > > + u16 flags; > > + u16 status; > > + u16 count; > > + u16 len; > > + u64 addr; > > +} __attribute__ ((packed)); > > + > > +struct c2_rxp_hdr { > > + u16 flags; > > + u16 status; > > + u16 len; > > + u16 rsvd; > > +} __attribute__ ((packed)); > > + > > +struct c2_tx_desc { > > + u32 len; > > + u32 status; > > + dma_addr_t next_offset; > > +}; > > + > > +struct c2_rx_desc { > > + u32 len; > > + u32 status; > > + dma_addr_t next_offset; > > +}; > > + > > +struct c2_alloc { > > + u32 last; > > + u32 max; > > + spinlock_t lock; > > + unsigned long *table; > > +}; > > + > > +struct c2_array { > > + struct { > > + void **page; > > + int used; > > + } *page_list; > > +}; > > + > > +/* > > + * The MQ shared pointer pool is organized as a linked list of > > + * chunks. Each chunk contains a linked list of free shared pointers > > + * that can be allocated to a given user mode client. > > + * > > + */ > > +struct sp_chunk { > > + struct sp_chunk* next; > > + u32 gfp_mask; > > + u16 head; > > + u16 shared_ptr[0]; > > +}; > > + > > +struct c2_pd_table { > > + struct c2_alloc alloc; > > + struct c2_array pd; > > +}; > > + > > +struct c2_qp_table { > > + struct c2_alloc alloc; > > + u32 rdb_base; > > + int rdb_shift; > > + int sqp_start; > > + spinlock_t lock; > > + struct c2_array qp; > > + struct c2_icm_table *qp_table; > > + struct c2_icm_table *eqp_table; > > + struct c2_icm_table *rdb_table; > > +}; > > + > > +struct c2_element { > > + struct c2_element *next; > > + void *ht_desc; /* host descriptor */ > > + void *hw_desc; /* hardware descriptor */ > > + struct sk_buff *skb; > > + dma_addr_t mapaddr; > > + u32 maplen; > > +}; > > + > > +struct c2_ring { > > + struct c2_element *to_clean; > > + struct c2_element *to_use; > > + struct c2_element *start; > > + unsigned long count; > > +}; > > + > > +struct c2_dev { > > + struct ib_device ibdev; > > + void __iomem *regs; > > + void __iomem *mmio_txp_ring; /* remapped adapter memory for hw > rings */ > > + void __iomem *mmio_rxp_ring; > > + spinlock_t lock; > > + struct pci_dev *pcidev; > > + struct net_device *netdev; > > + unsigned int cur_tx; > > + unsigned int cur_rx; > > + u64 fw_ver; > > + u32 adapter_handle; > > + u32 hw_rev; > > + u32 device_cap_flags; > > + u32 vendor_id; > > + u32 vendor_part_id; > > + void __iomem *kva; /* KVA device memory */ > > + void __iomem *pa; /* PA device memory */ > > + void **qptr_array; > > + > > + kmem_cache_t* host_msg_cache; > > + //kmem_cache_t* ae_msg_cache; > > + > > + struct list_head cca_link; /* adapter list */ > > + struct list_head eh_wakeup_list; /* event wakeup list */ > > + wait_queue_head_t req_vq_wo; > > + > > + /* RNIC Limits */ > > + u32 max_mr; > > + u32 max_mr_size; > > + u32 max_qp; > > + u32 max_qp_wr; > > + u32 max_sge; > > + u32 max_cq; > > + u32 max_cqe; > > + u32 max_pd; > > + > > + struct c2_pd_table pd_table; > > + struct c2_qp_table qp_table; > > +#if 0 > > + struct c2_mr_table mr_table; > > +#endif > > + int ports; /* num of GigE ports */ > > + int devnum; > > + spinlock_t vqlock; /* sync vbs req MQ */ > > + > > + /* Verbs Queues */ > > + struct c2_mq req_vq; /* Verbs Request MQ */ > > + struct c2_mq rep_vq; /* Verbs Reply MQ */ > > + struct c2_mq aeq; /* Async Events MQ */ > > + > > + /* Kernel client MQs */ > > + struct sp_chunk* kern_mqsp_pool; > > + > > + /* Device updates these values when posting messages to a host > > + * target queue */ > > + u16 req_vq_shared; > > + u16 rep_vq_shared; > > + u16 aeq_shared; > > + u16 irq_claimed; > > + > > + /* > > + * Shared host target pages for user-accessible MQs. > > + */ > > + int hthead; /* index of first free entry */ > > + void* htpages; /* kernel vaddr */ > > + int htlen; /* length of htpages memory */ > > + void* htuva; /* user mapped vaddr */ > > + spinlock_t htlock; /* serialize allocation */ > > + > > + u64 adapter_hint_uva; /* access to the activity FIFO */ > > + > > + spinlock_t aeq_lock; > > + spinlock_t rnic_lock; > > + > > + > > + u16 hint_count; > > + u16 hints_read; > > + > > + int init; /* TRUE if it's ready */ > > + char ae_cache_name[16]; > > + char vq_cache_name[16]; > > +}; > > + > > +struct c2_port { > > + u32 msg_enable; > > + struct c2_dev *c2dev; > > + struct net_device *netdev; > > + > > + spinlock_t tx_lock; > > + u32 tx_avail; > > + struct c2_ring tx_ring; > > + struct c2_ring rx_ring; > > + > > + void *mem; /* PCI memory for host rings */ > > + dma_addr_t dma; > > + unsigned long mem_size; > > + > > + u32 rx_buf_size; > > + > > + struct net_device_stats netstats; > > +}; > > + > > +/* > > + * Activity FIFO registers in BAR0. > > + */ > > +#define PCI_BAR0_HOST_HINT 0x100 > > +#define PCI_BAR0_ADAPTER_HINT 0x2000 > > + > > +/* > > + * Ammasso PCI vendor id and Cepheus PCI device id. > > + */ > > +#define CQ_ARMED 0x01 > > +#define CQ_WAIT_FOR_DMA 0x80 > > + > > +/* > > + * The format of a hint is as follows: > > + * Lower 16 bits are the count of hints for the queue. > > + * Next 15 bits are the qp_index > > + * Upper most bit depends on who reads it: > > + * If read by producer, then it means Full (1) or Not-Full (0) > > + * If read by consumer, then it means Empty (1) or Not-Empty (0) > > + */ > > +#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | > hint_count) > > +#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16) > > +#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF) > > + > > + > > +/* > > + * The following defines the offset in SDRAM for the > cc_adapter_pci_regs_t > > + * struct. > > + */ > > +#define C2_ADAPTER_PCI_REGS_OFFSET 0x10000 > > + > > +#ifndef readq > > +static inline u64 readq(const void __iomem *addr) > > +{ > > + u64 ret = readl(addr + 4); > > + ret <<= 32; > > + ret |= readl(addr); > > + > > + return ret; > > +} > > +#endif > > + > > +#ifndef writeq > > +static inline void writeq(u64 val, void __iomem *addr) > > +{ > > + writel((u32) (val), addr); > > + writel((u32) (val >> 32), (addr + 4)); > > +} > > +#endif > > + > > +/* Read from memory-mapped device */ > > +static inline u64 c2_read64(const void __iomem *addr) > > +{ > > + return readq(addr); > > +} > > + > > +static inline u32 c2_read32(const void __iomem *addr) > > +{ > > + return readl(addr); > > +} > > + > > +static inline u16 c2_read16(const void __iomem *addr) > > +{ > > + return readw(addr); > > +} > > + > > +static inline u8 c2_read8(const void __iomem *addr) > > +{ > > + return readb(addr); > > +} > > + > > +/* Write to memory-mapped device */ > > +static inline void c2_write64(void __iomem *addr, u64 val) > > +{ > > + writeq(val, addr); > > +} > > + > > +static inline void c2_write32(void __iomem *addr, u32 val) > > +{ > > + writel(val, addr); > > +} > > + > > +static inline void c2_write16(void __iomem *addr, u16 val) > > +{ > > + writew(val, addr); > > +} > > + > > +static inline void c2_write8(void __iomem *addr, u8 val) > > +{ > > + writeb(val, addr); > > +} > > + > > +#define C2_SET_CUR_RX(c2dev, cur_rx) \ > > + c2_write32(c2dev->mmio_txp_ring + 4092, cpu_to_be32(cur_rx)) > > + > > +#define C2_GET_CUR_RX(c2dev) \ > > + be32_to_cpu(c2_read32(c2dev->mmio_txp_ring + 4092)) > > + > > +static inline struct c2_dev *to_c2dev(struct ib_device* ibdev) > > +{ > > + return container_of(ibdev, struct c2_dev, ibdev); > > +} > > + > > +static inline int c2_errno(void *reply) > > +{ > > + switch(c2_wr_get_result(reply)) { > > + case CC_OK: > > + return 0; > > + case CCERR_NO_BUFS: > > + case CCERR_INSUFFICIENT_RESOURCES: > > + case CCERR_ZERO_RDMA_READ_RESOURCES: > > + return -ENOMEM; > > + case CCERR_MR_IN_USE: > > + case CCERR_QP_IN_USE: > > + return -EBUSY; > > + case CCERR_ADDR_IN_USE: > > + return -EADDRINUSE; > > + case CCERR_ADDR_NOT_AVAIL: > > + return -EADDRNOTAVAIL; > > + case CCERR_CONN_RESET: > > + return -ECONNRESET; > > + case CCERR_NOT_IMPLEMENTED: > > + case CCERR_INVALID_WQE: > > + return -ENOSYS; > > + case CCERR_QP_NOT_PRIVILEGED: > > + return -EPERM; > > + case CCERR_STACK_ERROR: > > + return -EPROTO; > > + case CCERR_ACCESS_VIOLATION: > > + case CCERR_BASE_AND_BOUNDS_VIOLATION: > > + return -EFAULT; > > + case CCERR_STAG_STATE_NOT_INVALID: > > + case CCERR_INVALID_ADDRESS: > > + case CCERR_INVALID_CQ: > > + case CCERR_INVALID_EP: > > + case CCERR_INVALID_MODIFIER: > > + case CCERR_INVALID_MTU: > > + case CCERR_INVALID_PD_ID: > > + case CCERR_INVALID_QP: > > + case CCERR_INVALID_RNIC: > > + case CCERR_INVALID_STAG: > > + return -EINVAL; > > + default: > > + return -EAGAIN; > > + } > > +} > > + > > +/* Device */ > > +extern int c2_register_device(struct c2_dev *c2dev); > > +extern void c2_unregister_device(struct c2_dev *c2dev); > > +extern int c2_rnic_init(struct c2_dev* c2dev); > > +extern void c2_rnic_term(struct c2_dev* c2dev); > > + > > +/* QPs */ > > +extern int c2_alloc_qp(struct c2_dev *c2dev, struct c2_pd *pd, > > + struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp); > > +extern void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp); > > +extern int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp, > > + struct ib_qp_attr *attr, int attr_mask); > > +extern int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, > > + struct ib_send_wr **bad_wr); > > +extern int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr > *ib_wr, > > + struct ib_recv_wr **bad_wr); > > +extern int __devinit c2_init_qp_table(struct c2_dev *c2dev); > > +extern void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev); > > + > > +/* PDs */ > > +extern int c2_pd_alloc(struct c2_dev *c2dev, int privileged, struct > c2_pd *pd); > > +extern void c2_pd_free(struct c2_dev *c2dev, struct c2_pd *pd); > > +extern int __devinit c2_init_pd_table(struct c2_dev *c2dev); > > +extern void __devexit c2_cleanup_pd_table(struct c2_dev *c2dev); > > + > > +/* CQs */ > > +extern int c2_init_cq(struct c2_dev *c2dev, int entries, struct > c2_ucontext *ctx, > > + struct c2_cq *cq); > > +extern void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq); > > +extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); > > +extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 > mq_index); > > +extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc > *entry); > > +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); > > + > > +/* CM */ > > +extern int c2_llp_connect(struct iw_cm_id* cm_id, const void* pdata, u8 > pdata_len); > > +extern int c2_llp_accept(struct iw_cm_id* cm_id, const void* pdata, u8 > pdata_len); > > +extern int c2_llp_reject(struct iw_cm_id* cm_id, const void* pdata, u8 > pdata_len); > > +extern int c2_llp_service_create(struct iw_cm_id* cm_id, int backlog); > > +extern int c2_llp_service_destroy(struct iw_cm_id* cm_id); > > + > > +/* MM */ > > +extern int c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 > **addr_list, > > + int pbl_depth, u32 length, u64 *va, > > + cc_acf_t acf, struct c2_mr *mr); > > +extern int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index); > > + > > +/* AE */ > > +extern void c2_ae_event(struct c2_dev *c2dev, u32 mq_index); > > + > > +/* Allocators */ > > +extern u32 c2_alloc(struct c2_alloc *alloc); > > +extern void c2_free(struct c2_alloc *alloc, u32 obj); > > +extern int c2_alloc_init(struct c2_alloc *alloc, u32 num, u32 > reserved); > > +extern void c2_alloc_cleanup(struct c2_alloc *alloc); > > +extern int c2_init_mqsp_pool(unsigned int gfp_mask, struct sp_chunk** > root); > > +extern void c2_free_mqsp_pool(struct sp_chunk* root); > > +extern u16* c2_alloc_mqsp(struct sp_chunk* head); > > +extern void c2_free_mqsp(u16* mqsp); > > +extern int c2_array_init(struct c2_array *array, int nent); > > +extern void c2_array_clear(struct c2_array *array, int index); > > +extern int c2_array_set(struct c2_array *array, int index, void > *value); > > +extern void *c2_array_get(struct c2_array *array, int index); > > + > > +#endif > > + > > Index: hw/amso1100/c2_vq.c > > =================================================================== > > --- hw/amso1100/c2_vq.c (revision 0) > > +++ hw/amso1100/c2_vq.c (revision 0) > > @@ -0,0 +1,272 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#include > > +#include > > + > > +#include "c2_vq.h" > > + > > +/* > > + * Verbs Request Objects: > > + * > > + * VQ Request Objects are allocated by the kernel verbs handlers. > > + * They contain a wait object, a refcnt, an atomic bool indicating that > the > > + * adapter has replied, and a copy of the verb reply work request. > > + * A pointer to the VQ Request Object is passed down in the context > > + * field of the work request message, and reflected back by the adapter > > + * in the verbs reply message. The function handle_vq() in the > interrupt > > + * path will use this pointer to: > > + * 1) append a copy of the verbs reply message > > + * 2) mark that the reply is ready > > + * 3) wake up the kernel verbs handler blocked awaiting the reply. > > + * > > + * > > + * The kernel verbs handlers do a "get" to put a 2nd reference on the > > + * VQ Request object. If the kernel verbs handler exits before the > adapter > > + * can respond, this extra reference will keep the VQ Request object > around > > + * until the adapter's reply can be processed. The reason we need this > is > > + * because a pointer to this object is stuffed into the context field > of > > + * the verbs work request message, and reflected back in the reply > message. > > + * It is used in the interrupt handler (handle_vq()) to wake up the > appropriate > > + * kernel verb handler that is blocked awaiting the verb reply. > > + * So handle_vq() will do a "put" on the object when it's done > accessing it. > > + * NOTE: If we guarantee that the kernel verb handler will never bail > before > > + * getting the reply, then we don't need these refcnts. > > + * > > + * > > + * VQ Request objects are freed by the kernel verbs handlers only > > + * after the verb has been processed, or when the adapter fails and > > + * does not reply. > > + * > > + * > > + * Verbs Reply Buffers: > > + * > > + * VQ Reply bufs are local host memory copies of a outstanding Verb > Request reply > > + * message. The are always allocated by the kernel verbs handlers, and > _may_ be > > + * freed by either the kernel verbs handler -or- the interrupt handler. > The > > + * kernel verbs handler _must_ free the repbuf, then free the vq > request object > > + * in that order. > > + */ > > + > > +int > > +vq_init(struct c2_dev* c2dev) > > +{ > > + sprintf(c2dev->vq_cache_name, "c2-vq:dev%c", (char ) ('0' + > c2dev->devnum)); > > + c2dev->host_msg_cache = kmem_cache_create(c2dev->vq_cache_name, > > + c2dev->rep_vq.msg_size, 0, > > + SLAB_HWCACHE_ALIGN, NULL, NULL); > > + if (c2dev->host_msg_cache == NULL) { > > + return -ENOMEM; > > + } > > + return 0; > > +} > > + > > +void > > +vq_term(struct c2_dev* c2dev) > > +{ > > + kmem_cache_destroy(c2dev->host_msg_cache); > > +} > > + > > +/* vq_req_alloc - allocate a VQ Request Object and initialize it. > > + * The refcnt is set to 1. > > + */ > > +struct c2_vq_req * > > +vq_req_alloc(struct c2_dev *c2dev) > > +{ > > + struct c2_vq_req *r; > > + > > + r = (struct c2_vq_req *)kmalloc(sizeof(struct c2_vq_req), > GFP_KERNEL); > > + if (r) { > > + init_waitqueue_head(&r->wait_object); > > + r->reply_msg = (u64)NULL; > > + atomic_set(&r->refcnt, 1); > > + atomic_set(&r->reply_ready, 0); > > + } > > + return r; > > +} > > + > > + > > +/* vq_req_free - free the VQ Request Object. It is assumed the verbs > handler > > + * has already free the VQ Reply Buffer if it existed. > > + */ > > +void > > +vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *r) > > +{ > > + r->reply_msg = (u64)NULL; > > + if (atomic_dec_and_test(&r->refcnt)) { > > + kfree(r); > > + } > > +} > > + > > +/* vq_req_get - reference a VQ Request Object. Done > > + * only in the kernel verbs handlers. > > + */ > > +void > > +vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *r) > > +{ > > + atomic_inc(&r->refcnt); > > +} > > + > > + > > +/* vq_req_put - dereference and potentially free a VQ Request Object. > > + * > > + * This is only called by handle_vq() on the interrupt when it is done > processing > > + * a verb reply message. If the associated kernel verbs handler has > already bailed, > > + * then this put will actually free the VQ Request object _and_ the VQ > Reply Buffer > > + * if it exists. > > + */ > > +void > > +vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *r) > > +{ > > + if (atomic_dec_and_test(&r->refcnt)) { > > + if (r->reply_msg != (u64)NULL) > > + vq_repbuf_free(c2dev, (void *)(unsigned long)r->reply_msg); > > + kfree(r); > > + } > > +} > > + > > + > > +/* > > + * vq_repbuf_alloc - allocate a VQ Reply Buffer. > > + */ > > +void * > > +vq_repbuf_alloc(struct c2_dev *c2dev) > > +{ > > + return kmem_cache_alloc(c2dev->host_msg_cache, SLAB_ATOMIC); > > +} > > + > > +/* > > + * vq_send_wr - post a verbs request message to the Verbs Request > Queue. > > + * If a message is not available in the MQ, then block until one is > available. > > + * NOTE: handle_mq() on the interrupt context will wake up threads > blocked here. > > + * When the adapter drains the Verbs Request Queue, it inserts MQ index > 0 in to the > > + * adapter->host activity fifo and interrupts the host. > > + */ > > +int > > +vq_send_wr(struct c2_dev *c2dev, ccwr_t *wr) > > +{ > > + void *msg; > > + wait_queue_t __wait; > > + > > + /* > > + * grab adapter vq lock > > + */ > > + spin_lock(&c2dev->vqlock); > > + > > + /* > > + * allocate msg > > + */ > > + msg = c2_mq_alloc(&c2dev->req_vq); > > + > > + /* > > + * If we cannot get a msg, then we'll wait > > + * When a messages are available, the int handler will wake_up() > > + * any waiters. > > + */ > > + while (msg == NULL) { > > + init_waitqueue_entry(&__wait, current); > > + add_wait_queue(&c2dev->req_vq_wo, &__wait); > > + spin_unlock(&c2dev->vqlock); > > + for (;;) { > > + set_current_state(TASK_INTERRUPTIBLE); > > + if (!c2_mq_full(&c2dev->req_vq)) { > > + break; > > + } > > + if (!signal_pending(current)) { > > + schedule_timeout(1*HZ); /* 1 second... */ > > + continue; > > + } > > + set_current_state(TASK_RUNNING); > > + remove_wait_queue(&c2dev->req_vq_wo, &__wait); > > + return -EINTR; > > + } > > + set_current_state(TASK_RUNNING); > > + remove_wait_queue(&c2dev->req_vq_wo, &__wait); > > + spin_lock(&c2dev->vqlock); > > + msg = c2_mq_alloc(&c2dev->req_vq); > > + } > > + > > + /* > > + * copy wr into adapter msg > > + */ > > + memcpy(msg, wr, c2dev->req_vq.msg_size); > > + > > + /* > > + * post msg > > + */ > > + c2_mq_produce(&c2dev->req_vq); > > + > > + /* > > + * release adapter vq lock > > + */ > > + spin_unlock(&c2dev->vqlock); > > + return 0; > > +} > > + > > + > > +/* > > + * vq_wait_for_reply - block until the adapter posts a Verb Reply > Message. > > + */ > > +int > > +vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req) > > +{ > > + wait_queue_t __wait; > > + int rc = 0; > > + > > + /* > > + * Add this request to the wait queue. > > + */ > > + init_waitqueue_entry(&__wait, current); > > + add_wait_queue(&req->wait_object, &__wait); > > + for (;;) { > > + set_current_state(TASK_UNINTERRUPTIBLE); > > + if (atomic_read(&req->reply_ready)) { > > + break; > > + } > > + if (schedule_timeout(60*HZ) == 0) { > > + rc = -ETIMEDOUT; > > + break; > > + } > > + } > > + set_current_state(TASK_RUNNING); > > + remove_wait_queue(&req->wait_object, &__wait); > > + return rc; > > +} > > + > > +/* > > + * vq_repbuf_free - Free a Verbs Reply Buffer. > > + */ > > +void > > +vq_repbuf_free(struct c2_dev *c2dev, void *reply) > > +{ > > + kmem_cache_free(c2dev->host_msg_cache, reply); > > +} > > Index: hw/amso1100/README > > =================================================================== > > --- hw/amso1100/README (revision 0) > > +++ hw/amso1100/README (revision 0) > > @@ -0,0 +1,11 @@ > > + > > +This is the OpenIB iWARP driver for the AMSO1100 HCA from > > +Open Grid Computing. The adapter is a 1Gb RDMA capable PCI-X RNIC. > > + > > +The driver implements an iWARP CM Provider and OpenIB verbs > > +provider. The company that created the device (Ammasso, Inc.) > > +is no longer in business, however, limited quantities of the cards > > +are available for development purposes from Open Grid Computing. > > + > > +Please contact 512-343-9196 x 108 or e-mail tom at opengridcomputing.com > > +for more information. > > Index: hw/amso1100/c2_provider.c > > =================================================================== > > --- hw/amso1100/c2_provider.c (revision 0) > > +++ hw/amso1100/c2_provider.c (revision 0) > > @@ -0,0 +1,704 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + * > > + */ > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#include > > +#include > > +#include > > + > > +#include > > +#include "c2.h" > > +#include "c2_provider.h" > > +#include "c2_user.h" > > + > > +static int c2_query_device(struct ib_device *ibdev, > > + struct ib_device_attr *props) > > +{ > > + struct c2_dev* c2dev = to_c2dev(ibdev); > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + memset(props, 0, sizeof *props); > > + > > + memcpy(&props->sys_image_guid, c2dev->netdev->dev_addr, 6); > > + memcpy(&props->node_guid, c2dev->netdev->dev_addr, 6); > > + > > + props->fw_ver = c2dev->fw_ver; > > + props->device_cap_flags = c2dev->device_cap_flags; > > + props->vendor_id = c2dev->vendor_id; > > + props->vendor_part_id = c2dev->vendor_part_id; > > + props->hw_ver = c2dev->hw_rev; > > + props->max_mr_size = ~0ull; > > + props->max_qp = c2dev->max_qp; > > + props->max_qp_wr = c2dev->max_qp_wr; > > + props->max_sge = c2dev->max_sge; > > + props->max_cq = c2dev->max_cq; > > + props->max_cqe = c2dev->max_cqe; > > + props->max_mr = c2dev->max_mr; > > + props->max_pd = c2dev->max_pd; > > + props->max_qp_rd_atom = 0; > > + props->max_qp_init_rd_atom = 0; > > + props->local_ca_ack_delay = 0; > > + > > + return 0; > > +} > > + > > +static int c2_query_port(struct ib_device *ibdev, > > + u8 port, struct ib_port_attr *props) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + props->max_mtu = IB_MTU_4096; > > + props->lid = 0; > > + props->lmc = 0; > > + props->sm_lid = 0; > > + props->sm_sl = 0; > > + props->state = IB_PORT_ACTIVE; > > + props->phys_state = 0; > > + props->port_cap_flags = > > + IB_PORT_CM_SUP | > > + IB_PORT_SNMP_TUNNEL_SUP | > > + IB_PORT_REINIT_SUP | > > + IB_PORT_DEVICE_MGMT_SUP | > > + IB_PORT_VENDOR_CLASS_SUP| > > + IB_PORT_BOOT_MGMT_SUP; > > + props->gid_tbl_len = 128; > > + props->pkey_tbl_len = 1; > > + props->qkey_viol_cntr = 0; > > + props->active_width = 1; > > + props->active_speed = 1; > > + > > + return 0; > > +} > > + > > +static int c2_modify_port(struct ib_device *ibdev, > > + u8 port, int port_modify_mask, > > + struct ib_port_modify *props) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return 0; > > +} > > + > > +static int c2_query_pkey(struct ib_device *ibdev, > > + u8 port, u16 index, u16 *pkey) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + *pkey = 0; > > + return 0; > > +} > > + > > +static int c2_query_gid(struct ib_device *ibdev, u8 port, > > + int index, union ib_gid *gid) > > +{ > > + struct c2_dev* c2dev = to_c2dev(ibdev); > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + memcpy(&(gid->raw[0]),c2dev->netdev->dev_addr, MAX_ADDR_LEN); > > + > > + return 0; > > +} > > + > > +/* Allocate the user context data structure. This keeps track > > + * of all objects associated with a particular user-mode client. > > + */ > > +static struct ib_ucontext *c2_alloc_ucontext(struct ib_device *ibdev, > > + struct ib_udata *udata) > > +{ > > + struct c2_alloc_ucontext_resp uresp; > > + struct c2_ucontext *context; > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + memset(&uresp, 0, sizeof uresp); > > + > > + uresp.qp_tab_size = to_c2dev(ibdev)->max_qp; > > + > > + context = kmalloc(sizeof *context, GFP_KERNEL); > > + if (!context) > > + return ERR_PTR(-ENOMEM); > > + > > + /* The OpenIB user context is logically similar to the RNIC > > + * Instance of our existing driver > > + */ > > + /* context->rnic_p = rnic_open */ > > + > > + if (ib_copy_to_udata(udata, &uresp, sizeof uresp)) { > > + kfree(context); > > + return ERR_PTR(-EFAULT); > > + } > > + > > + return &context->ibucontext; > > +} > > + > > +static int c2_dealloc_ucontext(struct ib_ucontext *context) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return -ENOSYS; > > +} > > + > > +static int c2_mmap_uar(struct ib_ucontext *context, > > + struct vm_area_struct *vma) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return -ENOSYS; > > +} > > + > > +static struct ib_pd *c2_alloc_pd(struct ib_device *ibdev, > > + struct ib_ucontext *context, > > + struct ib_udata *udata) > > +{ > > + struct c2_pd* pd; > > + int err; > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + pd = kmalloc(sizeof *pd, GFP_KERNEL); > > + if (!pd) > > + return ERR_PTR(-ENOMEM); > > + > > + err = c2_pd_alloc(to_c2dev(ibdev), !context, pd); > > + if (err) { > > + kfree(pd); > > + return ERR_PTR(err); > > + } > > + > > + if (context) { > > + if (ib_copy_to_udata(udata, &pd->pd_id, sizeof (__u32))) { > > + c2_pd_free(to_c2dev(ibdev), pd); > > + kfree(pd); > > + return ERR_PTR(-EFAULT); > > + } > > + } > > + > > + return &pd->ibpd; > > +} > > + > > +static int c2_dealloc_pd(struct ib_pd *pd) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + c2_pd_free(to_c2dev(pd->device), to_c2pd(pd)); > > + kfree(pd); > > + > > + return 0; > > +} > > + > > +static struct ib_ah *c2_ah_create(struct ib_pd *pd, > > + struct ib_ah_attr *ah_attr) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return ERR_PTR(-ENOSYS); > > +} > > + > > +static int c2_ah_destroy(struct ib_ah *ah) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return -ENOSYS; > > +} > > + > > +static struct ib_qp *c2_create_qp(struct ib_pd *pd, > > + struct ib_qp_init_attr *init_attr, > > + struct ib_udata *udata) > > +{ > > + struct c2_qp *qp; > > + int err; > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + switch(init_attr->qp_type) { > > + case IB_QPT_RC: > > + qp = kmalloc(sizeof(*qp), GFP_KERNEL); > > + if (!qp) { > > + dprintk("%s: Unable to allocate QP\n", __FUNCTION__); > > + return ERR_PTR(-ENOMEM); > > + } > > + > > + if (pd->uobject) { > > + /* XXX userspace specific */ > > + } > > + > > + err = c2_alloc_qp(to_c2dev(pd->device), > > + to_c2pd(pd), > > + init_attr, > > + qp); > > + if (err && pd->uobject) { > > + /* XXX userspace specific */ > > + } > > + > > + break; > > + default: > > + dprintk("%s: Invalid QP type: %d\n", __FUNCTION__, > init_attr->qp_type); > > + return ERR_PTR(-EINVAL); > > + break; > > + } > > + > > + if (err) { > > + kfree(pd); > > + return ERR_PTR(err); > > + } > > + > > + return &qp->ibqp; > > +} > > + > > +static int c2_destroy_qp(struct ib_qp *ib_qp) > > +{ > > + struct c2_qp *qp = to_c2qp(ib_qp); > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + c2_free_qp(to_c2dev(ib_qp->device), qp); > > + kfree(qp); > > + > > + return 0; > > +} > > + > > +static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, > > + struct ib_ucontext *context, > > + struct ib_udata *udata) > > +{ > > + struct c2_cq *cq; > > + int err; > > + > > + cq = kmalloc(sizeof(*cq), GFP_KERNEL); > > + if (!cq) { > > + dprintk("%s: Unable to allocate CQ\n", __FUNCTION__); > > + return ERR_PTR(-ENOMEM); > > + } > > + > > + err = c2_init_cq(to_c2dev(ibdev), entries, NULL, cq); > > + if (err) { > > + dprintk("%s: error initializing CQ\n", __FUNCTION__); > > + kfree(cq); > > + return ERR_PTR(err); > > + } > > + > > + return &cq->ibcq; > > +} > > + > > +static int c2_destroy_cq(struct ib_cq *ib_cq) > > +{ > > + struct c2_cq *cq = to_c2cq(ib_cq); > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + c2_free_cq(to_c2dev(ib_cq->device), cq); > > + kfree(cq); > > + > > + return 0; > > +} > > + > > +static inline u32 c2_convert_access(int acc) > > +{ > > + return (acc & IB_ACCESS_REMOTE_WRITE ? CC_ACF_REMOTE_WRITE : 0) | > > + (acc & IB_ACCESS_REMOTE_READ ? CC_ACF_REMOTE_READ : 0) | > > + (acc & IB_ACCESS_LOCAL_WRITE ? CC_ACF_LOCAL_WRITE : 0) | > > + CC_ACF_LOCAL_READ | CC_ACF_WINDOW_BIND; > > +} > > + > > +static struct ib_mr *c2_reg_phys_mr(struct ib_pd *ib_pd, > > + struct ib_phys_buf *buffer_list, > > + int num_phys_buf, > > + int acc, > > + u64 *iova_start) > > +{ > > + struct c2_mr *mr; > > + u64 **page_list; > > + u32 total_len; > > + int err, i, j, k, pbl_depth; > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + pbl_depth = 0; > > + total_len = 0; > > + > > + for (i = 0; i < num_phys_buf; i++) { > > + > > + int size; > > + > > + if (buffer_list[i].addr & ~PAGE_MASK) { > > + dprintk("Unaligned Memory Buffer: 0x%x\n", > > + (unsigned int)buffer_list[i].addr); > > + return ERR_PTR(-EINVAL); > > + } > > + > > + if (!buffer_list[i].size) { > > + dprintk("Invalid Buffer Size\n"); > > + return ERR_PTR(-EINVAL); > > + } > > + > > + size = buffer_list[i].size; > > + total_len += size; > > + while (size) { > > + pbl_depth++; > > + size -= PAGE_SIZE; > > + } > > + } > > + > > + page_list = kmalloc(sizeof(u64 *) * pbl_depth, GFP_KERNEL); > > + if (!page_list) > > + return ERR_PTR(-ENOMEM); > > + > > + for (i = 0, j = 0; i < num_phys_buf; i++) { > > + > > + int naddrs; > > + > > + naddrs = (u32)buffer_list[i].size % ~PAGE_MASK; > > + for (k = 0; k < naddrs; k++) > > + page_list[j++] = > > + (u64 *)(unsigned long)(buffer_list[i].addr + (k << > PAGE_SHIFT)); > > + } > > + > > + mr = kmalloc(sizeof(*mr), GFP_KERNEL); > > + if (!mr) > > + return ERR_PTR(-ENOMEM); > > + > > + mr->pd = to_c2pd(ib_pd); > > + > > + err = c2_nsmr_register_phys_kern(to_c2dev(ib_pd->device), page_list, > > + pbl_depth, total_len, iova_start, > > + c2_convert_access(acc), mr); > > + kfree(page_list); > > + if (err) { > > + kfree(mr); > > + return ERR_PTR(err); > > + } > > + > > + return &mr->ibmr; > > +} > > + > > +static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc) > > +{ > > + struct ib_phys_buf bl; > > + u64 kva; > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + bl.size = 4096; > > + kva = (u64)(unsigned long)kmalloc(bl.size, GFP_KERNEL); > > + if (!kva) > > + return ERR_PTR(-ENOMEM); > > + > > + bl.addr = __pa(kva); > > + return c2_reg_phys_mr(pd, &bl, 1, acc, &kva); > > +} > > + > > +static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem > *region, > > + int acc, struct ib_udata *udata) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return ERR_PTR(-ENOSYS); > > +} > > + > > +static int c2_dereg_mr(struct ib_mr *ib_mr) > > +{ > > + struct c2_mr *mr = to_c2mr(ib_mr); > > + int err; > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + err = c2_stag_dealloc(to_c2dev(ib_mr->device), ib_mr->lkey); > > + if (err) > > + dprintk("c2_stag_dealloc failed: %d\n", err); > > + else > > + kfree(mr); > > + > > + return err; > > +} > > + > > +static ssize_t show_rev(struct class_device *cdev, char *buf) > > +{ > > + struct c2_dev *dev = container_of(cdev, struct c2_dev, > ibdev.class_dev); > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return sprintf(buf, "%x\n", dev->hw_rev); > > +} > > + > > +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) > > +{ > > + struct c2_dev *dev = container_of(cdev, struct c2_dev, > ibdev.class_dev); > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return sprintf(buf, "%x.%x.%x\n", > > + (int)(dev->fw_ver >> 32), > > + (int)(dev->fw_ver >> 16) & 0xffff, > > + (int)(dev->fw_ver & 0xffff)); > > +} > > + > > +static ssize_t show_hca(struct class_device *cdev, char *buf) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return sprintf(buf, "AMSO1100\n"); > > +} > > + > > +static ssize_t show_board(struct class_device *cdev, char *buf) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return sprintf(buf, "%.*s\n", 32, "AMSO1100 Board ID"); > > +} > > + > > +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); > > +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); > > +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); > > +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); > > + > > +static struct class_device_attribute *c2_class_attributes[] = { > > + &class_device_attr_hw_rev, > > + &class_device_attr_fw_ver, > > + &class_device_attr_hca_type, > > + &class_device_attr_board_id > > +}; > > + > > +static int c2_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, > int attr_mask) > > +{ > > + int err; > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + err = c2_qp_modify(to_c2dev(ibqp->device), to_c2qp(ibqp), attr, > attr_mask); > > + > > + return err; > > +} > > + > > +static int c2_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, > u16 lid) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return -ENOSYS; > > +} > > + > > +static int c2_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, > u16 lid) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return -ENOSYS; > > +} > > + > > +static int c2_process_mad(struct ib_device *ibdev, > > + int mad_flags, > > + u8 port_num, > > + struct ib_wc *in_wc, > > + struct ib_grh *in_grh, > > + struct ib_mad *in_mad, > > + struct ib_mad *out_mad) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return -ENOSYS; > > +} > > + > > +static int c2_connect(struct iw_cm_id* cm_id, > > + const void* pdata, u8 pdata_len) > > +{ > > + int err; > > + struct c2_qp* qp = container_of(cm_id->qp, struct c2_qp, ibqp); > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + if (cm_id->qp == NULL) > > + return -EINVAL; > > + > > + /* Cache the cm_id in the qp */ > > + qp->cm_id = cm_id; > > + > > + err = c2_llp_connect(cm_id, pdata, pdata_len); > > + > > + return err; > > +} > > + > > +static int c2_disconnect(struct iw_cm_id* cm_id, int abrupt) > > +{ > > + struct ib_qp_attr attr; > > + struct ib_qp *ib_qp = cm_id->qp; > > + int err; > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + if (ib_qp == 0) > > + /* If this is a lietening endpoint, there is no QP */ > > + return 0; > > + > > + memset(&attr, 0, sizeof(struct ib_qp_attr)); > > + if (abrupt) > > + attr.qp_state = IB_QPS_ERR; > > + else > > + attr.qp_state = IB_QPS_SQD; > > + > > + err = c2_modify_qp(ib_qp, &attr, IB_QP_STATE); > > + return err; > > +} > > + > > +static int c2_accept(struct iw_cm_id* cm_id, const void *pdata, u8 > pdata_len) > > +{ > > + int err; > > + struct c2_qp* qp = container_of(cm_id->qp, struct c2_qp, ibqp); > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + /* Cache the cm_id in the qp */ > > + qp->cm_id = cm_id; > > + > > + err = c2_llp_accept(cm_id, pdata, pdata_len); > > + > > + return err; > > +} > > + > > +static int c2_reject(struct iw_cm_id* cm_id, const void* pdata, u8 > pdata_len) > > +{ > > + int err; > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + err = c2_llp_reject(cm_id, pdata, pdata_len); > > + return err; > > +} > > + > > +static int c2_getpeername(struct iw_cm_id* cm_id, > > + struct sockaddr_in* local_addr, > > + struct sockaddr_in* remote_addr ) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + *local_addr = cm_id->local_addr; > > + *remote_addr = cm_id->remote_addr; > > + return 0; > > +} > > + > > +static int c2_service_create(struct iw_cm_id* cm_id, int backlog) > > +{ > > + int err; > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + err = c2_llp_service_create(cm_id, backlog); > > + return err; > > +} > > + > > +static int c2_service_destroy(struct iw_cm_id* cm_id) > > +{ > > + int err; > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + > > + err = c2_llp_service_destroy(cm_id); > > + > > + return err; > > +} > > + > > +int c2_register_device(struct c2_dev *dev) > > +{ > > + int ret; > > + int i; > > + > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX); > > + dev->ibdev.owner = THIS_MODULE; > > + > > + dev->ibdev.node_type = IB_NODE_RNIC; > > + memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); > > + memcpy(&dev->ibdev.node_guid, dev->netdev->dev_addr, 6); > > + dev->ibdev.phys_port_cnt = 1; > > + dev->ibdev.dma_device = &dev->pcidev->dev; > > + dev->ibdev.class_dev.dev = &dev->pcidev->dev; > > + dev->ibdev.query_device = c2_query_device; > > + dev->ibdev.query_port = c2_query_port; > > + dev->ibdev.modify_port = c2_modify_port; > > + dev->ibdev.query_pkey = c2_query_pkey; > > + dev->ibdev.query_gid = c2_query_gid; > > + dev->ibdev.alloc_ucontext = c2_alloc_ucontext; > > + dev->ibdev.dealloc_ucontext = c2_dealloc_ucontext; > > + dev->ibdev.mmap = c2_mmap_uar; > > + dev->ibdev.alloc_pd = c2_alloc_pd; > > + dev->ibdev.dealloc_pd = c2_dealloc_pd; > > + dev->ibdev.create_ah = c2_ah_create; > > + dev->ibdev.destroy_ah = c2_ah_destroy; > > + dev->ibdev.create_qp = c2_create_qp; > > + dev->ibdev.modify_qp = c2_modify_qp; > > + dev->ibdev.destroy_qp = c2_destroy_qp; > > + dev->ibdev.create_cq = c2_create_cq; > > + dev->ibdev.destroy_cq = c2_destroy_cq; > > + dev->ibdev.poll_cq = c2_poll_cq; > > + dev->ibdev.get_dma_mr = c2_get_dma_mr; > > + dev->ibdev.reg_phys_mr = c2_reg_phys_mr; > > + dev->ibdev.reg_user_mr = c2_reg_user_mr; > > + dev->ibdev.dereg_mr = c2_dereg_mr; > > + > > + dev->ibdev.alloc_fmr = 0; > > + dev->ibdev.unmap_fmr = 0; > > + dev->ibdev.dealloc_fmr = 0; > > + dev->ibdev.map_phys_fmr = 0; > > + > > + dev->ibdev.attach_mcast = c2_multicast_attach; > > + dev->ibdev.detach_mcast = c2_multicast_detach; > > + dev->ibdev.process_mad = c2_process_mad; > > + > > + dev->ibdev.req_notify_cq = c2_arm_cq; > > + dev->ibdev.post_send = c2_post_send; > > + dev->ibdev.post_recv = c2_post_receive; > > + > > + dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), > GFP_KERNEL); > > + dev->ibdev.iwcm->connect = c2_connect; > > + dev->ibdev.iwcm->disconnect = c2_disconnect; > > + dev->ibdev.iwcm->accept = c2_accept; > > + dev->ibdev.iwcm->reject = c2_reject; > > + dev->ibdev.iwcm->getpeername = c2_getpeername; > > + dev->ibdev.iwcm->create_listen = c2_service_create; > > + dev->ibdev.iwcm->destroy_listen = c2_service_destroy; > > + > > + ret = ib_register_device(&dev->ibdev); > > + if (ret) > > + return ret; > > + > > + for (i = 0; i < ARRAY_SIZE(c2_class_attributes); ++i) { > > + ret = class_device_create_file(&dev->ibdev.class_dev, > > + c2_class_attributes[i]); > > + if (ret) { > > + ib_unregister_device(&dev->ibdev); > > + return ret; > > + } > > + } > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + return 0; > > +} > > + > > +void c2_unregister_device(struct c2_dev *dev) > > +{ > > + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + ib_unregister_device(&dev->ibdev); > > +} > > Index: hw/amso1100/c2_alloc.c > > =================================================================== > > --- hw/amso1100/c2_alloc.c (revision 0) > > +++ hw/amso1100/c2_alloc.c (revision 0) > > @@ -0,0 +1,255 @@ > > +/* > > + * Copyright (c) 2004 Topspin Communications. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > + > > +#include > > +#include > > +#include > > + > > +#include "c2.h" > > + > > +/* Trivial bitmap-based allocator */ > > +u32 c2_alloc(struct c2_alloc *alloc) > > +{ > > + u32 obj; > > + > > + spin_lock(&alloc->lock); > > + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); > > + if (obj < alloc->max) { > > + set_bit(obj, alloc->table); > > + alloc->last = obj; > > + } else > > + obj = -1; > > + > > + spin_unlock(&alloc->lock); > > + > > + return obj; > > +} > > + > > +void c2_free(struct c2_alloc *alloc, u32 obj) > > +{ > > + spin_lock(&alloc->lock); > > + clear_bit(obj, alloc->table); > > + alloc->last = min(alloc->last, obj); > > + spin_unlock(&alloc->lock); > > +} > > + > > +int c2_alloc_init(struct c2_alloc *alloc, u32 num, u32 reserved) > > +{ > > + int i; > > + > > + alloc->last = 0; > > + alloc->max = num; > > + spin_lock_init(&alloc->lock); > > + alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long), > > + GFP_KERNEL); > > + if (!alloc->table) > > + return -ENOMEM; > > + > > + bitmap_zero(alloc->table, num); > > + for (i = 0; i < reserved; ++i) > > + set_bit(i, alloc->table); > > + > > + return 0; > > +} > > + > > +void c2_alloc_cleanup(struct c2_alloc *alloc) > > +{ > > + kfree(alloc->table); > > +} > > + > > +/* > > + * Array of pointers with lazy allocation of leaf pages. Callers of > > + * _get, _set and _clear methods must use a lock or otherwise > > + * serialize access to the array. > > + */ > > + > > +void *c2_array_get(struct c2_array *array, int index) > > +{ > > + int p = (index * sizeof (void *)) >> PAGE_SHIFT; > > + > > + if (array->page_list[p].page) { > > + int i = index & (PAGE_SIZE / sizeof (void *) - 1); > > + return array->page_list[p].page[i]; > > + } else > > + return NULL; > > +} > > + > > +int c2_array_set(struct c2_array *array, int index, void *value) > > +{ > > + int p = (index * sizeof (void *)) >> PAGE_SHIFT; > > + > > + /* Allocate with GFP_ATOMIC because we'll be called with locks held. > */ > > + if (!array->page_list[p].page) > > + array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC); > > + > > + if (!array->page_list[p].page) > > + return -ENOMEM; > > + > > + array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] > = > > + value; > > + ++array->page_list[p].used; > > + > > + return 0; > > +} > > + > > +void c2_array_clear(struct c2_array *array, int index) > > +{ > > + int p = (index * sizeof (void *)) >> PAGE_SHIFT; > > + > > + if (--array->page_list[p].used == 0) { > > + free_page((unsigned long) array->page_list[p].page); > > + array->page_list[p].page = NULL; > > + } > > + > > + if (array->page_list[p].used < 0) > > + pr_debug("Array %p index %d page %d with ref count %d < 0\n", > > + array, index, p, array->page_list[p].used); > > +} > > + > > +int c2_array_init(struct c2_array *array, int nent) > > +{ > > + int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; > > + int i; > > + > > + array->page_list = kmalloc(npage * sizeof *array->page_list, > GFP_KERNEL); > > + if (!array->page_list) > > + return -ENOMEM; > > + > > + for (i = 0; i < npage; ++i) { > > + array->page_list[i].page = NULL; > > + array->page_list[i].used = 0; > > + } > > + > > + return 0; > > +} > > + > > +void c2_array_cleanup(struct c2_array *array, int nent) > > +{ > > + int i; > > + > > + for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / > PAGE_SIZE; ++i) > > + free_page((unsigned long) array->page_list[i].page); > > + > > + kfree(array->page_list); > > +} > > + > > +static int c2_alloc_mqsp_chunk(unsigned int gfp_mask, struct sp_chunk** > head) > > +{ > > + int i; > > + struct sp_chunk* new_head; > > + > > + new_head = (struct sp_chunk*)__get_free_page(gfp_mask|GFP_DMA); > > + if (new_head == NULL) > > + return -ENOMEM; > > + > > + new_head->next = NULL; > > + new_head->head = 0; > > + new_head->gfp_mask = gfp_mask; > > + > > + /* build list where each index is the next free slot */ > > + for (i = 0; > > + i < (PAGE_SIZE-sizeof(struct sp_chunk*)-sizeof(u16)) / > sizeof(u16)-1; > > + i++) { > > + new_head->shared_ptr[i] = i+1; > > + } > > + /* terminate list */ > > + new_head->shared_ptr[i] = 0xFFFF; > > + > > + *head = new_head; > > + return 0; > > +} > > + > > +int c2_init_mqsp_pool(unsigned int gfp_mask, struct sp_chunk** root) { > > + return c2_alloc_mqsp_chunk(gfp_mask, root); > > +} > > + > > +void c2_free_mqsp_pool(struct sp_chunk* root) > > +{ > > + struct sp_chunk* next; > > + > > + while (root) { > > + next = root->next; > > + __free_page((struct page*)root); > > + root = next; > > + } > > +} > > + > > +u16* c2_alloc_mqsp(struct sp_chunk* head) > > +{ > > + u16 mqsp; > > + > > + while (head) { > > + mqsp = head->head; > > + if (mqsp != 0xFFFF) { > > + head->head = head->shared_ptr[mqsp]; > > + break; > > + } else if (head->next == NULL) { > > + if (c2_alloc_mqsp_chunk(head->gfp_mask, &head->next) == 0) { > > + head = head->next; > > + mqsp = head->head; > > + head->head = > > + head->shared_ptr[mqsp]; > > + break; > > + } > > + else > > + return 0; > > + } > > + else > > + head = head->next; > > + } > > + if (head) > > + return &(head->shared_ptr[mqsp]); > > + return 0; > > +} > > + > > +void c2_free_mqsp(u16* mqsp) > > +{ > > + struct sp_chunk* head; > > + u16 idx; > > + > > + /* The chunk containing this ptr begins at the page boundary */ > > + head = (struct sp_chunk*)((unsigned long)mqsp & PAGE_MASK); > > + > > + /* Link head to new mqsp */ > > + *mqsp = head->head; > > + > > + /* Compute the shared_ptr index */ > > + idx = ((unsigned long)mqsp & ~PAGE_MASK) >> 1; > > + idx -= (unsigned long)&(((struct sp_chunk*)0)->shared_ptr[0]) >> 1; > > + > > + /* Point this index at the head */ > > + head->shared_ptr[idx] = head->head; > > + > > + /* Point head at this index */ > > + head->head = idx; > > +} > > Index: hw/amso1100/cc_types.h > > =================================================================== > > --- hw/amso1100/cc_types.h (revision 0) > > +++ hw/amso1100/cc_types.h (revision 0) > > @@ -0,0 +1,297 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#ifndef _CC_TYPES_H_ > > +#define _CC_TYPES_H_ > > + > > +#include > > + > > +#ifndef NULL > > +#define NULL 0 > > +#endif > > +#ifndef TRUE > > +#define TRUE 1 > > +#endif > > +#ifndef FALSE > > +#define FALSE 0 > > +#endif > > + > > +#define PTR_TO_CTX(p) (u64)(u32)(p) > > + > > +#define CC_PTR_TO_64(p) (u64)(u32)(p) > > +#define CC_64_TO_PTR(c) (void*)(u32)(c) > > + > > + > > + > > +/* > > + * not really a "type" however this needs > > + * to be common between adapter and host. > > + * this is the best place to put it. > > + */ > > +#define CC_QP_NO_ATTR_CHANGE 0xFFFFFFFF > > + > > +/* Maximum allowed size in bytes of private_data exchange > > + * on connect. > > + */ > > +#define CC_MAX_PRIVATE_DATA_SIZE 200 > > + > > +/* > > + * These types are shared among the adapter, host, and CCIL consumer. > Thus > > + * they are placed here since everyone includes cc_types.h... > > + */ > > +typedef enum { > > + CC_CQ_NOTIFICATION_TYPE_NONE = 1, > > + CC_CQ_NOTIFICATION_TYPE_NEXT, > > + CC_CQ_NOTIFICATION_TYPE_NEXT_SE > > +} cc_cq_notification_type_t; > > + > > +typedef enum { > > + CC_CFG_ADD_ADDR = 1, > > + CC_CFG_DEL_ADDR = 2, > > + CC_CFG_ADD_ROUTE = 3, > > + CC_CFG_DEL_ROUTE = 4 > > +} cc_setconfig_cmd_t; > > + > > +typedef enum { > > + CC_GETCONFIG_ROUTES = 1, > > + CC_GETCONFIG_ADDRS > > +} cc_getconfig_cmd_t; > > + > > +/* > > + * CCIL Work Request Identifiers > > + */ > > +typedef enum { > > + CCWR_RNIC_OPEN = 1, > > + CCWR_RNIC_QUERY, > > + CCWR_RNIC_SETCONFIG, > > + CCWR_RNIC_GETCONFIG, > > + CCWR_RNIC_CLOSE, > > + CCWR_CQ_CREATE, > > + CCWR_CQ_QUERY, > > + CCWR_CQ_MODIFY, > > + CCWR_CQ_DESTROY, > > + CCWR_QP_CONNECT, > > + CCWR_PD_ALLOC, > > + CCWR_PD_DEALLOC, > > + CCWR_SRQ_CREATE, > > + CCWR_SRQ_QUERY, > > + CCWR_SRQ_MODIFY, > > + CCWR_SRQ_DESTROY, > > + CCWR_QP_CREATE, > > + CCWR_QP_QUERY, > > + CCWR_QP_MODIFY, > > + CCWR_QP_DESTROY, > > + CCWR_NSMR_STAG_ALLOC, > > + CCWR_NSMR_REGISTER, > > + CCWR_NSMR_PBL, > > + CCWR_STAG_DEALLOC, > > + CCWR_NSMR_REREGISTER, > > + CCWR_SMR_REGISTER, > > + CCWR_MR_QUERY, > > + CCWR_MW_ALLOC, > > + CCWR_MW_QUERY, > > + CCWR_EP_CREATE, > > + CCWR_EP_GETOPT, > > + CCWR_EP_SETOPT, > > + CCWR_EP_DESTROY, > > + CCWR_EP_BIND, > > + CCWR_EP_CONNECT, > > + CCWR_EP_LISTEN, > > + CCWR_EP_SHUTDOWN, > > + CCWR_EP_LISTEN_CREATE, > > + CCWR_EP_LISTEN_DESTROY, > > + CCWR_EP_QUERY, > > + CCWR_CR_ACCEPT, > > + CCWR_CR_REJECT, > > + CCWR_CONSOLE, > > + CCWR_TERM, > > + CCWR_FLASH_INIT, > > + CCWR_FLASH, > > + CCWR_BUF_ALLOC, > > + CCWR_BUF_FREE, > > + CCWR_FLASH_WRITE, > > + CCWR_INIT, /* WARNING: Don't move this ever again! */ > > + > > + > > + > > + /* Add new IDs here */ > > + > > + > > + > > + /* > > + * WARNING: CCWR_LAST must always be the last verbs id defined! > > > + * All the preceding IDs are fixed, and must not > change. > > + * You can add new IDs, but must not remove or reorder > > + * any IDs. If you do, YOU will ruin any hope of > > + * compatability between versions. > > + */ > > + CCWR_LAST, > > + > > + /* > > + * Start over at 1 so that arrays indexed by user wr id's > > + * begin at 1. This is OK since the verbs and user wr id's > > + * are always used on disjoint sets of queues. > > + */ > > +#if 0 > > + CCWR_SEND = 1, > > + CCWR_SEND_SE, > > + CCWR_SEND_INV, > > + CCWR_SEND_SE_INV, > > +#else > > + /* > > + * The order of the CCWR_SEND_XX verbs must > > + * match the order of the RDMA_OPs > > + */ > > + CCWR_SEND = 1, > > + CCWR_SEND_INV, > > + CCWR_SEND_SE, > > + CCWR_SEND_SE_INV, > > +#endif > > + CCWR_RDMA_WRITE, > > + CCWR_RDMA_READ, > > + CCWR_RDMA_READ_INV, > > + CCWR_MW_BIND, > > + CCWR_NSMR_FASTREG, > > + CCWR_STAG_INVALIDATE, > > + CCWR_RECV, > > + CCWR_NOP, > > + CCWR_UNIMPL, /* WARNING: This must always be the last user wr > id defined! */ > > +} ccwr_ids_t; > > +#define RDMA_SEND_OPCODE_FROM_WR_ID(x) (x+2) > > + > > +/* > > + * SQ/RQ Work Request Types > > + */ > > +typedef enum { > > + CC_WR_TYPE_SEND = CCWR_SEND, > > + CC_WR_TYPE_SEND_SE = CCWR_SEND_SE, > > + CC_WR_TYPE_SEND_INV = CCWR_SEND_INV, > > + CC_WR_TYPE_SEND_SE_INV = CCWR_SEND_SE_INV, > > + CC_WR_TYPE_RDMA_WRITE = CCWR_RDMA_WRITE, > > + CC_WR_TYPE_RDMA_READ = CCWR_RDMA_READ, > > + CC_WR_TYPE_RDMA_READ_INV_STAG = CCWR_RDMA_READ_INV, > > + CC_WR_TYPE_BIND_MW = CCWR_MW_BIND, > > + CC_WR_TYPE_FASTREG_NSMR = CCWR_NSMR_FASTREG, > > + CC_WR_TYPE_INV_STAG = CCWR_STAG_INVALIDATE, > > + CC_WR_TYPE_RECV = CCWR_RECV, > > + CC_WR_TYPE_NOP = CCWR_NOP, > > +} cc_wr_type_t; > > + > > +/* > > + * These are used as bitfields for efficient comparison of multiple > possible > > + * states. > > + */ > > +typedef enum { > > + CC_QP_STATE_IDLE = 0x01, /* initial state */ > > + CC_QP_STATE_CONNECTING = 0x02, /* LLP is connecting */ > > + CC_QP_STATE_RTS = 0x04, /* RDDP/RDMAP enabled */ > > + CC_QP_STATE_CLOSING = 0x08, /* LLP is shutting down */ > > + CC_QP_STATE_TERMINATE = 0x10, /* Connection > Terminat[ing|ed] */ > > + CC_QP_STATE_ERROR = 0x20, /* Error state to flush > everything */ > > +} cc_qp_state_t; > > + > > +typedef struct _cc_netaddr_s { > > + u32 ip_addr; > > + u32 netmask; > > + u32 mtu; > > +} cc_netaddr_t; > > + > > +typedef struct _cc_route_s { > > + u32 ip_addr; /* 0 indicates the default route */ > > + u32 netmask; /* netmask associated with dst */ > > + u32 flags; > > + union { > > + u32 ipaddr; /* address of the nexthop interface */ > > + u8 enaddr[6]; > > + } nexthop; > > +} cc_route_t; > > + > > +/* > > + * A Scatter Gather Entry. > > + */ > > +typedef u32 cc_stag_t; > > + > > +typedef struct { > > + cc_stag_t stag; > > + u32 length; > > + u64 to; > > +} cc_data_addr_t; > > + > > +/* > > + * MR and MW flags used by the consumer, RI, and RNIC. > > + */ > > +typedef enum { > > + MEM_REMOTE = 0x0001, /* allow mw binds with remote access. */ > > + MEM_VA_BASED = 0x0002, /* Not Zero-based */ > > + MEM_PBL_COMPLETE = 0x0004, /* PBL array is complete in this msg > */ > > + MEM_LOCAL_READ = 0x0008, /* allow local reads */ > > + MEM_LOCAL_WRITE = 0x0010, /* allow local writes */ > > + MEM_REMOTE_READ = 0x0020, /* allow remote reads */ > > + MEM_REMOTE_WRITE = 0x0040, /* allow remote writes */ > > + MEM_WINDOW_BIND = 0x0080, /* binds allowed */ > > + MEM_SHARED = 0x0100, /* set if MR is shared */ > > + MEM_STAG_VALID = 0x0200 /* set if STAG is in valid state */ > > +} cc_mm_flags_t; > > + > > +/* > > + * CCIL API ACF flags defined in terms of the low level mem flags. > > + * This minimizes translation needed in the user API > > + */ > > +typedef enum { > > + CC_ACF_LOCAL_READ = MEM_LOCAL_READ, > > + CC_ACF_LOCAL_WRITE = MEM_LOCAL_WRITE, > > + CC_ACF_REMOTE_READ = MEM_REMOTE_READ, > > + CC_ACF_REMOTE_WRITE = MEM_REMOTE_WRITE, > > + CC_ACF_WINDOW_BIND = MEM_WINDOW_BIND > > +} cc_acf_t; > > + > > +/* > > + * Image types of objects written to flash > > + */ > > +#define CC_FLASH_IMG_BITFILE 1 > > +#define CC_FLASH_IMG_OPTION_ROM 2 > > +#define CC_FLASH_IMG_VPD 3 > > + > > +/* > > + * to fix bug 1815 we define the max size allowable of the > > + * terminate message (per the IETF spec).Refer to the IETF > > + * protocal specification, section 12.1.6, page 64) > > + * The message is prefixed by 20 types of DDP info. > > + * > > + * Then the message has 6 bytes for the terminate control > > + * and DDP segment length info plus a DDP header (either > > + * 14 or 18 byts) plus 28 bytes for the RDMA header. > > + * Thus the max size in: > > + * 20 + (6 + 18 + 28) = 72 > > + */ > > +#define CC_MAX_TERMINATE_MESSAGE_SIZE (72) > > +#endif > > Index: hw/amso1100/c2_rnic.c > > =================================================================== > > --- hw/amso1100/c2_rnic.c (revision 0) > > +++ hw/amso1100/c2_rnic.c (revision 0) > > @@ -0,0 +1,581 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + * > > + */ > > + > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#include > > +#ifdef NETEVENT_NOTIFIER > > +#include > > +#include > > +#include > > +#endif > > + > > + > > +#include > > +#include > > +#include > > +#include > > +#include "c2.h" > > +#include "c2_vq.h" > > + > > +#define C2_MAX_MRS 32768 > > +#define C2_MAX_QPS 16000 > > +#define C2_MAX_WQE_SZ 256 > > +#define C2_MAX_QP_WR ((128*1024)/C2_MAX_WQE_SZ) > > +#define C2_MAX_SGES 4 > > +#define C2_MAX_CQS 32768 > > +#define C2_MAX_CQES 4096 > > +#define C2_MAX_PDS 16384 > > + > > +/* > > + * Send the adapter INIT message to the amso1100 > > + */ > > +static int c2_adapter_init(struct c2_dev *c2dev) > > +{ > > + ccwr_init_req_t wr; > > + int err; > > + > > + memset(&wr, 0, sizeof(wr)); > > + c2_wr_set_id(&wr, CCWR_INIT); > > + wr.hdr.context = 0; > > + wr.hint_count = cpu_to_be64(__pa(&c2dev->hint_count)); > > + wr.q0_host_shared = > > + cpu_to_be64(__pa(c2dev->req_vq.shared)); > > + wr.q1_host_shared = > > + cpu_to_be64(__pa(c2dev->rep_vq.shared)); > > + wr.q1_host_msg_pool = > > + cpu_to_be64(__pa(c2dev->rep_vq.msg_pool)); > > + wr.q2_host_shared = > > + cpu_to_be64(__pa(c2dev->aeq.shared)); > > + wr.q2_host_msg_pool = > > + cpu_to_be64(__pa(c2dev->aeq.msg_pool)); > > + > > + /* Post the init message */ > > + err = vq_send_wr(c2dev, (ccwr_t *)&wr); > > + > > + return err; > > +} > > + > > +/* > > + * Send the adapter TERM message to the amso1100 > > + */ > > +static void c2_adapter_term(struct c2_dev *c2dev) > > +{ > > + ccwr_init_req_t wr; > > + > > + memset(&wr, 0, sizeof(wr)); > > + c2_wr_set_id(&wr, CCWR_TERM); > > + wr.hdr.context = 0; > > + > > + /* Post the init message */ > > + vq_send_wr(c2dev, (ccwr_t *)&wr); > > + c2dev->init = 0; > > + > > + return; > > +} > > + > > +/* > > + * Hack to hard code an ip address > > + */ > > +extern char *rnic_ip_addr; > > +static int c2_setconfig_hack(struct c2_dev *c2dev) > > +{ > > + struct c2_vq_req *vq_req; > > + ccwr_rnic_setconfig_req_t *wr; > > + ccwr_rnic_setconfig_rep_t *reply; > > + cc_netaddr_t netaddr; > > + int err, len; > > + > > + vq_req = vq_req_alloc(c2dev); > > + if (!vq_req) > > + return -ENOMEM; > > + > > + len = sizeof(cc_netaddr_t); > > + wr = kmalloc(sizeof(*wr) + len, GFP_KERNEL); > > + if (!wr) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + > > + c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG); > > + wr->hdr.context = (unsigned long)vq_req; > > + wr->rnic_handle = c2dev->adapter_handle; > > + wr->option = cpu_to_be32(CC_CFG_ADD_ADDR); > > + > > + netaddr.ip_addr = in_aton(rnic_ip_addr); > > + netaddr.netmask = htonl(0xFFFFFF00); > > + netaddr.mtu = 0; > > + > > + memcpy(wr->data, &netaddr, len); > > + > > + vq_req_get(c2dev, vq_req); > > + > > + err = vq_send_wr(c2dev, (ccwr_t *)wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail1; > > + } > > + > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) > > + goto bail1; > > + > > + reply = (ccwr_rnic_setconfig_rep_t *)(unsigned > long)(vq_req->reply_msg); > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail1; > > + } > > + > > + err = c2_errno(reply); > > + vq_repbuf_free(c2dev, reply); > > + > > +bail1: > > + kfree(wr); > > +bail0: > > + vq_req_free(c2dev, vq_req); > > + return err; > > +} > > + > > +/* > > + * Open a single RNIC instance to use with all > > + * low level openib calls > > + */ > > +static int c2_rnic_open(struct c2_dev *c2dev) > > +{ > > + struct c2_vq_req *vq_req; > > + ccwr_t wr; > > + ccwr_rnic_open_rep_t* reply; > > + int err; > > + > > + vq_req = vq_req_alloc(c2dev); > > + if (vq_req == NULL) { > > + return -ENOMEM; > > + } > > + > > + memset(&wr, 0, sizeof(wr)); > > + c2_wr_set_id(&wr, CCWR_RNIC_OPEN); > > + wr.rnic_open.req.hdr.context = (unsigned long)(vq_req); > > + wr.rnic_open.req.flags = cpu_to_be16(RNIC_PRIV_MODE); > > + wr.rnic_open.req.port_num = cpu_to_be16(0); > > + wr.rnic_open.req.user_context = (unsigned long)c2dev; > > + > > + vq_req_get(c2dev, vq_req); > > + > > + err = vq_send_wr(c2dev, &wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail0; > > + } > > + > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) { > > + goto bail0; > > + } > > + > > + reply = (ccwr_rnic_open_rep_t*)(unsigned long)(vq_req->reply_msg); > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + > > + if ( (err = c2_errno(reply)) != 0) { > > + goto bail1; > > + } > > + > > + c2dev->adapter_handle = reply->rnic_handle; > > + > > +bail1: > > + vq_repbuf_free(c2dev, reply); > > +bail0: > > + vq_req_free(c2dev, vq_req); > > + return err; > > +} > > + > > +/* > > + * Close the RNIC instance > > + */ > > +static int c2_rnic_close(struct c2_dev *c2dev) > > +{ > > + struct c2_vq_req *vq_req; > > + ccwr_t wr; > > + ccwr_rnic_close_rep_t *reply; > > + int err; > > + > > + vq_req = vq_req_alloc(c2dev); > > + if (vq_req == NULL) { > > + return -ENOMEM; > > + } > > + > > + memset(&wr, 0, sizeof(wr)); > > + c2_wr_set_id(&wr, CCWR_RNIC_CLOSE); > > + wr.rnic_close.req.hdr.context = (unsigned long)vq_req; > > + wr.rnic_close.req.rnic_handle = c2dev->adapter_handle; > > + > > + vq_req_get(c2dev, vq_req); > > + > > + err = vq_send_wr(c2dev, &wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail0; > > + } > > + > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) { > > + goto bail0; > > + } > > + > > + reply = (ccwr_rnic_close_rep_t*)(unsigned long)(vq_req->reply_msg); > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + > > + if ( (err = c2_errno(reply)) != 0) { > > + goto bail1; > > + } > > + > > + c2dev->adapter_handle = 0; > > + > > +bail1: > > + vq_repbuf_free(c2dev, reply); > > +bail0: > > + vq_req_free(c2dev, vq_req); > > + return err; > > +} > > +#ifdef NETEVENT_NOTIFIER > > +static int netevent_notifier(struct notifier_block *self, unsigned long > > > event, void* data) > > +{ > > + int i; > > + u8* ha; > > + struct neighbour* neigh = data; > > + struct netevent_redirect* redir = data; > > + struct netevent_route_change* rev = data; > > + > > + switch (event) { > > + case NETEVENT_ROUTE_UPDATE: > > + printk(KERN_ERR "NETEVENT_ROUTE_UPDATE:\n"); > > + printk(KERN_ERR "fib_flags : %d\n", > > + rev->fib_info->fib_flags); > > + printk(KERN_ERR "fib_protocol : %d\n", > > + rev->fib_info->fib_protocol); > > + printk(KERN_ERR "fib_prefsrc : %08x\n", > > + rev->fib_info->fib_prefsrc); > > + printk(KERN_ERR "fib_priority : %d\n", > > + rev->fib_info->fib_priority); > > + break; > > + > > + case NETEVENT_NEIGH_UPDATE: > > + printk(KERN_ERR "NETEVENT_NEIGH_UPDATE:\n"); > > + printk(KERN_ERR "nud_state : %d\n", neigh->nud_state); > > + printk(KERN_ERR "refcnt : %d\n", neigh->refcnt); > > + printk(KERN_ERR "used : %d\n", neigh->used); > > + printk(KERN_ERR "confirmed : %d\n", neigh->confirmed); > > + printk(KERN_ERR " ha: "); > > + for (i=0; i < neigh->dev->addr_len; i+=4) { > > + ha = &neigh->ha[i]; > > + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], ha[3]); > > + } > > + printk("\n"); > > + > > + printk(KERN_ERR "%8s: ", neigh->dev->name); > > + for (i=0; i < neigh->dev->addr_len; i+=4) { > > + ha = &neigh->ha[i]; > > + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], ha[3]); > > + } > > + printk("\n"); > > + break; > > + > > + case NETEVENT_REDIRECT: > > + printk(KERN_ERR "NETEVENT_REDIRECT:\n"); > > + printk(KERN_ERR "old: "); > > + for (i=0; i < redir->old->neighbour->dev->addr_len; i+=4) { > > + ha = &redir->old->neighbour->ha[i]; > > + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], ha[3]); > > + } > > + printk("\n"); > > + > > + printk(KERN_ERR "new: "); > > + for (i=0; i < redir->new->neighbour->dev->addr_len; i+=4) { > > + ha = &redir->new->neighbour->ha[i]; > > + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], ha[3]); > > + } > > + printk("\n"); > > + break; > > + > > + default: > > + printk(KERN_ERR "NETEVENT_WTFO:\n"); > > + } > > + > > + return NOTIFY_DONE; > > +} > > + > > +static struct notifier_block nb = { > > + .notifier_call = netevent_notifier, > > +}; > > +#endif > > +/* > > + * Called by c2_probe to initialize the RNIC. This principally > > + * involves initalizing the various limits and resouce pools that > > + * comprise the RNIC instance. > > + */ > > +int c2_rnic_init(struct c2_dev* c2dev) > > +{ > > + int err; > > + u32 qsize, msgsize; > > + void *q1_pages; > > + void *q2_pages; > > + void __iomem *mmio_regs; > > + > > + /* Initialize the adapter limits */ > > + c2dev->max_mr = C2_MAX_MRS; > > + c2dev->max_mr_size = ~0; > > + c2dev->max_qp = C2_MAX_QPS; > > + c2dev->max_qp_wr = C2_MAX_QP_WR; > > + c2dev->max_sge = C2_MAX_SGES; > > + c2dev->max_cq = C2_MAX_CQS; > > + c2dev->max_cqe = C2_MAX_CQES; > > + c2dev->max_pd = C2_MAX_PDS; > > + > > + /* Device capabilities */ > > + c2dev->device_cap_flags = > > + ( > > + IB_DEVICE_RESIZE_MAX_WR | > > + IB_DEVICE_CURR_QP_STATE_MOD | > > + IB_DEVICE_SYS_IMAGE_GUID | > > + IB_DEVICE_ZERO_STAG | > > + IB_DEVICE_SEND_W_INV | > > + IB_DEVICE_MW | > > + IB_DEVICE_ARP > > + ); > > + > > + /* Allocate the qptr_array */ > > + c2dev->qptr_array = vmalloc(C2_MAX_CQS*sizeof(void *)); > > + if (!c2dev->qptr_array) { > > + return -ENOMEM; > > + } > > + > > + /* Inialize the qptr_array */ > > + memset(c2dev->qptr_array, 0, C2_MAX_CQS*sizeof(void *)); > > + c2dev->qptr_array[0] = (void *)&c2dev->req_vq; > > + c2dev->qptr_array[1] = (void *)&c2dev->rep_vq; > > + c2dev->qptr_array[2] = (void *)&c2dev->aeq; > > + > > + /* Initialize data structures */ > > + init_waitqueue_head(&c2dev->req_vq_wo); > > + spin_lock_init(&c2dev->vqlock); > > + spin_lock_init(&c2dev->aeq_lock); > > + > > + > > + /* Allocate MQ shared pointer pool for kernel clients. User > > + * mode client pools are hung off the user context > > + */ > > + err = c2_init_mqsp_pool(GFP_KERNEL, &c2dev->kern_mqsp_pool); > > + if (err) { > > + goto bail0; > > + } > > + > > + /* Allocate shared pointers for Q0, Q1, and Q2 from > > + * the shared pointer pool. > > + */ > > + c2dev->req_vq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); > > + c2dev->rep_vq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); > > + c2dev->aeq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); > > + if (!c2dev->req_vq.shared || > > + !c2dev->rep_vq.shared || > > + !c2dev->aeq.shared) { > > + err = -ENOMEM; > > + goto bail1; > > + } > > + > > + mmio_regs = c2dev->kva; > > + /* Initialize the Verbs Request Queue */ > > + c2_mq_init(&c2dev->req_vq, 0, > > + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q0_QSIZE)), > > + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q0_MSGSIZE)), > > + mmio_regs + be32_to_cpu(c2_read32(mmio_regs + > C2_REGS_Q0_POOLSTART)), > > + mmio_regs + be32_to_cpu(c2_read32(mmio_regs + > C2_REGS_Q0_SHARED)), > > + C2_MQ_ADAPTER_TARGET); > > + > > + /* Initialize the Verbs Reply Queue */ > > + qsize = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q1_QSIZE)); > > + msgsize = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q1_MSGSIZE)); > > + q1_pages = kmalloc(qsize * msgsize, GFP_KERNEL); > > + if (!q1_pages) { > > + err = -ENOMEM; > > + goto bail1; > > + } > > + c2_mq_init(&c2dev->rep_vq, > > + 1, > > + qsize, > > + msgsize, > > + q1_pages, > > + mmio_regs + be32_to_cpu(c2_read32(mmio_regs + > C2_REGS_Q1_SHARED)), > > + C2_MQ_HOST_TARGET); > > + > > + /* Initialize the Asynchronus Event Queue */ > > + qsize = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q2_QSIZE)); > > + msgsize = be32_to_cpu(c2_read32(mmio_regs + C2_REGS_Q2_MSGSIZE)); > > + q2_pages = kmalloc(qsize * msgsize, GFP_KERNEL); > > + if (!q2_pages) { > > + err = -ENOMEM; > > + goto bail2; > > + } > > + c2_mq_init(&c2dev->aeq, > > + 2, > > + qsize, > > + msgsize, > > + q2_pages, > > + mmio_regs + be32_to_cpu(c2_read32(mmio_regs + > C2_REGS_Q2_SHARED)), > > + C2_MQ_HOST_TARGET); > > + > > + /* Initialize the verbs request allocator */ > > + err = vq_init(c2dev); > > + if (err) { > > + goto bail3; > > + } > > + > > + /* Enable interrupts on the adapter */ > > + c2_write32(c2dev->regs + C2_IDIS, 0); > > + > > + /* create the WR init message */ > > + err = c2_adapter_init(c2dev); > > + if (err) { > > + goto bail4; > > + } > > + c2dev->init++; > > + > > + /* open an adapter instance */ > > + err = c2_rnic_open(c2dev); > > + if (err) { > > + goto bail4; > > + } > > + > > + /* Initialize the PD pool */ > > + err = c2_init_pd_table(c2dev); > > + if (err) > > + goto bail5; > > + > > + /* Initialize the QP pool */ > > + err = c2_init_qp_table(c2dev); > > + if (err) > > + goto bail6; > > + > > + /* XXX hardcode an address */ > > + err = c2_setconfig_hack(c2dev); > > + if (err) > > + goto bail7; > > + > > +#ifdef NETEVENT_NOTIFIER > > + register_netevent_notifier(&nb); > > +#endif > > + return 0; > > + > > +bail7: > > + c2_cleanup_qp_table(c2dev); > > +bail6: > > + c2_cleanup_pd_table(c2dev); > > +bail5: > > + c2_rnic_close(c2dev); > > +bail4: > > + vq_term(c2dev); > > +bail3: > > + kfree(q2_pages); > > +bail2: > > + kfree(q1_pages); > > +bail1: > > + c2_free_mqsp_pool(c2dev->kern_mqsp_pool); > > +bail0: > > + vfree(c2dev->qptr_array); > > + > > + return err; > > +} > > + > > +/* > > + * Called by c2_remove to cleanup the RNIC resources. > > + */ > > +void c2_rnic_term(struct c2_dev* c2dev) > > +{ > > +#ifdef NETEVENT_NOTIFIER > > + unregister_netevent_notifier(&nb); > > +#endif > > + > > + /* Close the open adapter instance */ > > + c2_rnic_close(c2dev); > > + > > + /* Send the TERM message to the adapter */ > > + c2_adapter_term(c2dev); > > + > > + /* Disable interrupts on the adapter */ > > + c2_write32(c2dev->regs + C2_IDIS, 1); > > + > > + /* Free the QP pool */ > > + c2_cleanup_qp_table(c2dev); > > + > > + /* Free the PD pool */ > > + c2_cleanup_pd_table(c2dev); > > + > > + /* Free the verbs request allocator */ > > + vq_term(c2dev); > > + > > + /* Free the asynchronus event queue */ > > + kfree(c2dev->aeq.msg_pool); > > + > > + /* Free the verbs reply queue */ > > + kfree(c2dev->rep_vq.msg_pool); > > + > > + /* Free the MQ shared pointer pool */ > > + c2_free_mqsp_pool(c2dev->kern_mqsp_pool); > > + > > + /* Free the qptr_array */ > > + vfree(c2dev->qptr_array); > > + > > + return; > > +} > > Index: hw/amso1100/c2_vq.h > > =================================================================== > > --- hw/amso1100/c2_vq.h (revision 0) > > +++ hw/amso1100/c2_vq.h (revision 0) > > @@ -0,0 +1,60 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#ifndef _C2_VQ_H_ > > +#define _C2_VQ_H_ > > +#include > > + > > +#include "c2.h" > > +#include "c2_wr.h" > > + > > +struct c2_vq_req{ > > + u64 reply_msg; /* ptr to reply msg */ > > + wait_queue_head_t wait_object; /* wait object for vq reqs */ > > + atomic_t reply_ready; /* set when reply is ready */ > > + atomic_t refcnt; /* used to cancel WRs... */ > > +}; > > + > > +extern int vq_init(struct c2_dev* c2dev); > > +extern void vq_term(struct c2_dev* c2dev); > > + > > +extern struct c2_vq_req* vq_req_alloc(struct c2_dev *c2dev); > > +extern void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *req); > > +extern void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *req); > > +extern void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *req); > > +extern int vq_send_wr(struct c2_dev *c2dev, ccwr_t *wr); > > + > > +extern void* vq_repbuf_alloc(struct c2_dev *c2dev); > > +extern void vq_repbuf_free(struct c2_dev *c2dev, void *reply); > > + > > +extern int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req > *req); > > +#endif /* _C2_VQ_H_ */ > > Index: hw/amso1100/c2_wr.h > > =================================================================== > > --- hw/amso1100/c2_wr.h (revision 0) > > +++ hw/amso1100/c2_wr.h (revision 0) > > @@ -0,0 +1,1343 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#ifndef _CC_WR_H_ > > +#define _CC_WR_H_ > > +#include "cc_types.h" > > +/* > > + * WARNING: If you change this file, also bump CC_IVN_BASE > > + * in common/include/clustercore/cc_ivn.h. > > + */ > > + > > +#ifdef CCDEBUG > > +#define CCWR_MAGIC 0xb07700b0 > > +#endif > > + > > +/* > > + * Build String Length. It must be the same as CC_BUILD_STR_LEN in > ccil_api.h > > + */ > > +#define WR_BUILD_STR_LEN 64 > > + > > +#ifdef _MSC_VER > > +#define PACKED > > +#pragma pack(push) > > +#pragma pack(1) > > +#define __inline__ __inline > > +#else > > +#define PACKED __attribute__ ((packed)) > > +#endif > > + > > +/* > > + * WARNING: All of these structs need to align any 64bit types on > > + * 64 bit boundaries! 64bit types include u64 and u64. > > + */ > > + > > +/* > > + * Clustercore Work Request Header. Be sensitive to field layout > > + * and alignment. > > + */ > > +typedef struct { > > + /* wqe_count is part of the cqe. It is put here so the > > + * adapter can write to it while the wr is pending without > > + * clobbering part of the wr. This word need not be dma'd > > + * from the host to adapter by libccil, but we copy it anyway > > + * to make the memcpy to the adapter better aligned. > > + */ > > + u32 wqe_count; > > + > > + /* Put these fields next so that later 32- and 64-bit > > + * quantities are naturally aligned. > > + */ > > + u8 id; > > + u8 result; /* adapter -> host */ > > + u8 sge_count; /* host -> adapter */ > > + u8 flags; /* host -> adapter */ > > + > > + u64 context; > > +#ifdef CCMSGMAGIC > > + u32 magic; > > + u32 pad; > > +#endif > > +} PACKED ccwr_hdr_t; > > + > > +/* > > + *------------------------ RNIC ------------------------ > > + */ > > + > > +/* > > + * WR_RNIC_OPEN > > + */ > > + > > +/* > > + * Flags for the RNIC WRs > > + */ > > +typedef enum { > > + RNIC_IRD_STATIC = 0x0001, > > + RNIC_ORD_STATIC = 0x0002, > > + RNIC_QP_STATIC = 0x0004, > > + RNIC_SRQ_SUPPORTED = 0x0008, > > + RNIC_PBL_BLOCK_MODE = 0x0010, > > + RNIC_SRQ_MODEL_ARRIVAL = 0x0020, > > + RNIC_CQ_OVF_DETECTED = 0x0040, > > + RNIC_PRIV_MODE = 0x0080 > > +} PACKED cc_rnic_flags_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 user_context; > > + u16 flags; /* See cc_rnic_flags_t */ > > + u16 port_num; > > +} PACKED ccwr_rnic_open_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > +} PACKED ccwr_rnic_open_rep_t; > > + > > +typedef union { > > + ccwr_rnic_open_req_t req; > > + ccwr_rnic_open_rep_t rep; > > +} PACKED ccwr_rnic_open_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > +} PACKED ccwr_rnic_query_req_t; > > + > > +/* > > + * WR_RNIC_QUERY > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 user_context; > > + u32 vendor_id; > > + u32 part_number; > > + u32 hw_version; > > + u32 fw_ver_major; > > + u32 fw_ver_minor; > > + u32 fw_ver_patch; > > + char fw_ver_build_str[WR_BUILD_STR_LEN]; > > + u32 max_qps; > > + u32 max_qp_depth; > > + u32 max_srq_depth; > > + u32 max_send_sgl_depth; > > + u32 max_rdma_sgl_depth; > > + u32 max_cqs; > > + u32 max_cq_depth; > > + u32 max_cq_event_handlers; > > + u32 max_mrs; > > + u32 max_pbl_depth; > > + u32 max_pds; > > + u32 max_global_ird; > > + u32 max_global_ord; > > + u32 max_qp_ird; > > + u32 max_qp_ord; > > + u32 flags; /* See cc_rnic_flags_t */ > > + u32 max_mws; > > + u32 pbe_range_low; > > + u32 pbe_range_high; > > + u32 max_srqs; > > + u32 page_size; > > +} PACKED ccwr_rnic_query_rep_t; > > + > > +typedef union { > > + ccwr_rnic_query_req_t req; > > + ccwr_rnic_query_rep_t rep; > > +} PACKED ccwr_rnic_query_t; > > + > > +/* > > + * WR_RNIC_GETCONFIG > > + */ > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 option; /* see cc_getconfig_cmd_t */ > > + u64 reply_buf; > > + u32 reply_buf_len; > > +} PACKED ccwr_rnic_getconfig_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 option; /* see cc_getconfig_cmd_t */ > > + u32 count_len; /* length of the number of addresses configured */ > > +} PACKED ccwr_rnic_getconfig_rep_t; > > + > > +typedef union { > > + ccwr_rnic_getconfig_req_t req; > > + ccwr_rnic_getconfig_rep_t rep; > > +} PACKED ccwr_rnic_getconfig_t; > > + > > +/* > > + * WR_RNIC_SETCONFIG > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 option; /* See cc_setconfig_cmd_t */ > > + /* variable data and pad See cc_netaddr_t and > > + * cc_route_t > > + */ > > + u8 data[0]; > > +} PACKED ccwr_rnic_setconfig_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_rnic_setconfig_rep_t; > > + > > +typedef union { > > + ccwr_rnic_setconfig_req_t req; > > + ccwr_rnic_setconfig_rep_t rep; > > +} PACKED ccwr_rnic_setconfig_t; > > + > > +/* > > + * WR_RNIC_CLOSE > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > +} PACKED ccwr_rnic_close_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_rnic_close_rep_t; > > + > > +typedef union { > > + ccwr_rnic_close_req_t req; > > + ccwr_rnic_close_rep_t rep; > > +} PACKED ccwr_rnic_close_t; > > + > > +/* > > + *------------------------ CQ ------------------------ > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 shared_ht; > > + u64 user_context; > > + u64 msg_pool; > > + u32 rnic_handle; > > + u32 msg_size; > > + u32 depth; > > +} PACKED ccwr_cq_create_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 mq_index; > > + u32 adapter_shared; > > + u32 cq_handle; > > +} PACKED ccwr_cq_create_rep_t; > > + > > +typedef union { > > + ccwr_cq_create_req_t req; > > + ccwr_cq_create_rep_t rep; > > +} PACKED ccwr_cq_create_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 cq_handle; > > + u32 new_depth; > > + u64 new_msg_pool; > > +} PACKED ccwr_cq_modify_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_cq_modify_rep_t; > > + > > +typedef union { > > + ccwr_cq_modify_req_t req; > > + ccwr_cq_modify_rep_t rep; > > +} PACKED ccwr_cq_modify_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 cq_handle; > > +} PACKED ccwr_cq_destroy_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_cq_destroy_rep_t; > > + > > +typedef union { > > + ccwr_cq_destroy_req_t req; > > + ccwr_cq_destroy_rep_t rep; > > +} PACKED ccwr_cq_destroy_t; > > + > > +/* > > + *------------------------ PD ------------------------ > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 pd_id; > > +} PACKED ccwr_pd_alloc_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_pd_alloc_rep_t; > > + > > +typedef union { > > + ccwr_pd_alloc_req_t req; > > + ccwr_pd_alloc_rep_t rep; > > +} PACKED ccwr_pd_alloc_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 pd_id; > > +} PACKED ccwr_pd_dealloc_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_pd_dealloc_rep_t; > > + > > +typedef union { > > + ccwr_pd_dealloc_req_t req; > > + ccwr_pd_dealloc_rep_t rep; > > +} PACKED ccwr_pd_dealloc_t; > > + > > +/* > > + *------------------------ SRQ ------------------------ > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 shared_ht; > > + u64 user_context; > > + u32 rnic_handle; > > + u32 srq_depth; > > + u32 srq_limit; > > + u32 sgl_depth; > > + u32 pd_id; > > +} PACKED ccwr_srq_create_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 srq_depth; > > + u32 sgl_depth; > > + u32 msg_size; > > + u32 mq_index; > > + u32 mq_start; > > + u32 srq_handle; > > +} PACKED ccwr_srq_create_rep_t; > > + > > +typedef union { > > + ccwr_srq_create_req_t req; > > + ccwr_srq_create_rep_t rep; > > +} PACKED ccwr_srq_create_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 srq_handle; > > +} PACKED ccwr_srq_destroy_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_srq_destroy_rep_t; > > + > > +typedef union { > > + ccwr_srq_destroy_req_t req; > > + ccwr_srq_destroy_rep_t rep; > > +} PACKED ccwr_srq_destroy_t; > > + > > +/* > > + *------------------------ QP ------------------------ > > + */ > > +typedef enum { > > + QP_RDMA_READ = 0x00000001, /* RDMA read enabled? */ > > + QP_RDMA_WRITE = 0x00000002, /* RDMA write enabled? */ > > + QP_MW_BIND = 0x00000004, /* MWs enabled */ > > + QP_ZERO_STAG = 0x00000008, /* enabled? */ > > + QP_REMOTE_TERMINATION = 0x00000010, /* remote end terminated > */ > > + QP_RDMA_READ_RESPONSE = 0x00000020 /* Remote RDMA read */ > > + /* enabled? */ > > +} PACKED ccwr_qp_flags_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 shared_sq_ht; > > + u64 shared_rq_ht; > > + u64 user_context; > > + u32 rnic_handle; > > + u32 sq_cq_handle; > > + u32 rq_cq_handle; > > + u32 sq_depth; > > + u32 rq_depth; > > + u32 srq_handle; > > + u32 srq_limit; > > + u32 flags; /* see ccwr_qp_flags_t */ > > + u32 send_sgl_depth; > > + u32 recv_sgl_depth; > > + u32 rdma_write_sgl_depth; > > + u32 ord; > > + u32 ird; > > + u32 pd_id; > > +} PACKED ccwr_qp_create_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 sq_depth; > > + u32 rq_depth; > > + u32 send_sgl_depth; > > + u32 recv_sgl_depth; > > + u32 rdma_write_sgl_depth; > > + u32 ord; > > + u32 ird; > > + u32 sq_msg_size; > > + u32 sq_mq_index; > > + u32 sq_mq_start; > > + u32 rq_msg_size; > > + u32 rq_mq_index; > > + u32 rq_mq_start; > > + u32 qp_handle; > > +} PACKED ccwr_qp_create_rep_t; > > + > > +typedef union { > > + ccwr_qp_create_req_t req; > > + ccwr_qp_create_rep_t rep; > > +} PACKED ccwr_qp_create_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 qp_handle; > > +} PACKED ccwr_qp_query_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 user_context; > > + u32 rnic_handle; > > + u32 sq_depth; > > + u32 rq_depth; > > + u32 send_sgl_depth; > > + u32 rdma_write_sgl_depth; > > + u32 recv_sgl_depth; > > + u32 ord; > > + u32 ird; > > + u16 qp_state; > > + u16 flags; /* see ccwr_qp_flags_t */ > > + u32 qp_id; > > + u32 local_addr; > > + u32 remote_addr; > > + u16 local_port; > > + u16 remote_port; > > + u32 terminate_msg_length; /* 0 if not present */ > > + u8 data[0]; > > + /* Terminate Message in-line here. */ > > +} PACKED ccwr_qp_query_rep_t; > > + > > +typedef union { > > + ccwr_qp_query_req_t req; > > + ccwr_qp_query_rep_t rep; > > +} PACKED ccwr_qp_query_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 stream_msg; > > + u32 stream_msg_length; > > + u32 rnic_handle; > > + u32 qp_handle; > > + u32 next_qp_state; > > + u32 ord; > > + u32 ird; > > + u32 sq_depth; > > + u32 rq_depth; > > + u32 llp_ep_handle; > > +} PACKED ccwr_qp_modify_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 ord; > > + u32 ird; > > + u32 sq_depth; > > + u32 rq_depth; > > + u32 sq_msg_size; > > + u32 sq_mq_index; > > + u32 sq_mq_start; > > + u32 rq_msg_size; > > + u32 rq_mq_index; > > + u32 rq_mq_start; > > +} PACKED ccwr_qp_modify_rep_t; > > + > > +typedef union { > > + ccwr_qp_modify_req_t req; > > + ccwr_qp_modify_rep_t rep; > > +} PACKED ccwr_qp_modify_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 qp_handle; > > +} PACKED ccwr_qp_destroy_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_qp_destroy_rep_t; > > + > > +typedef union { > > + ccwr_qp_destroy_req_t req; > > + ccwr_qp_destroy_rep_t rep; > > +} PACKED ccwr_qp_destroy_t; > > + > > +/* > > + * The CCWR_QP_CONNECT msg is posted on the verbs request queue. It > can > > + * only be posted when a QP is in IDLE state. After the connect > request is > > + * submitted to the LLP, the adapter moves the QP to CONNECT_PENDING > state. > > + * No synchronous reply from adapter to this WR. The results of > > + * connection are passed back in an async event > CCAE_ACTIVE_CONNECT_RESULTS > > + * See ccwr_ae_active_connect_results_t > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 qp_handle; > > + u32 remote_addr; > > + u16 remote_port; > > + u16 pad; > > + u32 private_data_length; > > + u8 private_data[0]; /* Private data in-line. */ > > +} PACKED ccwr_qp_connect_req_t; > > + > > +typedef struct { > > + ccwr_qp_connect_req_t req; > > + /* no synchronous reply. */ > > +} PACKED ccwr_qp_connect_t; > > + > > + > > +/* > > + *------------------------ MM ------------------------ > > + */ > > + > > +typedef cc_mm_flags_t ccwr_mr_flags_t; /* cc_types.h */ > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 pbl_depth; > > + u32 pd_id; > > + u32 flags; /* See ccwr_mr_flags_t */ > > +} PACKED ccwr_nsmr_stag_alloc_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 pbl_depth; > > + u32 stag_index; > > +} PACKED ccwr_nsmr_stag_alloc_rep_t; > > + > > +typedef union { > > + ccwr_nsmr_stag_alloc_req_t req; > > + ccwr_nsmr_stag_alloc_rep_t rep; > > +} PACKED ccwr_nsmr_stag_alloc_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 va; > > + u32 rnic_handle; > > + u16 flags; /* See ccwr_mr_flags_t */ > > + u8 stag_key; > > + u8 pad; > > + u32 pd_id; > > + u32 pbl_depth; > > + u32 pbe_size; > > + u32 fbo; > > + u32 length; > > + u32 addrs_length; > > + /* array of paddrs (must be aligned on a 64bit boundary) */ > > + u64 paddrs[0]; > > +} PACKED ccwr_nsmr_register_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 pbl_depth; > > + u32 stag_index; > > +} PACKED ccwr_nsmr_register_rep_t; > > + > > +typedef union { > > + ccwr_nsmr_register_req_t req; > > + ccwr_nsmr_register_rep_t rep; > > +} PACKED ccwr_nsmr_register_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 flags; /* See ccwr_mr_flags_t */ > > + u32 stag_index; > > + u32 addrs_length; > > + /* array of paddrs (must be aligned on a 64bit boundary) */ > > + u64 paddrs[0]; > > +} PACKED ccwr_nsmr_pbl_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_nsmr_pbl_rep_t; > > + > > +typedef union { > > + ccwr_nsmr_pbl_req_t req; > > + ccwr_nsmr_pbl_rep_t rep; > > +} PACKED ccwr_nsmr_pbl_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 stag_index; > > +} PACKED ccwr_mr_query_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u8 stag_key; > > + u8 pad[3]; > > + u32 pd_id; > > + u32 flags; /* See ccwr_mr_flags_t */ > > + u32 pbl_depth; > > +} PACKED ccwr_mr_query_rep_t; > > + > > +typedef union { > > + ccwr_mr_query_req_t req; > > + ccwr_mr_query_rep_t rep; > > +} PACKED ccwr_mr_query_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 stag_index; > > +} PACKED ccwr_mw_query_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u8 stag_key; > > + u8 pad[3]; > > + u32 pd_id; > > + u32 flags; /* See ccwr_mr_flags_t */ > > +} PACKED ccwr_mw_query_rep_t; > > + > > +typedef union { > > + ccwr_mw_query_req_t req; > > + ccwr_mw_query_rep_t rep; > > +} PACKED ccwr_mw_query_t; > > + > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 stag_index; > > +} PACKED ccwr_stag_dealloc_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_stag_dealloc_rep_t; > > + > > +typedef union { > > + ccwr_stag_dealloc_req_t req; > > + ccwr_stag_dealloc_rep_t rep; > > +} PACKED ccwr_stag_dealloc_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 va; > > + u32 rnic_handle; > > + u16 flags; /* See ccwr_mr_flags_t */ > > + u8 stag_key; > > + u8 pad; > > + u32 stag_index; > > + u32 pd_id; > > + u32 pbl_depth; > > + u32 pbe_size; > > + u32 fbo; > > + u32 length; > > + u32 addrs_length; > > + u32 pad1; > > + /* array of paddrs (must be aligned on a 64bit boundary) */ > > + u64 paddrs[0]; > > +} PACKED ccwr_nsmr_reregister_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 pbl_depth; > > + u32 stag_index; > > +} PACKED ccwr_nsmr_reregister_rep_t; > > + > > +typedef union { > > + ccwr_nsmr_reregister_req_t req; > > + ccwr_nsmr_reregister_rep_t rep; > > +} PACKED ccwr_nsmr_reregister_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 va; > > + u32 rnic_handle; > > + u16 flags; /* See ccwr_mr_flags_t */ > > + u8 stag_key; > > + u8 pad; > > + u32 stag_index; > > + u32 pd_id; > > +} PACKED ccwr_smr_register_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 stag_index; > > +} PACKED ccwr_smr_register_rep_t; > > + > > +typedef union { > > + ccwr_smr_register_req_t req; > > + ccwr_smr_register_rep_t rep; > > +} PACKED ccwr_smr_register_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 pd_id; > > +} PACKED ccwr_mw_alloc_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 stag_index; > > +} PACKED ccwr_mw_alloc_rep_t; > > + > > +typedef union { > > + ccwr_mw_alloc_req_t req; > > + ccwr_mw_alloc_rep_t rep; > > +} PACKED ccwr_mw_alloc_t; > > + > > +/* > > + *------------------------ WRs ----------------------- > > + */ > > + > > +typedef struct { > > + ccwr_hdr_t hdr; /* Has status and WR Type */ > > +} PACKED ccwr_user_hdr_t; > > + > > +/* Completion queue entry. */ > > +typedef struct { > > + ccwr_hdr_t hdr; /* Has status and WR Type */ > > + u64 qp_user_context;/* cc_user_qp_t * */ > > + u32 qp_state; /* Current QP State */ > > + u32 handle; /* QPID or EP Handle */ > > + u32 bytes_rcvd; /* valid for RECV WCs */ > > + u32 stag; > > +} PACKED ccwr_ce_t; > > + > > + > > +/* > > + * Flags used for all post-sq WRs. These must fit in the flags > > + * field of the ccwr_hdr_t (eight bits). > > + */ > > +typedef enum { > > + SQ_SIGNALED = 0x01, > > + SQ_READ_FENCE = 0x02, > > + SQ_FENCE = 0x04, > > +} PACKED cc_sq_flags_t; > > + > > +/* > > + * Common fields for all post-sq WRs. Namely the standard header and a > > > + * secondary header with fields common to all post-sq WRs. > > + */ > > +typedef struct { > > + ccwr_user_hdr_t user_hdr; > > +} PACKED cc_sq_hdr_t; > > + > > +/* > > + * Same as above but for post-rq WRs. > > + */ > > +typedef struct { > > + ccwr_user_hdr_t user_hdr; > > +} PACKED cc_rq_hdr_t; > > + > > +/* > > + * use the same struct for all sends. > > + */ > > +typedef struct { > > + cc_sq_hdr_t sq_hdr; > > + u32 sge_len; > > + u32 remote_stag; > > + u8 data[0]; /* SGE array */ > > +} PACKED ccwr_send_req_t, ccwr_send_se_req_t, ccwr_send_inv_req_t, > > ccwr_send_se_inv_req_t; > > + > > +typedef ccwr_ce_t ccwr_send_rep_t; > > + > > +typedef union { > > + ccwr_send_req_t req; > > + ccwr_send_rep_t rep; > > +} PACKED ccwr_send_t, ccwr_send_se_t, ccwr_send_inv_t, > ccwr_send_se_inv_t; > > + > > +typedef struct { > > + cc_sq_hdr_t sq_hdr; > > + u64 remote_to; > > + u32 remote_stag; > > + u32 sge_len; > > + u8 data[0]; /* SGE array */ > > +} PACKED ccwr_rdma_write_req_t; > > + > > +typedef ccwr_ce_t ccwr_rdma_write_rep_t; > > + > > +typedef union { > > + ccwr_rdma_write_req_t req; > > + ccwr_rdma_write_rep_t rep; > > +} PACKED ccwr_rdma_write_t; > > + > > +typedef struct { > > + cc_sq_hdr_t sq_hdr; > > + u64 local_to; > > + u64 remote_to; > > + u32 local_stag; > > + u32 remote_stag; > > + u32 length; > > +} PACKED ccwr_rdma_read_req_t,ccwr_rdma_read_inv_req_t; > > + > > +typedef ccwr_ce_t ccwr_rdma_read_rep_t; > > + > > +typedef union { > > + ccwr_rdma_read_req_t req; > > + ccwr_rdma_read_rep_t rep; > > +} PACKED ccwr_rdma_read_t, ccwr_rdma_read_inv_t; > > + > > +typedef struct { > > + cc_sq_hdr_t sq_hdr; > > + u64 va; > > + u8 stag_key; > > + u8 pad[3]; > > + u32 mw_stag_index; > > + u32 mr_stag_index; > > + u32 length; > > + u32 flags; /* see ccwr_mr_flags_t; */ > > +} PACKED ccwr_mw_bind_req_t; > > + > > +typedef ccwr_ce_t ccwr_mw_bind_rep_t; > > + > > +typedef union { > > + ccwr_mw_bind_req_t req; > > + ccwr_mw_bind_rep_t rep; > > +} PACKED ccwr_mw_bind_t; > > + > > +typedef struct { > > + cc_sq_hdr_t sq_hdr; > > + u64 va; > > + u8 stag_key; > > + u8 pad[3]; > > + u32 stag_index; > > + u32 pbe_size; > > + u32 fbo; > > + u32 length; > > + u32 addrs_length; > > + /* array of paddrs (must be aligned on a 64bit boundary) */ > > + u64 paddrs[0]; > > +} PACKED ccwr_nsmr_fastreg_req_t; > > + > > +typedef ccwr_ce_t ccwr_nsmr_fastreg_rep_t; > > + > > +typedef union { > > + ccwr_nsmr_fastreg_req_t req; > > + ccwr_nsmr_fastreg_rep_t rep; > > +} PACKED ccwr_nsmr_fastreg_t; > > + > > +typedef struct { > > + cc_sq_hdr_t sq_hdr; > > + u8 stag_key; > > + u8 pad[3]; > > + u32 stag_index; > > +} PACKED ccwr_stag_invalidate_req_t; > > + > > +typedef ccwr_ce_t ccwr_stag_invalidate_rep_t; > > + > > +typedef union { > > + ccwr_stag_invalidate_req_t req; > > + ccwr_stag_invalidate_rep_t rep; > > +} PACKED ccwr_stag_invalidate_t; > > + > > +typedef union { > > + cc_sq_hdr_t sq_hdr; > > + ccwr_send_req_t send; > > + ccwr_send_se_req_t send_se; > > + ccwr_send_inv_req_t send_inv; > > + ccwr_send_se_inv_req_t send_se_inv; > > + ccwr_rdma_write_req_t rdma_write; > > + ccwr_rdma_read_req_t rdma_read; > > + ccwr_mw_bind_req_t mw_bind; > > + ccwr_nsmr_fastreg_req_t nsmr_fastreg; > > + ccwr_stag_invalidate_req_t stag_inv; > > +} PACKED ccwr_sqwr_t; > > + > > + > > +/* > > + * RQ WRs > > + */ > > +typedef struct { > > + cc_rq_hdr_t rq_hdr; > > + u8 data[0]; /* array of SGEs */ > > +} PACKED ccwr_rqwr_t, ccwr_recv_req_t; > > + > > +typedef ccwr_ce_t ccwr_recv_rep_t; > > + > > +typedef union { > > + ccwr_recv_req_t req; > > + ccwr_recv_rep_t rep; > > +} PACKED ccwr_recv_t; > > + > > +/* > > + * All AEs start with this header. Most AEs only need to convey the > > + * information in the header. Some, like LLP connection events, need > > + * more info. The union typdef ccwr_ae_t has all the possible AEs. > > + * > > + * hdr.context is the user_context from the rnic_open WR. NULL If this > > > + * is not affiliated with an rnic > > + * > > + * hdr.id is the AE identifier (eg; CCAE_REMOTE_SHUTDOWN, > > + * CCAE_LLP_CLOSE_COMPLETE) > > + * > > + * resource_type is one of: CC_RES_IND_QP, CC_RES_IND_CQ, > CC_RES_IND_SRQ > > + * > > + * user_context is the context passed down when the host created the > resource. > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 user_context; /* user context for this res. */ > > + u32 resource_type; /* see cc_resource_indicator_t */ > > + u32 resource; /* handle for resource */ > > + u32 qp_state; /* current QP State */ > > +} PACKED PACKED ccwr_ae_hdr_t; > > + > > +/* > > + * After submitting the CCAE_ACTIVE_CONNECT_RESULTS message on the AEQ, > > > + * the adapter moves the QP into RTS state > > + */ > > +typedef struct { > > + ccwr_ae_hdr_t ae_hdr; > > + u32 laddr; > > + u32 raddr; > > + u16 lport; > > + u16 rport; > > + u32 private_data_length; > > + u8 private_data[0]; /* data is in-line in the msg. */ > > +} PACKED ccwr_ae_active_connect_results_t; > > + > > +/* > > + * When connections are established by the stack (and the private data > > + * MPA frame is received), the adapter will generate an event to the > host. > > + * The details of the connection, any private data, and the new > connection > > + * request handle is passed up via the CCAE_CONNECTION_REQUEST msg on > the > > + * AE queue: > > + */ > > +typedef struct { > > + ccwr_ae_hdr_t ae_hdr; > > + u32 cr_handle; /* connreq handle (sock ptr) */ > > + u32 laddr; > > + u32 raddr; > > + u16 lport; > > + u16 rport; > > + u32 private_data_length; > > + u8 private_data[0]; /* data is in-line in the msg. */ > > +} PACKED ccwr_ae_connection_request_t; > > + > > +typedef union { > > + ccwr_ae_hdr_t ae_generic; > > + ccwr_ae_active_connect_results_t ae_active_connect_results; > > + ccwr_ae_connection_request_t ae_connection_request; > > +} PACKED ccwr_ae_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 hint_count; > > + u64 q0_host_shared; > > + u64 q1_host_shared; > > + u64 q1_host_msg_pool; > > + u64 q2_host_shared; > > + u64 q2_host_msg_pool; > > +} PACKED ccwr_init_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_init_rep_t; > > + > > +typedef union { > > + ccwr_init_req_t req; > > + ccwr_init_rep_t rep; > > +} PACKED ccwr_init_t; > > + > > +/* > > + * For upgrading flash. > > + */ > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > +} PACKED ccwr_flash_init_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 adapter_flash_buf_offset; > > + u32 adapter_flash_len; > > +} PACKED ccwr_flash_init_rep_t; > > + > > +typedef union { > > + ccwr_flash_init_req_t req; > > + ccwr_flash_init_rep_t rep; > > +} PACKED ccwr_flash_init_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 len; > > +} PACKED ccwr_flash_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 status; > > +} PACKED ccwr_flash_rep_t; > > + > > +typedef union { > > + ccwr_flash_req_t req; > > + ccwr_flash_rep_t rep; > > +} PACKED ccwr_flash_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 size; > > +} PACKED ccwr_buf_alloc_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 offset; /* 0 if mem not available */ > > + u32 size; /* 0 if mem not available */ > > +} PACKED ccwr_buf_alloc_rep_t; > > + > > +typedef union { > > + ccwr_buf_alloc_req_t req; > > + ccwr_buf_alloc_rep_t rep; > > +} PACKED ccwr_buf_alloc_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 offset; /* Must match value from alloc */ > > + u32 size; /* Must match value from alloc */ > > +} PACKED ccwr_buf_free_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_buf_free_rep_t; > > + > > +typedef union { > > + ccwr_buf_free_req_t req; > > + ccwr_buf_free_rep_t rep; > > +} PACKED ccwr_buf_free_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 offset; > > + u32 size; > > + u32 type; > > + u32 flags; > > +} PACKED ccwr_flash_write_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 status; > > +} PACKED ccwr_flash_write_rep_t; > > + > > +typedef union { > > + ccwr_flash_write_req_t req; > > + ccwr_flash_write_rep_t rep; > > +} PACKED ccwr_flash_write_t; > > + > > +/* > > + * Messages for LLP connection setup. > > + */ > > + > > +/* > > + * Listen Request. This allocates a listening endpoint to allow > passive > > + * connection setup. Newly established LLP connections are passed up > > + * via an AE. See ccwr_ae_connection_request_t > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u64 user_context; /* returned in AEs. */ > > + u32 rnic_handle; > > + u32 local_addr; /* local addr, or 0 */ > > + u16 local_port; /* 0 means "pick one" */ > > + u16 pad; > > + u32 backlog; /* tradional tcp listen bl */ > > +} PACKED ccwr_ep_listen_create_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 ep_handle; /* handle to new listening ep */ > > + u16 local_port; /* resulting port... */ > > + u16 pad; > > +} PACKED ccwr_ep_listen_create_rep_t; > > + > > +typedef union { > > + ccwr_ep_listen_create_req_t req; > > + ccwr_ep_listen_create_rep_t rep; > > +} PACKED ccwr_ep_listen_create_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 ep_handle; > > +} PACKED ccwr_ep_listen_destroy_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_ep_listen_destroy_rep_t; > > + > > +typedef union { > > + ccwr_ep_listen_destroy_req_t req; > > + ccwr_ep_listen_destroy_rep_t rep; > > +} PACKED ccwr_ep_listen_destroy_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 ep_handle; > > +} PACKED ccwr_ep_query_req_t; > > + > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 local_addr; > > + u32 remote_addr; > > + u16 local_port; > > + u16 remote_port; > > +} PACKED ccwr_ep_query_rep_t; > > + > > +typedef union { > > + ccwr_ep_query_req_t req; > > + ccwr_ep_query_rep_t rep; > > +} PACKED ccwr_ep_query_t; > > + > > + > > +/* > > + * The host passes this down to indicate acceptance of a pending iWARP > > + * connection. The cr_handle was obtained from the CONNECTION_REQUEST > > + * AE passed up by the adapter. See ccwr_ae_connection_request_t. > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 qp_handle; /* QP to bind to this LLP conn */ > > + u32 ep_handle; /* LLP handle to accept */ > > + u32 private_data_length; > > + u8 private_data[0]; /* data in-line in msg. */ > > +} PACKED ccwr_cr_accept_req_t; > > + > > +/* > > + * adapter sends reply when private data is successfully submitted to > > + * the LLP. > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_cr_accept_rep_t; > > + > > +typedef union { > > + ccwr_cr_accept_req_t req; > > + ccwr_cr_accept_rep_t rep; > > +} PACKED ccwr_cr_accept_t; > > + > > +/* > > + * The host sends this down if a given iWARP connection request was > > + * rejected by the consumer. The cr_handle was obtained from a > > + * previous ccwr_ae_connection_request_t AE sent by the adapter. > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > + u32 rnic_handle; > > + u32 ep_handle; /* LLP handle to reject */ > > +} PACKED ccwr_cr_reject_req_t; > > + > > +/* > > + * Dunno if this is needed, but we'll add it for now. The adapter will > > + * send the reject_reply after the LLP endpoint has been destroyed. > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; > > +} PACKED ccwr_cr_reject_rep_t; > > + > > +typedef union { > > + ccwr_cr_reject_req_t req; > > + ccwr_cr_reject_rep_t rep; > > +} PACKED ccwr_cr_reject_t; > > + > > +/* > > + * console command. Used to implement a debug console over the verbs > > + * request and reply queues. > > + */ > > + > > +/* > > + * Console request message. It contains: > > + * - message hdr with id = CCWR_CONSOLE > > + * - the physaddr/len of host memory to be used for the reply. > > + * - the command string. eg: "netstat -s" or "zoneinfo" > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; /* id = CCWR_CONSOLE */ > > + u64 reply_buf; /* pinned host buf for reply */ > > + u32 reply_buf_len; /* length of reply buffer */ > > + u8 command[0]; /* NUL terminated ascii string */ > > + /* containing the command req */ > > +} PACKED ccwr_console_req_t; > > + > > +/* > > + * flags used in the console reply. > > + */ > > +typedef enum { > > + CONS_REPLY_TRUNCATED = 0x00000001 /* reply was truncated */ > > +} PACKED cc_console_flags_t; > > + > > +/* > > + * Console reply message. > > + * hdr.result contains the cc_status_t error if the reply was _not_ > generated, > > + * or CC_OK if the reply was generated. > > + */ > > +typedef struct { > > + ccwr_hdr_t hdr; /* id = CCWR_CONSOLE */ > > + u32 flags; /* see cc_console_flags_t */ > > +} PACKED ccwr_console_rep_t; > > + > > +typedef union { > > + ccwr_console_req_t req; > > + ccwr_console_rep_t rep; > > +} PACKED ccwr_console_t; > > + > > + > > +/* > > + * Giant union with all WRs. Makes life easier... > > + */ > > +typedef union { > > + ccwr_hdr_t hdr; > > + ccwr_user_hdr_t user_hdr; > > + ccwr_rnic_open_t rnic_open; > > + ccwr_rnic_query_t rnic_query; > > + ccwr_rnic_getconfig_t rnic_getconfig; > > + ccwr_rnic_setconfig_t rnic_setconfig; > > + ccwr_rnic_close_t rnic_close; > > + ccwr_cq_create_t cq_create; > > + ccwr_cq_modify_t cq_modify; > > + ccwr_cq_destroy_t cq_destroy; > > + ccwr_pd_alloc_t pd_alloc; > > + ccwr_pd_dealloc_t pd_dealloc; > > + ccwr_srq_create_t srq_create; > > + ccwr_srq_destroy_t srq_destroy; > > + ccwr_qp_create_t qp_create; > > + ccwr_qp_query_t qp_query; > > + ccwr_qp_modify_t qp_modify; > > + ccwr_qp_destroy_t qp_destroy; > > + ccwr_qp_connect_t qp_connect; > > + ccwr_nsmr_stag_alloc_t nsmr_stag_alloc; > > + ccwr_nsmr_register_t nsmr_register; > > + ccwr_nsmr_pbl_t nsmr_pbl; > > + ccwr_mr_query_t mr_query; > > + ccwr_mw_query_t mw_query; > > + ccwr_stag_dealloc_t stag_dealloc; > > + ccwr_sqwr_t sqwr; > > + ccwr_rqwr_t rqwr; > > + ccwr_ce_t ce; > > + ccwr_ae_t ae; > > + ccwr_init_t init; > > + ccwr_ep_listen_create_t ep_listen_create; > > + ccwr_ep_listen_destroy_t ep_listen_destroy; > > + ccwr_cr_accept_t cr_accept; > > + ccwr_cr_reject_t cr_reject; > > + ccwr_console_t console; > > + ccwr_flash_init_t flash_init; > > + ccwr_flash_t flash; > > + ccwr_buf_alloc_t buf_alloc; > > + ccwr_buf_free_t buf_free; > > + ccwr_flash_write_t flash_write; > > +} PACKED ccwr_t; > > + > > + > > +/* > > + * Accessors for the wr fields that are packed together tightly to > > + * reduce the wr message size. The wr arguments are void* so that > > + * either a ccwr_t*, a ccwr_hdr_t*, or a pointer to any of the types > > + * in the ccwr_t union can be passed in. > > + */ > > +static __inline__ u8 > > +c2_wr_get_id(void *wr) > > +{ > > + return ((ccwr_hdr_t *)wr)->id; > > +} > > +static __inline__ void > > +c2_wr_set_id(void *wr, u8 id) > > +{ > > + ((ccwr_hdr_t *)wr)->id = id; > > +} > > +static __inline__ u8 > > +c2_wr_get_result(void *wr) > > +{ > > + return ((ccwr_hdr_t *)wr)->result; > > +} > > +static __inline__ void > > +c2_wr_set_result(void *wr, u8 result) > > +{ > > + ((ccwr_hdr_t *)wr)->result = result; > > +} > > +static __inline__ u8 > > +c2_wr_get_flags(void *wr) > > +{ > > + return ((ccwr_hdr_t *)wr)->flags; > > +} > > +static __inline__ void > > +c2_wr_set_flags(void *wr, u8 flags) > > +{ > > + ((ccwr_hdr_t *)wr)->flags = flags; > > +} > > +static __inline__ u8 > > +c2_wr_get_sge_count(void *wr) > > +{ > > + return ((ccwr_hdr_t *)wr)->sge_count; > > +} > > +static __inline__ void > > +c2_wr_set_sge_count(void *wr, u8 sge_count) > > +{ > > + ((ccwr_hdr_t *)wr)->sge_count = sge_count; > > +} > > +static __inline__ u32 > > +c2_wr_get_wqe_count(void *wr) > > +{ > > + return ((ccwr_hdr_t *)wr)->wqe_count; > > +} > > +static __inline__ void > > +c2_wr_set_wqe_count(void *wr, u32 wqe_count) > > +{ > > + ((ccwr_hdr_t *)wr)->wqe_count = wqe_count; > > +} > > + > > +#undef PACKED > > + > > +#ifdef _MSC_VER > > +#pragma pack(pop) > > +#endif > > + > > +#endif /* _CC_WR_H_ */ > > Index: hw/amso1100/c2_cm.c > > =================================================================== > > --- hw/amso1100/c2_cm.c (revision 0) > > +++ hw/amso1100/c2_cm.c (revision 0) > > @@ -0,0 +1,415 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + * > > + */ > > +#include "c2.h" > > +#include "c2_vq.h" > > +#include > > + > > +int c2_llp_connect(struct iw_cm_id* cm_id, const void* pdata, u8 > pdata_len) > > +{ > > + struct c2_dev *c2dev = to_c2dev(cm_id->device); > > + struct c2_qp *qp = to_c2qp(cm_id->qp); > > + ccwr_qp_connect_req_t *wr; /* variable size needs a malloc. */ > > + struct c2_vq_req *vq_req; > > + int err; > > + > > + /* > > + * only support the max private_data length > > + */ > > + if (pdata_len > CC_MAX_PRIVATE_DATA_SIZE) { > > + return -EINVAL; > > + } > > + > > + /* > > + * Create and send a WR_QP_CONNECT... > > + */ > > + wr = kmalloc(sizeof(*wr) + pdata_len, GFP_KERNEL); > > + if (!wr) { > > + return -ENOMEM; > > + } > > + > > + vq_req = vq_req_alloc(c2dev); > > + if (!vq_req) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + > > + c2_wr_set_id(wr, CCWR_QP_CONNECT); > > + wr->hdr.context = 0; > > + wr->rnic_handle = c2dev->adapter_handle; > > + wr->qp_handle = qp->adapter_handle; > > + > > + wr->remote_addr = cm_id->remote_addr.sin_addr.s_addr; > > + wr->remote_port = cm_id->remote_addr.sin_port; > > + > > + /* > > + * Move any private data from the callers's buf into > > + * the WR. > > + */ > > + if (pdata) { > > + wr->private_data_length = cpu_to_be32(pdata_len); > > + memcpy(&wr->private_data[0], pdata, pdata_len); > > + } else { > > + wr->private_data_length = 0; > > + } > > + > > + /* > > + * Send WR to adapter. NOTE: There is no synch reply from > > + * the adapter. > > + */ > > + err = vq_send_wr(c2dev, (ccwr_t*)wr); > > + vq_req_free(c2dev, vq_req); > > +bail0: > > + kfree(wr); > > + return err; > > +} > > + > > +int > > +c2_llp_service_create(struct iw_cm_id* cm_id, int backlog) > > +{ > > + struct c2_dev *c2dev; > > + ccwr_ep_listen_create_req_t wr; > > + ccwr_ep_listen_create_rep_t *reply; > > + struct c2_vq_req *vq_req; > > + int err; > > + > > + c2dev = to_c2dev(cm_id->device); > > + if (c2dev == NULL) > > + return -EINVAL; > > + > > + /* > > + * Allocate verbs request. > > + */ > > + vq_req = vq_req_alloc(c2dev); > > + if (!vq_req) > > + return -ENOMEM; > > + > > + /* > > + * Build the WR > > + */ > > + c2_wr_set_id(&wr, CCWR_EP_LISTEN_CREATE); > > + wr.hdr.context = (u64)(unsigned long)vq_req; > > + wr.rnic_handle = c2dev->adapter_handle; > > + wr.local_addr = cm_id->local_addr.sin_addr.s_addr; > > + wr.local_port = cm_id->local_addr.sin_port; > > + wr.backlog = cpu_to_be32(backlog); > > + wr.user_context = (u64)(unsigned long)cm_id; > > + > > + /* > > + * Reference the request struct. Dereferenced in the int handler. > > + */ > > + vq_req_get(c2dev, vq_req); > > + > > + /* > > + * Send WR to adapter > > + */ > > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail0; > > + } > > + > > + /* > > + * Wait for reply from adapter > > + */ > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) { > > + goto bail0; > > + } > > + > > + /* > > + * Process reply > > + */ > > + reply = (ccwr_ep_listen_create_rep_t*)(unsigned > long)vq_req->reply_msg; > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail1; > > + } > > + > > + if ( (err = c2_errno(reply)) != 0) { > > + goto bail1; > > + } > > + > > + /* > > + * get the adapter handle > > + */ > > + cm_id->provider_id = reply->ep_handle; > > + > > + /* > > + * free vq stuff > > + */ > > + vq_repbuf_free(c2dev, reply); > > + vq_req_free(c2dev, vq_req); > > + > > + return 0; > > + > > +bail1: > > + vq_repbuf_free(c2dev, reply); > > +bail0: > > + vq_req_free(c2dev, vq_req); > > + return err; > > +} > > + > > + > > +int > > +c2_llp_service_destroy(struct iw_cm_id* cm_id) > > +{ > > + > > + struct c2_dev *c2dev; > > + ccwr_ep_listen_destroy_req_t wr; > > + ccwr_ep_listen_destroy_rep_t *reply; > > + struct c2_vq_req *vq_req; > > + int err; > > + > > + c2dev = to_c2dev(cm_id->device); > > + if (c2dev == NULL) > > + return -EINVAL; > > + > > + /* > > + * Allocate verbs request. > > + */ > > + vq_req = vq_req_alloc(c2dev); > > + if (!vq_req) { > > + return -ENOMEM; > > + } > > + > > + /* > > + * Build the WR > > + */ > > + c2_wr_set_id(&wr, CCWR_EP_LISTEN_DESTROY); > > + wr.hdr.context = (unsigned long)vq_req; > > + wr.rnic_handle = c2dev->adapter_handle; > > + wr.ep_handle = cm_id->provider_id; > > + > > + /* > > + * reference the request struct. dereferenced in the int handler. > > + */ > > + vq_req_get(c2dev, vq_req); > > + > > + /* > > + * Send WR to adapter > > + */ > > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail0; > > + } > > + > > + /* > > + * Wait for reply from adapter > > + */ > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) { > > + goto bail0; > > + } > > + > > + /* > > + * Process reply > > + */ > > + reply = (ccwr_ep_listen_destroy_rep_t*)(unsigned > long)vq_req->reply_msg; > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + if ( (err = c2_errno(reply)) != 0) { > > + goto bail1; > > + } > > + > > +bail1: > > + vq_repbuf_free(c2dev, reply); > > +bail0: > > + vq_req_free(c2dev, vq_req); > > + return err; > > +} > > + > > + > > +int > > +c2_llp_accept(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) > > +{ > > + struct c2_dev *c2dev = to_c2dev(cm_id->device); > > + struct c2_qp *qp = to_c2qp(cm_id->qp); > > + ccwr_cr_accept_req_t *wr; /* variable length WR */ > > + struct c2_vq_req *vq_req; > > + ccwr_cr_accept_rep_t *reply; /* VQ Reply msg ptr. */ > > + int err; > > + > > + /* Make sure there's a bound QP */ > > + if (qp == 0) > > + return -EINVAL; > > + > > + /* > > + * only support the max private_data length > > + */ > > + if (pdata_len > CC_MAX_PRIVATE_DATA_SIZE) { > > + return -EINVAL; > > + } > > + > > + /* > > + * Allocate verbs request. > > + */ > > + vq_req = vq_req_alloc(c2dev); > > + if (!vq_req) { > > + return -ENOMEM; > > + } > > + > > + wr = kmalloc(sizeof(*wr) + pdata_len, GFP_KERNEL); > > + if (!wr) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + > > + /* > > + * Build the WR > > + */ > > + c2_wr_set_id(wr, CCWR_CR_ACCEPT); > > + wr->hdr.context = (unsigned long)vq_req; > > + wr->rnic_handle = c2dev->adapter_handle; > > + wr->ep_handle = (u32)cm_id->provider_id; > > + wr->qp_handle = qp->adapter_handle; > > + if (pdata) { > > + wr->private_data_length = cpu_to_be32(pdata_len); > > + memcpy(&wr->private_data[0], pdata, pdata_len); > > + } else { > > + wr->private_data_length = 0; > > + } > > + > > + /* > > + * reference the request struct. dereferenced in the int handler. > > + */ > > + vq_req_get(c2dev, vq_req); > > + > > + /* > > + * Send WR to adapter > > + */ > > + err = vq_send_wr(c2dev, (ccwr_t*)wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail1; > > + } > > + > > + /* > > + * Wait for reply from adapter > > + */ > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) { > > + goto bail1; > > + } > > + > > + /* > > + * Process reply > > + */ > > + reply = (ccwr_cr_accept_rep_t*)(unsigned long)vq_req->reply_msg; > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail1; > > + } > > + > > + err = c2_errno(reply); > > + vq_repbuf_free(c2dev, reply); > > + > > +bail1: > > + kfree(wr); > > +bail0: > > + vq_req_free(c2dev, vq_req); > > + return err; > > +} > > + > > +int > > +c2_llp_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) > > +{ > > + struct c2_dev *c2dev; > > + ccwr_cr_reject_req_t wr; > > + struct c2_vq_req *vq_req; > > + ccwr_cr_reject_rep_t *reply; > > + int err; > > + > > + c2dev = to_c2dev(cm_id->device); > > + > > + /* > > + * Allocate verbs request. > > + */ > > + vq_req = vq_req_alloc(c2dev); > > + if (!vq_req) { > > + return -ENOMEM; > > + } > > + > > + /* > > + * Build the WR > > + */ > > + c2_wr_set_id(&wr, CCWR_CR_REJECT); > > + wr.hdr.context = (unsigned long)vq_req; > > + wr.rnic_handle = c2dev->adapter_handle; > > + wr.ep_handle = (u32)cm_id->provider_id; > > + > > + /* > > + * reference the request struct. dereferenced in the int handler. > > + */ > > + vq_req_get(c2dev, vq_req); > > + > > + /* > > + * Send WR to adapter > > + */ > > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail0; > > + } > > + > > + /* > > + * Wait for reply from adapter > > + */ > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) { > > + goto bail0; > > + } > > + > > + /* > > + * Process reply > > + */ > > + reply = (ccwr_cr_reject_rep_t*)(unsigned long)vq_req->reply_msg; > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + err = c2_errno(reply); > > + > > + /* > > + * free vq stuff > > + */ > > + vq_repbuf_free(c2dev, reply); > > + > > +bail0: > > + vq_req_free(c2dev, vq_req); > > + return err; > > +} > > + > > Index: hw/amso1100/c2_provider.h > > =================================================================== > > --- hw/amso1100/c2_provider.h (revision 0) > > +++ hw/amso1100/c2_provider.h (revision 0) > > @@ -0,0 +1,174 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + * > > + */ > > + > > +#ifndef C2_PROVIDER_H > > +#define C2_PROVIDER_H > > + > > +#include > > +#include > > + > > +#include "c2_mq.h" > > +#include > > + > > +#define C2_MPT_FLAG_ATOMIC (1 << 14) > > +#define C2_MPT_FLAG_REMOTE_WRITE (1 << 13) > > +#define C2_MPT_FLAG_REMOTE_READ (1 << 12) > > +#define C2_MPT_FLAG_LOCAL_WRITE (1 << 11) > > +#define C2_MPT_FLAG_LOCAL_READ (1 << 10) > > + > > +struct c2_buf_list { > > + void *buf; > > + DECLARE_PCI_UNMAP_ADDR(mapping) > > +}; > > + > > + > > +/* The user context keeps track of objects allocated for a > > + * particular user-mode client. */ > > +struct c2_ucontext { > > + struct ib_ucontext ibucontext; > > + > > + int index; /* rnic index (minor) */ > > + int port; /* Which GigE port */ > > + > > + /* > > + * Shared HT pages for user-accessible MQs. > > + */ > > + int hthead; /* index of first > free entry */ > > + void* htpages; /* kernel vaddr */ > > + int htlen; /* length of htpages memory */ > > + void* htuva; /* user mapped vaddr */ > > + spinlock_t htlock; /* serialize allocation */ > > + u64 adapter_hint_uva; /* Activity FIFO */ > > +}; > > + > > +struct c2_mtt; > > + > > +/* All objects associated with a PD are kept in the > > + * associated user context if present. > > + */ > > +struct c2_pd { > > + struct ib_pd ibpd; > > + u32 pd_id; > > + atomic_t sqp_count; > > +}; > > + > > +struct c2_mr { > > + struct ib_mr ibmr; > > + struct c2_pd *pd; > > +}; > > + > > +struct c2_av; > > + > > +enum c2_ah_type { > > + C2_AH_ON_HCA, > > + C2_AH_PCI_POOL, > > + C2_AH_KMALLOC > > +}; > > + > > +struct c2_ah { > > + struct ib_ah ibah; > > +}; > > + > > +struct c2_cq { > > + struct ib_cq ibcq; > > + spinlock_t lock; > > + atomic_t refcount; > > + int cqn; > > + int is_kernel; > > + wait_queue_head_t wait; > > + > > + u32 adapter_handle; > > + struct c2_mq mq; > > +}; > > + > > +struct c2_wq { > > + spinlock_t lock; > > +}; > > +struct iw_cm_id; > > +struct c2_qp { > > + struct ib_qp ibqp; > > + struct iw_cm_id* cm_id; > > + spinlock_t lock; > > + atomic_t refcount; > > + wait_queue_head_t wait; > > + int qpn; > > + > > + u32 adapter_handle; > > + u32 send_sgl_depth; > > + u32 recv_sgl_depth; > > + u32 rdma_write_sgl_depth; > > + u8 state; > > + > > + struct c2_mq sq_mq; > > + struct c2_mq rq_mq; > > +}; > > + > > +struct c2_cr_query_attrs { > > + u32 local_addr; > > + u32 remote_addr; > > + u16 local_port; > > + u16 remote_port; > > +}; > > + > > +static inline struct c2_pd *to_c2pd(struct ib_pd *ibpd) > > +{ > > + return container_of(ibpd, struct c2_pd, ibpd); > > +} > > + > > +static inline struct c2_ucontext *to_c2ucontext(struct ib_ucontext > *ibucontext) > > +{ > > + return container_of(ibucontext, struct c2_ucontext, ibucontext); > > +} > > + > > +static inline struct c2_mr *to_c2mr(struct ib_mr *ibmr) > > +{ > > + return container_of(ibmr, struct c2_mr, ibmr); > > +} > > + > > + > > +static inline struct c2_ah *to_c2ah(struct ib_ah *ibah) > > +{ > > + return container_of(ibah, struct c2_ah, ibah); > > +} > > + > > +static inline struct c2_cq *to_c2cq(struct ib_cq *ibcq) > > +{ > > + return container_of(ibcq, struct c2_cq, ibcq); > > +} > > + > > +static inline struct c2_qp *to_c2qp(struct ib_qp *ibqp) > > +{ > > + return container_of(ibqp, struct c2_qp, ibqp); > > +} > > +#endif /* C2_PROVIDER_H */ > > Index: hw/amso1100/c2_pd.c > > =================================================================== > > --- hw/amso1100/c2_pd.c (revision 0) > > +++ hw/amso1100/c2_pd.c (revision 0) > > @@ -0,0 +1,73 @@ > > +/* > > + * Copyright (c) 2004 Topspin Communications. All rights reserved. > > + * Copyright (c) 2005 Cisco Systems. All rights reserved. > > + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > + > > +#include > > +#include > > + > > +#include "c2.h" > > +#include "c2_provider.h" > > + > > +int c2_pd_alloc(struct c2_dev *dev, int privileged, struct c2_pd *pd) > > +{ > > + int err = 0; > > + > > + might_sleep(); > > + > > + atomic_set(&pd->sqp_count, 0); > > + pd->pd_id = c2_alloc(&dev->pd_table.alloc); > > + if (pd->pd_id == -1) > > + return -ENOMEM; > > + > > + return err; > > +} > > + > > +void c2_pd_free(struct c2_dev *dev, struct c2_pd *pd) > > +{ > > + might_sleep(); > > + c2_free(&dev->pd_table.alloc, pd->pd_id); > > +} > > + > > +int __devinit c2_init_pd_table(struct c2_dev *dev) > > +{ > > + return c2_alloc_init(&dev->pd_table.alloc, > > + dev->max_pd, > > + 0); > > +} > > + > > +void __devexit c2_cleanup_pd_table(struct c2_dev *dev) > > +{ > > + /* XXX check if any PDs are still allocated? */ > > + c2_alloc_cleanup(&dev->pd_table.alloc); > > +} > > Index: hw/amso1100/c2_cq.c > > =================================================================== > > --- hw/amso1100/c2_cq.c (revision 0) > > +++ hw/amso1100/c2_cq.c (revision 0) > > @@ -0,0 +1,401 @@ > > +/* > > + * Copyright (c) 2004, 2005 Topspin Communications. All rights > reserved. > > + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. > > + * Copyright (c) 2005 Cisco Systems, Inc. All rights reserved. > > + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. > > + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + * > > + */ > > +#include "c2.h" > > +#include "c2_vq.h" > > +#include "cc_status.h" > > + > > +#define C2_CQ_MSG_SIZE ((sizeof(ccwr_ce_t) + 32-1) & ~(32-1)) > > + > > +void c2_cq_event(struct c2_dev *c2dev, u32 mq_index) > > +{ > > + struct c2_cq *cq; > > + > > + cq = c2dev->qptr_array[mq_index]; > > + > > + if (!cq) { > > + dprintk("Completion event for bogus CQ %08x\n", mq_index); > > + return; > > + } > > + > > + assert(cq->ibcq.comp_handler); > > + (*cq->ibcq.comp_handler)(&cq->ibcq, cq->ibcq.cq_context); > > +} > > + > > +void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index) > > +{ > > + struct c2_cq *cq; > > + struct c2_mq *q; > > + > > + cq = c2dev->qptr_array[mq_index]; > > + if (!cq) > > + return; > > + > > + spin_lock_irq(&cq->lock); > > + > > + q = &cq->mq; > > + if (q && !c2_mq_empty(q)) { > > + u16 priv = q->priv; > > + ccwr_ce_t *msg; > > + > > + while (priv != cpu_to_be16(*q->shared)) { > > + msg = (ccwr_ce_t *)(q->msg_pool + priv * q->msg_size); > > + if (msg->qp_user_context == (u64)(unsigned long)qp) { > > + msg->qp_user_context = (u64)0; > > + } > > + BUMP(q, priv); > > + } > > + } > > + > > + spin_unlock_irq(&cq->lock); > > +} > > + > > +static inline enum ib_wc_status c2_cqe_status_to_openib(u8 status) > > +{ > > + switch (status) { > > + case CC_OK: return IB_WC_SUCCESS; > > + case CCERR_FLUSHED: return IB_WC_WR_FLUSH_ERR; > > + case CCERR_BASE_AND_BOUNDS_VIOLATION: return IB_WC_LOC_PROT_ERR; > > + case CCERR_ACCESS_VIOLATION: return IB_WC_LOC_ACCESS_ERR; > > + case CCERR_TOTAL_LENGTH_TOO_BIG: return IB_WC_LOC_LEN_ERR; > > + case CCERR_INVALID_WINDOW: return IB_WC_MW_BIND_ERR; > > + default: return IB_WC_GENERAL_ERR; > > + } > > +} > > + > > + > > +static inline int c2_poll_one(struct c2_dev *c2dev, > > + struct c2_cq *cq, > > + struct ib_wc *entry) > > +{ > > + ccwr_ce_t *ce; > > + struct c2_qp *qp; > > + int is_recv = 0; > > + > > + ce = (ccwr_ce_t *)c2_mq_consume(&cq->mq); > > + if (!ce) { > > + return -EAGAIN; > > + } > > + > > + /* > > + * if the qp returned is null then this qp has already > > + * been freed and we are unable process the completion. > > + * try pulling the next message > > + */ > > + while ( (qp = (struct c2_qp *)(unsigned long)ce->qp_user_context) == > NULL) { > > + c2_mq_free(&cq->mq); > > + ce = (ccwr_ce_t *)c2_mq_consume(&cq->mq); > > + if (!ce) > > + return -EAGAIN; > > + } > > + > > + entry->status = c2_cqe_status_to_openib(c2_wr_get_result(ce)); > > + entry->wr_id = ce->hdr.context; > > + entry->qp_num = ce->handle; > > + entry->wc_flags = 0; > > + entry->slid = 0; > > + entry->sl = 0; > > + entry->src_qp = 0; > > + entry->dlid_path_bits = 0; > > + entry->pkey_index = 0; > > + > > + switch (c2_wr_get_id(ce)) { > > + case CC_WR_TYPE_SEND: > > + entry->opcode = IB_WC_SEND; > > + break; > > + case CC_WR_TYPE_RDMA_WRITE: > > + entry->opcode = IB_WC_RDMA_WRITE; > > + break; > > + case CC_WR_TYPE_RDMA_READ: > > + entry->opcode = IB_WC_RDMA_READ; > > + break; > > + case CC_WR_TYPE_BIND_MW: > > + entry->opcode = IB_WC_BIND_MW; > > + break; > > + case CC_WR_TYPE_RECV: > > + entry->byte_len = be32_to_cpu(ce->bytes_rcvd); > > + entry->opcode = IB_WC_RECV; > > + is_recv = 1; > > + break; > > + default: > > + break; > > + } > > + > > + /* consume the WQEs */ > > + if (is_recv) > > + c2_mq_lconsume(&qp->rq_mq, 1); > > + else > > + c2_mq_lconsume(&qp->sq_mq, > be32_to_cpu(c2_wr_get_wqe_count(ce))+1); > > + > > + /* free the message */ > > + c2_mq_free(&cq->mq); > > + > > + return 0; > > +} > > + > > +int c2_poll_cq(struct ib_cq *ibcq, int num_entries, > > + struct ib_wc *entry) > > +{ > > + struct c2_dev *c2dev = to_c2dev(ibcq->device); > > + struct c2_cq *cq = to_c2cq(ibcq); > > + unsigned long flags; > > + int npolled, err; > > + > > + spin_lock_irqsave(&cq->lock, flags); > > + > > + for (npolled = 0; npolled < num_entries; ++npolled) { > > + > > + err = c2_poll_one(c2dev, cq, entry + npolled); > > + if (err) > > + break; > > + } > > + > > + spin_unlock_irqrestore(&cq->lock, flags); > > + > > + return npolled; > > +} > > + > > +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) > > +{ > > + struct c2_mq_shared volatile *shared; > > + struct c2_cq *cq; > > + > > + cq = to_c2cq(ibcq); > > + shared = cq->mq.peer; > > + > > + if (notify == IB_CQ_NEXT_COMP) > > + shared->notification_type = CC_CQ_NOTIFICATION_TYPE_NEXT; > > + else if (notify == IB_CQ_SOLICITED) > > + shared->notification_type = CC_CQ_NOTIFICATION_TYPE_NEXT_SE; > > + else > > + return -EINVAL; > > + > > + shared->armed = CQ_WAIT_FOR_DMA|CQ_ARMED; > > + > > + /* > > + * Now read back shared->armed to make the PCI > > + * write synchronous. This is necessary for > > + * correct cq notification semantics. > > + */ > > + { > > + volatile char c; > > + c = shared->armed; > > + } > > + > > + return 0; > > +} > > + > > +static void c2_free_cq_buf(struct c2_mq *mq) > > +{ > > + int npages; > > + > > + npages = ((mq->q_size * mq->msg_size) + PAGE_SIZE - 1) / PAGE_SIZE; > > + free_pages((unsigned long)mq->msg_pool, npages); > > +} > > + > > +static int c2_alloc_cq_buf(struct c2_mq *mq, int q_size, int msg_size) > > +{ > > + unsigned long pool_start; > > + int npages; > > + > > + npages = ( (q_size * msg_size) + PAGE_SIZE - 1) / PAGE_SIZE; > > + > > + pool_start = __get_free_pages(GFP_KERNEL, npages); > > + if (!pool_start) > > + return -ENOMEM; > > + > > + c2_mq_init(mq, > > + 0, /* index (currently unknown) */ > > + q_size, > > + msg_size, > > + (u8 *)pool_start, > > + 0, /* peer (currently unknown) */ > > + C2_MQ_HOST_TARGET); > > + > > + return 0; > > +} > > + > > +int c2_init_cq(struct c2_dev *c2dev, int entries, > > + struct c2_ucontext *ctx, struct c2_cq *cq) > > +{ > > + ccwr_cq_create_req_t wr; > > + ccwr_cq_create_rep_t* reply; > > + unsigned long peer_pa; > > + struct c2_vq_req *vq_req; > > + int err; > > + > > + might_sleep(); > > + > > + cq->ibcq.cqe = entries - 1; > > + cq->is_kernel = !ctx; > > + > > + /* Allocate a shared pointer */ > > + cq->mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); > > + if (!cq->mq.shared) > > + return -ENOMEM; > > + > > + /* Allocate pages for the message pool */ > > + err = c2_alloc_cq_buf(&cq->mq, entries+1, C2_CQ_MSG_SIZE); > > + if (err) > > + goto bail0; > > + > > + vq_req = vq_req_alloc(c2dev); > > + if (!vq_req) { > > + err = -ENOMEM; > > + goto bail1; > > + } > > + > > + memset(&wr, 0, sizeof(wr)); > > + c2_wr_set_id(&wr, CCWR_CQ_CREATE); > > + wr.hdr.context = (unsigned long)vq_req; > > + wr.rnic_handle = c2dev->adapter_handle; > > + wr.msg_size = cpu_to_be32(cq->mq.msg_size); > > + wr.depth = cpu_to_be32(cq->mq.q_size); > > + wr.shared_ht = cpu_to_be64(__pa(cq->mq.shared)); > > + wr.msg_pool = cpu_to_be64(__pa(cq->mq.msg_pool)); > > + wr.user_context = (u64)(unsigned long)(cq); > > + > > + vq_req_get(c2dev, vq_req); > > + > > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail2; > > + } > > + > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) > > + goto bail2; > > + > > + reply = (ccwr_cq_create_rep_t*)(unsigned long)(vq_req->reply_msg); > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail2; > > + } > > + > > + if ( (err = c2_errno(reply)) != 0) > > + goto bail3; > > + > > + cq->adapter_handle = reply->cq_handle; > > + cq->mq.index = be32_to_cpu(reply->mq_index); > > + > > + peer_pa = (unsigned long)(c2dev->pa + > be32_to_cpu(reply->adapter_shared)); > > + cq->mq.peer = ioremap_nocache(peer_pa, PAGE_SIZE); > > + if (!cq->mq.peer) { > > + err = -ENOMEM; > > + goto bail3; > > + } > > + > > + vq_repbuf_free(c2dev, reply); > > + vq_req_free(c2dev, vq_req); > > + > > + spin_lock_init(&cq->lock); > > + atomic_set(&cq->refcount, 1); > > + init_waitqueue_head(&cq->wait); > > + > > + /* > > + * Use the MQ index allocated by the adapter to > > + * store the CQ in the qptr_array > > + */ > > + /* XXX qptr_array lock? */ > > + cq->cqn = cq->mq.index; > > + c2dev->qptr_array[cq->cqn] = cq; > > + > > + return 0; > > + > > +bail3: > > + vq_repbuf_free(c2dev, reply); > > +bail2: > > + vq_req_free(c2dev, vq_req); > > +bail1: > > + c2_free_cq_buf(&cq->mq); > > +bail0: > > + c2_free_mqsp(cq->mq.shared); > > + > > + return err; > > +} > > + > > +void c2_free_cq(struct c2_dev *c2dev, > > + struct c2_cq *cq) > > +{ > > + int err; > > + struct c2_vq_req *vq_req; > > + ccwr_cq_destroy_req_t wr; > > + ccwr_cq_destroy_rep_t *reply; > > + > > + might_sleep(); > > + > > + atomic_dec(&cq->refcount); > > + wait_event(cq->wait, !atomic_read(&cq->refcount)); > > + > > + vq_req = vq_req_alloc(c2dev); > > + if (!vq_req) { > > + goto bail0; > > + } > > + > > + memset(&wr, 0, sizeof(wr)); > > + c2_wr_set_id(&wr, CCWR_CQ_DESTROY); > > + wr.hdr.context = (unsigned long)vq_req; > > + wr.rnic_handle = c2dev->adapter_handle; > > + wr.cq_handle = cq->adapter_handle; > > + > > + vq_req_get(c2dev, vq_req); > > + > > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail1; > > + } > > + > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) > > + goto bail1; > > + > > + reply = (ccwr_cq_destroy_rep_t*)(unsigned long)(vq_req->reply_msg); > > + > > +//bail2: > > + vq_repbuf_free(c2dev, reply); > > +bail1: > > + vq_req_free(c2dev, vq_req); > > +bail0: > > + if (cq->is_kernel) { > > + c2_free_cq_buf(&cq->mq); > > + } > > + > > + return; > > +} > > + > > Index: hw/amso1100/Makefile > > =================================================================== > > --- hw/amso1100/Makefile (revision 0) > > +++ hw/amso1100/Makefile (revision 0) > > @@ -0,0 +1,22 @@ > > +EXTRA_CFLAGS += -Idrivers/infiniband/include > > + > > +ifdef CONFIG_INFINIBAND_AMSO1100_DEBUG > > +EXTRA_CFLAGS += -DC2_DEBUG > > +endif > > + > > +obj-$(CONFIG_INFINIBAND_AMSO1100) += iw_c2.o > > + > > +iw_c2-y := \ > > + c2.o \ > > + c2_provider.o \ > > + c2_rnic.o \ > > + c2_alloc.o \ > > + c2_mq.o \ > > + c2_ae.o \ > > + c2_vq.o \ > > + c2_intr.o \ > > + c2_cq.o \ > > + c2_qp.o \ > > + c2_cm.o \ > > + c2_mm.o \ > > + c2_pd.o > > Index: hw/amso1100/c2_mm.c > > =================================================================== > > --- hw/amso1100/c2_mm.c (revision 0) > > +++ hw/amso1100/c2_mm.c (revision 0) > > @@ -0,0 +1,376 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#include "c2.h" > > +#include "c2_vq.h" > > + > > +#define PBL_VIRT 1 > > +#define PBL_PHYS 2 > > + > > +/* > > + * Send all the PBL messages to convey the remainder of the PBL > > + * Wait for the adapter's reply on the last one. > > + * This is indicated by setting the MEM_PBL_COMPLETE in the flags. > > + * > > + * NOTE: vq_req is _not_ freed by this function. The VQ Host > > + * Reply buffer _is_ freed by this function. > > + */ > > +static int > > +send_pbl_messages(struct c2_dev *c2dev, u32 stag_index, > > + unsigned long va, u32 pbl_depth, > > + struct c2_vq_req *vq_req, int pbl_type) > > +{ > > + u32 pbe_count; /* amt that fits in a PBL msg */ > > + u32 count; /* amt in this PBL MSG. */ > > + ccwr_nsmr_pbl_req_t *wr; /* PBL WR ptr */ > > + ccwr_nsmr_pbl_rep_t *reply; /* reply ptr */ > > + int err, pbl_virt, i; > > + > > + switch (pbl_type) { > > + case PBL_VIRT: > > + pbl_virt = 1; > > + break; > > + case PBL_PHYS: > > + pbl_virt = 0; > > + break; > > + default: > > + return -EINVAL; > > + break; > > + } > > + > > + pbe_count = (c2dev->req_vq.msg_size - > > + sizeof(ccwr_nsmr_pbl_req_t)) / sizeof(u64); > > + wr = (ccwr_nsmr_pbl_req_t*)kmalloc(c2dev->req_vq.msg_size, > GFP_KERNEL); > > + if (!wr) { > > + return -ENOMEM; > > + } > > + c2_wr_set_id(wr, CCWR_NSMR_PBL); > > + > > + /* > > + * Only the last PBL message will generate a reply from the verbs, > > + * so we set the context to 0 indicating there is no kernel verbs > > + * handler blocked awaiting this reply. > > + */ > > + wr->hdr.context = 0; > > + wr->rnic_handle = c2dev->adapter_handle; > > + wr->stag_index = stag_index; /* already swapped */ > > + wr->flags = 0; > > + while (pbl_depth) { > > + count = min(pbe_count, pbl_depth); > > + wr->addrs_length = cpu_to_be32(count); > > + > > + /* > > + * If this is the last message, then reference the > > + * vq request struct cuz we're gonna wait for a reply. > > + * also make this PBL msg as the last one. > > + */ > > + if (count == pbl_depth) { > > + /* > > + * reference the request struct. dereferenced in the > > + * int handler. > > + */ > > + vq_req_get(c2dev, vq_req); > > + wr->flags = cpu_to_be32(MEM_PBL_COMPLETE); > > + > > + /* > > + * This is the last PBL message. > > + * Set the context to our VQ Request Object so we can > > + * wait for the reply. > > + */ > > + wr->hdr.context = (unsigned long)vq_req; > > + } > > + > > + /* > > + * if pbl_virt is set then va is a virtual address that describes > a > > + * virtually contiguous memory allocation. the wr needs the start > of > > + * each virtual page to be converted to the corresponding > physical > > + * address of the page. > > + * > > + * if pbl_virt is not set then va is an array of physical > addresses and > > + * there is no conversion to do. just fill in the wr with what > is in > > + * the array. > > + */ > > + for (i=0; i < count; i++) { > > + if (pbl_virt) { > > + /* XXX */ //wr->paddrs[i] = > cpu_to_be64(user_virt_to_phys(va)); > > + va += PAGE_SIZE; > > + } else { > > + wr->paddrs[i] = cpu_to_be64((u64)(unsigned long)((void > **)va)[i]); > > + } > > + } > > + > > + /* > > + * Send WR to adapter > > + */ > > + err = vq_send_wr(c2dev, (ccwr_t*)wr); > > + if (err) { > > + if (count <= pbe_count) { > > + vq_req_put(c2dev, vq_req); > > + } > > + goto bail0; > > + } > > + pbl_depth -= count; > > + } > > + > > + /* > > + * Now wait for the reply... > > + */ > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) { > > + goto bail0; > > + } > > + > > + /* > > + * Process reply > > + */ > > + reply = (ccwr_nsmr_pbl_rep_t*)(unsigned long)vq_req->reply_msg; > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + > > + err = c2_errno(reply); > > + > > + vq_repbuf_free(c2dev, reply); > > +bail0: > > + kfree(wr); > > + return err; > > +} > > + > > +#define CC_PBL_MAX_DEPTH 131072 > > +int > > +c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 **addr_list, > > + int pbl_depth, u32 length, u64 *va, > > + cc_acf_t acf, struct c2_mr *mr) > > +{ > > + struct c2_vq_req *vq_req; > > + ccwr_nsmr_register_req_t *wr; > > + ccwr_nsmr_register_rep_t *reply; > > + u16 flags; > > + int i, pbe_count, count; > > + int err; > > + > > + if (!va || !length || !addr_list || !pbl_depth) > > + return -EINTR; > > + > > + /* > > + * Verify PBL depth is within rnic max > > + */ > > + if (pbl_depth > CC_PBL_MAX_DEPTH) { > > + return -EINTR; > > + } > > + > > + /* > > + * allocate verbs request object > > + */ > > + vq_req = vq_req_alloc(c2dev); > > + if (!vq_req) > > + return -ENOMEM; > > + > > + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); > > + if (!wr) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + > > + /* > > + * build the WR > > + */ > > + c2_wr_set_id(wr, CCWR_NSMR_REGISTER); > > + wr->hdr.context = (unsigned long)vq_req; > > + wr->rnic_handle = c2dev->adapter_handle; > > + > > + flags = (acf | MEM_VA_BASED | MEM_REMOTE); > > + > > + /* > > + * compute how many pbes can fit in the message > > + */ > > + pbe_count = (c2dev->req_vq.msg_size - > > + sizeof(ccwr_nsmr_register_req_t)) / > > + sizeof(u64); > > + > > + if (pbl_depth <= pbe_count) { > > + flags |= MEM_PBL_COMPLETE; > > + } > > + wr->flags = cpu_to_be16(flags); > > + wr->stag_key = 0; //stag_key; > > + wr->va = cpu_to_be64(*va); > > + wr->pd_id = mr->pd->pd_id; > > + wr->pbe_size = cpu_to_be32(PAGE_SIZE); > > + wr->length = cpu_to_be32(length); > > + wr->pbl_depth = cpu_to_be32(pbl_depth); > > + wr->fbo = cpu_to_be32(0); > > + count = min(pbl_depth, pbe_count); > > + wr->addrs_length = cpu_to_be32(count); > > + > > + /* > > + * fill out the PBL for this message > > + */ > > + for (i = 0; i < count; i++) { > > + wr->paddrs[i] = cpu_to_be64((u64)(unsigned long)addr_list[i]); > > + } > > + > > + /* > > + * regerence the request struct > > + */ > > + vq_req_get(c2dev, vq_req); > > + > > + /* > > + * send the WR to the adapter > > + */ > > + err = vq_send_wr(c2dev, (ccwr_t *)wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail1; > > + } > > + > > + /* > > + * wait for reply from adapter > > + */ > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) { > > + goto bail1; > > + } > > + > > + /* > > + * process reply > > + */ > > + reply = (ccwr_nsmr_register_rep_t *)(unsigned > long)(vq_req->reply_msg); > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail1; > > + } > > + if ( (err = c2_errno(reply))) { > > + goto bail2; > > + } > > + //*p_pb_entries = be32_to_cpu(reply->pbl_depth); > > + mr->ibmr.lkey = mr->ibmr.rkey = be32_to_cpu(reply->stag_index); > > + vq_repbuf_free(c2dev, reply); > > + > > + /* > > + * if there are still more PBEs we need to send them to > > + * the adapter and wait for a reply on the final one. > > + * reuse vq_req for this purpose. > > + */ > > + pbl_depth -= count; > > + if (pbl_depth) { > > + > > + vq_req->reply_msg = (unsigned long)NULL; > > + atomic_set(&vq_req->reply_ready, 0); > > + err = send_pbl_messages(c2dev, > > + cpu_to_be32(mr->ibmr.lkey), > > + (unsigned long)&addr_list[i], > > + pbl_depth, vq_req, PBL_PHYS); > > + if (err) { > > + goto bail1; > > + } > > + } > > + > > + vq_req_free(c2dev, vq_req); > > + kfree(wr); > > + > > + return err; > > + > > +bail2: > > + vq_repbuf_free(c2dev, reply); > > +bail1: > > + kfree(wr); > > +bail0: > > + vq_req_free(c2dev, vq_req); > > + return err; > > +} > > + > > +int > > +c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index) > > +{ > > + struct c2_vq_req *vq_req; /* verbs request object */ > > + ccwr_stag_dealloc_req_t wr; /* work request */ > > + ccwr_stag_dealloc_rep_t *reply; /* WR reply */ > > + int err; > > + > > + > > + /* > > + * allocate verbs request object > > + */ > > + vq_req = vq_req_alloc(c2dev); > > + if (!vq_req) { > > + return -ENOMEM; > > + } > > + > > + /* > > + * Build the WR > > + */ > > + c2_wr_set_id(&wr, CCWR_STAG_DEALLOC); > > + wr.hdr.context = (u64)(unsigned long)vq_req; > > + wr.rnic_handle = c2dev->adapter_handle; > > + wr.stag_index = cpu_to_be32(stag_index); > > + > > + /* > > + * reference the request struct. dereferenced in the int handler. > > + */ > > + vq_req_get(c2dev, vq_req); > > + > > + /* > > + * Send WR to adapter > > + */ > > + err = vq_send_wr(c2dev, (ccwr_t*)&wr); > > + if (err) { > > + vq_req_put(c2dev, vq_req); > > + goto bail0; > > + } > > + > > + /* > > + * Wait for reply from adapter > > + */ > > + err = vq_wait_for_reply(c2dev, vq_req); > > + if (err) { > > + goto bail0; > > + } > > + > > + /* > > + * Process reply > > + */ > > + reply = (ccwr_stag_dealloc_rep_t*)(unsigned long)vq_req->reply_msg; > > + if (!reply) { > > + err = -ENOMEM; > > + goto bail0; > > + } > > + > > + err = c2_errno(reply); > > + > > + vq_repbuf_free(c2dev, reply); > > +bail0: > > + vq_req_free(c2dev, vq_req); > > + return err; > > +} > > + > > + > > Index: hw/amso1100/cc_status.h > > =================================================================== > > --- hw/amso1100/cc_status.h (revision 0) > > +++ hw/amso1100/cc_status.h (revision 0) > > @@ -0,0 +1,163 @@ > > +/* > > + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > +#ifndef _CC_STATUS_H_ > > +#define _CC_STATUS_H_ > > + > > +/* > > + * Verbs Status Codes > > + */ > > +typedef enum { > > + CC_OK = 0, /* This must be zero */ > > + CCERR_INSUFFICIENT_RESOURCES = 1, > > + CCERR_INVALID_MODIFIER = 2, > > + CCERR_INVALID_MODE = 3, > > + CCERR_IN_USE = 4, > > + CCERR_INVALID_RNIC = 5, > > + CCERR_INTERRUPTED_OPERATION = 6, > > + CCERR_INVALID_EH = 7, > > + CCERR_INVALID_CQ = 8, > > + CCERR_CQ_EMPTY = 9, > > + CCERR_NOT_IMPLEMENTED = 10, > > + CCERR_CQ_DEPTH_TOO_SMALL = 11, > > + CCERR_PD_IN_USE = 12, > > + CCERR_INVALID_PD = 13, > > + CCERR_INVALID_SRQ = 14, > > + CCERR_INVALID_ADDRESS = 15, > > + CCERR_INVALID_NETMASK = 16, > > + CCERR_INVALID_QP = 17, > > + CCERR_INVALID_QP_STATE = 18, > > + CCERR_TOO_MANY_WRS_POSTED = 19, > > + CCERR_INVALID_WR_TYPE = 20, > > + CCERR_INVALID_SGL_LENGTH = 21, > > + CCERR_INVALID_SQ_DEPTH = 22, > > + CCERR_INVALID_RQ_DEPTH = 23, > > + CCERR_INVALID_ORD = 24, > > + CCERR_INVALID_IRD = 25, > > + CCERR_QP_ATTR_CANNOT_CHANGE = 26, > > + CCERR_INVALID_STAG = 27, > > + CCERR_QP_IN_USE = 28, > > + CCERR_OUTSTANDING_WRS = 29, > > + CCERR_STAG_IN_USE = 30, > > + CCERR_INVALID_STAG_INDEX = 31, > > + CCERR_INVALID_SGL_FORMAT = 32, > > + CCERR_ADAPTER_TIMEOUT = 33, > > + CCERR_INVALID_CQ_DEPTH = 34, > > + CCERR_INVALID_PRIVATE_DATA_LENGTH = 35, > > + CCERR_INVALID_EP = 36, > > + CCERR_MR_IN_USE = CCERR_STAG_IN_USE, > > + CCERR_FLUSHED = 38, > > + CCERR_INVALID_WQE = 39, > > + CCERR_LOCAL_QP_CATASTROPHIC_ERROR = 40, > > + CCERR_REMOTE_TERMINATION_ERROR = 41, > > + CCERR_BASE_AND_BOUNDS_VIOLATION = 42, > > + CCERR_ACCESS_VIOLATION = 43, > > + CCERR_INVALID_PD_ID = 44, > > + CCERR_WRAP_ERROR = 45, > > + CCERR_INV_STAG_ACCESS_ERROR = 46, > > + CCERR_ZERO_RDMA_READ_RESOURCES = 47, > > + CCERR_QP_NOT_PRIVILEGED = 48, > > + CCERR_STAG_STATE_NOT_INVALID = 49, > > + CCERR_INVALID_PAGE_SIZE = 50, > > + CCERR_INVALID_BUFFER_SIZE = 51, > > + CCERR_INVALID_PBE = 52, > > + CCERR_INVALID_FBO = 53, > > + CCERR_INVALID_LENGTH = 54, > > + CCERR_INVALID_ACCESS_RIGHTS = 55, > > + CCERR_PBL_TOO_BIG = 56, > > + CCERR_INVALID_VA = 57, > > + CCERR_INVALID_REGION = 58, > > + CCERR_INVALID_WINDOW = 59, > > + CCERR_TOTAL_LENGTH_TOO_BIG = 60, > > + CCERR_INVALID_QP_ID = 61, > > + CCERR_ADDR_IN_USE = 62, > > + CCERR_ADDR_NOT_AVAIL = 63, > > + CCERR_NET_DOWN = 64, > > + CCERR_NET_UNREACHABLE = 65, > > + CCERR_CONN_ABORTED = 66, > > + CCERR_CONN_RESET = 67, > > + CCERR_NO_BUFS = 68, > > + CCERR_CONN_TIMEDOUT = 69, > > + CCERR_CONN_REFUSED = 70, > > + CCERR_HOST_UNREACHABLE = 71, > > + CCERR_INVALID_SEND_SGL_DEPTH = 72, > > + CCERR_INVALID_RECV_SGL_DEPTH = 73, > > + CCERR_INVALID_RDMA_WRITE_SGL_DEPTH = 74, > > + CCERR_INSUFFICIENT_PRIVILEGES = 75, > > + CCERR_STACK_ERROR = 76, > > + CCERR_INVALID_VERSION = 77, > > + CCERR_INVALID_MTU = 78, > > + CCERR_INVALID_IMAGE = 79, > > + CCERR_PENDING = 98, /* not an error; user internally by adapter */ > > + CCERR_DEFER = 99, /* not an error; used internally by adapter */ > > + CCERR_FAILED_WRITE = 100, > > + CCERR_FAILED_ERASE = 101, > > + CCERR_FAILED_VERIFICATION = 102, > > + CCERR_NOT_FOUND = 103, > > + > > +} cc_status_t; > > + > > +/* > > + * Verbs and Completion Status Code types... > > + */ > > +typedef cc_status_t cc_verbs_status_t; > > +typedef cc_status_t cc_wc_status_t; > > + > > +/* > > + * CCAE_ACTIVE_CONNECT_RESULTS status result codes. > > + */ > > +typedef enum { > > + CC_CONN_STATUS_SUCCESS = CC_OK, > > + CC_CONN_STATUS_NO_MEM = CCERR_INSUFFICIENT_RESOURCES, > > + CC_CONN_STATUS_TIMEDOUT = CCERR_CONN_TIMEDOUT, > > + CC_CONN_STATUS_REFUSED = CCERR_CONN_REFUSED, > > + CC_CONN_STATUS_NETUNREACH = CCERR_NET_UNREACHABLE, > > + CC_CONN_STATUS_HOSTUNREACH = CCERR_HOST_UNREACHABLE, > > + CC_CONN_STATUS_INVALID_RNIC = CCERR_INVALID_RNIC, > > + CC_CONN_STATUS_INVALID_QP = CCERR_INVALID_QP, > > + CC_CONN_STATUS_INVALID_QP_STATE = CCERR_INVALID_QP_STATE, > > + CC_CONN_STATUS_REJECTED = CCERR_CONN_RESET, > > +} cc_connect_status_t; > > + > > +/* > > + * Flash programming status codes. > > + */ > > +typedef enum { > > + CC_FLASH_STATUS_SUCCESS = 0x0000, > > + CC_FLASH_STATUS_VERIFY_ERR = 0x0002, > > + CC_FLASH_STATUS_IMAGE_ERR = 0x0004, > > + CC_FLASH_STATUS_ECLBS = 0x0400, > > + CC_FLASH_STATUS_PSLBS = 0x0800, > > + CC_FLASH_STATUS_VPENS = 0x1000, > > +} cc_flash_status_t; > > + > > +#endif /* _CC_STATUS_H_ */ > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From gshipman at lanl.gov Tue Jan 24 07:41:24 2006 From: gshipman at lanl.gov (Galen Shipman) Date: Tue, 24 Jan 2006 08:41:24 -0700 Subject: [openib-general] Reregister Memory Region Verb Message-ID: Hello all, Does OpenIB currently support the Reregister Memory Region Verb? I wasn't able to find it in the header files.. Thanks, Galen From caitlinb at broadcom.com Tue Jan 24 07:46:22 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 24 Jan 2006 07:46:22 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C3401@NT-SJCA-0751.brcm.ad.broadcom.com> Grant Grundler wrote: > On Mon, Jan 23, 2006 at 03:53:19PM -0800, Roland Dreier wrote: >> > Yes, but we need to start somewhere. Until someone submits > a >> driver that does all the things you mention, it makes > sense to >> move forward with what has been proposed to date. >> >> I agree with this, and overall I am very much in favor of getting >> iWARP support all the way upstream. > > *nod* > > BTW, this is a message that needs to be repeated regularly > until iWARP support *is* upstream. The opposite perception is > still lingering in some places because of discussions from 1 and 2 > years ago. > >> The reason I want to take time to make sure that we have the right >> code before we merge it is that I get the feeling that there may be >> elements of a) using the IB tree to get changes upstream that would >> be vetoed on netdev > > Yeah, that has happened before. And I expect netdev folks > might strongly object (if they haven't already) to some > "sideband" method of managing TCP/IP config when TCP/IP is > exclusively running on an RNIC (TOE with RDMA front-end). > IMHO, that's seems like the "hardest to fix" issue so > everyone is happy. Most of the other details can be negotiated. > It is important to separate two issues here: L2-L3 coordination and L4 coordination. The patches that Tom recently posted address the L2-L3 coordination. They ensure that the TCP/IP stack used by the RNIC is consistent in its L2-L3 configuration with the host stack. For example, it will route packets to the same destination IP address the same way as the host stack, use the same MAC addresses for first hops, etc. This logic should really apply to *any* transport service that claims to use IP Addresses. The step beyond that, which should also apply to anything that claims to support SOCK_STREAM or SOCK_DGRAM with IP Addresses, is achieving similar consistency or even collaboration at L4. This step will take time, because the opinion within netdev is against sharing L4 state information with a parallel TCP stack. There are also challenges like enabling common iSCSI login phase code that is independent of the ultimate transport, and ensuring that sockets cannot be made to behave in ways that are contrary to system policy as stated via netfilter. Those will be compicated give and takes. There is no reason to hold up the L2-L3 work because of it. If it were, then SDP/IB should have been deferred until complete consistency was achieved. Updating one step at a time makes more sense. From Thomas.Talpey at netapp.com Tue Jan 24 07:54:45 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 24 Jan 2006 10:54:45 -0500 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: References: <54AD0F12E08D1541B826BE97C98F99F11C32DF@NT-SJCA-0751.brcm.ad.broadcom.com> <20060123215516.GG29214@esmail.cup.hp.com> Message-ID: <7.0.1.0.2.20060124105336.072ea480@netapp.com> At 06:53 PM 1/23/2006, Roland Dreier wrote: >vetoed on netdev and b) trying to get openib and the kernel community >to accept code just so a vendor can meet a product marketing deadline. > >BTW, upon reflection, the best idea for moving this forward might be >to push the Ammasso driver along with the rest of the iWARP patches, >so that there's some more context for review. Just because a vendor >is out of business is no reason for Linux not to have a driver for a >piece of hardware. In fact, there are a bunch of Ammasso cards out there, and also, what better proof could you have that there isn't a hidden hardware agenda in the submission! Tom. From rdreier at cisco.com Tue Jan 24 07:55:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Jan 2006 07:55:04 -0800 Subject: [openib-general] Reregister Memory Region Verb In-Reply-To: (Galen Shipman's message of "Tue, 24 Jan 2006 08:41:24 -0700") References: Message-ID: Galen> Does OpenIB currently support the Reregister Memory Region Galen> Verb? I wasn't able to find it in the header files.. No, it doesn't. What is your use case? I'm wondering how hard it would be to implement support at least sufficient for what you want to do. - R. From tom at opengridcomputing.com Tue Jan 24 07:57:37 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 24 Jan 2006 09:57:37 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <20060124015708.GQ29214@esmail.cup.hp.com> References: <54AD0F12E08D1541B826BE97C98F99F11C32DF@NT-SJCA-0751.brcm.ad.broadcom.com> <20060123215516.GG29214@esmail.cup.hp.com> <20060124015708.GQ29214@esmail.cup.hp.com> Message-ID: <1138118257.22009.13.camel@trinity.ogc.int> BTW, along these lines, does anyone have commentary on the netevent registration patch? To avoid the impression that this patch falls under the "using IB to get stuff into netdev that would be vetoed otherwise", I am happy to submit a patch to netdev directly to avoid sullying OpenIB's good name. Nonetheless, I'd like to get some feedback, especially if it is "what are you nuts?!?" On Mon, 2006-01-23 at 17:57 -0800, Grant Grundler wrote: > On Mon, Jan 23, 2006 at 03:53:19PM -0800, Roland Dreier wrote: > > > Yes, but we need to start somewhere. Until someone submits > > > a driver that does all the things you mention, it makes > > > sense to move forward with what has been proposed to date. > > > > I agree with this, and overall I am very much in favor of getting > > iWARP support all the way upstream. > > *nod* > > BTW, this is a message that needs to be repeated regularly until > iWARP support *is* upstream. The opposite perception is still lingering > in some places because of discussions from 1 and 2 years ago. > > > The reason I want to take time to make sure that we have the right > > code before we merge it is that I get the feeling that there may be > > elements of a) using the IB tree to get changes upstream that would be > > vetoed on netdev > > Yeah, that has happened before. And I expect netdev folks might strongly > object (if they haven't already) to some "sideband" method of managing > TCP/IP config when TCP/IP is exclusively running on an RNIC (TOE with > RDMA front-end). IMHO, that's seems like the "hardest to fix" issue > so everyone is happy. Most of the other details can be negotiated. > > > and b) trying to get openib and the kernel community > > to accept code just so a vendor can meet a product marketing deadline. > > TTM via kernel.org? BWHAHAHA! :^) > > Sorry, I can't take that serious. :) > > > BTW, upon reflection, the best idea for moving this forward might be > > to push the Ammasso driver along with the rest of the iWARP patches, > > so that there's some more context for review. Just because a vendor > > is out of business is no reason for Linux not to have a driver for a > > piece of hardware. > > "Exactly." says the co-maintainer of the parisc-linux port. :) > > thanks, > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue Jan 24 08:02:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Jan 2006 08:02:40 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11C3401@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Tue, 24 Jan 2006 07:46:22 -0800") References: <54AD0F12E08D1541B826BE97C98F99F11C3401@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: Caitlin> It is important to separate two issues here: L2-L3 Caitlin> coordination and L4 coordination. Caitlin> The patches that Tom recently posted address the L2-L3 Caitlin> coordination. They ensure that the TCP/IP stack used by Caitlin> the RNIC is consistent in its L2-L3 configuration with Caitlin> the host stack. For example, it will route packets to the Caitlin> same destination IP address the same way as the host Caitlin> stack, use the same MAC addresses for first hops, etc. But the patches don't provide full L2 coordination. For example if I install a netfilter rule that discards packets from a particular source MAC (a completely, 100% L2 notion), the RNIC will still happily accept connections from that MAC, right? Caitlin> Those will be compicated give and takes. There is no Caitlin> reason to hold up the L2-L3 work because of it. If it Caitlin> were, then SDP/IB should have been deferred until Caitlin> complete consistency was achieved. Updating one step at a Caitlin> time makes more sense. SDP/IB is not in the upstream kernel. And I agree that there are many issues with respect to network stack integration to work out before SDP is suitable for merging. - R. From gshipman at lanl.gov Tue Jan 24 08:06:55 2006 From: gshipman at lanl.gov (Galen Shipman) Date: Tue, 24 Jan 2006 09:06:55 -0700 Subject: [openib-general] Reregister Memory Region Verb In-Reply-To: References: Message-ID: <83f77db82a517a173a113e2a66eda541@lanl.gov> On Jan 24, 2006, at 8:55 AM, Roland Dreier wrote: > Galen> Does OpenIB currently support the Reregister Memory Region > Galen> Verb? I wasn't able to find it in the header files.. > > Roland> No, it doesn't. What is your use case? I'm wondering how > hard it > Roland> would be to implement support at least sufficient for what you > want to > Roland> do. > I would like to be able to extend an existing registration such that the driver would take advantage of the fact that part of the extended registration is already registered, i.e. only the "new" memory would be pinned and made resident. Also, it would be necessary that this operation wouldn't invalidate operations in progress on the original registration or the call would block until pending operations had completed, although we would prefer that it wouldn't block if possible ;-) . - Galen From rdreier at cisco.com Tue Jan 24 08:08:13 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Jan 2006 08:08:13 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1138118257.22009.13.camel@trinity.ogc.int> (Tom Tucker's message of "Tue, 24 Jan 2006 09:57:37 -0600") References: <54AD0F12E08D1541B826BE97C98F99F11C32DF@NT-SJCA-0751.brcm.ad.broadcom.com> <20060123215516.GG29214@esmail.cup.hp.com> <20060124015708.GQ29214@esmail.cup.hp.com> <1138118257.22009.13.camel@trinity.ogc.int> Message-ID: Tom> BTW, along these lines, does anyone have commentary on the Tom> netevent registration patch? It looks reasonable to me, although I'm not sure what the netdev people will think about scattering notifier calls around the tree. There's certainly a maintainability cost in terms of requiring those calls to be kept in sync with how neighbours change. Also changes to net/ipv6 were conspicuous in their absence. Does your patch only handle ipv4? Tom> To avoid the impression that this patch falls under the Tom> "using IB to get stuff into netdev that would be vetoed Tom> otherwise", I am happy to submit a patch to netdev directly Tom> to avoid sullying OpenIB's good name. It's not just sullying a good name -- it's respect for maintainers. We can't merge new functionality into net/ without buy-in from netdev and Dave Miller. - R. From xma at us.ibm.com Tue Jan 24 08:26:25 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 24 Jan 2006 08:26:25 -0800 Subject: [openib-general] Re: Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060124095713.GP26724@mellanox.co.il> Message-ID: Why not removing the platform dependency and using gettimeofday() instead to measure the performance? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Jan 24 08:32:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 18:32:47 +0200 Subject: [openib-general] Re: Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: References: Message-ID: <20060124163247.GI26526@mellanox.co.il> Quoting r. Shirley Ma : > Subject: Re: [openib-general] Re: Re: Re: Userspace testing results (many?kernels, many svn trees) > > > Why not removing the platform dependency and using gettimeofday() instead to > measure the performance? I'm trying to get sub-microsecond precision, and I dont want the system call overhead. -- MST From mst at mellanox.co.il Tue Jan 24 08:36:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 18:36:30 +0200 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: References: Message-ID: <20060124163630.GK26526@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) > > Michael> Could the high/low bits be swapped? What happends if you > Michael> change cycles_t from long long to long? Could you try > Michael> running the clock_test utility? > > What seems to be happening is that mftb is giving the low 32 bits of > the timebase (as expected on ppc32). Since your get_cycles() is > returning a long long, those 32 bits get put in the most significant > 32 bits of the return value, and the low 32 bits are garbage (ppc is > big endian). > > If I compile clock_test for ppc32, I see that get_cycles() compiles to: > > 1000064c : > 1000064c: 7c 6c 42 e6 mftb r3 > 10000650: 4e 80 00 20 blr > > For comparison, a function like > > unsigned long long blah(void) { return 0x100000002ull; } > > compiles to > > 00000000 : > 0: 38 60 00 01 li r3,1 > 4: 38 80 00 02 li r4,2 > 8: 4e 80 00 20 blr > > In other words the convention on ppc32 is that unsigned long long > return values have the high 32 bits in r3 and the low 32 bits in r4. > > I think you want to use something like > > typedef unsigned long long cycles_t; > static inline cycles_t get_cycles() > { > unsigned long low, hi, hi2; > > do { > asm volatile ("mftbu %0" : "=r" (hi)); > asm volatile ("mftb %0" : "=r" (low)); > asm volatile ("mftbu %0" : "=r" (hi2)); > } while (hi != hi2); > > return ((unsigned long long) hi << 32) | low; > } > > for ppc32. I'm convinced, I moved it back to 32 bit. > However, this is not quite enough to make things work on > all powerpc systems, because the timebase does not necessarily run at > the same speed as the CPU. For example, on an IBM JS20 blade, > clock_test prints > > 1 sec = 6536.8 usec > 1 sec = 6537.05 usec > > (both as a 32-bit and 64-bit executable) because, as /proc/cpuinfo shows: > > processor : 0 > cpu : PPC970FX, altivec supported > clock : 2194.624509MHz > revision : 3.0 > > processor : 1 > cpu : PPC970FX, altivec supported > clock : 2194.624509MHz > revision : 3.0 > > timebase : 14318000 > machine : CHRP IBM,8842-P2C > > the timebase runs at about 14.3 MHz, or approx 153 times slower than > the CPU clock. > > I'm not sure how you want to fix this in perftest. I just added some cycle calibration code to get_cpu_mhz(). Check it out (you can just run clock_test). -- MST From tom at opengridcomputing.com Tue Jan 24 08:41:47 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 24 Jan 2006 10:41:47 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: References: <54AD0F12E08D1541B826BE97C98F99F11C3401@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1138120907.22009.25.camel@trinity.ogc.int> On Tue, 2006-01-24 at 08:02 -0800, Roland Dreier wrote: > Caitlin> It is important to separate two issues here: L2-L3 > Caitlin> coordination and L4 coordination. > > Caitlin> The patches that Tom recently posted address the L2-L3 > Caitlin> coordination. They ensure that the TCP/IP stack used by > Caitlin> the RNIC is consistent in its L2-L3 configuration with > Caitlin> the host stack. For example, it will route packets to the > Caitlin> same destination IP address the same way as the host > Caitlin> stack, use the same MAC addresses for first hops, etc. > > But the patches don't provide full L2 coordination. For example if I > install a netfilter rule that discards packets from a particular > source MAC (a completely, 100% L2 notion), the RNIC will still happily > accept connections from that MAC, right? The intended behavior is to provide "full coordination". For the example you give, I would expect that rdma_resolve_addr would fail due to to a timeout waiting for an ARP reply. > > Caitlin> Those will be compicated give and takes. There is no > Caitlin> reason to hold up the L2-L3 work because of it. If it > Caitlin> were, then SDP/IB should have been deferred until > Caitlin> complete consistency was achieved. Updating one step at a > Caitlin> time makes more sense. > > SDP/IB is not in the upstream kernel. And I agree that there are many > issues with respect to network stack integration to work out before > SDP is suitable for merging. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue Jan 24 08:43:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Jan 2006 08:43:09 -0800 Subject: [openib-general] Reregister Memory Region Verb In-Reply-To: <83f77db82a517a173a113e2a66eda541@lanl.gov> (Galen Shipman's message of "Tue, 24 Jan 2006 09:06:55 -0700") References: <83f77db82a517a173a113e2a66eda541@lanl.gov> Message-ID: Galen> I would like to be able to extend an existing registration Galen> such that the driver would take advantage of the fact that Galen> part of the extended registration is already registered, Galen> i.e. only the "new" memory would be pinned and made Galen> resident. Also, it would be necessary that this operation Galen> wouldn't invalidate operations in progress on the original Galen> registration or the call would block until pending Galen> operations had completed, although we would prefer that it Galen> wouldn't block if possible ;-) . Hmm, this is a problem. The IB spec specifically requires that the reregister MR operation behave like a deregister followed by register. In C11-21, the spec says that operations in progress during a reregister must get an error because of the deregistration. - R. From iod00d at hp.com Tue Jan 24 08:47:21 2006 From: iod00d at hp.com (Grant Grundler) Date: Tue, 24 Jan 2006 08:47:21 -0800 Subject: [openib-general] Re: Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: References: <20060124095713.GP26724@mellanox.co.il> Message-ID: <20060124164721.GC1528@esmail.cup.hp.com> On Tue, Jan 24, 2006 at 08:26:25AM -0800, Shirley Ma wrote: > Why not removing the platform dependency and using gettimeofday() instead > to measure the performance? The cycle counters can give us transaction level timing. We can see how expensive the first/last transactions are and what the variation is during the run. gettimeofday() can't do that on most platforms. AFAICT, only PPC has real wierdness with get_cycles(). And if that can't be fixed, then we will have to use gettimeofday() on PPC or make it a compile time option. thanks, grant From rdreier at cisco.com Tue Jan 24 09:07:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Jan 2006 09:07:27 -0800 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060124163630.GK26526@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 24 Jan 2006 18:36:30 +0200") References: <20060124163630.GK26526@mellanox.co.il> Message-ID: Michael> I just added some cycle calibration code to Michael> get_cpu_mhz(). Check it out (you can just run Michael> clock_test). Seems to work well. On the same JS20 system: Warning: measured CPU frequency value14.317 differs from nominal 2194.62 1 sec = 1.00254e+06 usec 1 sec = 1.00203e+06 usec The only minor issue is: get_clock.c: In function get_clock.c:97: warning: long long int format, long unsigned int arg (arg 4) - R. From mst at mellanox.co.il Tue Jan 24 09:10:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 19:10:11 +0200 Subject: [openib-general] Re: Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060124164721.GC1528@esmail.cup.hp.com> References: <20060124164721.GC1528@esmail.cup.hp.com> Message-ID: <20060124171011.GA30064@mellanox.co.il> Quoting r. Grant Grundler : > Subject: Re: [openib-general] Re: Re: Re: Userspace testing results (many?kernels, many svn trees) > > On Tue, Jan 24, 2006 at 08:26:25AM -0800, Shirley Ma wrote: > > Why not removing the platform dependency and using gettimeofday() instead > > to measure the performance? > > The cycle counters can give us transaction level timing. > We can see how expensive the first/last transactions are > and what the variation is during the run. > gettimeofday() can't do that on most platforms. > > AFAICT, only PPC has real wierdness with get_cycles(). > And if that can't be fixed, then we will have to use > gettimeofday() on PPC or make it a compile time option. > > thanks, > grant > No, I think I have a sane work-around: run gettimeofday several times and time it, compare the result with /proc/cpuinfo. Also good for stuff like power management clocking the CPU down, etc. -- MST From rdreier at cisco.com Tue Jan 24 09:13:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Jan 2006 09:13:32 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1138120907.22009.25.camel@trinity.ogc.int> (Tom Tucker's message of "Tue, 24 Jan 2006 10:41:47 -0600") References: <54AD0F12E08D1541B826BE97C98F99F11C3401@NT-SJCA-0751.brcm.ad.broadcom.com> <1138120907.22009.25.camel@trinity.ogc.int> Message-ID: Tom> The intended behavior is to provide "full coordination". For Tom> the example you give, I would expect that rdma_resolve_addr Tom> would fail due to to a timeout waiting for an ARP reply. OK, now I'm going off into crazy-land, but I could have a rule that filters on source MAC and ethertype, and lets ARPs but no other packets through. - R. From mst at mellanox.co.il Tue Jan 24 09:17:08 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 19:17:08 +0200 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: References: Message-ID: <20060124171707.GA30219@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > Michael> I just added some cycle calibration code to > Michael> get_cpu_mhz(). Check it out (you can just run > Michael> clock_test). > > Seems to work well. On the same JS20 system: > > Warning: measured CPU frequency value14.317 differs from nominal 2194.62 > 1 sec = 1.00254e+06 usec > 1 sec = 1.00203e+06 usec > > The only minor issue is: > > get_clock.c: In function get_clock.c:97: warning: long long int format, long unsigned int arg (arg 4) > > - R. I fixed this, thanks! -- MST From swise at opengridcomputing.com Tue Jan 24 09:25:54 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 24 Jan 2006 11:25:54 -0600 Subject: [openib-general] possible bug in rdma_connect() Message-ID: <1138123554.7613.7.camel@stevo-desktop> Sean, I think I found a bug in rdma_connect(). Shouldn't it bump the rdma_cm_id refcnt before calling down into the transport-specific CM to inititiate a connect? Then deref it in the callback function. The case I'm thinking about is if an ib_client does and rdma_connect(), then does an rdma_destroy_id() before the connect reply comes in. From tom at opengridcomputing.com Tue Jan 24 09:29:23 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 24 Jan 2006 11:29:23 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: References: <54AD0F12E08D1541B826BE97C98F99F11C3401@NT-SJCA-0751.brcm.ad.broadcom.com> <1138120907.22009.25.camel@trinity.ogc.int> Message-ID: <1138123763.22009.31.camel@trinity.ogc.int> Ok, you got me ;-) Busted. Steve pointed out another one -- the connection is already established, then the filter rule is added, the connection continues (bypassing the rule) until the ARP cache entry times out...ga! BTW, the same holds true for RSO (a netdev sanctioned non-offload, offload). Albeit, the period of surprise is somewhat shorter, i.e. a dozen or so frames. The netdev people need to be involved. I think it will be briefly painful, but ultimately worthwhile to do this "right". On Tue, 2006-01-24 at 09:13 -0800, Roland Dreier wrote: > Tom> The intended behavior is to provide "full coordination". For > Tom> the example you give, I would expect that rdma_resolve_addr > Tom> would fail due to to a timeout waiting for an ARP reply. > > OK, now I'm going off into crazy-land, but I could have a rule that > filters on source MAC and ethertype, and lets ARPs but no other > packets through. > > - R. From mshefty at ichips.intel.com Tue Jan 24 09:50:38 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Jan 2006 09:50:38 -0800 Subject: [openib-general] possible bug in rdma_connect() In-Reply-To: <1138123554.7613.7.camel@stevo-desktop> References: <1138123554.7613.7.camel@stevo-desktop> Message-ID: <43D668EE.1090400@ichips.intel.com> Steve Wise wrote: > I think I found a bug in rdma_connect(). Shouldn't it bump the > rdma_cm_id refcnt before calling down into the transport-specific CM to > inititiate a connect? Then deref it in the callback function. > > The case I'm thinking about is if an ib_client does and rdma_connect(), > then does an rdma_destroy_id() before the connect reply comes in. The reference isn't needed. The rdma_cm must create a transport specific ID before it can call connect. That ID is destroyed by the rdma_cm when the rdma_cm_id is destroyed. The destruction of the transport specific ID must block until all callbacks associated with it complete. - Sean From nacc at us.ibm.com Tue Jan 24 10:23:09 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 24 Jan 2006 10:23:09 -0800 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060123234442.GD29917@mellanox.co.il> References: <20060123234003.GR5074@us.ibm.com> <20060123234442.GD29917@mellanox.co.il> Message-ID: <20060124182309.GU5074@us.ibm.com> On 24.01.2006 [01:44:42 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > > I have just uploaded a simple utility which I called clock_test which > > > measures a clock once a second: this way you'll know whether mtfb > > > is measuring time properly. > > > > Will it get built by running make in the perftest directory? > Yes. > > > Any special usage I should know about? > > Look at its source, you'll see. > > You just run it for a while and it will print out the time > tkaen from mtfb each second. > Kill it with CRTL-C. With 5169, I get: 1 sec = 5.37731e+14 usec 1 sec = 5.37451e+14 usec 1 sec = 5.3748e+14 usec 1 sec = 5.37495e+14 usec 1 sec = 5.37483e+14 usec 1 sec = 5.37493e+14 usec 1 sec = 5.37495e+14 usec 1 sec = 5.37495e+14 usec 1 sec = 5.37493e+14 usec 1 sec = 5.37492e+14 usec 1 sec = 5.37495e+14 usec 1 sec = 5.37494e+14 usec Thanks, Nish From sean.hefty at intel.com Tue Jan 24 10:46:45 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 24 Jan 2006 10:46:45 -0800 Subject: [openib-general] OpenSM GET_TABLE path record question Message-ID: Does anyone know if OpenSM (or any SM) includes path records where the DGID=SGID when responding to a GET_TABLE request? The fields being set in the request are SGID, NumbPath, and PKey. - Sean From mst at mellanox.co.il Tue Jan 24 10:54:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 20:54:38 +0200 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060124182309.GU5074@us.ibm.com> References: <20060124182309.GU5074@us.ibm.com> Message-ID: <20060124185438.GA30693@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Subject: Re: [openib-general] Re: Re: Userspace testing results (many kernels,many svn trees) > > On 24.01.2006 [01:44:42 +0200], Michael S. Tsirkin wrote: > > Quoting r. Nishanth Aravamudan : > > > > I have just uploaded a simple utility which I called clock_test which > > > > measures a clock once a second: this way you'll know whether mtfb > > > > is measuring time properly. > > > > > > Will it get built by running make in the perftest directory? > > Yes. > > > > > Any special usage I should know about? > > > > Look at its source, you'll see. > > > > You just run it for a while and it will print out the time > > tkaen from mtfb each second. > > Kill it with CRTL-C. > > With 5169, I get: > > 1 sec = 5.37731e+14 usec > 1 sec = 5.37451e+14 usec > 1 sec = 5.3748e+14 usec > 1 sec = 5.37495e+14 usec > 1 sec = 5.37483e+14 usec > 1 sec = 5.37493e+14 usec > 1 sec = 5.37495e+14 usec > 1 sec = 5.37495e+14 usec > 1 sec = 5.37493e+14 usec > 1 sec = 5.37492e+14 usec > 1 sec = 5.37495e+14 usec > 1 sec = 5.37494e+14 usec > > Thanks, > Nish > Hmm. First, try updating to 5174 and run clock_test. Second, what about this patch: Index: get_clock.h =================================================================== --- get_clock.h (revision 5171) +++ get_clock.h (working copy) @@ -47,8 +47,7 @@ static inline cycles_t get_cycles() val = (val << 32) | low; return val; } -#elif defined(__PPC__) || defined(__PPC64__) -/* Note: only PPC CPUs which have mftb instruction are supported. */ +#elif defined(__PPC64__) /* PPC64 has mftb */ typedef unsigned long cycles_t; static inline cycles_t get_cycles() @@ -58,6 +57,16 @@ static inline cycles_t get_cycles() asm volatile ("mftb %0" : "=r" (ret) : ); return ret; } +#elif defined(__PPC__) +/* Note: only PPC CPUs which have mftb instruction are supported. */ +typedef unsigned long cycles_t; +static inline cycles_t get_cycles() +{ + cycles_t ret; + + asm volatile ("mftb %0, 268" : "=r" (ret) : ); + return ret; +} #elif defined(__ia64__) /* Itanium2 and up has ar.itc (Itanium1 has errata) */ typedef unsigned long cycles_t; -- MST From mst at mellanox.co.il Tue Jan 24 10:57:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 20:57:27 +0200 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: References: Message-ID: <20060124185727.GB30693@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > Michael> I just added some cycle calibration code to > Michael> get_cpu_mhz(). Check it out (you can just run > Michael> clock_test). > > Seems to work well. On the same JS20 system: > > Warning: measured CPU frequency value14.317 differs from nominal 2194.62 > 1 sec = 1.00254e+06 usec > 1 sec = 1.00203e+06 usec > > The only minor issue is: > > get_clock.c: In function get_clock.c:97: warning: long long int format, long unsigned int arg (arg 4) > > - R. What about 32 bit? I see this in timex.h: asm volatile ( "98: mftb %0\n" "99:\n" ".section __ftr_fixup,\"a\"\n" " .long %1\n" " .long 0\n" " .long 98b\n" " .long 99b\n" ".previous" : "=r" (ret) : "i" (CPU_FTR_601)); Could some PPC gurus here chime up on what this does? Cant I just "mftb %0, 268"? -- MST From rdreier at cisco.com Tue Jan 24 11:02:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Jan 2006 11:02:18 -0800 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: <20060124185727.GB30693@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 24 Jan 2006 20:57:27 +0200") References: <20060124185727.GB30693@mellanox.co.il> Message-ID: Michael> What about 32 bit? I see this in timex.h: Yes, 32 bit is fine. That stuff in timex.h is there to fix up some sort of crazy PPC 601 stuff (and 601 is so old I don't think anyone cares or even has a box with both PPC601 and PCI). Michael> Could some PPC gurus here chime up on what this does? Michael> Cant I just "mftb %0, 268"? You can just do "mftb %0" to get the low 32 bits of the timebase, and "mftbu %0" to get the high 32 bits. I'm not sure what the ", 268" would do. - R. From sean.hefty at intel.com Tue Jan 24 10:42:36 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 24 Jan 2006 10:42:36 -0800 Subject: [openib-general] [PATCH] [MAD] set RMPP version even if RMPP is not active Message-ID: This sets the RMPP version number in the RMPP header if RMPP is present, but not active. The current code does not set the version if RMPP is inactive. Signed-off-by: Sean Hefty --- Index: mad.c =================================================================== --- mad.c (revision 5098) +++ mad.c (working copy) @@ -826,14 +826,16 @@ struct ib_mad_send_buf * ib_create_send_ mad_send_wr->send_wr.wr.ud.remote_qkey = IB_QP_SET_QKEY; mad_send_wr->send_wr.wr.ud.pkey_index = pkey_index; - if (rmpp_active) { + if (mad_agent->rmpp_version) { struct ib_rmpp_mad *rmpp_mad = mad_send_wr->send_buf.mad; - rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(hdr_len - - IB_MGMT_RMPP_HDR + data_len); rmpp_mad->rmpp_hdr.rmpp_version = mad_agent->rmpp_version; - rmpp_mad->rmpp_hdr.rmpp_type = IB_MGMT_RMPP_TYPE_DATA; - ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, - IB_MGMT_RMPP_FLAG_ACTIVE); + if (rmpp_active) { + rmpp_mad->rmpp_hdr.paylen_newwin = + cpu_to_be32(hdr_len - IB_MGMT_RMPP_HDR + data_len); + rmpp_mad->rmpp_hdr.rmpp_type = IB_MGMT_RMPP_TYPE_DATA; + ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, + IB_MGMT_RMPP_FLAG_ACTIVE); + } } mad_send_wr->send_buf.mad_agent = mad_agent; From rdreier at cisco.com Tue Jan 24 11:16:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Jan 2006 11:16:44 -0800 Subject: [openib-general] [PATCH] [MAD] set RMPP version even if RMPP is not active In-Reply-To: (Sean Hefty's message of "Tue, 24 Jan 2006 10:42:36 -0800") References: Message-ID: How important is this? Should I queue it for 2.6.16, or can it wait for 2.6.17 to open? - R. From mst at mellanox.co.il Tue Jan 24 11:20:20 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 21:20:20 +0200 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: Message-ID: <20060124192020.GA31001@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > Yes, 32 bit is fine. That stuff in timex.h is there to fix up some > sort of crazy PPC 601 stuff (and 601 is so old I don't think anyone > cares or even has a box with both PPC601 and PCI). I guessed that much. > Michael> Cant I just "mftb %0, 268"? > > You can just do "mftb %0" to get the low 32 bits of the timebase, and > "mftbu %0" to get the high 32 bits. I'm not sure what the ", 268" > would do. Thats what trunk head does, but Nishanth Aravamudan here sees 1 sec = 5.37731e+14 usec which seems to indicate something's still wrong. Thanks anyway! -- MST From rdreier at cisco.com Tue Jan 24 11:24:02 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Jan 2006 11:24:02 -0800 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: <20060124192020.GA31001@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 24 Jan 2006 21:20:20 +0200") References: <20060124192020.GA31001@mellanox.co.il> Message-ID: Michael> Thats what trunk head does, but Nishanth Aravamudan here Michael> sees Michael> 1 sec = 5.37731e+14 usec Michael> which seems to indicate something's still wrong. Nish, did everything get rebuilt? Since the makefile doesn't have dependencies, I can do things like: $ make clock_test cc -Wall -g -D_GNU_SOURCE -O2 clock_test.c get_clock.c -o clock_test $ touch get_clock.h $ make clock_test make: `clock_test' is up to date. so a simple "svn up" followed by "make" make not be enough to pick up all changes. - R. From rdreier at cisco.com Tue Jan 24 11:25:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Jan 2006 11:25:45 -0800 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: <20060124192020.GA31001@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 24 Jan 2006 21:20:20 +0200") References: <20060124192020.GA31001@mellanox.co.il> Message-ID: Michael> 1 sec = 5.37731e+14 usec Michael> which seems to indicate something's still wrong. BTW this number is pretty close to 2^32 times bigger than 1e6, so the problem is probably still using long long to return the result of mftb (which will result in shifting the result by 32 bits, ie multiplying by 2^32). - R. From sean.hefty at intel.com Tue Jan 24 11:30:04 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 24 Jan 2006 11:30:04 -0800 Subject: [openib-general] [PATCH] [MAD] set RMPP version even if RMPP is not active In-Reply-To: Message-ID: >How important is this? Should I queue it for 2.6.16, or can it wait >for 2.6.17 to open? It's a compliance issue, but the fact that the code has gone this long without hitting an issue probably means that it can wait. The RMPP code actually checks the active flag before checking the version, which is wrong, but why this has gone undetected. - Sean From halr at voltaire.com Tue Jan 24 11:26:50 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Jan 2006 14:26:50 -0500 Subject: [openib-general] OpenSM GET_TABLE path record question In-Reply-To: References: Message-ID: <1138130806.4338.42704.camel@hal.voltaire.com> Hi Sean, On Tue, 2006-01-24 at 13:46, Sean Hefty wrote: > Does anyone know if OpenSM (or any SM) includes path records where the DGID=SGID > when responding to a GET_TABLE request? > The fields being set in the request are SGID, NumbPath, and PKey. Yes, loopback paths should be returned. I did a totally wildcard query and they are returned. Are you just asking or is there some issue with this ? -- Hal > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Tue Jan 24 11:39:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 21:39:23 +0200 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: References: Message-ID: <20060124193923.GB31001@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > Michael> 1 sec = 5.37731e+14 usec > > Michael> which seems to indicate something's still wrong. > > BTW this number is pretty close to 2^32 times bigger than 1e6, so the > problem is probably still using long long to return the result of > mftb (which will result in shifting the result by 32 bits, ie > multiplying by 2^32). Hmm. Maybe make clean wasnt run after updating? Could it be un on rev 5174? -- MST From mshefty at ichips.intel.com Tue Jan 24 11:39:40 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Jan 2006 11:39:40 -0800 Subject: [openib-general] OpenSM GET_TABLE path record question In-Reply-To: <1138130806.4338.42704.camel@hal.voltaire.com> References: <1138130806.4338.42704.camel@hal.voltaire.com> Message-ID: <43D6827C.10003@ichips.intel.com> Hal Rosenstock wrote: > Yes, loopback paths should be returned. I did a totally wildcard query > and they are returned. Are you just asking or is there some issue with > this ? My testing of the caching code wasn't failing in the location that I expected when attempting loopback connections. (I was failing route resolution, which this would explain.) So I just wanted to verify that this worked before debugging the issue too far. Thanks, Sean From xma at us.ibm.com Tue Jan 24 11:42:18 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 24 Jan 2006 11:42:18 -0800 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: Message-ID: in linux-2.6.16-test/include/asm-powerpc/reg.h #define mftb() ({unsigned long rval; \ asm volatile("mftb %0" : "=r" (rval)); rval;}) mftb() returns unsigned long. If we want to use the 64 bit register, then mftb() can't be used. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Jan 24 11:46:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 24 Jan 2006 11:46:19 -0800 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: (Shirley Ma's message of "Tue, 24 Jan 2006 11:42:18 -0800") References: Message-ID: Shirley> mftb() returns unsigned long. If we want to use the 64 Shirley> bit register, then mftb() can't be used. Not exactly. If the CPU has 64 bit registers, then mftb will give a full 64 bit value. If the CPU only has 32 bit registers (or is running in 32 mode), then since mftb only touches one register, mftb can only return 32 bits. In the 32 bit case, mftb returns the low 32 bits of the timebase, and mftbu can be used to return the high 32 bits. - R. From halr at voltaire.com Tue Jan 24 11:44:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Jan 2006 14:44:47 -0500 Subject: [openib-general] OpenSM GET_TABLE path record question In-Reply-To: <43D6827C.10003@ichips.intel.com> References: <1138130806.4338.42704.camel@hal.voltaire.com> <43D6827C.10003@ichips.intel.com> Message-ID: <1138131882.4338.42802.camel@hal.voltaire.com> On Tue, 2006-01-24 at 14:39, Sean Hefty wrote: > Hal Rosenstock wrote: > > Yes, loopback paths should be returned. I did a totally wildcard query > > and they are returned. Are you just asking or is there some issue with > > this ? > > My testing of the caching code wasn't failing in the location that I expected > when attempting loopback connections. (I was failing route resolution, which > this would explain.) So I just wanted to verify that this worked before > debugging the issue too far. I didn't do the exact query you mentioned but a get all PathRecords returned a loopback path in the very first record. Wouldn't madeye at least print out some headers and some hex ? I could verify from that if you want. -- Hal From mshefty at ichips.intel.com Tue Jan 24 11:55:15 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Jan 2006 11:55:15 -0800 Subject: [openib-general] OpenSM GET_TABLE path record question In-Reply-To: <1138131882.4338.42802.camel@hal.voltaire.com> References: <1138130806.4338.42704.camel@hal.voltaire.com> <43D6827C.10003@ichips.intel.com> <1138131882.4338.42802.camel@hal.voltaire.com> Message-ID: <43D68623.3060305@ichips.intel.com> Hal Rosenstock wrote: > I didn't do the exact query you mentioned but a get all PathRecords > returned a loopback path in the very first record. Wouldn't madeye at > least print out some headers and some hex ? I could verify from that if > you want. I'll spend some time debugging this first. I figured that I'd at least ask if it was supposed to work before spending too much time debugging. Thanks. - Sean From nacc at us.ibm.com Tue Jan 24 11:58:43 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 24 Jan 2006 11:58:43 -0800 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: References: <20060124192020.GA31001@mellanox.co.il> Message-ID: <20060124195843.GC27746@us.ibm.com> On 24.01.2006 [11:24:02 -0800], Roland Dreier wrote: > Michael> Thats what trunk head does, but Nishanth Aravamudan here > Michael> sees > > Michael> 1 sec = 5.37731e+14 usec > > Michael> which seems to indicate something's still wrong. > > Nish, did everything get rebuilt? Since the makefile doesn't have > dependencies, I can do things like: > > $ make clock_test > cc -Wall -g -D_GNU_SOURCE -O2 clock_test.c get_clock.c -o clock_test > $ touch get_clock.h > $ make clock_test > make: `clock_test' is up to date. > > so a simple "svn up" followed by "make" make not be enough to pick up > all changes. I'm *almost* certain it got rebuilt, as this is the current flow: I submit a job to the testing grid: The grid sends the job to the two machines: Each machine does: A fresh kernel build, and reboots into it, then Grabs a tarball of the svn tree and untars, then runs make inside, first rm -rf the old tree. So, in essence, it starts from scratch *every* time. I'll run the test again and see what happens. Thanks, Nish From nacc at us.ibm.com Tue Jan 24 12:00:19 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 24 Jan 2006 12:00:19 -0800 Subject: [openib-general] Re: Re: Userspace testing results (many kernels, many svn trees) In-Reply-To: <20060124185438.GA30693@mellanox.co.il> References: <20060124182309.GU5074@us.ibm.com> <20060124185438.GA30693@mellanox.co.il> Message-ID: <20060124200019.GE27746@us.ibm.com> On 24.01.2006 [20:54:38 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Subject: Re: [openib-general] Re: Re: Userspace testing results (many kernels,many svn trees) > > > > On 24.01.2006 [01:44:42 +0200], Michael S. Tsirkin wrote: > > > Quoting r. Nishanth Aravamudan : > > > > > I have just uploaded a simple utility which I called clock_test which > > > > > measures a clock once a second: this way you'll know whether mtfb > > > > > is measuring time properly. > > > > > > > > Will it get built by running make in the perftest directory? > > > Yes. > > > > > > > Any special usage I should know about? > > > > > > Look at its source, you'll see. > > > > > > You just run it for a while and it will print out the time > > > tkaen from mtfb each second. > > > Kill it with CRTL-C. > > > > With 5169, I get: > > > > 1 sec = 5.37731e+14 usec > > 1 sec = 5.37451e+14 usec > > 1 sec = 5.3748e+14 usec > > 1 sec = 5.37495e+14 usec > > 1 sec = 5.37483e+14 usec > > 1 sec = 5.37493e+14 usec > > 1 sec = 5.37495e+14 usec > > 1 sec = 5.37495e+14 usec > > 1 sec = 5.37493e+14 usec > > 1 sec = 5.37492e+14 usec > > 1 sec = 5.37495e+14 usec > > 1 sec = 5.37494e+14 usec > > > > Thanks, > > Nish > > > > Hmm. First, try updating to 5174 and run clock_test. > Second, what about this patch: Will do... Thanks, Nish From nacc at us.ibm.com Tue Jan 24 13:02:32 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 24 Jan 2006 13:02:32 -0800 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: <20060124193923.GB31001@mellanox.co.il> References: <20060124193923.GB31001@mellanox.co.il> Message-ID: <20060124210232.GG27746@us.ibm.com> On 24.01.2006 [21:39:23 +0200], Michael S. Tsirkin wrote: > Quoting r. Roland Dreier : > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > Michael> 1 sec = 5.37731e+14 usec > > > > Michael> which seems to indicate something's still wrong. > > > > BTW this number is pretty close to 2^32 times bigger than 1e6, so the > > problem is probably still using long long to return the result of > > mftb (which will result in shifting the result by 32 bits, ie > > multiplying by 2^32). > > Hmm. > Maybe make clean wasnt run after updating? > Could it be un on rev 5174? Heh, here's what happens with 5174: Correlation coefficient r^2: 0.773428 < 0.9 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec 1 sec = inf usec And so forth... Thanks, Nish P.S. Is there any way to specify how long to run clock_test from the command line? It's a bit of a pain the grid to kill a process... Thanks, Nish From sean.hefty at intel.com Tue Jan 24 13:02:57 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 24 Jan 2006 13:02:57 -0800 Subject: [openib-general] OpenSM GET_TABLE path record question In-Reply-To: <1138131882.4338.42802.camel@hal.voltaire.com> Message-ID: >I didn't do the exact query you mentioned but a get all PathRecords >returned a loopback path in the very first record. Wouldn't madeye at >least print out some headers and some hex ? I could verify from that if >you want. Here's a dump from madeye. I see where the second path record is, and it looks correct. I'm not sure where the 5a 5a 5a ... at the start of each MAD is coming from. Jan 24 13:29:32 mshefty-linux2 kernel: Madeye:sent GMP Jan 24 13:29:32 mshefty-linux2 kernel: MAD version....0x1 Jan 24 13:29:32 mshefty-linux2 kernel: Class..........0x3 (Subnet admin.) Jan 24 13:29:32 mshefty-linux2 kernel: Class version..0x2 Jan 24 13:29:32 mshefty-linux2 kernel: Method.........0x12 (Get table) Jan 24 13:29:32 mshefty-linux2 kernel: Status.........0x00 Jan 24 13:29:32 mshefty-linux2 kernel: Class specific.0x00 Jan 24 13:29:32 mshefty-linux2 kernel: Trans ID.......0x9c5d7fe923000000 Jan 24 13:29:32 mshefty-linux2 kernel: Attr ID........0x35 (Path Record) Jan 24 13:29:32 mshefty-linux2 kernel: Attr modifier..0x0000 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP version...0x1 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP type......0x0 (Unknown) Jan 24 13:29:32 mshefty-linux2 kernel: RMPP RRespTime.0x0 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP flags.....0x0 (Inactive) Jan 24 13:29:32 mshefty-linux2 kernel: RMPP status....0x0 Jan 24 13:29:32 mshefty-linux2 kernel: Seg number.....0x0000 Jan 24 13:29:32 mshefty-linux2 kernel: Data 2.........0x0000 Jan 24 13:29:32 mshefty-linux2 kernel: Madeye:recv GMP Jan 24 13:29:32 mshefty-linux2 kernel: MAD version....0x1 Jan 24 13:29:32 mshefty-linux2 kernel: Class..........0x3 (Subnet admin.) Jan 24 13:29:32 mshefty-linux2 kernel: Class version..0x2 Jan 24 13:29:32 mshefty-linux2 kernel: Method.........0x92 (Get table response) Jan 24 13:29:32 mshefty-linux2 kernel: Status.........0x00 Jan 24 13:29:32 mshefty-linux2 kernel: Class specific.0x00 Jan 24 13:29:32 mshefty-linux2 kernel: Trans ID.......0x9c5d7fe923000000 Jan 24 13:29:32 mshefty-linux2 kernel: Attr ID........0x35 (Path Record) Jan 24 13:29:32 mshefty-linux2 kernel: Attr modifier..0x0000 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP version...0x1 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP type......0x1 (Data) Jan 24 13:29:32 mshefty-linux2 kernel: RMPP RRespTime.0x0 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP flags.....0x3 (Active - First) Jan 24 13:29:32 mshefty-linux2 kernel: RMPP status....0x0 Jan 24 13:29:32 mshefty-linux2 kernel: Seg number.....0x0001 Jan 24 13:29:32 mshefty-linux2 kernel: Payload len....0x027c Jan 24 13:29:32 mshefty-linux2 kernel: Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 1 0 0 0 0 0 5a 5a 5a 5a 5a 5a 5a 5a Jan 24 13:29:32 mshefty-linux2 kernel: Data...........5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a Jan 24 13:29:32 mshefty-linux2 kernel: Data...........5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a Jan 24 13:29:32 mshefty-linux2 kernel: Data...........1 3 2 92 0 0 0 0 0 0 0 23 e9 7f 5d 9c Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 35 0 0 0 0 0 0 1 1 3 0 0 0 0 1 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 2 7c 0 0 0 0 0 0 0 0 0 8 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 0 0 0 0 30 8 0 0 0 0 0 0 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 0 2 c9 1 7 fc 5e 11 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 0 2 c9 1 7 fc 5b e1 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 7 0 8 0 0 0 0 0 80 ff ff 0 0 84 83 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........92 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 0 2 c9 1 9 75 c4 11 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Madeye:sent GMP Jan 24 13:29:32 mshefty-linux2 kernel: MAD version....0x1 Jan 24 13:29:32 mshefty-linux2 kernel: Class..........0x3 (Subnet admin.) Jan 24 13:29:32 mshefty-linux2 kernel: Class version..0x2 Jan 24 13:29:32 mshefty-linux2 kernel: Method.........0x12 (Get table) Jan 24 13:29:32 mshefty-linux2 kernel: Status.........0x00 Jan 24 13:29:32 mshefty-linux2 kernel: Class specific.0x00 Jan 24 13:29:32 mshefty-linux2 kernel: Trans ID.......0x9c5d7fe923000000 Jan 24 13:29:32 mshefty-linux2 kernel: Attr ID........0x35 (Path Record) Jan 24 13:29:32 mshefty-linux2 kernel: Attr modifier..0x0000 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP version...0x1 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP type......0x2 (Ack) Jan 24 13:29:32 mshefty-linux2 kernel: RMPP RRespTime.0x0 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP flags.....0x1 (Active) Jan 24 13:29:32 mshefty-linux2 kernel: RMPP status....0x0 Jan 24 13:29:32 mshefty-linux2 kernel: Seg number.....0x0001 Jan 24 13:29:32 mshefty-linux2 kernel: New window.....0x0041 Jan 24 13:29:32 mshefty-linux2 kernel: Madeye:recv GMP Jan 24 13:29:32 mshefty-linux2 kernel: MAD version....0x1 Jan 24 13:29:32 mshefty-linux2 kernel: Class..........0x3 (Subnet admin.) Jan 24 13:29:32 mshefty-linux2 kernel: Class version..0x2 Jan 24 13:29:32 mshefty-linux2 kernel: Method.........0x92 (Get table response) Jan 24 13:29:32 mshefty-linux2 kernel: Status.........0x00 Jan 24 13:29:32 mshefty-linux2 kernel: Class specific.0x00 Jan 24 13:29:32 mshefty-linux2 kernel: Trans ID.......0x9c5d7fe923000000 Jan 24 13:29:32 mshefty-linux2 kernel: Attr ID........0x35 (Path Record) Jan 24 13:29:32 mshefty-linux2 kernel: Attr modifier..0x0000 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP version...0x1 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP type......0x1 (Data) Jan 24 13:29:32 mshefty-linux2 kernel: RMPP RRespTime.0x0 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP flags.....0x1 (Active) Jan 24 13:29:32 mshefty-linux2 kernel: RMPP status....0x0 Jan 24 13:29:32 mshefty-linux2 kernel: Seg number.....0x0002 Jan 24 13:29:32 mshefty-linux2 kernel: Payload len....0x0000 Jan 24 13:29:32 mshefty-linux2 kernel: Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 1 0 0 0 0 0 5a 5a 5a 5a 5a 5a 5a 5a Jan 24 13:29:32 mshefty-linux2 kernel: Data...........5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a Jan 24 13:29:32 mshefty-linux2 kernel: Data...........5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a Jan 24 13:29:32 mshefty-linux2 kernel: Data...........1 3 2 92 0 0 0 0 0 0 0 23 e9 7f 5d 9c Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 35 0 0 0 0 0 0 1 1 1 0 0 0 0 2 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 0 0 0 0 30 8 fe 80 0 0 0 0 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 2 c9 1 9 75 c1 71 fe 80 0 0 0 0 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 2 c9 1 7 fc 5b e1 0 e 0 8 0 0 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 80 ff ff 0 0 84 83 92 0 0 0 0 0 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 0 0 0 0 0 0 fe 80 0 0 0 0 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 2 c9 1 a d2 5b 91 fe 80 0 0 0 0 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 2 c9 1 7 fc 5b e1 Jan 24 13:29:32 mshefty-linux2 kernel: Madeye:recv GMP Jan 24 13:29:32 mshefty-linux2 kernel: MAD version....0x1 Jan 24 13:29:32 mshefty-linux2 kernel: Class..........0x3 (Subnet admin.) Jan 24 13:29:32 mshefty-linux2 kernel: Class version..0x2 Jan 24 13:29:32 mshefty-linux2 kernel: Method.........0x92 (Get table response) Jan 24 13:29:32 mshefty-linux2 kernel: Status.........0x00 Jan 24 13:29:32 mshefty-linux2 kernel: Class specific.0x00 Jan 24 13:29:32 mshefty-linux2 kernel: Trans ID.......0x9c5d7fe923000000 Jan 24 13:29:32 mshefty-linux2 kernel: Attr ID........0x35 (Path Record) Jan 24 13:29:32 mshefty-linux2 kernel: Attr modifier..0x0000 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP version...0x1 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP type......0x1 (Data) Jan 24 13:29:32 mshefty-linux2 kernel: RMPP RRespTime.0x0 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP flags.....0x5 (Active - Last) Jan 24 13:29:32 mshefty-linux2 kernel: RMPP status....0x0 Jan 24 13:29:32 mshefty-linux2 kernel: Seg number.....0x0003 Jan 24 13:29:32 mshefty-linux2 kernel: Payload len....0x00c4 Jan 24 13:29:32 mshefty-linux2 kernel: Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 1 0 0 0 0 0 5a 5a 5a 5a 5a 5a 5a 5a Jan 24 13:29:32 mshefty-linux2 kernel: Data...........5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a Jan 24 13:29:32 mshefty-linux2 kernel: Data...........5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a Jan 24 13:29:32 mshefty-linux2 kernel: Data...........1 3 2 92 0 0 0 0 0 0 0 23 e9 7f 5d 9c Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 35 0 0 0 0 0 0 1 1 5 0 0 0 0 3 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 0 c4 0 0 0 0 0 0 0 0 0 8 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 0 0 0 0 30 8 0 2 c9 1 9 75 c5 d1 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 0 2 c9 1 7 fc 5b e1 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 1 0 8 0 0 0 0 0 80 ff ff 0 0 84 83 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........92 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 0 2 c9 1 7 fc 5b e1 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 0 2 c9 1 7 fc 5b e1 Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 8 0 8 0 0 0 0 Jan 24 13:29:32 mshefty-linux2 kernel: Madeye:sent GMP Jan 24 13:29:32 mshefty-linux2 kernel: MAD version....0x1 Jan 24 13:29:32 mshefty-linux2 kernel: Class..........0x3 (Subnet admin.) Jan 24 13:29:32 mshefty-linux2 kernel: Class version..0x2 Jan 24 13:29:32 mshefty-linux2 kernel: Method.........0x12 (Get table) Jan 24 13:29:32 mshefty-linux2 kernel: Status.........0x00 Jan 24 13:29:32 mshefty-linux2 kernel: Class specific.0x00 Jan 24 13:29:32 mshefty-linux2 kernel: Trans ID.......0x9c5d7fe923000000 Jan 24 13:29:32 mshefty-linux2 kernel: Attr ID........0x35 (Path Record) Jan 24 13:29:32 mshefty-linux2 kernel: Attr modifier..0x0000 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP version...0x1 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP type......0x2 (Ack) Jan 24 13:29:32 mshefty-linux2 kernel: RMPP RRespTime.0x0 Jan 24 13:29:32 mshefty-linux2 kernel: RMPP flags.....0x1 (Active) Jan 24 13:29:32 mshefty-linux2 kernel: RMPP status....0x0 Jan 24 13:29:32 mshefty-linux2 kernel: Seg number.....0x0003 Jan 24 13:29:32 mshefty-linux2 kernel: New window.....0x0041 Jan 24 13:46:02 mshefty-linux2 su: (to root) mshefty on /dev/pts/3 From mst at mellanox.co.il Tue Jan 24 13:19:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jan 2006 23:19:52 +0200 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: <20060124210232.GG27746@us.ibm.com> References: <20060124210232.GG27746@us.ibm.com> Message-ID: <20060124211952.GA31938@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > On 24.01.2006 [21:39:23 +0200], Michael S. Tsirkin wrote: > > Quoting r. Roland Dreier : > > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > > > Michael> 1 sec = 5.37731e+14 usec > > > > > > Michael> which seems to indicate something's still wrong. > > > > > > BTW this number is pretty close to 2^32 times bigger than 1e6, so the > > > problem is probably still using long long to return the result of > > > mftb (which will result in shifting the result by 32 bits, ie > > > multiplying by 2^32). > > > > Hmm. > > Maybe make clean wasnt run after updating? > > Could it be un on rev 5174? > > Heh, here's what happens with 5174: > > Correlation coefficient r^2: 0.773428 < 0.9 > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > 1 sec = inf usec > > And so forth... > > Thanks, > Nish Hmm. Looks like mftb is returning wrong data. Could you uncomment lines setting DEBUG and DEBUG_DATA at the top? This will print all mftb values out. > > P.S. Is there any way to specify how long to run clock_test from the > command line? It's a bit of a pain the grid to kill a process... > > Thanks, > Nish I'll make something up, for now I guess you can just add a hard-coded counter in for(;;) in clock_test.c -- MST From nacc at us.ibm.com Tue Jan 24 13:20:44 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 24 Jan 2006 13:20:44 -0800 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: <20060124211952.GA31938@mellanox.co.il> References: <20060124210232.GG27746@us.ibm.com> <20060124211952.GA31938@mellanox.co.il> Message-ID: <20060124212044.GH27746@us.ibm.com> On 24.01.2006 [23:19:52 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > On 24.01.2006 [21:39:23 +0200], Michael S. Tsirkin wrote: > > > Quoting r. Roland Dreier : > > > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > > > > > Michael> 1 sec = 5.37731e+14 usec > > > > > > > > Michael> which seems to indicate something's still wrong. > > > > > > > > BTW this number is pretty close to 2^32 times bigger than 1e6, so the > > > > problem is probably still using long long to return the result of > > > > mftb (which will result in shifting the result by 32 bits, ie > > > > multiplying by 2^32). > > > > > > Hmm. > > > Maybe make clean wasnt run after updating? > > > Could it be un on rev 5174? > > > > Heh, here's what happens with 5174: > > > > Correlation coefficient r^2: 0.773428 < 0.9 > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > > > And so forth... > > > > Thanks, > > Nish > > Hmm. Looks like mftb is returning wrong data. > Could you uncomment lines setting DEBUG and DEBUG_DATA at the top? > This will print all mftb values out. Sure -- but I won't be able to for a few hours, I have to run to class. Sorry! Thanks, Nish From halr at voltaire.com Tue Jan 24 13:27:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Jan 2006 16:27:31 -0500 Subject: [openib-general] OpenSM GET_TABLE path record question In-Reply-To: References: Message-ID: <1138138048.4338.43312.camel@hal.voltaire.com> On Tue, 2006-01-24 at 16:02, Sean Hefty wrote: > >I didn't do the exact query you mentioned but a get all PathRecords > >returned a loopback path in the very first record. Wouldn't madeye at > >least print out some headers and some hex ? I could verify from that if > >you want. > > Here's a dump from madeye. I see where the second path record is, and it looks > correct. I'm not sure where the 5a 5a 5a ... at the start of each MAD is coming > from. I'm not sure I trust the data output from madeye here. I've seen it on the analyzer and the MAD coming from OpenSM is good. > Jan 24 13:29:32 mshefty-linux2 kernel: Madeye:sent GMP > Jan 24 13:29:32 mshefty-linux2 kernel: MAD version....0x1 > Jan 24 13:29:32 mshefty-linux2 kernel: Class..........0x3 (Subnet admin.) > Jan 24 13:29:32 mshefty-linux2 kernel: Class version..0x2 > Jan 24 13:29:32 mshefty-linux2 kernel: Method.........0x12 (Get table) > Jan 24 13:29:32 mshefty-linux2 kernel: Status.........0x00 > Jan 24 13:29:32 mshefty-linux2 kernel: Class specific.0x00 > Jan 24 13:29:32 mshefty-linux2 kernel: Trans ID.......0x9c5d7fe923000000 > Jan 24 13:29:32 mshefty-linux2 kernel: Attr ID........0x35 (Path Record) > Jan 24 13:29:32 mshefty-linux2 kernel: Attr modifier..0x0000 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP version...0x1 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP type......0x0 (Unknown) > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP RRespTime.0x0 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP flags.....0x0 (Inactive) > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP status....0x0 > Jan 24 13:29:32 mshefty-linux2 kernel: Seg number.....0x0000 > Jan 24 13:29:32 mshefty-linux2 kernel: Data 2.........0x0000 > Jan 24 13:29:32 mshefty-linux2 kernel: Madeye:recv GMP > Jan 24 13:29:32 mshefty-linux2 kernel: MAD version....0x1 > Jan 24 13:29:32 mshefty-linux2 kernel: Class..........0x3 (Subnet admin.) > Jan 24 13:29:32 mshefty-linux2 kernel: Class version..0x2 > Jan 24 13:29:32 mshefty-linux2 kernel: Method.........0x92 (Get table response) > Jan 24 13:29:32 mshefty-linux2 kernel: Status.........0x00 > Jan 24 13:29:32 mshefty-linux2 kernel: Class specific.0x00 > Jan 24 13:29:32 mshefty-linux2 kernel: Trans ID.......0x9c5d7fe923000000 > Jan 24 13:29:32 mshefty-linux2 kernel: Attr ID........0x35 (Path Record) > Jan 24 13:29:32 mshefty-linux2 kernel: Attr modifier..0x0000 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP version...0x1 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP type......0x1 (Data) > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP RRespTime.0x0 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP flags.....0x3 (Active - First) > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP status....0x0 > Jan 24 13:29:32 mshefty-linux2 kernel: Seg number.....0x0001 > Jan 24 13:29:32 mshefty-linux2 kernel: Payload len....0x027c > Jan 24 13:29:32 mshefty-linux2 kernel: > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 1 0 0 0 0 0 5a 5a 5a > 5a 5a 5a 5a 5a > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........1 3 2 92 0 0 0 0 0 0 0 23 > e9 7f 5d 9c This looks like the start of the MAD to me... I don't know what the previous 48 bytes were... > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 35 0 0 0 0 0 0 1 1 3 0 0 > 0 0 1 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 2 7c 0 0 0 0 0 0 0 0 0 > 8 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 0 0 0 0 30 8 0 0 0 0 0 > 0 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 0 2 c9 1 > 7 fc 5e 11 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 0 2 c9 1 > 7 fc 5b e1 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 7 0 8 0 0 0 0 0 80 ff ff > 0 0 84 83 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........92 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 0 2 c9 1 > 9 75 c4 11 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Madeye:sent GMP > Jan 24 13:29:32 mshefty-linux2 kernel: MAD version....0x1 > Jan 24 13:29:32 mshefty-linux2 kernel: Class..........0x3 (Subnet admin.) > Jan 24 13:29:32 mshefty-linux2 kernel: Class version..0x2 > Jan 24 13:29:32 mshefty-linux2 kernel: Method.........0x12 (Get table) > Jan 24 13:29:32 mshefty-linux2 kernel: Status.........0x00 > Jan 24 13:29:32 mshefty-linux2 kernel: Class specific.0x00 > Jan 24 13:29:32 mshefty-linux2 kernel: Trans ID.......0x9c5d7fe923000000 > Jan 24 13:29:32 mshefty-linux2 kernel: Attr ID........0x35 (Path Record) > Jan 24 13:29:32 mshefty-linux2 kernel: Attr modifier..0x0000 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP version...0x1 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP type......0x2 (Ack) > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP RRespTime.0x0 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP flags.....0x1 (Active) > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP status....0x0 > Jan 24 13:29:32 mshefty-linux2 kernel: Seg number.....0x0001 > Jan 24 13:29:32 mshefty-linux2 kernel: New window.....0x0041 > Jan 24 13:29:32 mshefty-linux2 kernel: Madeye:recv GMP > Jan 24 13:29:32 mshefty-linux2 kernel: MAD version....0x1 > Jan 24 13:29:32 mshefty-linux2 kernel: Class..........0x3 (Subnet admin.) > Jan 24 13:29:32 mshefty-linux2 kernel: Class version..0x2 > Jan 24 13:29:32 mshefty-linux2 kernel: Method.........0x92 (Get table response) > Jan 24 13:29:32 mshefty-linux2 kernel: Status.........0x00 > Jan 24 13:29:32 mshefty-linux2 kernel: Class specific.0x00 > Jan 24 13:29:32 mshefty-linux2 kernel: Trans ID.......0x9c5d7fe923000000 > Jan 24 13:29:32 mshefty-linux2 kernel: Attr ID........0x35 (Path Record) > Jan 24 13:29:32 mshefty-linux2 kernel: Attr modifier..0x0000 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP version...0x1 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP type......0x1 (Data) > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP RRespTime.0x0 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP flags.....0x1 (Active) > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP status....0x0 > Jan 24 13:29:32 mshefty-linux2 kernel: Seg number.....0x0002 > Jan 24 13:29:32 mshefty-linux2 kernel: Payload len....0x0000 > Jan 24 13:29:32 mshefty-linux2 kernel: > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 1 0 0 0 0 0 5a 5a 5a > 5a 5a 5a 5a 5a > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........1 3 2 92 0 0 0 0 0 0 0 23 > e9 7f 5d 9c > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 35 0 0 0 0 0 0 1 1 1 0 0 > 0 0 2 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 0 0 0 0 0 0 0 0 0 0 0 > 8 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 0 0 0 0 30 8 fe 80 0 0 > 0 0 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 2 c9 1 9 75 c1 71 fe 80 > 0 0 0 0 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 2 c9 1 7 fc 5b e1 0 e 0 > 8 0 0 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 80 ff ff 0 0 84 83 92 0 > 0 0 0 0 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 0 0 0 0 0 0 fe 80 0 0 > 0 0 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 2 c9 1 a d2 5b 91 fe 80 > 0 0 0 0 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 2 c9 1 7 fc 5b e1 > Jan 24 13:29:32 mshefty-linux2 kernel: Madeye:recv GMP > Jan 24 13:29:32 mshefty-linux2 kernel: MAD version....0x1 > Jan 24 13:29:32 mshefty-linux2 kernel: Class..........0x3 (Subnet admin.) > Jan 24 13:29:32 mshefty-linux2 kernel: Class version..0x2 > Jan 24 13:29:32 mshefty-linux2 kernel: Method.........0x92 (Get table response) > Jan 24 13:29:32 mshefty-linux2 kernel: Status.........0x00 > Jan 24 13:29:32 mshefty-linux2 kernel: Class specific.0x00 > Jan 24 13:29:32 mshefty-linux2 kernel: Trans ID.......0x9c5d7fe923000000 > Jan 24 13:29:32 mshefty-linux2 kernel: Attr ID........0x35 (Path Record) > Jan 24 13:29:32 mshefty-linux2 kernel: Attr modifier..0x0000 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP version...0x1 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP type......0x1 (Data) > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP RRespTime.0x0 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP flags.....0x5 (Active - Last) > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP status....0x0 > Jan 24 13:29:32 mshefty-linux2 kernel: Seg number.....0x0003 > Jan 24 13:29:32 mshefty-linux2 kernel: Payload len....0x00c4 > Jan 24 13:29:32 mshefty-linux2 kernel: > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 1 0 0 0 0 0 5a 5a 5a > 5a 5a 5a 5a 5a > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........5a 5a 5a 5a 5a 5a 5a 5a 5a > 5a 5a 5a 5a 5a 5a 5a > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........1 3 2 92 0 0 0 0 0 0 0 23 > e9 7f 5d 9c > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 35 0 0 0 0 0 0 1 1 5 0 0 > 0 0 3 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 0 c4 0 0 0 0 0 0 0 0 0 > 8 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 0 0 0 0 0 30 8 0 2 c9 1 > 9 75 c5 d1 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 0 2 c9 1 > 7 fc 5b e1 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 1 0 8 0 0 0 0 0 80 ff ff > 0 0 84 83 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........92 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 0 2 c9 1 > 7 fc 5b e1 > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........fe 80 0 0 0 0 0 0 0 2 c9 1 > 7 fc 5b e1 Here's one loopback path (DGID = SGID). > Jan 24 13:29:32 mshefty-linux2 kernel: Data...........0 8 0 8 0 0 0 0 > Jan 24 13:29:32 mshefty-linux2 kernel: Madeye:sent GMP > Jan 24 13:29:32 mshefty-linux2 kernel: MAD version....0x1 > Jan 24 13:29:32 mshefty-linux2 kernel: Class..........0x3 (Subnet admin.) > Jan 24 13:29:32 mshefty-linux2 kernel: Class version..0x2 > Jan 24 13:29:32 mshefty-linux2 kernel: Method.........0x12 (Get table) > Jan 24 13:29:32 mshefty-linux2 kernel: Status.........0x00 > Jan 24 13:29:32 mshefty-linux2 kernel: Class specific.0x00 > Jan 24 13:29:32 mshefty-linux2 kernel: Trans ID.......0x9c5d7fe923000000 > Jan 24 13:29:32 mshefty-linux2 kernel: Attr ID........0x35 (Path Record) > Jan 24 13:29:32 mshefty-linux2 kernel: Attr modifier..0x0000 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP version...0x1 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP type......0x2 (Ack) > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP RRespTime.0x0 > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP flags.....0x1 (Active) > Jan 24 13:29:32 mshefty-linux2 kernel: RMPP status....0x0 > Jan 24 13:29:32 mshefty-linux2 kernel: Seg number.....0x0003 > Jan 24 13:29:32 mshefty-linux2 kernel: New window.....0x0041 > Jan 24 13:46:02 mshefty-linux2 su: (to root) mshefty on /dev/pts/3 > > From iod00d at hp.com Tue Jan 24 13:54:31 2006 From: iod00d at hp.com (Grant Grundler) Date: Tue, 24 Jan 2006 13:54:31 -0800 Subject: [openib-general] OpenSM GET_TABLE path record question In-Reply-To: References: <1138131882.4338.42802.camel@hal.voltaire.com> Message-ID: <20060124215431.GA2938@esmail.cup.hp.com> On Tue, Jan 24, 2006 at 01:02:57PM -0800, Sean Hefty wrote: > I'm not sure where the 5a 5a 5a ... at the start of each MAD is coming > from. Maybe this? mm/slab.c:#define POISON_INUSE 0x5a /* for use-uninitialised poisoning */ grant From mshefty at ichips.intel.com Tue Jan 24 14:05:41 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Jan 2006 14:05:41 -0800 Subject: [openib-general] [PATCH] [MAD] set RMPP version even if RMPP is not active In-Reply-To: References: Message-ID: <43D6A4B5.3070507@ichips.intel.com> Sean Hefty wrote: > The RMPP code actually checks the active flag before checking the version, which > is wrong, but why this has gone undetected. On second thought, and checking the spec in more detail... o13-21.1.3: States that if Active=0, all other fields shall be reserved. and o13-21.1.4: Adds that version shall have a value of 1 in all MADs where Active=1. Given this, I think that the current code _is_ correct. It's the spec that's seems broken... - Sean From ardavis at ichips.intel.com Tue Jan 24 14:42:13 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 24 Jan 2006 14:42:13 -0800 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11C337E@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F11C337E@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <43D6AD45.1030409@ichips.intel.com> ok, maybe we should backup and start over.... This is exactly why immediate data was initially proposed as an extension instead of general API. We start to penalize native IB features based on the requirements of other RDMA interfaces that have to emulate the feature anyway. What prevents the next RDMA interface that comes along from requiring other variations of the interface due to implementation implications? This is an IB specific feature that does not map well on iWARP so lets just call it what it is and let IB providers supply immediate data capabilities via the extension interface. -arlin Caitlin Bestler wrote: >> >>Maybe we need to just go back to one model and always deliver >>via the event? With the post_recv_immed requirements, other >>transports have a mechanism to emulate and create the >>necessary resources on the recv side to place idata and copy >>to event when operation is completed. Would this work for iWARP? >> >> >> >>Two different models for receiving idata should be avoided if >>at all possible. >> >> >> >> >> > >Always delivering by the event is not feasible for an iWARP vendor. >If you are working over RDMAC verbs then the work completion is no >longer accessible by the time the Work Completion is reaped. So copying >from the receive buffer to the event does not work since the location >of the receive buffer is now known only to the application. > >The same problem exists in the opposite direction for InfiniBand HCAs >using standard verbs. They cannot copy from the CQE to the receive >buffer. > >So the user is stuck checking a flag or the event type to know where >their data is. This is not terribly user friendly, but it is the best >that can be offered if we want to enable this optimization. The need >to check the flag does reduce the value of the optimization though. > > > > >> >>6. Is dto_completion_data xfer_length include immediate_data >>size or not? >> >> >> >>no >> >> >> >> >> > >Then how does the receiver know how much data there is? > >Even if an iWarp Provider attempts to optimize immediate >placement into the CQ, it will end up setting the xfer_length >whenever the packet is received out of order. > >So it is far simpler for the application to simply know that >the data will be in the buffer, and that the xfer_length will >be set. It doesn't need to worry about whether they were set >by the cq_poll verb or by the hardware. > > > >> >>11. Need to cleanup operation description to make it clear >>that Send|RDMA_write and immediate data part >> >>is a single atomic operation. The current "followed by" >>language is misleading. >> >>Make it explicit that there is a single local DTO completion >>and single remote DTO completion. >> >> >> >>Ok, I will clean that up >> >> >> >> > >The best mapping available over RDMAC-compliant firmware for >an iWARP NIC would be to post two operations (RDMA Write followed >by a short Send). That would require additional spacein the send >and completion queues since a completion for the write can only >be suppressed for a successful completion. > >Whether these extra slots were required would be an IA attribute. > >And the requirement is that nothing for that QP can come between >the iWARP Write and the Send. How the provider does that is up >to it. Options include locking over both posts and a composite >work request. Anyone working over existing RDMAC-compliant >verbs will have to use the first approach. > > > > >>12. Is your intension that post_recv_immed can ONLY except >>immediate data and is not >> >>capable to recv any message? >> >> >> >>No, the intention is to extend the post_recv to handle 32bit >>idata which may arrive with or without other send or rdma_write data. >> >> >> >>Does it make more sense to add a dto_flags to the existing post_recv? >> >> >> >> > >How does this map to iWARP? > >When the data can be sent as an immediate OR as data, then when received >it can be placed into the receive buffer or even potentially directly >into the CQ when everything aligns just right. > >But an iWARP sender has to place the immediate value as the first >four bytes of a Send message. There is no other mapping than makes >sense. Shoving the rest of the message up is complex, as is using >the last four bytes of the message since the last four bytes *could* >cross a DDP Segment boundary, and would require the user to provide >a buffer that was 4 bytes larger. > > > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From mshefty at ichips.intel.com Tue Jan 24 14:48:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Jan 2006 14:48:33 -0800 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal In-Reply-To: <43D6AD45.1030409@ichips.intel.com> References: <54AD0F12E08D1541B826BE97C98F99F11C337E@NT-SJCA-0751.brcm.ad.broadcom.com> <43D6AD45.1030409@ichips.intel.com> Message-ID: <43D6AEC1.1040503@ichips.intel.com> Arlin Davis wrote: > This is exactly why immediate data was initially proposed as an > extension instead of general API. We start to penalize native IB > features based on the requirements of other RDMA interfaces that have to > emulate the feature anyway. What prevents the next RDMA interface that > comes along from requiring other variations of the interface due to > implementation implications? This is an IB specific feature that does > not map well on iWARP so lets just call it what it is and let IB > providers supply immediate data capabilities via the extension interface. I completely agree. - Sean From halr at voltaire.com Tue Jan 24 15:19:38 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Jan 2006 18:19:38 -0500 Subject: [openib-general] [PATCH] [MAD] set RMPP version even if RMPP is not active In-Reply-To: <43D6A4B5.3070507@ichips.intel.com> References: <43D6A4B5.3070507@ichips.intel.com> Message-ID: <1138144488.4338.43955.camel@hal.voltaire.com> On Tue, 2006-01-24 at 17:05, Sean Hefty wrote: > Sean Hefty wrote: > > The RMPP code actually checks the active flag before checking the version, which > > is wrong, but why this has gone undetected. > > On second thought, and checking the spec in more detail... > > o13-21.1.3: States that if Active=0, all other fields shall be reserved. > > and > > o13-21.1.4: Adds that version shall have a value of 1 in all MADs where Active=1. > > Given this, I think that the current code _is_ correct. It's the spec that's > seems broken... Guess the active bit needs to always stay the same regardless of the RMPP version. -- Hal From Arkady.Kanevsky at netapp.com Tue Jan 24 15:46:18 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 24 Jan 2006 18:46:18 -0500 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal Message-ID: But this penalizes user which need to deal with 2 way to deal with post calls and completions. I do not think we are not to far from consensus. Transport independent App will allocate 4 bytes extra for buffers that can match immediate data. Completion data will return where the immediate data is return (Consumer can not request it on posting), and 4 bytes for immediate data in completion event. The rest are ironing details for complete specification. This is no different than for any other new functionality proposed. And except for wasting 4 bytes per buffer or completion I do not see how it penalizes IB. Moreover if Apps knows that Provider returns immediate data in completion event it can avoid any penalty. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Arlin Davis [mailto:ardavis at ichips.intel.com] > Sent: Tuesday, January 24, 2006 5:42 PM > To: Caitlin Bestler > Cc: Davis, Arlin R; Kanevsky, Arkady; Lentini, James; > dat-discussions at yahoogroups.com; openib-general at openib.org > Subject: Re: [openib-general] RE: [RFC] DAT 2.0 immediate > data proposal > > ok, maybe we should backup and start over.... > > This is exactly why immediate data was initially proposed as > an extension instead of general API. We start to penalize > native IB features based on the requirements of other RDMA > interfaces that have to emulate the feature anyway. What > prevents the next RDMA interface that comes along from > requiring other variations of the interface due to > implementation implications? This is an IB specific feature > that does not map well on iWARP so lets just call it what it > is and let IB providers supply immediate data capabilities > via the extension interface. > > -arlin > > Caitlin Bestler wrote: > > >> > >>Maybe we need to just go back to one model and always deliver > >>via the event? With the post_recv_immed requirements, other > >>transports have a mechanism to emulate and create the > >>necessary resources on the recv side to place idata and copy > >>to event when operation is completed. Would this work for iWARP? > >> > >> > >> > >>Two different models for receiving idata should be avoided if > >>at all possible. > >> > >> > >> > >> > >> > > > >Always delivering by the event is not feasible for an iWARP vendor. > >If you are working over RDMAC verbs then the work completion is no > >longer accessible by the time the Work Completion is reaped. > So copying > >from the receive buffer to the event does not work since the location > >of the receive buffer is now known only to the application. > > > >The same problem exists in the opposite direction for InfiniBand HCAs > >using standard verbs. They cannot copy from the CQE to the receive > >buffer. > > > >So the user is stuck checking a flag or the event type to know where > >their data is. This is not terribly user friendly, but it is the best > >that can be offered if we want to enable this optimization. The need > >to check the flag does reduce the value of the optimization though. > > > > > > > > > >> > >>6. Is dto_completion_data xfer_length include immediate_data > >>size or not? > >> > >> > >> > >>no > >> > >> > >> > >> > >> > > > >Then how does the receiver know how much data there is? > > > >Even if an iWarp Provider attempts to optimize immediate > >placement into the CQ, it will end up setting the xfer_length > >whenever the packet is received out of order. > > > >So it is far simpler for the application to simply know that > >the data will be in the buffer, and that the xfer_length will > >be set. It doesn't need to worry about whether they were set > >by the cq_poll verb or by the hardware. > > > > > > > >> > >>11. Need to cleanup operation description to make it clear > >>that Send|RDMA_write and immediate data part > >> > >>is a single atomic operation. The current "followed by" > >>language is misleading. > >> > >>Make it explicit that there is a single local DTO completion > >>and single remote DTO completion. > >> > >> > >> > >>Ok, I will clean that up > >> > >> > >> > >> > > > >The best mapping available over RDMAC-compliant firmware for > >an iWARP NIC would be to post two operations (RDMA Write followed > >by a short Send). That would require additional spacein the send > >and completion queues since a completion for the write can only > >be suppressed for a successful completion. > > > >Whether these extra slots were required would be an IA attribute. > > > >And the requirement is that nothing for that QP can come between > >the iWARP Write and the Send. How the provider does that is up > >to it. Options include locking over both posts and a composite > >work request. Anyone working over existing RDMAC-compliant > >verbs will have to use the first approach. > > > > > > > > > >>12. Is your intension that post_recv_immed can ONLY except > >>immediate data and is not > >> > >>capable to recv any message? > >> > >> > >> > >>No, the intention is to extend the post_recv to handle 32bit > >>idata which may arrive with or without other send or > rdma_write data. > >> > >> > >> > >>Does it make more sense to add a dto_flags to the existing > post_recv? > >> > >> > >> > >> > > > >How does this map to iWARP? > > > >When the data can be sent as an immediate OR as data, then > when received > >it can be placed into the receive buffer or even potentially directly > >into the CQ when everything aligns just right. > > > >But an iWARP sender has to place the immediate value as the first > >four bytes of a Send message. There is no other mapping than makes > >sense. Shoving the rest of the message up is complex, as is using > >the last four bytes of the message since the last four bytes *could* > >cross a DDP Segment boundary, and would require the user to provide > >a buffer that was 4 bytes larger. > > > > > > > > > >_______________________________________________ > >openib-general mailing list > >openib-general at openib.org > >http://openib.org/mailman/listinfo/openib-general > > > >To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > > > From mshefty at ichips.intel.com Tue Jan 24 16:16:41 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Jan 2006 16:16:41 -0800 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal In-Reply-To: References: Message-ID: <43D6C369.3070101@ichips.intel.com> Kanevsky, Arkady wrote: > But this penalizes user which need to deal with 2 way to deal > with post calls and completions. Yes, any app that wants to take advantage of transport specific features, which immediate data is, is no longer transport neutral. How do you plan to handle the next RDMA transport that comes along with 64-bytes of immediate data? - Sean From nacc at us.ibm.com Tue Jan 24 16:25:06 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 24 Jan 2006 16:25:06 -0800 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: <20060124211952.GA31938@mellanox.co.il> References: <20060124210232.GG27746@us.ibm.com> <20060124211952.GA31938@mellanox.co.il> Message-ID: <20060125002506.GI27746@us.ibm.com> On 24.01.2006 [23:19:52 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > On 24.01.2006 [21:39:23 +0200], Michael S. Tsirkin wrote: > > > Quoting r. Roland Dreier : > > > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > > > > > Michael> 1 sec = 5.37731e+14 usec > > > > > > > > Michael> which seems to indicate something's still wrong. > > > > > > > > BTW this number is pretty close to 2^32 times bigger than 1e6, so the > > > > problem is probably still using long long to return the result of > > > > mftb (which will result in shifting the result by 32 bits, ie > > > > multiplying by 2^32). > > > > > > Hmm. > > > Maybe make clean wasnt run after updating? > > > Could it be un on rev 5174? > > > > Heh, here's what happens with 5174: > > > > Correlation coefficient r^2: 0.773428 < 0.9 > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > 1 sec = inf usec > > > > And so forth... > > > > Thanks, > > Nish > > Hmm. Looks like mftb is returning wrong data. > Could you uncomment lines setting DEBUG and DEBUG_DATA at the top? > This will print all mftb values out. Here you go: x=100 y=21172 x=110 y=21985 x=120 y=23709 x=130 y=26136 x=140 y=27815 x=150 y=29919 x=160 y=32063 x=170 y=33921 x=180 y=35829 x=190 y=37941 x=200 y=40042 x=210 y=42064 x=220 y=43935 x=230 y=45957 x=240 y=47818 x=250 y=50004 x=260 y=51942 x=270 y=54173 x=280 y=56043 x=290 y=57951 x=300 y=59798 x=310 y=62034 x=320 y=63812 x=330 y=65928 x=340 y=67835 x=350 y=69929 x=360 y=71869 x=370 y=73876 x=380 y=75872 x=390 y=77927 x=400 y=79865 x=410 y=81923 x=420 y=84076 x=430 y=85937 x=440 y=87819 x=450 y=89999 x=460 y=91831 x=470 y=93975 x=480 y=95994 x=490 y=97743 x=500 y=99868 x=510 y=101917 x=520 y=103806 x=530 y=105724 x=540 y=107811 x=550 y=110012 x=560 y=111876 x=570 y=113934 x=580 y=115857 x=590 y=117735 x=600 y=119811 x=610 y=121907 x=620 y=124025 x=630 y=125983 x=640 y=127945 x=650 y=129729 x=660 y=131898 x=670 y=133723 x=680 y=135868 x=690 y=137742 x=700 y=139815 x=710 y=141800 x=720 y=143812 x=730 y=145768 x=740 y=147790 x=750 y=149935 x=760 y=151826 x=770 y=154003 x=780 y=155618 x=790 y=157746 x=800 y=159846 x=810 y=161874 x=820 y=163914 x=830 y=165901 x=840 y=167634 x=850 y=169793 x=860 y=171833 x=870 y=173895 x=880 y=175657 x=890 y=177807 x=900 y=179827 x=910 y=181875 x=920 y=183834 x=930 y=185884 x=940 y=187860 x=950 y=189922 x=960 y=191883 x=974 y=194497 x=980 y=195672 x=990 y=197907 x=1000 y=199626 x=1010 y=201735 x=1020 y=203874 x=1030 y=205728 x=1040 y=207861 x=1050 y=209737 x=1060 y=211836 x=1070 y=213749 x=1080 y=215880 x=1090 y=217817 x=1100 y=219729 x=1110 y=221755 x=1120 y=223704 x=1130 y=225783 x=1140 y=227729 x=1150 y=229786 x=1160 y=232030 x=1170 y=233770 x=1180 y=235803 x=1190 y=237791 x=1200 y=239841 x=1210 y=241768 x=1220 y=243862 x=1230 y=245735 x=1240 y=247678 x=1250 y=249795 x=1260 y=251850 x=1270 y=253771 x=1280 y=255840 x=1290 y=257928 x=1300 y=259880 x=1310 y=261752 x=1320 y=263702 x=1330 y=265601 x=1340 y=267671 x=1350 y=269570 x=1360 y=271756 x=1370 y=273750 x=1380 y=275702 x=1390 y=277765 x=1400 y=279695 x=1410 y=281620 x=1420 y=283692 x=1430 y=285587 x=1440 y=287872 x=1450 y=289621 x=1460 y=291676 x=1470 y=293590 x=1480 y=295702 x=1490 y=297790 x=1500 y=299529 x=1510 y=301597 x=1520 y=303741 x=1530 y=305806 x=1540 y=307870 x=1550 y=309780 x=1560 y=311539 x=1570 y=313756 x=1580 y=315725 x=1590 y=317589 x=1600 y=319559 x=1610 y=321651 x=1620 y=323878 x=1630 y=325848 x=1640 y=327742 x=1650 y=329615 x=1660 y=331551 x=1670 y=333784 x=1680 y=335505 x=1690 y=337610 x=1700 y=339742 x=1710 y=341609 x=1720 y=343717 x=1730 y=345641 x=1740 y=347780 x=1750 y=349626 x=1760 y=351749 x=1770 y=353782 x=1780 y=355740 x=1790 y=357413 x=1800 y=359755 x=1810 y=361621 x=1820 y=363584 x=1830 y=365768 x=1840 y=367582 x=1850 y=369810 x=1860 y=371521 x=1870 y=373702 x=1880 y=375905 x=1890 y=377659 x=1900 y=379704 x=1910 y=381626 x=1920 y=383601 x=1930 y=385635 x=1940 y=387715 x=1950 y=389671 x=1960 y=391704 x=1970 y=393599 x=1980 y=395572 x=1990 y=397692 x=2000 y=399776 x=2010 y=401853 x=2020 y=403711 x=2030 y=405478 x=2040 y=407577 x=2050 y=409618 x=2060 y=411603 x=2070 y=413642 x=2080 y=415601 x=2090 y=417823 a = -8.02523 b = 199.818 a / b = -0.0401626 r^2 = 0.999999 Warning: measured timestamp frequency 199.818 differs from nominal 1600 MHz 1 sec = 1.00195e+06 usec 1 sec = 1.00198e+06 usec 1 sec = 1.00207e+06 usec 1 sec = 1.00207e+06 usec 1 sec = 1.00207e+06 usec 1 sec = 1.00207e+06 usec 1 sec = 1.00207e+06 usec 1 sec = 1.00207e+06 usec 1 sec = 1.00207e+06 usec 1 sec = 1.00207e+06 usec Just an FYI, when I tried redirecting stdout and stderr to a file, the program never would print the "1 sec = ..." lines. If I just redirect stderr (which I had to grab all the numbers above), it works fine. Dunno what's up with that or if it's just a problem on my end. > > P.S. Is there any way to specify how long to run clock_test from the > > command line? It's a bit of a pain the grid to kill a process... > > > > Thanks, > > Nish > > I'll make something up, for now I guess you can just add a hard-coded > counter in for(;;) in clock_test.c Great, thanks! That will make debugging with future revs a lot easier for me. Thanks, Nish From mst at mellanox.co.il Tue Jan 24 22:17:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jan 2006 08:17:29 +0200 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: <20060125002506.GI27746@us.ibm.com> References: <20060125002506.GI27746@us.ibm.com> Message-ID: <20060125061729.GB24479@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > On 24.01.2006 [23:19:52 +0200], Michael S. Tsirkin wrote: > > Quoting r. Nishanth Aravamudan : > > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > > > On 24.01.2006 [21:39:23 +0200], Michael S. Tsirkin wrote: > > > > Quoting r. Roland Dreier : > > > > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > > > > > > > Michael> 1 sec = 5.37731e+14 usec > > > > > > > > > > Michael> which seems to indicate something's still wrong. > > > > > > > > > > BTW this number is pretty close to 2^32 times bigger than 1e6, so the > > > > > problem is probably still using long long to return the result of > > > > > mftb (which will result in shifting the result by 32 bits, ie > > > > > multiplying by 2^32). > > > > > > > > Hmm. > > > > Maybe make clean wasnt run after updating? > > > > Could it be un on rev 5174? > > > > > > Heh, here's what happens with 5174: > > > > > > Correlation coefficient r^2: 0.773428 < 0.9 > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > 1 sec = inf usec > > > > > > And so forth... > > > > > > Thanks, > > > Nish > > > > Hmm. Looks like mftb is returning wrong data. > > Could you uncomment lines setting DEBUG and DEBUG_DATA at the top? > > This will print all mftb values out. > > Here you go: > > x=100 y=21172 > x=110 y=21985 > x=120 y=23709 > x=130 y=26136 > x=140 y=27815 > x=150 y=29919 > x=160 y=32063 > x=170 y=33921 > x=180 y=35829 > x=190 y=37941 > x=200 y=40042 > x=210 y=42064 > x=220 y=43935 > x=230 y=45957 > x=240 y=47818 > x=250 y=50004 > x=260 y=51942 > x=270 y=54173 > x=280 y=56043 > x=290 y=57951 > x=300 y=59798 > x=310 y=62034 > x=320 y=63812 > x=330 y=65928 > x=340 y=67835 > x=350 y=69929 > x=360 y=71869 > x=370 y=73876 > x=380 y=75872 > x=390 y=77927 > x=400 y=79865 > x=410 y=81923 > x=420 y=84076 > x=430 y=85937 > x=440 y=87819 > x=450 y=89999 > x=460 y=91831 > x=470 y=93975 > x=480 y=95994 > x=490 y=97743 > x=500 y=99868 > x=510 y=101917 > x=520 y=103806 > x=530 y=105724 > x=540 y=107811 > x=550 y=110012 > x=560 y=111876 > x=570 y=113934 > x=580 y=115857 > x=590 y=117735 > x=600 y=119811 > x=610 y=121907 > x=620 y=124025 > x=630 y=125983 > x=640 y=127945 > x=650 y=129729 > x=660 y=131898 > x=670 y=133723 > x=680 y=135868 > x=690 y=137742 > x=700 y=139815 > x=710 y=141800 > x=720 y=143812 > x=730 y=145768 > x=740 y=147790 > x=750 y=149935 > x=760 y=151826 > x=770 y=154003 > x=780 y=155618 > x=790 y=157746 > x=800 y=159846 > x=810 y=161874 > x=820 y=163914 > x=830 y=165901 > x=840 y=167634 > x=850 y=169793 > x=860 y=171833 > x=870 y=173895 > x=880 y=175657 > x=890 y=177807 > x=900 y=179827 > x=910 y=181875 > x=920 y=183834 > x=930 y=185884 > x=940 y=187860 > x=950 y=189922 > x=960 y=191883 > x=974 y=194497 > x=980 y=195672 > x=990 y=197907 > x=1000 y=199626 > x=1010 y=201735 > x=1020 y=203874 > x=1030 y=205728 > x=1040 y=207861 > x=1050 y=209737 > x=1060 y=211836 > x=1070 y=213749 > x=1080 y=215880 > x=1090 y=217817 > x=1100 y=219729 > x=1110 y=221755 > x=1120 y=223704 > x=1130 y=225783 > x=1140 y=227729 > x=1150 y=229786 > x=1160 y=232030 > x=1170 y=233770 > x=1180 y=235803 > x=1190 y=237791 > x=1200 y=239841 > x=1210 y=241768 > x=1220 y=243862 > x=1230 y=245735 > x=1240 y=247678 > x=1250 y=249795 > x=1260 y=251850 > x=1270 y=253771 > x=1280 y=255840 > x=1290 y=257928 > x=1300 y=259880 > x=1310 y=261752 > x=1320 y=263702 > x=1330 y=265601 > x=1340 y=267671 > x=1350 y=269570 > x=1360 y=271756 > x=1370 y=273750 > x=1380 y=275702 > x=1390 y=277765 > x=1400 y=279695 > x=1410 y=281620 > x=1420 y=283692 > x=1430 y=285587 > x=1440 y=287872 > x=1450 y=289621 > x=1460 y=291676 > x=1470 y=293590 > x=1480 y=295702 > x=1490 y=297790 > x=1500 y=299529 > x=1510 y=301597 > x=1520 y=303741 > x=1530 y=305806 > x=1540 y=307870 > x=1550 y=309780 > x=1560 y=311539 > x=1570 y=313756 > x=1580 y=315725 > x=1590 y=317589 > x=1600 y=319559 > x=1610 y=321651 > x=1620 y=323878 > x=1630 y=325848 > x=1640 y=327742 > x=1650 y=329615 > x=1660 y=331551 > x=1670 y=333784 > x=1680 y=335505 > x=1690 y=337610 > x=1700 y=339742 > x=1710 y=341609 > x=1720 y=343717 > x=1730 y=345641 > x=1740 y=347780 > x=1750 y=349626 > x=1760 y=351749 > x=1770 y=353782 > x=1780 y=355740 > x=1790 y=357413 > x=1800 y=359755 > x=1810 y=361621 > x=1820 y=363584 > x=1830 y=365768 > x=1840 y=367582 > x=1850 y=369810 > x=1860 y=371521 > x=1870 y=373702 > x=1880 y=375905 > x=1890 y=377659 > x=1900 y=379704 > x=1910 y=381626 > x=1920 y=383601 > x=1930 y=385635 > x=1940 y=387715 > x=1950 y=389671 > x=1960 y=391704 > x=1970 y=393599 > x=1980 y=395572 > x=1990 y=397692 > x=2000 y=399776 > x=2010 y=401853 > x=2020 y=403711 > x=2030 y=405478 > x=2040 y=407577 > x=2050 y=409618 > x=2060 y=411603 > x=2070 y=413642 > x=2080 y=415601 > x=2090 y=417823 > a = -8.02523 > b = 199.818 > a / b = -0.0401626 > r^2 = 0.999999 > Warning: measured timestamp frequency 199.818 differs from nominal 1600 MHz > 1 sec = 1.00195e+06 usec > 1 sec = 1.00198e+06 usec > 1 sec = 1.00207e+06 usec > 1 sec = 1.00207e+06 usec > 1 sec = 1.00207e+06 usec > 1 sec = 1.00207e+06 usec > 1 sec = 1.00207e+06 usec > 1 sec = 1.00207e+06 usec > 1 sec = 1.00207e+06 usec > 1 sec = 1.00207e+06 usec Seems to work fine now ... what changed? Time to try rdma_lat/rdma_bw I guess. > Just an FYI, when I tried redirecting stdout and stderr to a file, the > program never would print the "1 sec = ..." lines. If I just redirect > stderr (which I had to grab all the numbers above), it works fine. Dunno > what's up with that or if it's just a problem on my end. Probably, I'm just using stderr/stdout. -- MST From nacc at us.ibm.com Tue Jan 24 23:14:19 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 24 Jan 2006 23:14:19 -0800 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: <20060125061729.GB24479@mellanox.co.il> References: <20060125002506.GI27746@us.ibm.com> <20060125061729.GB24479@mellanox.co.il> Message-ID: <20060125071419.GL27746@us.ibm.com> On 25.01.2006 [08:17:29 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > On 24.01.2006 [23:19:52 +0200], Michael S. Tsirkin wrote: > > > Quoting r. Nishanth Aravamudan : > > > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > > > > > On 24.01.2006 [21:39:23 +0200], Michael S. Tsirkin wrote: > > > > > Quoting r. Roland Dreier : > > > > > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > > > > > > > > > Michael> 1 sec = 5.37731e+14 usec > > > > > > > > > > > > Michael> which seems to indicate something's still wrong. > > > > > > > > > > > > BTW this number is pretty close to 2^32 times bigger than 1e6, so the > > > > > > problem is probably still using long long to return the result of > > > > > > mftb (which will result in shifting the result by 32 bits, ie > > > > > > multiplying by 2^32). > > > > > > > > > > Hmm. > > > > > Maybe make clean wasnt run after updating? > > > > > Could it be un on rev 5174? > > > > > > > > Heh, here's what happens with 5174: > > > > > > > > Correlation coefficient r^2: 0.773428 < 0.9 > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > > > > > And so forth... > > > > > > > > Thanks, > > > > Nish > > > > > > Hmm. Looks like mftb is returning wrong data. > > > Could you uncomment lines setting DEBUG and DEBUG_DATA at the top? > > > This will print all mftb values out. > > > > Here you go: > > > > x=100 y=21172 > > x=110 y=21985 > > x=120 y=23709 > > x=130 y=26136 > > x=140 y=27815 > > x=150 y=29919 > > x=160 y=32063 > > x=170 y=33921 > > x=180 y=35829 > > x=190 y=37941 > > x=200 y=40042 > > x=210 y=42064 > > x=220 y=43935 > > x=230 y=45957 > > x=240 y=47818 > > x=250 y=50004 > > x=260 y=51942 > > x=270 y=54173 > > x=280 y=56043 > > x=290 y=57951 > > x=300 y=59798 > > x=310 y=62034 > > x=320 y=63812 > > x=330 y=65928 > > x=340 y=67835 > > x=350 y=69929 > > x=360 y=71869 > > x=370 y=73876 > > x=380 y=75872 > > x=390 y=77927 > > x=400 y=79865 > > x=410 y=81923 > > x=420 y=84076 > > x=430 y=85937 > > x=440 y=87819 > > x=450 y=89999 > > x=460 y=91831 > > x=470 y=93975 > > x=480 y=95994 > > x=490 y=97743 > > x=500 y=99868 > > x=510 y=101917 > > x=520 y=103806 > > x=530 y=105724 > > x=540 y=107811 > > x=550 y=110012 > > x=560 y=111876 > > x=570 y=113934 > > x=580 y=115857 > > x=590 y=117735 > > x=600 y=119811 > > x=610 y=121907 > > x=620 y=124025 > > x=630 y=125983 > > x=640 y=127945 > > x=650 y=129729 > > x=660 y=131898 > > x=670 y=133723 > > x=680 y=135868 > > x=690 y=137742 > > x=700 y=139815 > > x=710 y=141800 > > x=720 y=143812 > > x=730 y=145768 > > x=740 y=147790 > > x=750 y=149935 > > x=760 y=151826 > > x=770 y=154003 > > x=780 y=155618 > > x=790 y=157746 > > x=800 y=159846 > > x=810 y=161874 > > x=820 y=163914 > > x=830 y=165901 > > x=840 y=167634 > > x=850 y=169793 > > x=860 y=171833 > > x=870 y=173895 > > x=880 y=175657 > > x=890 y=177807 > > x=900 y=179827 > > x=910 y=181875 > > x=920 y=183834 > > x=930 y=185884 > > x=940 y=187860 > > x=950 y=189922 > > x=960 y=191883 > > x=974 y=194497 > > x=980 y=195672 > > x=990 y=197907 > > x=1000 y=199626 > > x=1010 y=201735 > > x=1020 y=203874 > > x=1030 y=205728 > > x=1040 y=207861 > > x=1050 y=209737 > > x=1060 y=211836 > > x=1070 y=213749 > > x=1080 y=215880 > > x=1090 y=217817 > > x=1100 y=219729 > > x=1110 y=221755 > > x=1120 y=223704 > > x=1130 y=225783 > > x=1140 y=227729 > > x=1150 y=229786 > > x=1160 y=232030 > > x=1170 y=233770 > > x=1180 y=235803 > > x=1190 y=237791 > > x=1200 y=239841 > > x=1210 y=241768 > > x=1220 y=243862 > > x=1230 y=245735 > > x=1240 y=247678 > > x=1250 y=249795 > > x=1260 y=251850 > > x=1270 y=253771 > > x=1280 y=255840 > > x=1290 y=257928 > > x=1300 y=259880 > > x=1310 y=261752 > > x=1320 y=263702 > > x=1330 y=265601 > > x=1340 y=267671 > > x=1350 y=269570 > > x=1360 y=271756 > > x=1370 y=273750 > > x=1380 y=275702 > > x=1390 y=277765 > > x=1400 y=279695 > > x=1410 y=281620 > > x=1420 y=283692 > > x=1430 y=285587 > > x=1440 y=287872 > > x=1450 y=289621 > > x=1460 y=291676 > > x=1470 y=293590 > > x=1480 y=295702 > > x=1490 y=297790 > > x=1500 y=299529 > > x=1510 y=301597 > > x=1520 y=303741 > > x=1530 y=305806 > > x=1540 y=307870 > > x=1550 y=309780 > > x=1560 y=311539 > > x=1570 y=313756 > > x=1580 y=315725 > > x=1590 y=317589 > > x=1600 y=319559 > > x=1610 y=321651 > > x=1620 y=323878 > > x=1630 y=325848 > > x=1640 y=327742 > > x=1650 y=329615 > > x=1660 y=331551 > > x=1670 y=333784 > > x=1680 y=335505 > > x=1690 y=337610 > > x=1700 y=339742 > > x=1710 y=341609 > > x=1720 y=343717 > > x=1730 y=345641 > > x=1740 y=347780 > > x=1750 y=349626 > > x=1760 y=351749 > > x=1770 y=353782 > > x=1780 y=355740 > > x=1790 y=357413 > > x=1800 y=359755 > > x=1810 y=361621 > > x=1820 y=363584 > > x=1830 y=365768 > > x=1840 y=367582 > > x=1850 y=369810 > > x=1860 y=371521 > > x=1870 y=373702 > > x=1880 y=375905 > > x=1890 y=377659 > > x=1900 y=379704 > > x=1910 y=381626 > > x=1920 y=383601 > > x=1930 y=385635 > > x=1940 y=387715 > > x=1950 y=389671 > > x=1960 y=391704 > > x=1970 y=393599 > > x=1980 y=395572 > > x=1990 y=397692 > > x=2000 y=399776 > > x=2010 y=401853 > > x=2020 y=403711 > > x=2030 y=405478 > > x=2040 y=407577 > > x=2050 y=409618 > > x=2060 y=411603 > > x=2070 y=413642 > > x=2080 y=415601 > > x=2090 y=417823 > > a = -8.02523 > > b = 199.818 > > a / b = -0.0401626 > > r^2 = 0.999999 > > Warning: measured timestamp frequency 199.818 differs from nominal 1600 MHz > > 1 sec = 1.00195e+06 usec > > 1 sec = 1.00198e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > Seems to work fine now ... what changed? > Time to try rdma_lat/rdma_bw I guess. Yes, I saw the same thing and was confused. Admittedly, this was a few times after running clock_test. I'll try again after the next set of jobs and see if maybe at some point it fails transiently. > > Just an FYI, when I tried redirecting stdout and stderr to a file, the > > program never would print the "1 sec = ..." lines. If I just redirect > > stderr (which I had to grab all the numbers above), it works fine. Dunno > > what's up with that or if it's just a problem on my end. > > Probably, I'm just using stderr/stdout. Not a huge deal, I'll try and debug here. Thanks, Nish From ogerlitz at voltaire.com Wed Jan 25 00:32:19 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 25 Jan 2006 10:32:19 +0200 Subject: [openib-general] Reregister Memory Region Verb In-Reply-To: <83f77db82a517a173a113e2a66eda541@lanl.gov> References: <83f77db82a517a173a113e2a66eda541@lanl.gov> Message-ID: <43D73793.1080301@voltaire.com> Galen Shipman wrote: > I would like to be able to extend an existing registration such that the > driver would take advantage of the fact that part of the extended > registration is already registered, i.e. only the "new" memory would be > pinned and made resident. Assuming you are referring to registration of virtual memory from user space, this seems as a request for the hw drivers to support overlapping memory regions in the sense that MR A can overlap with MR B such that till both A and B are unregistered the overlapped section is in place (pinned , resident, mapped in the HCA MMU etc). > although we would prefer that it wouldn't block if possible mmm. All the current memory registration verbs both user and kernel are blocking, is it an issue for you? Or. From ogerlitz at voltaire.com Wed Jan 25 06:03:34 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 25 Jan 2006 16:03:34 +0200 (IST) Subject: [openib-general] [PATCH] iser: introduce struct iser_desc Message-ID: commited in r5181 introduced struct iser_desc having four types: rx, tx control/command/dataout, removed the login/headers/dto/regd kmem_caches and struct dtask with its mempool. Signed-off-by: Or Gerlitz Index: ulp/iser/iser_conn.c =================================================================== --- ulp/iser/iser_conn.c (revision 5180) +++ ulp/iser/iser_conn.c (working copy) @@ -189,51 +189,25 @@ int iser_conn_bind(struct iscsi_iser_con p_iser_conn->p_iscsi_conn = iscsi_conn; iscsi_conn->ib_conn = p_iser_conn; - /* MERGE_ADDED_CHANGE moved here from ic_establish, before LOGIN sent */ - iser_dbg("postrecv_cache = ig.login_cache\n"); - iscsi_conn->postrecv_cache = ig.login_cache; - iscsi_conn->postrecv_bsize = ISER_LOGIN_PHASE_PDU_DATA_LEN; - sprintf(iscsi_conn->name,"%d.%d.%d.%d", - NIPQUAD(iscsi_conn->ib_conn->dst_addr)); + sprintf(iscsi_conn->name,"%d.%d.%d.%d:%d", + NIPQUAD(iscsi_conn->ib_conn->dst_addr), + iscsi_conn->ib_conn->dst_port); return 0; } /** - * iser_conn_enable_rdma - iSER API. Implements - * Allocate_Connection_Resources and Enable_Datamover primitives. - * + * iser_conn_set_full_featured_mode - (iSER API) */ int iser_conn_set_full_featured_mode(struct iscsi_iser_conn *p_iser_conn) { - int i,err = 0; + int i, err = 0; /* no need to keep it in a var, we are after login so if this should * be negotiated, by now the result should be available here */ int initial_post_recv_bufs_num = ISER_INITIAL_POST_RECV + 2; - p_iser_conn->postrecv_cache = NULL; - iser_dbg("Initially post: %d\n", initial_post_recv_bufs_num); - sprintf(p_iser_conn->postrecv_cn,"prcv_%d.%d.%d.%d:%d", - NIPQUAD(p_iser_conn->ib_conn->dst_addr),p_iser_conn->ib_conn->dst_port); - - /* Allocate recv buffers for the full-featured phase */ - - /* FIXME should be a param eg p_iser_conn->initiator_max_recv_dsl; */ - p_iser_conn->postrecv_bsize = defaultInitiatorRecvDataSegmentLength; - - p_iser_conn->postrecv_cache = - kmem_cache_create(p_iser_conn->postrecv_cn, - p_iser_conn->postrecv_bsize, - 0,SLAB_HWCACHE_ALIGN, NULL, NULL); - if (p_iser_conn->postrecv_cache == NULL) { - iser_err("Failed to allocate post recv cache\n"); - err = -ENOMEM; - goto ffeatured_mode_failure; - } - - /* Check that there is no posted recv or send buffers left - */ /* they must be consumed during the login phase */ if (atomic_read(&p_iser_conn->post_recv_buf_count) != 0) @@ -246,7 +220,7 @@ int iser_conn_set_full_featured_mode(str if (iser_post_receive_control(p_iser_conn) != 0) { iser_err("Failed to post recv bufs at:%d conn:0x%p\n", i, p_iser_conn); - err = -ENOMEM; + err = -ENOMEM; goto ffeatured_mode_failure; } } @@ -256,10 +230,6 @@ int iser_conn_set_full_featured_mode(str return 0; ffeatured_mode_failure: - if(p_iser_conn->postrecv_cache) { - kmem_cache_destroy(p_iser_conn->postrecv_cache); - p_iser_conn->postrecv_cache = NULL; - } return err; } @@ -372,9 +342,6 @@ void iser_conn_release(struct iser_conn p_iscsi_conn = p_iser_conn->p_iscsi_conn; if(p_iscsi_conn != NULL && p_iscsi_conn->ff_mode_enabled) { - if(kmem_cache_destroy(p_iscsi_conn->postrecv_cache) != 0) - iser_err("postrecv cache %s not empty, leak!\n", - p_iscsi_conn->postrecv_cn); p_iscsi_conn->ff_mode_enabled = 0; } /* release socket with conn descriptor */ @@ -440,70 +407,74 @@ int iser_complete_conn_termination(struc */ int iser_post_receive_control(struct iscsi_iser_conn *p_iser_conn) { - struct iser_adaptor *p_iser_adaptor = p_iser_conn->ib_conn->p_adaptor; - struct iser_dto *p_recv_dto; - struct iser_regd_buf *p_regd_buf; - int err = 0; - int i; + struct iser_desc *rx_desc; + struct iser_regd_buf *p_regd_hdr; + struct iser_regd_buf *p_regd_data; + struct iser_dto *p_recv_dto = NULL; + struct iser_adaptor *p_iser_adaptor = p_iser_conn->ib_conn->p_adaptor; + int rx_data_size, err = 0; - /* Create & init send DTO descriptor */ - iser_dbg( "Alloc post-recv DTO descriptor\n"); - p_recv_dto = kmem_cache_alloc(ig.dto_cache, + rx_desc = kmem_cache_alloc(ig.desc_cache, GFP_KERNEL | __GFP_NOFAIL); - if (p_recv_dto == NULL) { - iser_err("Failed to alloc DTO desc for post recv buffer\n"); - err = -ENOMEM; - goto post_receive_control_exit; - } - iser_dto_init(p_recv_dto); - p_recv_dto->p_conn = p_iser_conn; - p_recv_dto->type = ISER_DTO_RCV; - - iser_dbg("Allocate iSER header buffer\n"); - p_regd_buf = iser_regd_mem_alloc(p_iser_adaptor, - ig.header_cache, - ISER_TOTAL_HEADERS_LEN); - if (p_regd_buf == NULL) { - iser_err("Failed to alloc regd buf (post-recv-buf hdr)\n"); + if(rx_desc == NULL) { + iser_err("Failed to alloc desc for post recv\n"); err = -ENOMEM; goto post_receive_control_exit; } + rx_desc->type = ISCSI_RX; + + /* for the login sequence we must support rx of upto 8K * + * FIXME need better preditace to test whether we are logged in */ + if(!p_iser_conn->ff_mode_enabled) + rx_data_size = DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH; + else /* FIXME till user space sets conn->max_recv_dlength correctly */ + rx_data_size = 1024; - /* DMA_MAP: safe to dma_map now - map and invalidate the cache */ - iser_reg_single(p_iser_adaptor,p_regd_buf, DMA_FROM_DEVICE); + /* FIXME need to ensure this is HW cache start/end aligned */ + rx_desc->data = kmalloc(rx_data_size, GFP_KERNEL | __GFP_NOFAIL); - i = iser_dto_add_regd_buff(p_recv_dto, p_regd_buf, - USE_NO_OFFSET, - USE_ENTIRE_SIZE); - iser_dbg("Added header buffer 0x%p to DTO as entry: %d\n", - p_regd_buf, i); - - /* Create an iSER data buffer */ - p_regd_buf = iser_regd_mem_alloc(p_iser_adaptor, - p_iser_conn->postrecv_cache, - p_iser_conn->postrecv_bsize); - if (p_regd_buf == NULL) { - iser_err("Failed to alloc regd buf (post-recv-buf data)\n"); + if(rx_desc->data == NULL) { + iser_err("Failed to alloc data buf for post recv\n"); err = -ENOMEM; goto post_receive_control_exit; + } - iser_dbg("Allocated iSER data buffer from postrecv_cache 0x%p\n", - p_regd_buf->virt_addr); - /* DMA_MAP: safe to dma_map now - map and invalidate the cache */ - iser_reg_single(p_iser_adaptor,p_regd_buf, DMA_FROM_DEVICE); + p_recv_dto = &rx_desc->dto; + p_recv_dto->p_conn = p_iser_conn; + p_recv_dto->regd_vector_len = 0; + + p_regd_hdr = &rx_desc->hdr_regd_buf; + memset(p_regd_hdr, 0, sizeof(struct iser_regd_buf)); + p_regd_hdr->p_adaptor = p_iser_adaptor; + p_regd_hdr->virt_addr = rx_desc; /* == &rx_desc->iser_header */ + p_regd_hdr->data_size = ISER_TOTAL_HEADERS_LEN; + + iser_reg_single(p_iser_adaptor, p_regd_hdr, DMA_FROM_DEVICE); + + iser_dto_add_regd_buff(p_recv_dto, p_regd_hdr, USE_NO_OFFSET, + USE_ENTIRE_SIZE); + + p_regd_data = &rx_desc->data_regd_buf; + memset(p_regd_data, 0, sizeof(struct iser_regd_buf)); + p_regd_data->p_adaptor = p_iser_adaptor; + p_regd_data->virt_addr = rx_desc->data; + p_regd_data->data_size = rx_data_size; + + iser_reg_single(p_iser_adaptor, p_regd_data, DMA_FROM_DEVICE); - i = iser_dto_add_regd_buff(p_recv_dto, p_regd_buf, - USE_NO_OFFSET, USE_ENTIRE_SIZE); - iser_dbg("Added data buffer 0x%p to DTO as entry: %d\n", - p_regd_buf, i); + iser_dto_add_regd_buff(p_recv_dto, p_regd_data, + USE_NO_OFFSET, USE_ENTIRE_SIZE); atomic_inc(&p_iser_conn->post_recv_buf_count); - err = iser_post_recv(p_recv_dto); + err = iser_post_recv(rx_desc); post_receive_control_exit: - if (err && p_recv_dto != NULL) { + if(err && rx_desc) { iser_dto_free(p_recv_dto); + if(rx_desc->data != NULL) + kfree(rx_desc->data); + kmem_cache_free(ig.desc_cache, rx_desc); atomic_dec(&p_iser_conn->post_recv_buf_count); } return err; Index: ulp/iser/iser_mod.c =================================================================== --- ulp/iser/iser_mod.c (revision 5180) +++ ulp/iser/iser_mod.c (working copy) @@ -69,21 +69,6 @@ struct iser_global ig; static void iser_global_release(void); -static kmem_cache_t *iser_mem_cache_create(const char *cache_name, - unsigned int obj_size) -{ - kmem_cache_t *p_cache; - - p_cache = kmem_cache_create(cache_name, obj_size, - 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); - if (p_cache == NULL) { - iser_err("Failed to alloc cache: %s\n", cache_name); - iser_global_release(); - } - return p_cache; -} - /** * init_module - module initialization function */ @@ -95,24 +80,11 @@ int init_module(void) memset(&ig, 0, sizeof(struct iser_global)); - ig.header_cache = iser_mem_cache_create("iser_headers", - ISER_TOTAL_HEADERS_LEN); - if (ig.header_cache == NULL) - return -ENOMEM; - - ig.regd_buf_cache = iser_mem_cache_create("iser_regbuf", - sizeof(struct iser_regd_buf)); - if (ig.regd_buf_cache == NULL) - return -ENOMEM; - - ig.login_cache = iser_mem_cache_create("iser_login", - ISER_LOGIN_PHASE_PDU_DATA_LEN); - if (ig.login_cache == NULL) - return -ENOMEM; - - ig.dto_cache = iser_mem_cache_create("iser_dto", - sizeof(struct iser_dto)); - if (ig.dto_cache == NULL) + ig.desc_cache = kmem_cache_create("iser_descriptors", + sizeof (struct iser_desc), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (ig.desc_cache == NULL) return -ENOMEM; /* adaptor init is called only after the first addr resolution */ @@ -135,6 +107,7 @@ int init_module(void) */ static void iser_global_release(void) { + int err; struct iser_adaptor *p_adaptor; iscsi_iser_exit(); @@ -148,22 +121,13 @@ static void iser_global_release(void) ig.num_adaptors--; } - if (ig.dto_cache != NULL) { - kmem_cache_destroy(ig.dto_cache); - ig.dto_cache = NULL; - } - if (ig.login_cache != NULL) { - kmem_cache_destroy(ig.login_cache); - ig.login_cache = NULL; - } - if (ig.regd_buf_cache != NULL) { - kmem_cache_destroy(ig.regd_buf_cache); - ig.regd_buf_cache = NULL; - } - if (ig.header_cache != NULL) { - kmem_cache_destroy(ig.header_cache); - ig.header_cache = NULL; + if (ig.desc_cache != NULL) { + err = kmem_cache_destroy(ig.desc_cache); + if(err) + iser_err("kmem_cache_destory returned %d\n",err); + ig.desc_cache = NULL; } + iser_unreg_sockets(); } Index: ulp/iser/iscsi_iser.h =================================================================== --- ulp/iser/iscsi_iser.h (revision 5180) +++ ulp/iser/iscsi_iser.h (working copy) @@ -173,6 +173,29 @@ struct rdma_cm_id; struct ib_qp; struct iscsi_iser_cmd_task; + +struct iser_mem_reg { + u32 lkey; + u32 rkey; + u64 va; + u64 len; + void *mem_h; +}; + +struct iser_regd_buf { + struct iser_mem_reg reg; /* memory registration info */ + kmem_cache_t *data_cache; /* data allocated from here, when set */ + void *virt_addr; + + struct iser_adaptor *p_adaptor; /* p_adaptor->device for dma_unmap */ + dma_addr_t dma_addr; /* if non zero, addr for dma_unmap */ + enum dma_data_direction direction; /* direction for dma_unmap */ + unsigned int data_size; + + /* Reference count, memory freed when decremented to 0 */ + atomic_t ref_count; +}; + #define MAX_REGD_BUF_VECTOR_LEN 2 enum iser_dto_type { @@ -186,7 +209,6 @@ enum iser_dto_type { struct iser_dto { struct iscsi_iser_cmd_task *p_task; struct iscsi_iser_conn *p_conn; - enum iser_dto_type type; int notify_enable; /* vector of registered buffers */ @@ -198,8 +220,24 @@ struct iser_dto { unsigned int used_sz[MAX_REGD_BUF_VECTOR_LEN]; }; -enum iser_op_param_default { - defaultInitiatorRecvDataSegmentLength = 128, +enum iser_desc_type { + ISCSI_RX, + ISCSI_TX_CONTROL , + ISCSI_TX_SCSI_COMMAND, + ISCSI_TX_DATAOUT +}; + +struct iser_desc { + struct iser_hdr iser_header; + struct iscsi_hdr iscsi_header; + + struct iser_regd_buf hdr_regd_buf; + + void *data; /* used by RX & TX_CONTROL types */ + struct iser_regd_buf data_regd_buf; /* used by RX & TX_CONTROL types */ + + enum iser_desc_type type; + struct iser_dto dto; }; struct iser_conn @@ -232,10 +270,6 @@ struct iscsi_iser_conn struct list_head adaptor_list; /* entry in the adaptor's conn list */ - kmem_cache_t *postrecv_cache; - unsigned int postrecv_bsize; - char postrecv_cn[32]; - atomic_t post_recv_buf_count; atomic_t post_send_buf_count; @@ -280,14 +314,16 @@ struct iscsi_iser_queue { }; struct iscsi_iser_mgmt_task { - struct iscsi_hdr hdr; + struct iser_desc desc; + struct iscsi_hdr *hdr; /* points to desc->iscsi_hdr */ uint32_t itt; /* this ITT */ - char *data; /* mgmt payload */ + char *data; /* mgmt payload, points to desc->data */ int data_count; /* counts data to be sent */ }; struct iscsi_iser_cmd_task { - struct iscsi_cmd hdr; /* iSCSI PDU header */ + struct iser_desc desc; + struct iscsi_cmd *hdr; /* iSCSI PDU header points to desc->iscsi_hdr */ int itt; /* this ITT */ struct iscsi_iser_conn *conn; spinlock_t task_lock; @@ -311,26 +347,15 @@ struct iscsi_iser_cmd_task { int data_offset; struct iscsi_iser_mgmt_task *mtask; /* tmf mtask in progr */ - struct list_head dataqueue; /* Data-Out dataqueue */ - mempool_t *datapool; - - struct iscsi_iser_data_task *dtask; /* data task in progress*/ - unsigned int post_send_count; /* posted send buffers pending completion */ int dir[ISER_DIRS_NUM]; /* set if direction used */ - struct iser_regd_buf *rdma_regd[ISER_DIRS_NUM]; /* regd rdma buffer */ + struct iser_regd_buf rdma_regd[ISER_DIRS_NUM]; /* regd rdma buffer */ unsigned long data_len[ISER_DIRS_NUM]; /* total data length */ struct iser_data_buf data[ISER_DIRS_NUM]; /* orig. data descriptor */ struct iser_data_buf data_copy[ISER_DIRS_NUM]; /* contig. copy */ }; -struct iscsi_iser_data_task { - struct iscsi_data hdr; /* PDU */ - struct list_head item; /* data queue item */ -}; -#define ISCSI_DTASK_DEFAULT_MAX ISCSI_ISER_SG_TABLESIZE * PAGE_SIZE / 512 - struct iscsi_iser_session { /* iSCSI session-wide sequencing */ @@ -372,9 +397,6 @@ struct iscsi_iser_session int erl; }; -/* Various size limits */ -#define ISER_LOGIN_PHASE_PDU_DATA_LEN (8*1024) /* 8K */ - struct iser_page_vec { u64 *pages; int length; @@ -382,28 +404,6 @@ struct iser_page_vec { int data_size; }; -struct iser_mem_reg { - u32 lkey; - u32 rkey; - u64 va; - u64 len; - void *mem_h; -}; - -struct iser_regd_buf { - struct iser_mem_reg reg; /* memory registration info */ - kmem_cache_t *data_cache; /* data allocated from here, when set */ - void *virt_addr; - - struct iser_adaptor *p_adaptor; /* p_adaptor->device for dma_unmap */ - dma_addr_t dma_addr; /* if non zero, addr for dma_unmap */ - enum dma_data_direction direction; /* direction for dma_unmap */ - unsigned int data_size; - - /* Reference count, memory freed when decremented to 0 */ - atomic_t ref_count; -}; - struct iser_adaptor { struct list_head ig_list; /* entry in ig adaptors list */ @@ -426,11 +426,7 @@ struct iser_global { struct semaphore adaptor_list_sem; /* */ struct list_head adaptor_list; /* all iSER adaptors */ - kmem_cache_t *dto_cache; /* slab for iser_dto */ - kmem_cache_t *regd_buf_cache; /* slab iser_regd_buf */ - - kmem_cache_t *login_cache; - kmem_cache_t *header_cache; + kmem_cache_t *desc_cache; }; /* iser_global */ extern struct iser_global ig; @@ -517,8 +513,6 @@ void iser_adaptor_add_conn(struct iser_a #define USE_SIZE(size) (size) #define USE_ENTIRE_SIZE 0 -void iser_dto_init(struct iser_dto *p_dto); - int iser_dto_add_regd_buff(struct iser_dto *p_dto, struct iser_regd_buf *p_regd_buf, unsigned long use_offset, @@ -526,30 +520,23 @@ int iser_dto_add_regd_buff(struct iser_d void iser_dto_free(struct iser_dto *p_dto); -int iser_dto_completion_error(struct iser_dto *p_dto); - -void iser_dto_get_rx_pdu_data(struct iser_dto *p_dto, - unsigned long dto_xfer_len, - struct iscsi_hdr **p_rx_hdr, - char **rx_data, int *rx_data_size); - -struct iser_dto *iser_dto_send_create(struct iscsi_iser_conn *p_iser_conn, - struct iscsi_hdr *p_hdr, - unsigned char **p_header); +int iser_dto_completion_error(struct iser_desc *p_desc); +void iser_dto_send_create(struct iscsi_iser_conn *p_iser_conn, + struct iser_desc *tx_desc); /* iser_initiator.h */ -void iser_rcv_dto_completion(struct iser_dto *p_dto, +void iser_rcv_completion(struct iser_desc *p_desc, unsigned long dto_xfer_len); -void iser_snd_dto_completion(struct iser_dto *p_dto); +void iser_snd_completion(struct iser_desc *p_desc); /* iser_memory.h */ /* regd_buf */ -struct iser_regd_buf *iser_regd_buf_alloc(void); +//struct iser_regd_buf *iser_regd_buf_alloc(void); struct iser_regd_buf *iser_regd_mem_alloc(struct iser_adaptor *p_iser_adaptor, kmem_cache_t *cache, @@ -686,7 +673,7 @@ int iser_reg_phys_mem(struct iser_conn * void iser_unreg_mem(struct iser_mem_reg *mem_reg); -int iser_post_recv(struct iser_dto *p_dto); -int iser_start_send(struct iser_dto *p_dto); +int iser_post_recv(struct iser_desc *p_rx_desc); +int iser_start_send(struct iser_desc *p_tx_desc); #endif Index: ulp/iser/iser_verbs.c =================================================================== --- ulp/iser/iser_verbs.c (revision 5180) +++ ulp/iser/iser_verbs.c (working copy) @@ -567,12 +567,13 @@ static void iser_dto_to_iov(struct iser_ * * returns 0 on success, -1 on failure */ -int iser_post_recv(struct iser_dto *p_recv_dto) +int iser_post_recv(struct iser_desc *p_rx_desc) { int ib_ret, ret_val = 0; struct ib_recv_wr recv_wr, *recv_wr_failed; struct ib_sge iov[2]; struct iscsi_iser_conn *p_iser_conn; + struct iser_dto *p_recv_dto = &p_rx_desc->dto; /* Retrieve conn */ p_iser_conn = p_recv_dto->p_conn; @@ -584,7 +585,7 @@ int iser_post_recv(struct iser_dto *p_re recv_wr.next = NULL; recv_wr.sg_list = iov; recv_wr.num_sge = p_recv_dto->regd_vector_len; - recv_wr.wr_id = (unsigned long)p_recv_dto; + recv_wr.wr_id = (unsigned long)p_rx_desc; ib_ret = ib_post_recv (p_iser_conn->ib_conn->qp, &recv_wr, &recv_wr_failed); @@ -601,12 +602,13 @@ int iser_post_recv(struct iser_dto *p_re * * returns 0 on success, -1 on failure */ -int iser_start_send(struct iser_dto *p_dto) +int iser_start_send(struct iser_desc *p_tx_desc) { int ib_ret, ret_val = 0; struct ib_send_wr send_wr, *send_wr_failed; struct ib_sge iov[MAX_REGD_BUF_VECTOR_LEN]; struct iscsi_iser_conn *p_iser_conn; + struct iser_dto *p_dto = &p_tx_desc->dto; if (p_dto == NULL) iser_bug("NULL p_dto\n"); @@ -618,7 +620,7 @@ int iser_start_send(struct iser_dto *p_d iser_dto_to_iov(p_dto, iov, MAX_REGD_BUF_VECTOR_LEN); send_wr.next = NULL; - send_wr.wr_id = (unsigned long)p_dto; + send_wr.wr_id = (unsigned long)p_tx_desc; send_wr.sg_list = iov; send_wr.num_sge = p_dto->regd_vector_len; send_wr.opcode = IB_WR_SEND; @@ -640,13 +642,13 @@ int iser_start_send(struct iser_dto *p_d } static void iser_handle_comp_error(enum ib_wc_status status, - struct iser_dto *p_dto) + struct iser_desc *p_desc) { int ret_val; - struct iscsi_iser_conn *p_iser_conn = p_dto->p_conn; + struct iscsi_iser_conn *p_iser_conn = p_desc->dto.p_conn; if(p_iser_conn == NULL) - iser_bug("NULL p_dto->p_conn \n"); + iser_bug("NULL p_desc->p_conn \n"); /* Since the cma doesn't notify us on CONNECTION_EVENT_BROKEN * * we need to initiate a disconn */ @@ -664,7 +666,7 @@ static void iser_handle_comp_error(enum iser_dbg("Conn. 0x%p is being terminated asynchronously\n", p_iser_conn); } /* Handle completion Error */ - ret_val = iser_dto_completion_error(p_dto); + ret_val = iser_dto_completion_error(p_desc); if (ret_val && ret_val != -EAGAIN) iser_err("Failed to handle ERROR DTO completion\n"); } @@ -674,24 +676,24 @@ void iser_cq_tasklet_fn(unsigned long da struct iser_adaptor *p_iser_adaptor = (struct iser_adaptor *)data; struct ib_cq *cq = p_iser_adaptor->cq; struct ib_wc wc; - struct iser_dto *p_dto; + struct iser_desc *p_desc; unsigned long xfer_len; while (ib_poll_cq(cq, 1, &wc) == 1) { - p_dto = (struct iser_dto *) (unsigned long) wc.wr_id; + p_desc = (struct iser_desc *) (unsigned long) wc.wr_id; - if (p_dto == NULL || p_dto->type >= ISER_DTO_PASSIVE) - iser_bug("NULL p_dto %p or unexpected type\n", p_dto); + if (p_desc == NULL) + iser_bug("NULL p_desc\n"); if (wc.status == IB_WC_SUCCESS) { - if (p_dto->type == ISER_DTO_RCV) { + if (p_desc->type == ISCSI_RX) { xfer_len = (unsigned long)wc.byte_len; - iser_rcv_dto_completion(p_dto, xfer_len); - } else /* p_dto->type == ISER_DTO_SEND */ - iser_snd_dto_completion(p_dto); + iser_rcv_completion(p_desc, xfer_len); + } else /* type == ISCSI_TX_CONTROL/SCSI_CMD/DOUT */ + iser_snd_completion(p_desc); } else /* #warning "we better do a context jump here" */ - iser_handle_comp_error(wc.status, p_dto); + iser_handle_comp_error(wc.status, p_desc); } /* #warning "it is assumed here that arming CQ only once its empty would not" * "cause interrupts to be missed" */ Index: ulp/iser/iser_task.c =================================================================== --- ulp/iser/iser_task.c (revision 5180) +++ ulp/iser/iser_task.c (working copy) @@ -42,17 +42,20 @@ void iser_task_init_lowpart(struct iscsi spin_lock_init(&p_iser_task->task_lock); p_iser_task->status = ISER_TASK_STATUS_INIT; p_iser_task->post_send_count = 0; - + p_iser_task->dir[ISER_DIR_IN] = 0; p_iser_task->dir[ISER_DIR_OUT] = 0; - + p_iser_task->data_len[ISER_DIR_IN] = 0; p_iser_task->data_len[ISER_DIR_OUT] = 0; - - p_iser_task->rdma_regd[ISER_DIR_IN] = NULL; - p_iser_task->rdma_regd[ISER_DIR_OUT] = NULL; + + memset(&p_iser_task->rdma_regd[ISER_DIR_IN], 0, + sizeof(struct iser_regd_buf)); + memset(&p_iser_task->rdma_regd[ISER_DIR_OUT], 0, + sizeof(struct iser_regd_buf)); } + /** * iser_task_post_send_count_inc - Increments counter of * post-send buffers pending send completion @@ -112,13 +115,14 @@ void iser_task_finalize_lowpart(struct i spin_lock_bh(&p_iser_task->task_lock); if (p_iser_task->dir[ISER_DIR_IN]) { - deferred = iser_regd_buff_release(p_iser_task->rdma_regd[ISER_DIR_IN]); + deferred = iser_regd_buff_release + (&p_iser_task->rdma_regd[ISER_DIR_IN]); if (deferred) iser_bug("References remain for BUF-IN rdma reg\n"); } - if (p_iser_task->dir[ISER_DIR_OUT] && - p_iser_task->rdma_regd[ISER_DIR_OUT] != NULL) { - deferred = iser_regd_buff_release(p_iser_task->rdma_regd[ISER_DIR_OUT]); + if (p_iser_task->dir[ISER_DIR_OUT]) { + deferred = iser_regd_buff_release + (&p_iser_task->rdma_regd[ISER_DIR_OUT]); if (deferred) iser_bug("References remain for BUF-OUT rdma reg\n"); } Index: ulp/iser/iser_initiator.c =================================================================== --- ulp/iser/iser_initiator.c (revision 5180) +++ ulp/iser/iser_initiator.c (working copy) @@ -65,10 +65,8 @@ static int iser_reg_rdma_mem(struct iscs else priv_flags |= IB_ACCESS_REMOTE_READ; - p_iser_task->rdma_regd[cmd_dir] = NULL; - p_regd_buf = iser_regd_buf_alloc(); - if (p_regd_buf == NULL) - return -ENOMEM; + + p_regd_buf = &p_iser_task->rdma_regd[cmd_dir]; iser_dbg("p_mem %p p_mem->type %d\n", p_mem,p_mem->type); @@ -95,23 +93,20 @@ static int iser_reg_rdma_mem(struct iscs } page_vec = iser_page_vec_alloc(p_mem,0,cnt_to_reg); - if (page_vec == NULL) { - iser_regd_buff_release(p_regd_buf); + if (page_vec == NULL) return -ENOMEM; - } + page_vec_len = iser_page_vec_build(p_mem, page_vec, 0, cnt_to_reg); err = iser_reg_phys_mem(p_iser_conn, page_vec, priv_flags, &p_regd_buf->reg); iser_page_vec_free(page_vec); if (err) { iser_err("Failed to register %d page entries\n", page_vec_len); - iser_regd_buff_release(p_regd_buf); return -EINVAL; } /* take a reference on this regd buf such that it will not be released * * (eg in send dto completion) before we get the scsi response */ iser_regd_buff_ref(p_regd_buf); - p_iser_task->rdma_regd[cmd_dir] = p_regd_buf; return 0; } @@ -121,15 +116,15 @@ static int iser_reg_rdma_mem(struct iscs */ static int iser_prepare_read_cmd(struct iscsi_iser_cmd_task *p_iser_task, struct iser_data_buf *buf_in, - unsigned int edtl, - unsigned char *p_iser_header) + unsigned int edtl) + { struct iser_regd_buf *p_regd_buf; int err; dma_addr_t dma_addr; int dma_nents; struct device *dma_device; - struct iser_hdr *hdr = (struct iser_hdr *)p_iser_header; + struct iser_hdr *hdr = &p_iser_task->desc.iser_header; p_iser_task->dir[ISER_DIR_IN] = 1; dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; @@ -171,7 +166,7 @@ static int iser_prepare_read_cmd(struct iser_err("Failed to set up Data-IN RDMA\n"); return err; } - p_regd_buf = p_iser_task->rdma_regd[ISER_DIR_IN]; + p_regd_buf = &p_iser_task->rdma_regd[ISER_DIR_IN]; hdr->flags |= ISER_RSV; hdr->read_stag = cpu_to_be32(p_regd_buf->reg.rkey); @@ -193,16 +188,15 @@ iser_prepare_write_cmd(struct iscsi_iser struct iser_data_buf *buf_out, unsigned int imm_sz, unsigned int unsol_sz, - struct iser_dto *p_send_dto, - unsigned int edtl, - unsigned char *p_iser_header) + unsigned int edtl) { struct iser_regd_buf *p_regd_buf; int err; dma_addr_t dma_addr; int dma_nents; struct device *dma_device; - struct iser_hdr *hdr = (struct iser_hdr *)p_iser_header; + struct iser_dto *p_send_dto = &p_iser_task->desc.dto; + struct iser_hdr *hdr = &p_iser_task->desc.iser_header; p_iser_task->dir[ISER_DIR_OUT] = 1; dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; @@ -247,7 +241,7 @@ iser_prepare_write_cmd(struct iscsi_iser return err; } - p_regd_buf = p_iser_task->rdma_regd[ISER_DIR_OUT]; + p_regd_buf = &p_iser_task->rdma_regd[ISER_DIR_OUT]; if(unsol_sz < edtl) { hdr->flags |= ISER_WSV; @@ -279,14 +273,11 @@ int iser_send_command(struct iscsi_iser_ struct iscsi_iser_cmd_task *p_ctask) { struct iser_dto *p_send_dto = NULL; - unsigned int itt; - unsigned long data_seg_len; unsigned long edtl; - unsigned char *p_iser_header; int err = 0; struct iser_data_buf data_buf; - struct iscsi_cmd *hdr = &p_ctask->hdr; + struct iscsi_cmd *hdr = p_ctask->hdr; struct scsi_cmnd *sc = p_ctask->sc; if (atomic_read(&p_iser_conn->ib_conn->state) != ISER_CONN_UP) { @@ -294,22 +285,16 @@ int iser_send_command(struct iscsi_iser_ return -EPERM; } - itt = ntohl(hdr->itt); - data_seg_len = ntoh24(hdr->dlength); edtl = ntohl(hdr->data_length); /* MERGE_CHANGE - temporal move it up */ iser_task_init_lowpart(p_ctask); - /* Allocate send DTO descriptor, headers buf and add it to the DTO */ - p_send_dto = iser_dto_send_create(p_iser_conn, (struct iscsi_hdr *)hdr, - &p_iser_header); - if (p_send_dto == NULL) { - iser_err("Failed to create send DTO, conn:0x%p\n", p_iser_conn); - err = -ENOMEM; - goto send_command_error; - } + /* build the tx desc regd header and add it to the tx desc dto */ + p_ctask->desc.type = ISCSI_TX_SCSI_COMMAND; + p_send_dto = &p_ctask->desc.dto; p_send_dto->p_task = p_ctask; + iser_dto_send_create(p_iser_conn, &p_ctask->desc); if (sc->use_sg) { /* using a scatter list */ data_buf.p_buf = sc->request_buffer; @@ -322,8 +307,7 @@ int iser_send_command(struct iscsi_iser_ } if (hdr->flags & ISCSI_FLAG_CMD_READ) { - err = iser_prepare_read_cmd(p_ctask, &data_buf, - edtl, p_iser_header); + err = iser_prepare_read_cmd(p_ctask, &data_buf, edtl); if (err) goto send_command_error; } if (hdr->flags & ISCSI_FLAG_CMD_WRITE) { @@ -331,7 +315,7 @@ int iser_send_command(struct iscsi_iser_ p_ctask->imm_count, p_ctask->imm_count + p_ctask->unsol_count, - p_send_dto, edtl, p_iser_header); + edtl); if (err) goto send_command_error; } @@ -348,7 +332,7 @@ int iser_send_command(struct iscsi_iser_ iser_task_set_status(p_ctask,ISER_TASK_STATUS_STARTED); iser_task_post_send_count_inc(p_ctask); - err = iser_start_send(p_send_dto); + err = iser_start_send(&p_ctask->desc); if (err) { iser_task_post_send_count_dec_and_test(p_ctask); goto send_command_error; @@ -376,6 +360,7 @@ int iser_send_data_out(struct iscsi_iser struct iscsi_iser_cmd_task *p_ctask, struct iscsi_data *hdr) { + struct iser_desc *tx_desc = NULL; struct iser_dto *p_send_dto = NULL; unsigned long buf_offset; unsigned long data_seg_len; @@ -394,24 +379,28 @@ int iser_send_data_out(struct iscsi_iser iser_dbg("%s itt %d dseg_len %d offset %d\n", __func__,(int)itt,(int)data_seg_len,(int)buf_offset); - /* Allocate send DTO descriptor, headers buf and add it to the DTO */ - p_send_dto = iser_dto_send_create(p_iser_conn, - (struct iscsi_hdr *)hdr, NULL); - if (p_send_dto == NULL) { - iser_err("Failed to create send DTO, conn:0x%p\n", p_iser_conn); + tx_desc = kmem_cache_alloc(ig.desc_cache, GFP_KERNEL | __GFP_NOFAIL); + if(tx_desc == NULL) { + iser_err("Failed to alloc desc for post dataout\n"); err = -ENOMEM; goto send_data_out_error; } + tx_desc->type = ISCSI_TX_DATAOUT; + memcpy(&tx_desc->iscsi_header, hdr, sizeof(struct iscsi_hdr)); + + /* build the tx desc regd header and add it to the tx desc dto */ + p_send_dto = &tx_desc->dto; + p_send_dto->p_task = p_ctask; + iser_dto_send_create(p_iser_conn, tx_desc); + /* DMA_MAP: safe to dma_map now - map and flush the cache */ iser_reg_single(p_iser_conn->ib_conn->p_adaptor, p_send_dto->regd[0], DMA_TO_DEVICE); - p_send_dto->p_task = p_ctask; - /* all data was registered for RDMA, we can use the lkey */ iser_dto_add_regd_buff(p_send_dto, - p_ctask->rdma_regd[ISER_DIR_OUT], + &p_ctask->rdma_regd[ISER_DIR_OUT], USE_OFFSET(buf_offset), USE_SIZE(data_seg_len)); @@ -428,7 +417,7 @@ int iser_send_data_out(struct iscsi_iser iser_task_post_send_count_inc(p_ctask); - err = iser_start_send(p_send_dto); + err = iser_start_send(tx_desc); if (err) { iser_task_post_send_count_dec_and_test(p_ctask); goto send_data_out_error; @@ -439,6 +428,9 @@ int iser_send_data_out(struct iscsi_iser send_data_out_error: if (p_send_dto != NULL) iser_dto_free(p_send_dto); + if (tx_desc != NULL) + kmem_cache_free(ig.desc_cache, tx_desc); + if (p_iser_conn != NULL) { /* drop the conn, open tasks are deleted during shutdown */ iser_err("send dout failed, drop conn:0x%p\n", p_iser_conn); @@ -463,20 +455,19 @@ int iser_send_control(struct iscsi_iser_ return -EPERM; } - /* Allocate send DTO descriptor, headers buf and add it to the DTO */ - p_send_dto = iser_dto_send_create(p_iser_conn, &p_mtask->hdr, NULL); - if (p_send_dto == NULL) { - iser_err("Failed to create send DTO, conn: 0x%p\n",p_iser_conn); - err = -ENOMEM; - goto send_control_error; - } + /* build the tx desc regd header and add it to the tx desc dto */ + p_mtask->desc.type = ISCSI_TX_CONTROL; + p_send_dto = &p_mtask->desc.dto; + p_send_dto->p_task = NULL; + iser_dto_send_create(p_iser_conn, &p_mtask->desc); + p_iser_adaptor = p_iser_conn->ib_conn->p_adaptor; /* DMA_MAP: safe to dma_map now - map and flush the cache */ iser_reg_single(p_iser_adaptor, p_send_dto->regd[0], DMA_TO_DEVICE); - itt = ntohl(p_mtask->hdr.itt); - opcode = p_mtask->hdr.opcode & ISCSI_OPCODE_MASK; + itt = ntohl(p_mtask->hdr->itt); + opcode = p_mtask->hdr->opcode & ISCSI_OPCODE_MASK; /* no need to copy when there's data b/c the mtask is not reallocated * * till the response related to this ITT is received */ @@ -488,14 +479,10 @@ int iser_send_control(struct iscsi_iser_ case ISCSI_OP_LOGIN: case ISCSI_OP_TEXT: case ISCSI_OP_LOGOUT: - data_seg_len = ntoh24(p_mtask->hdr.dlength); + data_seg_len = ntoh24(p_mtask->hdr->dlength); if (data_seg_len > 0) { - p_regd_buf = iser_regd_buf_alloc(); - if (p_regd_buf == NULL) { - iser_err("Failed to alloc regd buffer\n"); - err = -ENOMEM; - goto send_control_error; - } + p_regd_buf = &p_mtask->desc.data_regd_buf; + memset(p_regd_buf, 0, sizeof(struct iser_regd_buf)); p_regd_buf->p_adaptor = p_iser_adaptor; p_regd_buf->virt_addr = p_mtask->data; p_regd_buf->data_size = p_mtask->data_count; @@ -520,7 +507,7 @@ int iser_send_control(struct iscsi_iser_ goto send_control_error; } - err = iser_start_send(p_send_dto); + err = iser_start_send(&p_mtask->desc); if (err) goto send_control_error; return 0; @@ -538,21 +525,30 @@ send_control_error: /** * iser_rcv_dto_completion - recv DTO completion */ -void iser_rcv_dto_completion(struct iser_dto *p_dto, - unsigned long dto_xfer_len) +void iser_rcv_completion(struct iser_desc *p_rx_desc, + unsigned long dto_xfer_len) { struct iscsi_iser_session *p_session; + struct iser_dto *p_dto = &p_rx_desc->dto; struct iscsi_iser_conn *p_iser_conn = p_dto->p_conn; struct iscsi_iser_cmd_task *p_iser_task = NULL; struct iscsi_hdr *p_hdr; - char *rx_data; + char *rx_data = NULL; int rc, rx_data_size = 0; unsigned int itt; unsigned char opcode; int no_more_task_sends = 0; - iser_dto_get_rx_pdu_data(p_dto, dto_xfer_len, - &p_hdr, &rx_data, &rx_data_size); + p_hdr = &p_rx_desc->iscsi_header; + + iser_dbg("op 0x%x itt 0x%x\n", p_hdr->opcode,p_hdr->itt); + + if (dto_xfer_len > ISER_TOTAL_HEADERS_LEN) { /* we have data */ + rx_data_size = dto_xfer_len - ISER_TOTAL_HEADERS_LEN; + rx_data = p_dto->regd[1]->virt_addr; + rx_data += p_dto->offset[1]; + } + opcode = p_hdr->opcode & ISCSI_OPCODE_MASK; /* FIXME - "task" handles for non cmds */ @@ -607,6 +603,8 @@ void iser_rcv_dto_completion(struct iser } iser_dto_free(p_dto); + kfree(p_rx_desc->data); + kmem_cache_free(ig.desc_cache, p_rx_desc); /* decrementing conn->post_recv_buf_count only --after-- freeing the * * task eliminates the need to worry on tasks which are completed in * @@ -615,22 +613,24 @@ void iser_rcv_dto_completion(struct iser atomic_dec(&p_iser_conn->post_recv_buf_count); } -void iser_snd_dto_completion(struct iser_dto *p_dto) +void iser_snd_completion(struct iser_desc *p_tx_desc) { + struct iser_dto *p_dto = &p_tx_desc->dto; struct iscsi_iser_conn *p_iser_conn = p_dto->p_conn; - struct iscsi_iser_cmd_task *p_iser_task = NULL; iser_dbg("Initiator, Data sent p_dto=0x%p\n", p_dto); - p_iser_task = p_dto->p_task; - iser_dto_free(p_dto); + + if(p_tx_desc->type == ISCSI_TX_DATAOUT) + kmem_cache_free(ig.desc_cache, p_tx_desc); + atomic_dec(&p_iser_conn->post_send_buf_count); /* if the last sent PDU of the task, task can be freed */ - if (p_iser_task != NULL && - iser_task_post_send_count_dec_and_test(p_iser_task)) - iser_task_finalize_lowpart(p_iser_task); + if (p_dto->p_task != NULL && + iser_task_post_send_count_dec_and_test(p_dto->p_task)) + iser_task_finalize_lowpart(p_dto->p_task); } static void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *p_iser_task) Index: ulp/iser/iser_dto.c =================================================================== --- ulp/iser/iser_dto.c (revision 5180) +++ ulp/iser/iser_dto.c (working copy) @@ -39,15 +39,6 @@ #include "iscsi_iser.h" -void iser_dto_init(struct iser_dto *p_dto) -{ - p_dto->p_task = NULL; - p_dto->p_conn = NULL; - p_dto->type = ISER_DTO_PASSIVE; - p_dto->notify_enable = 0; - p_dto->regd_vector_len = 0; -} - /** * iser_dto_add_regd_buff - Increments the reference count for the registered * buffer & adds it to the DTO object @@ -94,14 +85,6 @@ void iser_dto_buffs_release(struct iser_ void iser_dto_free(struct iser_dto *p_dto) { iser_dto_buffs_release(p_dto); - - if (p_dto->type == ISER_DTO_RCV || p_dto->type == ISER_DTO_SEND) { - iser_dbg("Release %s dto desc.: 0x%p\n", - p_dto->type == ISER_DTO_RCV ? "RECV" : "SEND", - p_dto); - kmem_cache_free(ig.dto_cache, p_dto); - } else - iser_bug("Unexpected type:%d, dto:0x%p\n",p_dto->type,p_dto); } /** @@ -109,11 +92,11 @@ void iser_dto_free(struct iser_dto *p_dt * * returns 0 on success, -1 on failure */ -int iser_dto_completion_error(struct iser_dto *p_dto) +int iser_dto_completion_error(struct iser_desc *p_desc) { struct iscsi_iser_conn *p_iser_conn; int err; - enum iser_dto_type dto_type = p_dto->type; + struct iser_dto *p_dto = &p_desc->dto; p_iser_conn = p_dto->p_conn; if (p_iser_conn == NULL) @@ -121,12 +104,16 @@ int iser_dto_completion_error(struct ise iser_dto_free(p_dto); - if (dto_type == ISER_DTO_RCV) + if(p_desc->type == ISCSI_RX) { + kfree(p_desc->data); + kmem_cache_free(ig.desc_cache, p_desc); atomic_dec(&p_iser_conn->post_recv_buf_count); - else if (dto_type == ISER_DTO_SEND) + } + else { /* type is TX control/command/dataout */ + if(p_desc->type == ISCSI_TX_DATAOUT) + kmem_cache_free(ig.desc_cache, p_desc); atomic_dec(&p_iser_conn->post_send_buf_count); - else - iser_bug("Unknown DTO type:%d\n", p_dto->type); + } err = iser_complete_conn_termination(p_iser_conn); @@ -135,76 +122,30 @@ int iser_dto_completion_error(struct ise /* iser_dto_get_rx_pdu_data - gets received PDU descriptor & data from rx DTO */ -void iser_dto_get_rx_pdu_data(struct iser_dto *p_dto, unsigned long dto_xfer_len, - struct iscsi_hdr **p_hdr, - char **rx_data, int *rx_data_size) -{ - unsigned char *p_recv_buf; - - if (dto_xfer_len < ISER_TOTAL_HEADERS_LEN) - iser_bug("Recvd data size:%ld less than iSER headers\n", - dto_xfer_len); - if (p_dto->regd_vector_len != 2) - iser_bug("Recvd data IOV len:%d != 2\n", - p_dto->regd_vector_len); - /* Get the header memory */ - p_recv_buf = (unsigned char *)p_dto->regd[0]->virt_addr; - p_recv_buf += p_dto->offset[0]; - /* Skip the iSER header to get the iSCSI PDU BHS */ - *p_hdr = (struct iscsi_hdr *)(p_recv_buf + ISER_HDR_LEN); - - if (dto_xfer_len > ISER_TOTAL_HEADERS_LEN) { /* we have data */ - *rx_data = p_dto->regd[1]->virt_addr; - *rx_data += p_dto->offset[1]; - *rx_data_size = dto_xfer_len - ISER_TOTAL_HEADERS_LEN; - } -} - /** * Creates a new send DTO descriptor, * adds header regd buffer * */ -struct iser_dto *iser_dto_send_create(struct iscsi_iser_conn *p_iser_conn, - struct iscsi_hdr *hdr, - unsigned char **p_header) +void iser_dto_send_create(struct iscsi_iser_conn *p_iser_conn, + struct iser_desc *tx_desc) { - struct iser_regd_buf *p_regd_hdr = NULL; - struct iser_dto *p_send_dto = NULL; - unsigned char *p_iser_header = NULL; - - p_send_dto = kmem_cache_alloc(ig.dto_cache,GFP_KERNEL | __GFP_NOFAIL); - if (p_send_dto == NULL) { - iser_err("allocation of send DTO descriptor failed\n"); - goto dto_send_create_exit; - } - /* setup send dto */ - iser_dto_init(p_send_dto); - p_send_dto->p_conn = p_iser_conn; - p_send_dto->type = ISER_DTO_SEND; - p_send_dto->notify_enable = 1; - - p_regd_hdr = iser_regd_mem_alloc(p_iser_conn->ib_conn->p_adaptor, - ig.header_cache, - ISER_TOTAL_HEADERS_LEN); - if (p_regd_hdr == NULL) { - iser_err("failed to allocate regd header\n"); - kmem_cache_free(ig.dto_cache, p_send_dto); - p_send_dto = NULL; - goto dto_send_create_exit; - } + struct iser_regd_buf *p_regd_hdr = &tx_desc->hdr_regd_buf; + struct iser_dto *p_send_dto = &tx_desc->dto; + + memset(p_regd_hdr, 0, sizeof(struct iser_regd_buf)); + p_regd_hdr->p_adaptor = p_iser_conn->ib_conn->p_adaptor; + p_regd_hdr->virt_addr = tx_desc; /* == &tx_desc->iser_header */ + p_regd_hdr->data_size = ISER_TOTAL_HEADERS_LEN; + + p_send_dto->p_conn = p_iser_conn; + p_send_dto->notify_enable = 1; + p_send_dto->regd_vector_len = 0; + + memset(&tx_desc->iser_header, 0, ISER_HDR_LEN); + tx_desc->iser_header.flags = ISER_VER; - /* setup iSER Header */ - p_iser_header = (unsigned char *)p_regd_hdr->virt_addr; - memset(p_iser_header, 0, ISER_HDR_LEN); - - ((struct iser_hdr *)p_iser_header)->flags = ISER_VER; - - memcpy(p_iser_header + ISER_HDR_LEN, hdr, ISER_PDU_BHS_LENGTH); - iser_dto_add_regd_buff(p_send_dto, p_regd_hdr, USE_NO_OFFSET, - USE_SIZE(ISER_TOTAL_HEADERS_LEN)); - dto_send_create_exit: - if (p_header != NULL) *p_header = p_iser_header; - return p_send_dto; + iser_dto_add_regd_buff(p_send_dto, p_regd_hdr, + USE_NO_OFFSET, USE_ENTIRE_SIZE); } Index: ulp/iser/iser_memory.c =================================================================== --- ulp/iser/iser_memory.c (revision 5180) +++ ulp/iser/iser_memory.c (working copy) @@ -51,54 +51,6 @@ iser_page_to_virt(struct page *page) } /** - * iser_regd_buf_alloc - allocates a blank registered buffer descriptor - * - * returns the registered buffer descriptor - */ -struct iser_regd_buf *iser_regd_buf_alloc(void) -{ - struct iser_regd_buf *p_regd_buf; - - p_regd_buf = (struct iser_regd_buf *)kmem_cache_alloc( - ig.regd_buf_cache, - GFP_KERNEL | __GFP_NOFAIL); - if (p_regd_buf != NULL) - memset(p_regd_buf, 0, sizeof(struct iser_regd_buf)); - - return p_regd_buf; -} - -/** - * iser_regd_mem_alloc - allocates memory and creates a registered buffer - * - * returns the registered buffer - */ -struct iser_regd_buf *iser_regd_mem_alloc(struct iser_adaptor *p_iser_adaptor, - kmem_cache_t *cache, - int data_size) -{ - struct iser_regd_buf *p_regd_buf; - void *data; - - p_regd_buf = iser_regd_buf_alloc(); - if (p_regd_buf != NULL) { - data = (void *) kmem_cache_alloc(cache, - GFP_KERNEL | __GFP_NOFAIL); - if (data == NULL) { - kmem_cache_free(ig.regd_buf_cache, p_regd_buf); - return NULL; - } - p_regd_buf->data_cache = cache; - p_regd_buf->p_adaptor = p_iser_adaptor; - p_regd_buf->virt_addr = data; - p_regd_buf->data_size = data_size; - /* not here as it is not safe (the data might be touched later */ - /* iser_reg_single(p_iser_adaptor, p_regd_buf, data, data_size, dir); */ - } - return p_regd_buf; -} - -/** * iser_regd_buff_ref - Increments the reference count of a * registered buffer * @@ -160,14 +112,6 @@ int iser_regd_buff_release(struct iser_r p_regd_buf->direction); /* else this regd buf is associated with task which we */ /* dma_unmap_single/sg later */ - - if (p_regd_buf->data_cache != NULL) { - iser_dbg("releasing regd_buf data=0x%p (count = 0)\n", - p_regd_buf->virt_addr); - kmem_cache_free(p_regd_buf->data_cache, - p_regd_buf->virt_addr); - } - kmem_cache_free(ig.regd_buf_cache, p_regd_buf); return 0; } else { iser_dbg("Release deferred, regd.buff: 0x%p\n", p_regd_buf); @@ -197,8 +141,6 @@ void iser_reg_single(struct iser_adaptor p_regd_buf->reg.va = dma_addr; p_regd_buf->dma_addr = dma_addr; - /* p_regd_buf->virt_addr = virt_addr; */ - /* p_regd_buf->data_size = data_size; */ p_regd_buf->direction = direction; } Index: ulp/iser/iscsi_iser.c =================================================================== --- ulp/iser/iscsi_iser.c (revision 5180) +++ ulp/iser/iscsi_iser.c (working copy) @@ -78,8 +78,6 @@ static unsigned int iscsi_max_lun = 512; module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); -static kmem_cache_t *task_mem_cache; - /** * iscsi_iser_cmd_init - Initialize iSCSI SCSI_READ or SCSI_WRITE commands * @@ -92,16 +90,17 @@ static void iscsi_iser_cmd_init(struct i ctask->sc = sc; ctask->conn = conn; - ctask->hdr.opcode = ISCSI_OP_SCSI_CMD; - ctask->hdr.flags = ISCSI_ATTR_SIMPLE; - ctask->hdr.lun[1] = sc->device->lun; - ctask->hdr.itt = ctask->itt | (conn->id << CID_SHIFT) | + + ctask->hdr->opcode = ISCSI_OP_SCSI_CMD; + ctask->hdr->flags = ISCSI_ATTR_SIMPLE; + ctask->hdr->lun[1] = sc->device->lun; + ctask->hdr->itt = ctask->itt | (conn->id << CID_SHIFT) | (session->age << AGE_SHIFT); - ctask->hdr.data_length = cpu_to_be32(sc->request_bufflen); - ctask->hdr.cmdsn = cpu_to_be32(session->cmdsn); session->cmdsn++; - ctask->hdr.exp_statsn = cpu_to_be32(conn->exp_statsn); - memcpy(ctask->hdr.cdb, sc->cmnd, sc->cmd_len); - memset(&ctask->hdr.cdb[sc->cmd_len], 0, + ctask->hdr->data_length = cpu_to_be32(sc->request_bufflen); + ctask->hdr->cmdsn = cpu_to_be32(session->cmdsn); session->cmdsn++; + ctask->hdr->exp_statsn = cpu_to_be32(conn->exp_statsn); + memcpy(ctask->hdr->cdb, sc->cmnd, sc->cmd_len); + memset(&ctask->hdr->cdb[sc->cmd_len], 0, MAX_COMMAND_SIZE - sc->cmd_len); ctask->mtask = NULL; @@ -111,7 +110,7 @@ static void iscsi_iser_cmd_init(struct i ctask->total_length = sc->request_bufflen; if (sc->sc_data_direction == DMA_TO_DEVICE) { - ctask->hdr.flags |= ISCSI_FLAG_CMD_WRITE; + ctask->hdr->flags |= ISCSI_FLAG_CMD_WRITE; BUG_ON(ctask->total_length == 0); /* unsolicited bytes to be sent as imm. data - with cmd pdu */ @@ -127,16 +126,16 @@ static void iscsi_iser_cmd_init(struct i else ctask->imm_count = min(ctask->total_length, conn->max_xmit_dlength); - hton24(ctask->hdr.dlength, ctask->imm_count); + hton24(ctask->hdr->dlength, ctask->imm_count); } else - zero_data(ctask->hdr.dlength); + zero_data(ctask->hdr->dlength); if (!session->initial_r2t_en) ctask->unsol_count = min(session->first_burst, ctask->total_length) - ctask->imm_count; if (!ctask->unsol_count) /* No unsolicit Data-Out's */ - ctask->hdr.flags |= ISCSI_FLAG_CMD_FINAL; + ctask->hdr->flags |= ISCSI_FLAG_CMD_FINAL; /*else ctask->xmstate |= XMSTATE_UNS_HDR | XMSTATE_UNS_INIT;*/ @@ -150,11 +149,12 @@ static void iscsi_iser_cmd_init(struct i ctask->itt, ctask->total_length, ctask->imm_count, ctask->unsol_count, ctask->rdma_data_count); } else { - ctask->hdr.flags |= ISCSI_FLAG_CMD_FINAL; + ctask->hdr->flags |= ISCSI_FLAG_CMD_FINAL; if (sc->sc_data_direction == DMA_FROM_DEVICE) - ctask->hdr.flags |= ISCSI_FLAG_CMD_READ; + ctask->hdr->flags |= ISCSI_FLAG_CMD_READ; ctask->datasn = 0; - zero_data(ctask->hdr.dlength); + zero_data(ctask->hdr->dlength); + ctask->rdma_data_count = ctask->total_length; } } @@ -207,24 +207,17 @@ iscsi_iser_conn_failure(struct iscsi_ise } static void iscsi_iser_unsolicit_data_init(struct iscsi_iser_conn *conn, - struct iscsi_iser_cmd_task *ctask) + struct iscsi_iser_cmd_task *ctask, + struct iscsi_data *hdr) { - struct iscsi_data *hdr; - struct iscsi_iser_data_task *dtask; - - dtask = mempool_alloc(ctask->datapool, GFP_ATOMIC); - - BUG_ON(!dtask); - hdr = &dtask->hdr; - memset(hdr, 0, sizeof(struct iscsi_data)); hdr->ttt = cpu_to_be32(ISCSI_RESERVED_TAG); hdr->datasn = cpu_to_be32(ctask->unsol_datasn); ctask->unsol_datasn++; hdr->opcode = ISCSI_OP_SCSI_DATA_OUT; - memcpy(hdr->lun, ctask->hdr.lun, sizeof(hdr->lun)); + memcpy(hdr->lun, ctask->hdr->lun, sizeof(hdr->lun)); - hdr->itt = ctask->hdr.itt; + hdr->itt = ctask->hdr->itt; hdr->exp_statsn = cpu_to_be32(conn->exp_statsn); hdr->offset = cpu_to_be32(ctask->total_length - @@ -240,31 +233,25 @@ static void iscsi_iser_unsolicit_data_in ctask->data_count = ctask->unsol_count; hdr->flags = ISCSI_FLAG_CMD_FINAL; } - - list_add(&dtask->item, &ctask->dataqueue); - - ctask->dtask = dtask; } static int iscsi_iser_ctask_xmit_unsol_data(struct iscsi_iser_conn *conn, struct iscsi_iser_cmd_task *ctask) { - struct iscsi_iser_data_task *dtask = NULL; + struct iscsi_data hdr; int error = 0; debug_iser("%s: enter\n", __FUNCTION__); /* Send data-out PDUs while there's still unsolicited data to send */ while (ctask->unsol_count > 0) { - iscsi_iser_unsolicit_data_init(conn, ctask); - - dtask = ctask->dtask; + iscsi_iser_unsolicit_data_init(conn, ctask, &hdr); debug_scsi("Sending data-out: itt 0x%x, data count %d\n", - dtask->hdr.itt, ctask->data_count); + hdr.itt, ctask->data_count); /* the buffer description has been passed with the command */ /* Send the command */ - error = iser_send_data_out(conn, ctask, &dtask->hdr); + error = iser_send_data_out(conn, ctask, &hdr); if (error) { printk(KERN_ERR "send_data_out failed\n"); goto iscsi_iser_ctask_xmit_unsol_data_exit; @@ -365,7 +352,7 @@ static int iscsi_iser_data_xmit(struct i if (iscsi_iser_mtask_xmit(conn, conn->mtask)) goto iscsi_iser_data_xmit_fail; - if (conn->mtask->hdr.itt == + if (conn->mtask->hdr->itt == cpu_to_be32(ISCSI_RESERVED_TAG)) { spin_lock_bh(&session->lock); __kfifo_put(session->mgmtpool.queue, @@ -396,7 +383,7 @@ static int iscsi_iser_data_xmit(struct i if (iscsi_iser_mtask_xmit(conn, conn->mtask)) goto iscsi_iser_data_xmit_fail; - if (conn->mtask->hdr.itt == + if (conn->mtask->hdr->itt == cpu_to_be32(ISCSI_RESERVED_TAG)) { spin_lock_bh(&session->lock); __kfifo_put(session->mgmtpool.queue, @@ -566,7 +553,7 @@ static int iscsi_iser_conn_send_generic( nop->exp_statsn = cpu_to_be32(conn->exp_statsn); - memcpy(&mtask->hdr, hdr, sizeof(struct iscsi_hdr)); + memcpy(mtask->hdr, hdr, sizeof(struct iscsi_hdr)); spin_unlock_bh(&session->lock); @@ -642,14 +629,6 @@ static inline void iscsi_iser_ctask_clea spin_unlock(&session->lock); return; } - if (sc->sc_data_direction == DMA_TO_DEVICE) { - struct iscsi_iser_data_task *dtask, *n; - list_for_each_entry_safe(dtask, n, &ctask->dataqueue, item) { - list_del(&dtask->item); - mempool_free(dtask, ctask->datapool); - } - } - ctask->sc = NULL; __kfifo_put(session->cmdpool.queue, (void*)&ctask, sizeof(void*)); spin_unlock(&session->lock); @@ -720,9 +699,12 @@ static int iscsi_iser_eh_abort(struct sc hdr->opcode = ISCSI_OP_SCSI_TMFUNC | ISCSI_OP_IMMEDIATE; hdr->flags = ISCSI_TM_FUNC_ABORT_TASK; hdr->flags |= ISCSI_FLAG_CMD_FINAL; - memcpy(hdr->lun, ctask->hdr.lun, sizeof(hdr->lun)); - hdr->rtt = ctask->hdr.itt; - hdr->refcmdsn = ctask->hdr.cmdsn; + memcpy(hdr->lun, ctask->hdr->lun, sizeof(hdr->lun)); + hdr->rtt = ctask->hdr->itt; + hdr->refcmdsn = ctask->hdr->cmdsn; + + iser_err("op 0x%x aborting rtt 0x%x itt 0x%x dlength %d]\n", + hdr->opcode, hdr->rtt, hdr->itt, ntoh24(hdr->dlength)); debug_iser("%s: calling iscsi_iser_conn_send_generic (task mgmt)\n", __FUNCTION__); rc = iscsi_iser_conn_send_generic(iscsi_handle(conn), (struct iscsi_hdr *)hdr, @@ -953,39 +935,11 @@ static void iscsi_iser_pool_free(struct kfree(items); } -static int iscsi_iser_dout_pool_alloc(struct iscsi_iser_session *session) -{ - int i; - int cmd_i; - - for (cmd_i = 0; cmd_i < session->cmds_max; cmd_i++) { - struct iscsi_iser_cmd_task *ctask = session->cmds[cmd_i]; - - ctask->datapool = mempool_create(ISCSI_DTASK_DEFAULT_MAX, - mempool_alloc_slab, - mempool_free_slab, - task_mem_cache); - if (ctask->datapool == NULL) { - goto dout_alloc_fail; - } - - INIT_LIST_HEAD(&ctask->dataqueue); - } - - return 0; - -dout_alloc_fail: - for (i = 0; i < cmd_i; i++) { - mempool_destroy(session->cmds[i]->datapool); - } - return -ENOMEM; -} - static iscsi_sessionh_t iscsi_iser_session_create(uint32_t initial_cmdsn, struct Scsi_Host *host) { struct iscsi_iser_session *session = NULL; - int cmd_i; + int cmd_i, mgmt_i, j; session = iscsi_hostdata(host->hostdata); memset(session, 0, sizeof(struct iscsi_iser_session)); @@ -1007,9 +961,13 @@ static iscsi_sessionh_t iscsi_iser_sessi } /* pre-format cmds pool with ITT */ - for (cmd_i = 0; cmd_i < session->cmds_max; cmd_i++) + for (cmd_i = 0; cmd_i < session->cmds_max; cmd_i++) { session->cmds[cmd_i]->itt = cmd_i; + session->cmds[cmd_i]->hdr = (struct iscsi_cmd *) + &session->cmds[cmd_i]->desc.iscsi_header; + } + spin_lock_init(&session->lock); INIT_LIST_HEAD(&session->connections); @@ -1022,30 +980,32 @@ static iscsi_sessionh_t iscsi_iser_sessi } /* pre-format immediate cmds pool with ITT */ - for (cmd_i = 0; cmd_i < session->mgmtpool_max; cmd_i++) { - session->mgmt_cmds[cmd_i]->itt = ISCSI_MGMT_ITT_OFFSET + cmd_i; - session->mgmt_cmds[cmd_i]->data = + for (mgmt_i = 0; mgmt_i < session->mgmtpool_max; mgmt_i++) { + session->mgmt_cmds[mgmt_i]->itt = ISCSI_MGMT_ITT_OFFSET + mgmt_i; + + session->mgmt_cmds[mgmt_i]->hdr = + &session->mgmt_cmds[mgmt_i]->desc.iscsi_header; + + /* FIXME need to ensure this is HW cache start/end aligned */ + session->mgmt_cmds[mgmt_i]->desc.data = kmalloc(DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH, GFP_KERNEL); - if (!session->mgmt_cmds[cmd_i]->data) { - int j; - for (j = 0; j < cmd_i; j++) - kfree(session->mgmt_cmds[j]->data); + + if (!session->mgmt_cmds[mgmt_i]->desc.data) { debug_iser("mgmt data allocation failed\n"); goto immdata_alloc_fail; } - } - if (iscsi_iser_dout_pool_alloc(session)) - goto dout_alloc_fail; + session->mgmt_cmds[mgmt_i]->data = + session->mgmt_cmds[mgmt_i]->desc.data; + } return iscsi_handle(session); -dout_alloc_fail: - for (cmd_i = 0; cmd_i < session->mgmtpool_max; cmd_i++) - kfree(session->mgmt_cmds[cmd_i]->data); - iscsi_iser_pool_free(&session->mgmtpool, (void**)session->mgmt_cmds); immdata_alloc_fail: + for (j = 0; j < mgmt_i; j++) + kfree(session->mgmt_cmds[j]->desc.data); + iscsi_iser_pool_free(&session->mgmtpool, (void**)session->mgmt_cmds); mgmtpool_alloc_fail: iscsi_iser_pool_free(&session->cmdpool, (void**)session->cmds); cmdpool_alloc_fail: @@ -1054,28 +1014,16 @@ cmdpool_alloc_fail: static void iscsi_iser_session_destroy(iscsi_sessionh_t sessionh) { - int cmd_i; - struct iscsi_iser_data_task *dtask, *n; + int mgmt_i; struct iscsi_iser_session *session = iscsi_ptr(sessionh); debug_iser("%s: enter\n", __FUNCTION__); - for (cmd_i = 0; cmd_i < session->cmds_max; cmd_i++) { - struct iscsi_iser_cmd_task *ctask = session->cmds[cmd_i]; - list_for_each_entry_safe(dtask, n, &ctask->dataqueue, item) { - list_del(&dtask->item); - mempool_free(dtask, ctask->datapool); - } - } - - for (cmd_i = 0; cmd_i < session->mgmtpool_max; cmd_i++) - kfree(session->mgmt_cmds[cmd_i]->data); - - for (cmd_i = 0; cmd_i < session->cmds_max; cmd_i++) { - mempool_destroy(session->cmds[cmd_i]->datapool); - } + for (mgmt_i = 0; mgmt_i < session->mgmtpool_max; mgmt_i++) + kfree(session->mgmt_cmds[mgmt_i]->desc.data); iscsi_iser_pool_free(&session->mgmtpool, (void**)session->mgmt_cmds); + iscsi_iser_pool_free(&session->cmdpool, (void**)session->cmds); debug_iser("%s: exit\n", __FUNCTION__); @@ -1583,31 +1531,6 @@ static struct iscsi_transport iscsi_iser .send_pdu = iscsi_iser_conn_send_pdu, }; -static int iscsi_iser_slabs_create(void) -{ - task_mem_cache = kmem_cache_create("iscsi_iser_task", - sizeof(struct iscsi_iser_data_task), - 0, - SLAB_HWCACHE_ALIGN | SLAB_NO_REAP, - NULL, NULL); - if (task_mem_cache == NULL) { - printk(KERN_ERR "Failed to create iscsi_iser_task slab\n"); - return 1; - } - return 0; -} - -static void iscsi_iser_slabs_destroy(void) -{ - if (task_mem_cache != NULL) { - if (kmem_cache_destroy(task_mem_cache) != 0) { - printk(KERN_ERR "Failed to destroy task_mem_cache\n"); - return; - } - task_mem_cache = NULL; - } -} - static inline int iscsi_iser_check_assign_cmdsn( struct iscsi_iser_session *session, struct iscsi_nopin *hdr) @@ -1881,13 +1804,9 @@ int iscsi_iser_init(void) } iscsi_iser_transport.max_lun = iscsi_max_lun; - if (iscsi_iser_slabs_create()) - return -ENOMEM; - error = iscsi_register_transport(&iscsi_iser_transport); if (error) { printk(KERN_ERR "iscsi_register_transport failed\n"); - iscsi_iser_slabs_destroy(); return error; } return 0; @@ -1896,6 +1815,5 @@ int iscsi_iser_init(void) void iscsi_iser_exit(void) { iscsi_unregister_transport(&iscsi_iser_transport); - iscsi_iser_slabs_destroy(); } From ogerlitz at voltaire.com Wed Jan 25 06:11:28 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 25 Jan 2006 16:11:28 +0200 (IST) Subject: [openib-general] [PATCH] iser: bugfix in session create Message-ID: commited to r5182 bugfix - session state is not LOGGED_IN upon creation but rather later Signed-off-by: Or Gerlitz Index: ulp/iser/iscsi_iser.c =================================================================== --- ulp/iser/iscsi_iser.c (revision 5181) +++ ulp/iser/iscsi_iser.c (working copy) @@ -946,7 +946,6 @@ static iscsi_sessionh_t iscsi_iser_sessi session->host = host; session->id = host->host_no; - session->state = ISCSI_STATE_LOGGED_IN; session->mgmtpool_max = ISCSI_ISER_MGMT_CMDS_MAX; session->cmds_max = ISCSI_ISER_XMIT_CMDS_MAX; session->cmdsn = initial_cmdsn; From halr at voltaire.com Wed Jan 25 08:58:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Jan 2006 11:58:03 -0500 Subject: [openib-general] [PATCH] OpenSM/complib: Fix assertion in cl_ptr_vector.h::cl_ptr_vector_get Message-ID: <1138208279.4338.48859.camel@hal.voltaire.com> OpenSM/complib: Fix assertion in cl_ptr_vector.h::cl_ptr_vector_get Signed-off-by: Hal Rosenstock Index: include/complib/cl_ptr_vector.h =================================================================== --- include/complib/cl_ptr_vector.h (revision 5182) +++ include/complib/cl_ptr_vector.h (working copy) @@ -416,7 +416,7 @@ cl_ptr_vector_get( { CL_ASSERT( p_vector ); CL_ASSERT( p_vector->state == CL_INITIALIZED ); - CL_ASSERT( p_vector->size > index ); + CL_ASSERT( p_vector->size >= index ); return( (void*)p_vector->p_ptr_array[index] ); } From caitlinb at broadcom.com Wed Jan 25 09:21:31 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 25 Jan 2006 09:21:31 -0800 Subject: [openib-general] Reregister Memory Region Verb Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C3568@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Galen Shipman wrote: > >> I would like to be able to extend an existing registration such that >> the driver would take advantage of the fact that part of the extended >> registration is already registered, i.e. only the "new" memory would >> be pinned and made resident. > > Assuming you are referring to registration of virtual memory > from user space, this seems as a request for the hw drivers > to support overlapping memory regions in the sense that MR A > can overlap with MR B such that till both A and B are > unregistered the overlapped section is in place (pinned , > resident, mapped in the HCA MMU etc). > Are you asking for the semantics of a reregister for virtual memory? If so, that would imply that it is the equivalent of deregistering and reregistering, except that the provider SHOULD optimize the process and minimize risk of giving up resources that it might not get back. That is different than simply extending an existing memory region, for example having some middleware register the widest range of the user's data space that has been seen. That would imply that you could expand an MR even while MWs were bound to it. >> although we would prefer that it wouldn't block if possible > > mmm. All the current memory registration verbs both user and > kernel are blocking, is it an issue for you? > If you need to do memory registrations in a context where blocking is not an option then you really need FMR work requests as in RDMAC and InfiniBand 1.2 verbs. From nacc at us.ibm.com Wed Jan 25 10:00:45 2006 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Wed, 25 Jan 2006 10:00:45 -0800 Subject: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) In-Reply-To: <20060125061729.GB24479@mellanox.co.il> References: <20060125002506.GI27746@us.ibm.com> <20060125061729.GB24479@mellanox.co.il> Message-ID: <20060125180045.GC16164@us.ibm.com> On 25.01.2006 [08:17:29 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > On 24.01.2006 [23:19:52 +0200], Michael S. Tsirkin wrote: > > > Quoting r. Nishanth Aravamudan : > > > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > > > > > On 24.01.2006 [21:39:23 +0200], Michael S. Tsirkin wrote: > > > > > Quoting r. Roland Dreier : > > > > > > Subject: Re: [openib-general] Re: Re: Userspace testing results (manykernels, many svn trees) > > > > > > > > > > > > Michael> 1 sec = 5.37731e+14 usec > > > > > > > > > > > > Michael> which seems to indicate something's still wrong. > > > > > > > > > > > > BTW this number is pretty close to 2^32 times bigger than 1e6, so the > > > > > > problem is probably still using long long to return the result of > > > > > > mftb (which will result in shifting the result by 32 bits, ie > > > > > > multiplying by 2^32). > > > > > > > > > > Hmm. > > > > > Maybe make clean wasnt run after updating? > > > > > Could it be un on rev 5174? > > > > > > > > Heh, here's what happens with 5174: > > > > > > > > Correlation coefficient r^2: 0.773428 < 0.9 > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > 1 sec = inf usec > > > > > > > > And so forth... > > > > > > > > Thanks, > > > > Nish > > > > > > Hmm. Looks like mftb is returning wrong data. > > > Could you uncomment lines setting DEBUG and DEBUG_DATA at the top? > > > This will print all mftb values out. > > > > Here you go: > > > > x=1990 y=397692 > > x=2000 y=399776 > > x=2010 y=401853 > > x=2020 y=403711 > > x=2030 y=405478 > > x=2040 y=407577 > > x=2050 y=409618 > > x=2060 y=411603 > > x=2070 y=413642 > > x=2080 y=415601 > > x=2090 y=417823 > > a = -8.02523 > > b = 199.818 > > a / b = -0.0401626 > > r^2 = 0.999999 > > Warning: measured timestamp frequency 199.818 differs from nominal 1600 MHz > > 1 sec = 1.00195e+06 usec > > 1 sec = 1.00198e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > 1 sec = 1.00207e+06 usec > > Seems to work fine now ... what changed? > Time to try rdma_lat/rdma_bw I guess. I think rdma_lat and rdma_bw are fixed now, magically. The first job of the day hasn't finished, but I checked the unformatted logs and it seems to give the following: rdma_lat: Warning: measured timestamp frequency 199.838 differs from nominal 1600 MHz loading libehca local address: LID 0x0d QPN 0x140406 PSN 0xee1d06 RKey 0x2340032 VAddr 0x0000001001a001 remote address: LID 0x08 QPN 0x140406 PSN 0x790ae8 RKey 0x2340032 VAddr 0x0000001001a001 Latency typical: 6.10244 usec Latency best : 6.00736 usec Latency worst : 71.9282 usec rdma_bw: Warning: measured timestamp frequency 199.82 differs from nominal 1600 MHz loading libehca local address: LID 0x0d, QPN 0x150406, PSN 0x7cca90 RKey 0x23a0032 VAddr 0x000000f7fce000 remote address: LID 0x08, QPN 0x150406, PSN 0x35668f, RKey 0x23a0032 VAddr 0x000000f7fb8000 Bandwidth peak (#0 to #963): 233.043 MB/sec Bandwidth average: 233.041 MB/sec Service Demand peak (#0 to #963): 837 cycles/KB Service Demand Avg : 50 cycles/KB Thanks for the debugging, Nish From ftillier at silverstorm.com Wed Jan 25 10:02:21 2006 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 25 Jan 2006 10:02:21 -0800 Subject: [openib-general] Reregister Memory Region Verb In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11C3568@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <000201c621d9$86a24880$6701a8c0@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Wednesday, January 25, 2006 9:22 AM > > >> although we would prefer that it wouldn't block if possible > > > > mmm. All the current memory registration verbs both user and > > kernel are blocking, is it an issue for you? > > > > If you need to do memory registrations in a context where > blocking is not an option then you really need FMR work > requests as in RDMAC and InfiniBand 1.2 verbs. No. The blocking semantics of memory registration APIs is a deliberate design choice and not a limitation of the hardware. It is possible (though more complicated) to make the API asynchronous. No existing IB stack to date has ever done so, however. - Fab From caitlinb at broadcom.com Wed Jan 25 10:05:34 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 25 Jan 2006 10:05:34 -0800 Subject: [openib-general] Reregister Memory Region Verb Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C357E@NT-SJCA-0751.brcm.ad.broadcom.com> Fab Tillier wrote: >> From: Caitlin Bestler [mailto:caitlinb at broadcom.com] >> Sent: Wednesday, January 25, 2006 9:22 AM >> >>>> although we would prefer that it wouldn't block if possible >>> >>> mmm. All the current memory registration verbs both user and kernel >>> are blocking, is it an issue for you? >>> >> >> If you need to do memory registrations in a context where blocking is >> not an option then you really need FMR work requests as in RDMAC and >> InfiniBand 1.2 verbs. > > No. The blocking semantics of memory registration APIs is a > deliberate design choice and not a limitation of the > hardware. It is possible (though more > complicated) to make the API asynchronous. No existing IB > stack to date has ever done so, however. > > - Fab If asynchronous memory registration (via work request) is such a bad idea then why is it part of both the RDMAC iWARP and InfiniBand 1.2 verb specifications? From ftillier at silverstorm.com Wed Jan 25 10:12:18 2006 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 25 Jan 2006 10:12:18 -0800 Subject: [openib-general] Reregister Memory Region Verb In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11C357E@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <000301c621da$e550c220$6701a8c0@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Wednesday, January 25, 2006 10:06 AM > > Fab Tillier wrote: > >> From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > >> Sent: Wednesday, January 25, 2006 9:22 AM > >> > >>>> although we would prefer that it wouldn't block if possible > >>> > >>> mmm. All the current memory registration verbs both user and kernel > >>> are blocking, is it an issue for you? > >>> > >> > >> If you need to do memory registrations in a context where blocking is > >> not an option then you really need FMR work requests as in RDMAC and > >> InfiniBand 1.2 verbs. > > > > No. The blocking semantics of memory registration APIs is a > > deliberate design choice and not a limitation of the > > hardware. It is possible (though more > > complicated) to make the API asynchronous. No existing IB > > stack to date has ever done so, however. > > > > - Fab > > If asynchronous memory registration (via work request) is > such a bad idea then why is it part of both the RDMAC iWARP > and InfiniBand 1.2 verb specifications? You misunderstood me. I didn't say anything about FMR being a bad idea, just that regular MRs could be made to work in a non-blocking manner. Non-blocking calls don't require FMR, it could be done without. - Fab From caitlinb at broadcom.com Wed Jan 25 10:48:35 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 25 Jan 2006 10:48:35 -0800 Subject: [openib-general] Reregister Memory Region Verb Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C35A1@NT-SJCA-0751.brcm.ad.broadcom.com> Fab Tillier wrote: >> From: Caitlin Bestler [mailto:caitlinb at broadcom.com] >> Sent: Wednesday, January 25, 2006 10:06 AM >> >> Fab Tillier wrote: >>>> From: Caitlin Bestler [mailto:caitlinb at broadcom.com] >>>> Sent: Wednesday, January 25, 2006 9:22 AM >>>> >>>>>> although we would prefer that it wouldn't block if possible >>>>> >>>>> mmm. All the current memory registration verbs both user and >>>>> kernel are blocking, is it an issue for you? >>>>> >>>> >>>> If you need to do memory registrations in a context where blocking >>>> is not an option then you really need FMR work requests as in RDMAC >>>> and InfiniBand 1.2 verbs. >>> >>> No. The blocking semantics of memory registration APIs is a >>> deliberate design choice and not a limitation of the hardware. It >>> is possible (though more complicated) to make the API asynchronous. >>> No existing IB stack to date has ever done so, however. >>> >>> - Fab >> >> If asynchronous memory registration (via work request) is such a bad >> idea then why is it part of both the RDMAC iWARP and InfiniBand 1.2 >> verb specifications? > > You misunderstood me. I didn't say anything about FMR being > a bad idea, just that regular MRs could be made to work in a > non-blocking manner. Non-blocking calls don't require FMR, it could > be done without. > > - Fab Yes, it is possible to specify a set of conditions where a memory registration verb would not have to block. But is it worthwhile to specify that under those conditions that it MUST NOT block? For verification purposes it is much simpler if a given verb is either guaranteed to never block, or is considered subject to blocking. It is much easier to check whether a routine that is supposed to be non-blocking NEVER makes a call to a routine that could block than it is to check that it never makes a call to a routine with the set of conditions that might cause it to block. So if applications have need to do registration where they are *guaranteed* that they will not block then I believe an asynch API (i.e., work requests) is a much better solution than adding lots of asterisks explaining when a call that normally "can block" will in fact be guaranteed not to block. The list of non-blocking scenarios is real easy to generate as a SHOULD NOT list, but it gets very tricky if you convert it to a MUST NOT list. From ftillier at silverstorm.com Wed Jan 25 11:09:27 2006 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 25 Jan 2006 11:09:27 -0800 Subject: [openib-general] Reregister Memory Region Verb In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11C35A1@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <000401c621e2$e128a2a0$6701a8c0@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Wednesday, January 25, 2006 10:49 AM > > Fab Tillier wrote: > >> From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > >> Sent: Wednesday, January 25, 2006 10:06 AM > >> > >> Fab Tillier wrote: > >>>> From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > >>>> Sent: Wednesday, January 25, 2006 9:22 AM > >>>> > >>>>>> although we would prefer that it wouldn't block if possible > >>>>> > >>>>> mmm. All the current memory registration verbs both user and > >>>>> kernel are blocking, is it an issue for you? > >>>>> > >>>> > >>>> If you need to do memory registrations in a context where blocking > >>>> is not an option then you really need FMR work requests as in RDMAC > >>>> and InfiniBand 1.2 verbs. > >>> > >>> No. The blocking semantics of memory registration APIs is a > >>> deliberate design choice and not a limitation of the hardware. It > >>> is possible (though more complicated) to make the API asynchronous. > >>> No existing IB stack to date has ever done so, however. > >>> > >>> - Fab > >> > >> If asynchronous memory registration (via work request) is such a bad > >> idea then why is it part of both the RDMAC iWARP and InfiniBand 1.2 > >> verb specifications? > > > > You misunderstood me. I didn't say anything about FMR being > > a bad idea, just that regular MRs could be made to work in a > > non-blocking manner. Non-blocking calls don't require FMR, it could > > be done without. > > Yes, it is possible to specify a set of conditions where > a memory registration verb would not have to block. But > is it worthwhile to specify that under those conditions > that it MUST NOT block? > > For verification purposes it is much simpler if a given > verb is either guaranteed to never block, or is considered > subject to blocking. It is much easier to check whether a > routine that is supposed to be non-blocking NEVER makes a > call to a routine that could block than it is to check that > it never makes a call to a routine with the set of conditions > that might cause it to block. > > So if applications have need to do registration where they > are *guaranteed* that they will not block then I believe > an asynch API (i.e., work requests) is a much better > solution than adding lots of asterisks explaining > when a call that normally "can block" will in fact > be guaranteed not to block. The list of non-blocking > scenarios is real easy to generate as a SHOULD NOT > list, but it gets very tricky if you convert it to > a MUST NOT list. I wholeheartedly agree that having an API that *may* block is much worse than just treating it as always blocking from a maintenance perspective. What I was referring to was that all the verb APIs could be made asynchronous, putting the burden on the API provider to handle any blocking issues and not the end user. You don't need work requests to have asynchronous APIs. However, this is a pretty significant change that I don't see happening for Linux (however I've had a long term dream of making async verbs a reality in Windows for kernel clients). - Fab From mamidala at cse.ohio-state.edu Wed Jan 25 11:35:06 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Wed, 25 Jan 2006 14:35:06 -0500 (EST) Subject: [openib-general] Creation of multicast groups Message-ID: Hi, I am trying to run a program which creates multicast groups. I am using the libraries -losmcomp -losmvendor -lopensm for this purpose. I was facing a problem while running the program. However, the program runs if I execute it as a root. I am using the revision 4918 of the osm related libraries, I am getting the following error messages: -I- Creating Multicast Group -I- MGID 0xff12a01cfe800000:0000000000000000 -I- Port Num:1 Jan 25 14:25:54 283113 [AB00EB00] -> osm_vendor_bind: Binding to port 0x6270510000005. Jan 25 14:25:54 285850 [AB00EB00] -> osm_vendor_open_port: ERR 542C: umad_open_port() failed Jan 25 14:25:54 285873 [AB00EB00] -> osm_vendor_bind: ERR 5424: Unable to Open Port 0x6270510000005. Jan 25 14:25:54 285890 [AB00EB00] -> osmv_bind_sa: ERR 5506: Failed to bind to vendor GSI Jan 25 14:25:54 285901 [AB00EB00] -> ibmcgrp_bind: ERR 00137: Unable to bind to SA Thanks, Amith From sean.hefty at intel.com Wed Jan 25 11:37:25 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Jan 2006 11:37:25 -0800 Subject: [openib-general] [PATCH 0/4] SA path record caching Message-ID: The following patch series adds caching of path records with the local system. I divided the changes up into 4 patches to make the review easier. The patches are arranged as follows: 1. Add a new API to ib_sa.h to pack/unpack SA attributes. 2. Create a fast indexing service. 3. Create a local SA database. 4. Modify the CMA to use the local SA database. There are some additional optimizations that can be added to these changes, but I would prefer to add them incrementally to these changes. This includes additional optimizations to the fast indexing service to help reduce its memory footprint, registering the local SA database to receive SA events, and failing over from the local SA database to using SA queries. Signed-off-by: Sean Hefty From sean.hefty at intel.com Wed Jan 25 11:42:01 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Jan 2006 11:42:01 -0800 Subject: [openib-general] [PATCH 1/4] SA path record caching In-Reply-To: Message-ID: Expose functions to pack/unpack SA attributes. This capability is also required by the local SA database. Signed-off-by: Sean Hefty --- Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 5098) +++ core/sa_query.c (working copy) @@ -440,6 +440,32 @@ void ib_sa_cancel_query(int id, struct i } EXPORT_SYMBOL(ib_sa_cancel_query); +int ib_sa_pack_attr(void *dst, void *src, int attr_id) +{ + switch (attr_id) { + case IB_SA_ATTR_PATH_REC: + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); + break; + default: + return -EINVAL; + } + return 0; +} +EXPORT_SYMBOL(ib_sa_pack_attr); + +int ib_sa_unpack_attr(void *dst, void *src, int attr_id) +{ + switch (attr_id) { + case IB_SA_ATTR_PATH_REC: + ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); + break; + default: + return -EINVAL; + } + return 0; +} +EXPORT_SYMBOL(ib_sa_unpack_attr); + static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent) { unsigned long flags; Index: include/rdma/ib_sa.h =================================================================== --- include/rdma/ib_sa.h (revision 5098) +++ include/rdma/ib_sa.h (working copy) @@ -398,5 +398,22 @@ ib_sa_mcmember_rec_delete(struct ib_devi context, query); } +/** + * ib_sa_pack_attr - Copy an SA attribute from a host defined structure to + * a network packed structure. + * dst: Destination buffer. + * src: Source buffer. + * attr_id: Identifer of SA attribute: IB_SA_ATTR_*. + */ +int ib_sa_pack_attr(void *dst, void *src, int attr_id); + +/** + * ib_sa_unpack_attr - Copy an SA attribute from a packed network structure + * to a host defined structure. + * dst: Destination buffer. + * src: Source buffer. + * attr_id: Identifer of SA attribute: IB_SA_ATTR_*. + */ +int ib_sa_unpack_attr(void *dst, void *src, int attr_id); #endif /* IB_SA_H */ From sean.hefty at intel.com Wed Jan 25 11:44:32 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Jan 2006 11:44:32 -0800 Subject: [openib-general] [PATCH 2/4] SA path record caching In-Reply-To: Message-ID: Add a fast indexing service that permits quick insertion, removal, and retrieval of data items using a key. Signed-off-by: Sean Hefty --- Index: core/index.c =================================================================== --- core/index.c (revision 0) +++ core/index.c (revision 0) @@ -0,0 +1,230 @@ +/* + * Copyright (c) 2006 Intel Corporation.  All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("Indexing service"); +MODULE_LICENSE("Dual BSD/GPL"); + +void index_init(struct index_root *root, unsigned int key_length, + gfp_t gfp_mask) +{ + memset(root, 0, sizeof *root); + root->key_length = key_length; + root->gfp_mask = gfp_mask; + root->node.ref_cnt = 1; /* do not delete root node */ +} +EXPORT_SYMBOL(index_init); + +void index_destroy(struct index_root *root) +{ + index_remove_all(root, NULL, NULL); +} +EXPORT_SYMBOL(index_destroy); + +void *index_insert(struct index_root *root, void *data, u8 *key) +{ + struct index_node *node, *new_node; + struct index_leaf *leaf; + int i, k, j; + + for (node = &root->node, k = 0; 1; node = node->child[i], k++) { + i = key[k]; + if (!node->child[i]) { + leaf = kzalloc(sizeof *leaf + root->key_length, + root->gfp_mask); + if (!leaf) + return ERR_PTR(-ENOMEM); + + leaf->data = data; + memcpy(leaf->key, key, root->key_length); + node->child[i] = leaf; + node->child_type[i] = INDEX_LEAF; + node->ref_cnt++; + return NULL; + } else if (node->child_type[i] == INDEX_LEAF) { + leaf = node->child[i]; + if (!memcmp(leaf->key + k, key + k, + root->key_length - k)) + return leaf->data; + + new_node = kzalloc(sizeof *new_node, root->gfp_mask); + if (!new_node) + return ERR_PTR(-ENOMEM); + + node->child[i] = new_node; + node->child_type[i] = INDEX_NODE; + new_node->parent = node; + new_node->ref_cnt++; + j = leaf->key[k + 1]; + new_node->child[j] = leaf; + new_node->child_type[j] = INDEX_LEAF; + } + } + return ERR_PTR(-EINVAL); +} +EXPORT_SYMBOL(index_insert); + +void *index_find(struct index_root *root, u8 *key) +{ + struct index_node *node; + struct index_leaf *leaf; + int i, k; + + for (node = &root->node, k = 0; node; node = node->child[i], k++) { + i = key[k]; + if (node->child_type[i] == INDEX_LEAF) { + leaf = node->child[i]; + if ((root->key_length > k) && + !memcmp(leaf->key + k, key + k, + root->key_length - k)) + return leaf->data; + else + return NULL; + } + } + return NULL; +} +EXPORT_SYMBOL(index_find); + +void *index_find_replace(struct index_root *root, void *data, u8 *key) +{ + struct index_node *node; + struct index_leaf *leaf; + void *old_data; + int i, k; + + for (node = &root->node, k = 0; node; node = node->child[i], k++) { + i = key[k]; + if (node->child_type[i] == INDEX_LEAF) { + leaf = node->child[i]; + if ((root->key_length > k) && + !memcmp(leaf->key + k, key + k, + root->key_length - k)) { + old_data = leaf->data; + leaf->data = data; + return old_data; + } else + return NULL; + } + } + return NULL; +} +EXPORT_SYMBOL(index_find_replace); + +void *index_remove(struct index_root *root, u8 *key) +{ + struct index_node *node, *temp_node; + struct index_leaf *leaf; + void *data = NULL; + int i, k; + + for (node = &root->node, k = 0; node; node = node->child[i], k++) { + i = key[k]; + if (node->child_type[i] == INDEX_LEAF) { + leaf = node->child[i]; + if (!memcmp(leaf->key + k, key + k, + root->key_length - k)) { + data = leaf->data; + kfree(leaf); + + while (1) { + node->child[i] = NULL; + node->child_type[i] = INDEX_EMPTY; + if (--node->ref_cnt) + break; + temp_node = node; + node = node->parent; + kfree(temp_node); + i = key[--k]; + } + } + return data; + } + } + return NULL; +} +EXPORT_SYMBOL(index_remove); + +void index_remove_all(struct index_root *root, + void (*callback)(void *context, void *data), + void *context) +{ + struct index_node *node = &root->node; + struct index_leaf *leaf; + int i = 0; + +clean_node: + for (; i < 256; i++) { + switch (node->child_type[i]) { + case INDEX_LEAF: + leaf = node->child[i]; + if (callback) + callback(context, leaf->data); + kfree(leaf); + break; + case INDEX_NODE: + node->ref_cnt = i; /* save location */ + node = node->child[i]; + i = 0; + goto clean_node; /* remove child node */ + default: + break; + } + } + + if (node != &root->node) { + /* finish cleaning parent node */ + node = node->parent; + i = node->ref_cnt; + kfree(node->child[i++]); + goto clean_node; + } + index_init(root, root->key_length, root->gfp_mask); +} +EXPORT_SYMBOL(index_remove_all); + +static int __init index_start(void) +{ + return 0; +} + +static void __exit index_exit(void) +{ +} + +module_init(index_start); +module_exit(index_exit); Index: include/rdma/index.h =================================================================== --- include/rdma/index.h (revision 0) +++ include/rdma/index.h (revision 0) @@ -0,0 +1,134 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef INDEX_H +#define INDEX_H + +#include + +enum { + INDEX_EMPTY, + INDEX_NODE, + INDEX_LEAF +}; + +struct index_leaf { + void *data; + u8 key[0]; +}; + +struct index_node { + void *parent; + void *child[256]; + char child_type[256]; + unsigned int ref_cnt; +}; + +struct index_root { + struct index_node node; + unsigned int key_length; + gfp_t gfp_mask; +}; + +/** + * index_init - Initialize the index before use. + * @root: The index root. + * @key_length: The size of the index key. + * @gfp_mask: GFP mask to use when allocating resources inserting items into the + * index. + */ +void index_init(struct index_root *root, unsigned int key_length, + gfp_t gfp_mask); + +/** + * index_destroy - Destroy the index, cleaning up any internal resources. + */ +void index_destroy(struct index_root *root); + +/** + * index_insert - Insert a data item in the index. + * @root: The index root. + * @data: Data item to insert into the index. + * @key: Index key value to associate with the specified data item. + * + * Returns NULL if the item was successfully inserted. If an item already + * exists in the index with the same key, returns that item. Otherwise, an + * error will be returned. + */ +void *index_insert(struct index_root *root, void *data, u8 *key); + +/** + * index_find - Return a data item in the index associated with the given key. + * @root: The index root. + * @key: The index key associated with the data item to retrieve. + * + * If the key is not found in the index, returns NULL. + */ +void *index_find(struct index_root *root, u8 *key); + +/** + * index_find_replace - Replace a data item in the index associated with the + * given key with the new item, and return the old item. + * @root: The index root. + * @data: Data item to replace the existing item. + * @key: Index key value associated with the data item. + * + * If an existing item is not found in the index, the replacement fails, and + * the function returns NULL. + */ +void *index_find_replace(struct index_root *root, void *data, u8 *key); + +/** + * index_remove - Remove a data item from the index. + * @root: The index root. + * @key: The index key to remove from the index. + * + * Returns the data item removed from the index, or NULL if no item was found. + */ +void *index_remove(struct index_root *root, u8 *key); + +/** + * index_remove_all - Remove all index values, invoking a user-specified routine + * for any data items that remain in the index. + * @root: The index root. + * @callback: A routine invoked for all objects remaining in the index. This + * parameter may be NULL. + * @context: User specified context passed to the user's %callback. + * + * This routine removes all indexed values, calling the specified free routine + * for all objects stored in the index. + */ +void index_remove_all(struct index_root *root, + void (*callback)(void *context, void *data), + void *context); + +#endif /* INDEX_H */ From sean.hefty at intel.com Wed Jan 25 11:47:06 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Jan 2006 11:47:06 -0800 Subject: [openib-general] [PATCH 3/4] SA path record caching In-Reply-To: Message-ID: Add a local SA database for path records to eliminate queries to the SA for connection establishment. Signed-off-by: Sean Hefty --- Index: core/local_sa.c =================================================================== --- core/local_sa.c (revision 0) +++ core/local_sa.c (revision 0) @@ -0,0 +1,453 @@ +/* + * Copyright (c) 2006 Intel Corporation.  All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +/* XXX : fixme when 2.6.16 released */ +#include +#include +#include + +#include +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand subnet administration caching"); +MODULE_LICENSE("Dual BSD/GPL"); + +static int retry_timer = 5000; /* 5 sec */ +module_param(retry_timer, int, 0444); +MODULE_PARM_DESC(retry_timer, "Time in ms between retried requests."); + +static int retries = 3; +module_param(retries, int, 0444); +MODULE_PARM_DESC(retries, "Number of times to retry a request."); + +static unsigned long cache_timeout = 15 * 60 * 1000; /* 15 min */ +module_param(cache_timeout, ulong, 0444); +MODULE_PARM_DESC(cache_timeout, "Time in ms between cache updates."); + +static unsigned long hold_time = 30 * 1000; /* 30 sec */ +module_param(hold_time, ulong, 0444); +MODULE_PARM_DESC(hold_timer, "Minimal time in ms between cache updates."); + +static unsigned long update_delay = 3000; /* 3 sec */ +module_param(update_delay, ulong, 0444); +MODULE_PARM_DESC(update_delay, "Delay in ms between an event and an update."); + +static void sa_db_add_one(struct ib_device *device); +static void sa_db_remove_one(struct ib_device *device); + +static struct ib_client sa_db_client = { + .name = "local_sa", + .add = sa_db_add_one, + .remove = sa_db_remove_one +}; + +static LIST_HEAD(dev_list); +static DEFINE_MUTEX(lock); +static unsigned long hold_time, update_delay; + +struct sa_db_port { + struct sa_db_device *dev; + struct ib_mad_agent *agent; + struct index_root index; + unsigned long update_time; + struct work_struct work; + union ib_gid gid; + int port_num; + u16 pkey; +}; + +struct sa_db_device { + struct list_head list; + struct ib_device *device; + struct ib_event_handler event_handler; + struct sa_db_port port[0]; +}; + +/* Define path record format to enable needed checks against MAD data. */ +struct ib_path_rec { + u8 reserved[8]; + u8 dgid[16]; + u8 sgid[16]; + __be16 dlid; + __be16 slid; + u8 reserved2[20]; +}; + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + ib_destroy_ah(mad_send_wc->send_buf->ah); + ib_free_send_mad(mad_send_wc->send_buf); +} + +/* + * Copy a path record from a received MAD and insert it into our index. + * The path record in the MAD is in network order, so must be swapped. It + * can also span multiple MADs, just to make our life hard. + */ +static void update_path_rec(struct sa_db_port *port, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_mad_recv_buf *recv_buf; + struct ib_sa_mad *mad = (void *) mad_recv_wc->recv_buf.mad; + struct ib_sa_path_rec *sa_path, *old_path; + struct ib_path_rec ib_path, *path = NULL; + int i, attr_size, left, offset = 0; + + attr_size = be16_to_cpu(mad->sa_hdr.attr_offset) * 8; + if (attr_size < sizeof ib_path) + return; + + list_for_each_entry(recv_buf, &mad_recv_wc->rmpp_list, list) { + for (i = 0; i < IB_MGMT_SA_DATA;) { + mad = (struct ib_sa_mad *) recv_buf->mad; + + left = IB_MGMT_SA_DATA - i; + if (left < sizeof ib_path) { + /* copy first piece of the attribute */ + memcpy(&ib_path, &mad->data[i], left); + path = &ib_path; + offset = left; + break; + } else if (offset) { + /* copy the second piece of the attribute */ + memcpy((void*) path + offset, &mad->data[i], + sizeof ib_path - offset); + i += attr_size - offset; + offset = 0; + } else { + path = (void *) &mad->data[i]; + i += attr_size; + } + + if (!path->slid) + return; + + sa_path = kmalloc(sizeof *sa_path, GFP_KERNEL); + if (!sa_path) + return; + + ib_sa_unpack_attr(sa_path, path, IB_SA_ATTR_PATH_REC); + + mutex_lock(&lock); + old_path = index_find_replace(&port->index, sa_path, + sa_path->dgid.raw); + if (old_path) + kfree(old_path); + else if (index_insert(&port->index, sa_path, + sa_path->dgid.raw)) { + mutex_unlock(&lock); + kfree(sa_path); + return; + } + mutex_unlock(&lock); + } + } +} + +static void recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_mad *mad = (void *) mad_recv_wc->recv_buf.mad; + + if (mad->mad_hdr.status) + goto done; + + switch (cpu_to_be16(mad->mad_hdr.attr_id)) { + case IB_SA_ATTR_PATH_REC: + update_path_rec(mad_agent->context, mad_recv_wc); + break; + default: + break; + } +done: + ib_free_recv_mad(mad_recv_wc); +} + +static struct ib_mad_send_buf* get_sa_msg(struct sa_db_port *port) +{ + struct ib_port_attr port_attr; + struct ib_ah_attr ah_attr; + struct ib_mad_send_buf *msg; + int ret; + + ret = ib_query_port(port->dev->device, port->port_num, &port_attr); + if (ret || port_attr.state != IB_PORT_ACTIVE) + return NULL; + + msg = ib_create_send_mad(port->agent, 1, 0, 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, GFP_KERNEL); + if (IS_ERR(msg)) + return NULL; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = port_attr.sm_lid; + ah_attr.sl = port_attr.sm_sl; + ah_attr.port_num = port->port_num; + + msg->ah = ib_create_ah(port->agent->qp->pd, &ah_attr); + if (IS_ERR(msg->ah)) { + ib_free_send_mad(msg); + return NULL; + } + + msg->timeout_ms = retry_timer; + msg->retries = retries; + msg->context[0] = port; + return msg; +} + +static __be64 form_tid(struct ib_mad_send_buf *msg) +{ + u64 hi_tid, low_tid; + + hi_tid = ((u64) msg->mad_agent->hi_tid) << 32; + low_tid = (u32)(unsigned long)(msg); + return cpu_to_be64(hi_tid | low_tid); +} + +static void format_path_req(struct sa_db_port *port, + struct ib_mad_send_buf *msg) +{ + struct ib_sa_mad *mad = msg->mad; + struct ib_sa_path_rec path_rec; + + mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; + mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; + mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; + mad->mad_hdr.method = IB_SA_METHOD_GET_TABLE; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); + mad->mad_hdr.tid = form_tid(msg); + + mad->sa_hdr.comp_mask = IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | + IB_SA_PATH_REC_NUMB_PATH; + + path_rec.sgid = port->gid; + path_rec.pkey = port->pkey; + path_rec.numb_path = 1; + ib_sa_pack_attr(mad->data, &path_rec, IB_SA_ATTR_PATH_REC); +} + +static void update_cache(void *data) +{ + struct sa_db_port *port = data; + struct ib_mad_send_buf *msg; + + msg = get_sa_msg(port); + if (!msg) + return; + + format_path_req(port, msg); + + if (ib_post_send_mad(msg, NULL)) { + ib_destroy_ah(msg->ah); + ib_free_send_mad(msg); + return; + } + + /* + * We record the time that we requested the update, rather than use the + * time that the update occurred. This allows us to generate a new + * update if an event occurs while we're still processing this one. + */ + port->update_time = jiffies; + queue_delayed_work(rdma_wq, &port->work, cache_timeout); +} + +static void schedule_update(struct sa_db_port *port) +{ + unsigned long time, delay; + + time = jiffies; + if (time_after(time, port->update_time + hold_time)) + delay = update_delay; + else + delay = port->update_time + hold_time - time; + + cancel_delayed_work(&port->work); + queue_delayed_work(rdma_wq, &port->work, delay); +} + +static void handle_event(struct ib_event_handler *event_handler, + struct ib_event *event) +{ + struct sa_db_device *dev; + dev = container_of(event_handler, typeof(*dev), event_handler); + + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) + schedule_update(&dev->port[event->element.port_num - 1]); +} + +int ib_get_path_rec(struct ib_device *device, u8 port_num, union ib_gid *sgid, + union ib_gid *dgid, u16 pkey, struct ib_sa_path_rec *rec) +{ + struct sa_db_device *dev; + struct sa_db_port *port; + struct ib_sa_path_rec *path_rec; + int ret = 0; + + mutex_lock(&lock); + dev = ib_get_client_data(device, &sa_db_client); + if (!dev) { + ret = -ENODEV; + goto unlock; + } + port = &dev->port[port_num - 1]; + + if (memcmp(&port->gid, sgid, sizeof *sgid) || port->pkey != pkey) { + ret = -ENODATA; + goto unlock; + } + + path_rec = index_find(&port->index, dgid->raw); + if (!path_rec) { + ret = -ENODATA; + goto unlock; + } + + memcpy(rec, path_rec, sizeof *path_rec); +unlock: + mutex_unlock(&lock); + return ret; +} +EXPORT_SYMBOL(ib_get_path_rec); + +static void sa_db_free_data(void *context, void *data) +{ + kfree(data); +} + +static void sa_db_add_one(struct ib_device *device) +{ + struct sa_db_device *dev; + struct sa_db_port *port; + int i; + + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, + GFP_KERNEL); + if (!dev) + return; + + for (i = 1; i <= device->phys_port_cnt; i++) { + port = &dev->port[i-1]; + port->dev = dev; + port->port_num = i; + port->update_time = jiffies - hold_time; + INIT_WORK(&port->work, update_cache, port); + index_init(&port->index, sizeof (union ib_gid), GFP_KERNEL); + + if (ib_get_cached_gid(device, i, 0, &port->gid) || + ib_get_cached_pkey(device, i, 0, &port->pkey)) + goto err; + + port->agent = ib_register_mad_agent(device, i, IB_QPT_GSI, + NULL, IB_MGMT_RMPP_VERSION, + send_handler, recv_handler, + port); + if (IS_ERR(port->agent)) + goto err; + } + + dev->device = device; + ib_set_client_data(device, &sa_db_client, dev); + + mutex_lock(&lock); + list_add_tail(&dev->list, &dev_list); + mutex_unlock(&lock); + + /* Initialization must be complete before cache updates can occur. */ + INIT_IB_EVENT_HANDLER(&dev->event_handler, device, handle_event); + ib_register_event_handler(&dev->event_handler); + + /* Force an update now. */ + for (i = 1; i <= device->phys_port_cnt; i++) + schedule_update(&dev->port[i-1]); + return; +err: + while (--i) { + ib_unregister_mad_agent(dev->port[i-1].agent); + index_destroy(&dev->port[i-1].index); + } + kfree(dev); +} + +static void sa_db_remove_one(struct ib_device *device) +{ + struct sa_db_device *dev; + int i; + + dev = ib_get_client_data(device, &sa_db_client); + if (!dev) + return; + + ib_unregister_event_handler(&dev->event_handler); + for (i = 0; i < device->phys_port_cnt; i++) + cancel_delayed_work(&dev->port[i].work); + flush_workqueue(rdma_wq); + + for (i = 0; i < device->phys_port_cnt; i++) { + ib_unregister_mad_agent(dev->port[i].agent); + index_remove_all(&dev->port[i].index, sa_db_free_data, NULL); + index_destroy(&dev->port[i].index); + } + + mutex_lock(&lock); + list_del(&dev->list); + mutex_unlock(&lock); + kfree(dev); +} + +static int __init sa_db_init(void) +{ + cache_timeout = msecs_to_jiffies(cache_timeout); + hold_time = msecs_to_jiffies(hold_time); + update_delay = msecs_to_jiffies(update_delay); + return ib_register_client(&sa_db_client); +} + +static void __exit sa_db_cleanup(void) +{ + ib_unregister_client(&sa_db_client); +} + +module_init(sa_db_init); +module_exit(sa_db_cleanup); Index: include/rdma/ib_local_sa.h =================================================================== --- include/rdma/ib_local_sa.h (revision 0) +++ include/rdma/ib_local_sa.h (revision 0) @@ -0,0 +1,55 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef IB_LOCAL_SA_H +#define IB_LOCAL_SA_H + +#include + +/** + * ib_get_path_rec - Query the local SA database for path information. + * @device: The local device to query. + * @port_num: The port of the local device being queried. + * @sgid: The source GID of the path record. + * @dgid: The destination GID of the path record. + * @pkey: The protection key of the path record. + * @rec: A reference to a path record structure that will receive a copy of + * the response. + * + * Returns a copy of a path record meeting the specified criteria to the + * location referenced by %rec. A return value < 0 indicates that an error + * occurred processing the request, or no path record was found. + */ +int ib_get_path_rec(struct ib_device *device, u8 port_num, union ib_gid *sgid, + union ib_gid *dgid, u16 pkey, struct ib_sa_path_rec *rec); + +#endif /* IB_LOCAL_SA_H */ Index: core/Makefile =================================================================== --- core/Makefile (revision 5098) +++ core/Makefile (working copy) @@ -1,10 +1,13 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -Idrivers/infiniband/ulp/ipoib obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o \ - ib_sa.o ib_at.o ib_addr.o rdma_cm.o + ib_sa.o ib_at.o ib_addr.o rdma_cm.o \ + ib_local_sa.o findex.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o ib_uat.o rdma_ucm.o +findex-y := index.o + ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o @@ -22,6 +25,8 @@ ib_addr-y := addr.o ib_sa-y := sa_query.o +ib_local_sa-y := local_sa.o + ib_umad-y := user_mad.o ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ From halr at voltaire.com Wed Jan 25 11:48:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Jan 2006 14:48:46 -0500 Subject: [openib-general] Re: Creation of multicast groups In-Reply-To: References: Message-ID: <1138218439.4338.49977.camel@hal.voltaire.com> Hi Amith, On Wed, 2006-01-25 at 14:35, amith rajith mamidala wrote: > Hi, > > I am trying to run a program which creates multicast groups. I am using > the libraries -losmcomp -losmvendor -lopensm for this purpose. > I was facing a problem while running the program. However, > the program runs if I execute it as a root. I am using the revision > 4918 of the osm related libraries, > > I am getting the following error messages: > > -I- Creating Multicast Group > -I- MGID 0xff12a01cfe800000:0000000000000000 > -I- Port Num:1 > Jan 25 14:25:54 283113 [AB00EB00] -> osm_vendor_bind: Binding to port > 0x6270510000005. > Jan 25 14:25:54 285850 [AB00EB00] -> osm_vendor_open_port: ERR 542C: > umad_open_port() failed > Jan 25 14:25:54 285873 [AB00EB00] -> osm_vendor_bind: ERR 5424: Unable to > Open Port 0x6270510000005. > Jan 25 14:25:54 285890 [AB00EB00] -> osmv_bind_sa: ERR 5506: Failed to > bind to vendor GSI > Jan 25 14:25:54 285901 [AB00EB00] -> ibmcgrp_bind: ERR 00137: Unable to > bind to SA If you want an ordinary user to be able to do this, you need to change your udev rules to add permissions as follows: KERNEL="umad*", NAME="infiniband/%k", MODE="0666" -- Hal > > Thanks, > Amith > > From sean.hefty at intel.com Wed Jan 25 11:55:59 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Jan 2006 11:55:59 -0800 Subject: [openib-general] [PATCH 4/4] SA path record caching In-Reply-To: Message-ID: Modify the CMA to use the local SA database for path record lookups. Signed-off-by: Sean Hefty Index: core/cma.c =================================================================== --- core/cma.c (revision 5115) +++ core/cma.c (working copy) @@ -34,7 +34,7 @@ #include #include #include -#include +#include MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); @@ -115,6 +115,9 @@ struct rdma_id_private { struct cma_work { struct work_struct work; struct rdma_id_private *id; + enum cma_state old_state; + enum cma_state new_state; + struct rdma_cm_event event; }; union cma_ip_addr { @@ -548,17 +551,6 @@ static void cma_cancel_addr(struct rdma_ } } -static void cma_cancel_route(struct rdma_id_private *id_priv) -{ - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: - ib_sa_cancel_query(id_priv->query_id, id_priv->query); - break; - default: - break; - } -} - static inline int cma_internal_listen(struct rdma_id_private *id_priv) { return (id_priv->state == CMA_LISTEN) && id_priv->cma_dev && @@ -610,9 +602,6 @@ static void cma_cancel_operation(struct case CMA_ADDR_QUERY: cma_cancel_addr(id_priv); break; - case CMA_ROUTE_QUERY: - cma_cancel_route(id_priv); - break; case CMA_LISTEN: if (cma_any_addr(&id_priv->id.route.addr.src_addr) && !id_priv->cma_dev) @@ -1019,65 +1008,65 @@ err: }; EXPORT_SYMBOL(rdma_listen); -static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, - void *context) +static void cma_work_handler(void *data) { - struct rdma_id_private *id_priv = context; - struct rdma_route *route = &id_priv->id.route; - enum rdma_cm_event_type event = RDMA_CM_EVENT_ROUTE_RESOLVED; + struct cma_work *work = data; + struct rdma_id_private *id_priv = work->id; + int destroy = 0; atomic_inc(&id_priv->dev_remove); - if (!status) { - route->path_rec = kmalloc(sizeof *route->path_rec, GFP_KERNEL); - if (route->path_rec) { - route->num_paths = 1; - *route->path_rec = *path_rec; - if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, - CMA_ROUTE_RESOLVED)) { - kfree(route->path_rec); - goto out; - } - } else - status = -ENOMEM; - } - - if (status) { - if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED)) - goto out; - event = RDMA_CM_EVENT_ROUTE_ERROR; - } + if (!cma_comp_exch(id_priv, work->old_state, work->new_state)) + goto out; - if (cma_notify_user(id_priv, event, status, NULL, 0)) { + if (id_priv->id.event_handler(&id_priv->id, &work->event)) { cma_exch(id_priv, CMA_DESTROYING); - cma_release_remove(id_priv); - cma_deref_id(id_priv); - rdma_destroy_id(&id_priv->id); - return; + destroy = 1; } out: cma_release_remove(id_priv); cma_deref_id(id_priv); + if (destroy) + rdma_destroy_id(&id_priv->id); + kfree(work); } static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) { + struct rdma_route *route = &id_priv->id.route; struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; - struct ib_sa_path_rec path_rec; + struct cma_work *work; + int ret; + + work = kzalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + route->path_rec = kmalloc(sizeof *route->path_rec, GFP_KERNEL); + if (!route->path_rec) { + ret = -ENOMEM; + goto err1; + } + + ret = ib_get_path_rec(id_priv->id.device, id_priv->id.port_num, + ib_addr_get_sgid(addr), ib_addr_get_dgid(addr), + ib_addr_get_pkey(addr), route->path_rec); + if (ret) + goto err2; - memset(&path_rec, 0, sizeof path_rec); - path_rec.sgid = *ib_addr_get_sgid(addr); - path_rec.dgid = *ib_addr_get_dgid(addr); - path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); - path_rec.numb_path = 1; - - id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, - id_priv->id.port_num, &path_rec, - IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | - IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, - timeout_ms, GFP_KERNEL, - cma_query_handler, id_priv, &id_priv->query); - - return (id_priv->query_id < 0) ? id_priv->query_id : 0; + route->num_paths = 1; + work->id = id_priv; + INIT_WORK(&work->work, cma_work_handler, work); + work->old_state = CMA_ROUTE_QUERY; + work->new_state = CMA_ROUTE_RESOLVED; + work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; + queue_work(rdma_wq, &work->work); + return 0; +err2: + kfree(route->path_rec); + route->path_rec = NULL; +err1: + kfree(work); + return ret; } int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) @@ -1179,29 +1168,6 @@ out: cma_deref_id(id_priv); } -static void loopback_addr_handler(void *data) -{ - struct cma_work *work = data; - struct rdma_id_private *id_priv = work->id; - - kfree(work); - atomic_inc(&id_priv->dev_remove); - - if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) - goto out; - - if (cma_notify_user(id_priv, RDMA_CM_EVENT_ADDR_RESOLVED, 0, NULL, 0)) { - cma_exch(id_priv, CMA_DESTROYING); - cma_release_remove(id_priv); - cma_deref_id(id_priv); - rdma_destroy_id(&id_priv->id); - return; - } -out: - cma_release_remove(id_priv); - cma_deref_id(id_priv); -} - static int cma_resolve_loopback(struct rdma_id_private *id_priv, struct sockaddr *src_addr, enum cma_state state) { @@ -1209,7 +1175,7 @@ static int cma_resolve_loopback(struct r struct rdma_dev_addr *dev_addr; int ret; - work = kmalloc(sizeof *work, GFP_KERNEL); + work = kzalloc(sizeof *work, GFP_KERNEL); if (!work) return -ENOMEM; @@ -1226,7 +1192,10 @@ static int cma_resolve_loopback(struct r } work->id = id_priv; - INIT_WORK(&work->work, loopback_addr_handler, work); + INIT_WORK(&work->work, cma_work_handler, work); + work->old_state = CMA_ADDR_QUERY; + work->new_state = CMA_ADDR_RESOLVED; + work->event.event = RDMA_CM_EVENT_ADDR_RESOLVED; queue_work(rdma_wq, &work->work); return 0; err: From sean.hefty at intel.com Wed Jan 25 12:28:02 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Jan 2006 12:28:02 -0800 Subject: [openib-general] [PATCH] [RFC] group devices by type Message-ID: I'd like to get some feedback about adding the ability to group devices by some higher level type. This would permit identifying all devices that are of type "Infiniband" from devices of other RDMA transports. I've included the patch to ib_verbs.h to do this, along with changes to mad.c to show how it would be used. If this is okay, then similar changes would be needed by a dozen or so other files, which I would do before submitting a final patch. Signed-off-by: Sean Hefty ---- Index: include/rdma/ib_verbs.h =================================================================== --- include/rdma/ib_verbs.h (revision 5098) +++ include/rdma/ib_verbs.h (working copy) @@ -57,7 +57,8 @@ union ib_gid { }; enum ib_node_type { - IB_NODE_CA = 1, + IB_NODE_IB = 0x10, /* mask for all IB node types */ + IB_NODE_CA, IB_NODE_SWITCH, IB_NODE_ROUTER }; Index: core/mad.c =================================================================== --- core/mad.c (revision 5098) +++ core/mad.c (working copy) @@ -2661,7 +2661,9 @@ static void ib_mad_init_device(struct ib { int start, end, i; - if (device->node_type == IB_NODE_SWITCH) { + if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) + return; + else if (device->node_type == IB_NODE_SWITCH) { start = 0; end = 0; } else { From rdreier at cisco.com Wed Jan 25 12:29:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Jan 2006 12:29:29 -0800 Subject: [openib-general] [PATCH 2/4] SA path record caching In-Reply-To: (Sean Hefty's message of "Wed, 25 Jan 2006 11:44:32 -0800") References: Message-ID: This probably doesn't belong under drivers/infiniband, since it's completely generic. How close are the existing rbtree, radix tree and idr libraries to what you need? It might be better to slightly extend an existing kernel library rather than creating yet another API... - R. From jlentini at netapp.com Wed Jan 25 12:33:16 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 25 Jan 2006 15:33:16 -0500 (EST) Subject: [openib-general] [PATCH] [RFC] group devices by type In-Reply-To: References: Message-ID: On Wed, 25 Jan 2006, Sean Hefty wrote: > Index: include/rdma/ib_verbs.h > =================================================================== > --- include/rdma/ib_verbs.h (revision 5098) > +++ include/rdma/ib_verbs.h (working copy) > @@ -57,7 +57,8 @@ union ib_gid { > }; > > enum ib_node_type { > - IB_NODE_CA = 1, > + IB_NODE_IB = 0x10, /* mask for all IB node types */ Is is time to update the prefix to RDMA_ (e.g. RDMA_NODE_IB)? > + IB_NODE_CA, > IB_NODE_SWITCH, > IB_NODE_ROUTER > }; > Index: core/mad.c > =================================================================== > --- core/mad.c (revision 5098) > +++ core/mad.c (working copy) > @@ -2661,7 +2661,9 @@ static void ib_mad_init_device(struct ib > { > int start, end, i; > > - if (device->node_type == IB_NODE_SWITCH) { > + if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) > + return; > + else if (device->node_type == IB_NODE_SWITCH) { > start = 0; > end = 0; > } else { From mshefty at ichips.intel.com Wed Jan 25 12:37:24 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 25 Jan 2006 12:37:24 -0800 Subject: [openib-general] [PATCH 2/4] SA path record caching In-Reply-To: References: Message-ID: <43D7E184.7040900@ichips.intel.com> Roland Dreier wrote: > This probably doesn't belong under drivers/infiniband, since it's > completely generic. How close are the existing rbtree, radix tree and > idr libraries to what you need? It might be better to slightly extend > an existing kernel library rather than creating yet another API... rbtree can work, but would be less performant. Idr doesn't work for this purpose. I didn't realize that there was a radix tree available, so I'll see if that will work. - Sean From caitlinb at broadcom.com Wed Jan 25 12:37:31 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 25 Jan 2006 12:37:31 -0800 Subject: [openib-general] [PATCH] [RFC] group devices by type Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C35C8@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On Wed, 25 Jan 2006, Sean Hefty wrote: > >> Index: include/rdma/ib_verbs.h >> =================================================================== >> --- include/rdma/ib_verbs.h (revision 5098) >> +++ include/rdma/ib_verbs.h (working copy) >> @@ -57,7 +57,8 @@ union ib_gid { >> }; >> >> enum ib_node_type { >> - IB_NODE_CA = 1, >> + IB_NODE_IB = 0x10, /* mask for all IB node types */ > > Is is time to update the prefix to RDMA_ (e.g. RDMA_NODE_IB)? > I would say that symbols that are IB specific are very low priority for shifting to transport neutral names. Unless the goal is to eliminate the use of the IB_ prefix completely I would not bother. It's more important that the symbols that transport neutral code would use not have the "IB_" or "ib_" prefixes. From mshefty at ichips.intel.com Wed Jan 25 12:39:08 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 25 Jan 2006 12:39:08 -0800 Subject: [openib-general] [PATCH] [RFC] group devices by type In-Reply-To: References: Message-ID: <43D7E1EC.2070809@ichips.intel.com> James Lentini wrote: >> enum ib_node_type { >>- IB_NODE_CA = 1, >>+ IB_NODE_IB = 0x10, /* mask for all IB node types */ > > > Is is time to update the prefix to RDMA_ (e.g. RDMA_NODE_IB)? Probably. I used IB_NODE_IB, under the assumption that it would be renamed to RDMA_NODE_IB. I can include that change as part of this if desired. - Sean From rdreier at cisco.com Wed Jan 25 13:32:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Jan 2006 13:32:56 -0800 Subject: [openib-general] [PATCH] [RFC] group devices by type In-Reply-To: (Sean Hefty's message of "Wed, 25 Jan 2006 12:28:02 -0800") References: Message-ID: > enum ib_node_type { > - IB_NODE_CA = 1, > + IB_NODE_IB = 0x10, /* mask for all IB node types */ > + IB_NODE_CA, > IB_NODE_SWITCH, > IB_NODE_ROUTER > }; Is there anywhere that uses this value to compare against the values that the IB spec uses in the NodeInfo record? - R. From bos at pathscale.com Wed Jan 25 14:32:41 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 25 Jan 2006 14:32:41 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <1137631411.4757.218.camel@serpentine.pathscale.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> Message-ID: <1138228361.15295.55.camel@serpentine.pathscale.com> I've been flailing away at the ioctls in our driver, with a good degree of success. However, one in particular is proving tricky: > Opening the /dev/ipath special file assigns an appropriate free > unit (chip) and port (context on a chip) to a user process. > Think of it as similar to /dev/ptmx for ttys, except there isn't > a devpts-like filesystem behind it. Once a process has > opened /dev/ipath, it needs to find out which unit and port it > has opened, so that it can access other attributes in /sys. To > do this, we provide a GETPORT ioctl. I still don't see how to replace this with anything else without performing unnatural acts. We use struct file's private_data to keep a pointer to the device in use, which works fine for ioctl. However, if I'm coming into the kernel over a netlink socket, I have no obvious way of going from my table of devices to the processes that have each one open, and I see no evidence that any other device driver tries to do anything like this either. Short of keeping a reference to the task_struct in the device, or walking the sending process's file table if we receive a netlink message (both of which are disgusting), I see no way to make this ioctl go away. Am I missing something? References: Message-ID: <43D7FE7E.5040607@ichips.intel.com> Roland Dreier wrote: > > enum ib_node_type { > > - IB_NODE_CA = 1, > > + IB_NODE_IB = 0x10, /* mask for all IB node types */ > > + IB_NODE_CA, > > IB_NODE_SWITCH, > > IB_NODE_ROUTER > > }; > > Is there anywhere that uses this value to compare against the values > that the IB spec uses in the NodeInfo record? I didn't notice anything that assumed the values assigned to the enum members, but I can think of several reasons how I would miss that. I guess an easy sanity check is to change the value assigned to IB_NODE_CA and see if anything stops working. - Sean From mulix at mulix.org Wed Jan 25 14:43:11 2006 From: mulix at mulix.org (Muli Ben-Yehuda) Date: Thu, 26 Jan 2006 00:43:11 +0200 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <1138228361.15295.55.camel@serpentine.pathscale.com> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1138228361.15295.55.camel@serpentine.pathscale.com> Message-ID: <20060125224311.GG27845@granada.merseine.nu> On Wed, Jan 25, 2006 at 02:32:41PM -0800, Bryan O'Sullivan wrote: > I've been flailing away at the ioctls in our driver, with a good degree > of success. However, one in particular is proving tricky: > > > Opening the /dev/ipath special file assigns an appropriate free > > unit (chip) and port (context on a chip) to a user process. > > Think of it as similar to /dev/ptmx for ttys, except there isn't > > a devpts-like filesystem behind it. Once a process has > > opened /dev/ipath, it needs to find out which unit and port it > > has opened, so that it can access other attributes in /sys. To > > do this, we provide a GETPORT ioctl. > > I still don't see how to replace this with anything else without > performing unnatural acts. If this is all it does, why not keep it as a device file, where open() assigns the resources, read() returns them, and close() frees them? no ioctl necessary. Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ From bos at pathscale.com Wed Jan 25 14:55:55 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 25 Jan 2006 14:55:55 -0800 Subject: [openib-general] Re: RFC: ipath ioctls and their replacements In-Reply-To: <20060125224311.GG27845@granada.merseine.nu> References: <1137631411.4757.218.camel@serpentine.pathscale.com> <1138228361.15295.55.camel@serpentine.pathscale.com> <20060125224311.GG27845@granada.merseine.nu> Message-ID: <1138229756.15295.75.camel@serpentine.pathscale.com> On Thu, 2006-01-26 at 00:43 +0200, Muli Ben-Yehuda wrote: > If this is all it does, why not keep it as a device file, where open() > assigns the resources, read() returns them, and close() frees them? no > ioctl necessary. Since the char special file doesn't currently implement a read() method, I can go that way, but the result will either end up being a function that does a copy_to_user of two bytes, or (if we ever find we need another ioctl-like thing) it will become an ioctl in all but name. This is the position the current infiniband code is in. There are special files with read methods defined that are exactly and precisely ioctl and nothing else, as far as I can tell, presumably because the resistance to using ioctl was so high. I'd rather call a spade a spade. References: <43D7E184.7040900@ichips.intel.com> Message-ID: <43D80C83.70207@ichips.intel.com> Sean Hefty wrote: > rbtree can work, but would be less performant. Idr doesn't work for > this purpose. I didn't realize that there was a radix tree available, > so I'll see if that will work. The radix tree implementation that's available uses an unsigned long for its index key. For path record lookup, we need an index key that's the size of a GID. Changing the radix tree to support a variable length key would pretty much redo the entire API and implementation. I agree that the index code doesn't belong in the infiniband directory, which is why I gave it a generic name. Keeping it in the infiniband directory just makes it easier for people to use until it merges upstream, versus requiring a kernel patch. - Sean From mshefty at ichips.intel.com Wed Jan 25 16:39:12 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 25 Jan 2006 16:39:12 -0800 Subject: [openib-general] [PATCH] [RFC] group devices by type In-Reply-To: <43D7FE7E.5040607@ichips.intel.com> References: <43D7FE7E.5040607@ichips.intel.com> Message-ID: <43D81A30.3030802@ichips.intel.com> Sean Hefty wrote: > I didn't notice anything that assumed the values assigned to the enum > members, but I can think of several reasons how I would miss that. I > guess an easy sanity check is to change the value assigned to IB_NODE_CA > and see if anything stops working. I changed the value of IB_NODE_CA from 1 to 9, and everything that I tested still worked. However... there looks like there may be some cases in userspace where the code assume that the node_type matches the values in a NodeInfo. A grep showed some areas where there were checks like "ca.node_type == 2". It looks like we might have issues, then, trying to use node_type as an RDMA device type identifier. Maybe we should just add a new device_type field to ib_device to distinguish between IB and iWarp devices... - Sean From ogerlitz at voltaire.com Wed Jan 25 22:48:10 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 26 Jan 2006 08:48:10 +0200 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: References: Message-ID: <43D870AA.9080204@voltaire.com> Sean Hefty wrote: > The following patch series adds caching of path records with the local system. > I divided the changes up into 4 patches to make the review easier. > 3. Create a local SA database. > 4. Modify the CMA to use the local SA database. Looking in the patch series i am somehow confused as of two reasons: - the ib_get_path_rec api does not seem to allow its consumer to specify a callback to be called when there is no path in the cache - the ib_get_path_rec seems to only look in the index and return path only if there is cached one. So is it an incomplete implementation just sent for review or i am missing something(s) ? Or. From ogerlitz at voltaire.com Thu Jan 26 04:54:04 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 26 Jan 2006 14:54:04 +0200 (IST) Subject: [openib-general] [PATCH] iser: remove unused fields of task structure Message-ID: commited in r5187 removed struct iscsi_iser_cmd_task->sg,bad_sg,sg_count,data_offset unused fields Signed-off-by: Or Gerlitz Index: ulp/iser/iscsi_iser.h =================================================================== --- ulp/iser/iscsi_iser.h (revision 5184) +++ ulp/iser/iscsi_iser.h (revision 5187) @@ -332,9 +332,6 @@ struct iscsi_iser_cmd_task { int datasn; /* DataSN */ uint32_t unsol_datasn; int sent; - struct scatterlist *sg; /* per-cmd SG list */ - struct scatterlist *bad_sg; /* assert statement */ - int sg_count; /* SG's to process */ int imm_count; /* imm-data (bytes) */ int unsol_count; /* unsolicited (bytes)*/ @@ -344,7 +341,6 @@ struct iscsi_iser_cmd_task { struct scsi_cmnd *sc; /* associated SCSI cmd*/ int total_length; - int data_offset; struct iscsi_iser_mgmt_task *mtask; /* tmf mtask in progr */ unsigned int post_send_count; /* posted send buffers pending completion */ Index: ulp/iser/iscsi_iser.c =================================================================== --- ulp/iser/iscsi_iser.c (revision 5184) +++ ulp/iser/iscsi_iser.c (revision 5187) @@ -105,7 +105,6 @@ static void iscsi_iser_cmd_init(struct i ctask->mtask = NULL; ctask->sent = 0; - ctask->sg_count = 0; ctask->total_length = sc->request_bufflen; From halr at voltaire.com Thu Jan 26 06:31:25 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Jan 2006 09:31:25 -0500 Subject: [openib-general] [PATCH] [RFC] group devices by type In-Reply-To: <43D81A30.3030802@ichips.intel.com> References: <43D7FE7E.5040607@ichips.intel.com> <43D81A30.3030802@ichips.intel.com> Message-ID: <1138285875.4338.60785.camel@hal.voltaire.com> Hi Sean, On Wed, 2006-01-25 at 19:39, Sean Hefty wrote: > Sean Hefty wrote: > > I didn't notice anything that assumed the values assigned to the enum > > members, but I can think of several reasons how I would miss that. I > > guess an easy sanity check is to change the value assigned to IB_NODE_CA > > and see if anything stops working. > > I changed the value of IB_NODE_CA from 1 to 9, and everything that I tested > still worked. However... there looks like there may be some cases in userspace > where the code assume that the node_type matches the values in a NodeInfo. A > grep showed some areas where there were checks like "ca.node_type == 2". I just changed the diags not do to things like that. -- Hal > It looks like we might have issues, then, trying to use node_type as an RDMA > device type identifier. Maybe we should just add a new device_type field to > ib_device to distinguish between IB and iWarp devices... > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Thu Jan 26 07:16:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Jan 2006 17:16:25 +0200 Subject: [openib-general] Re: [PATCH] [RFC] group devices by type In-Reply-To: References: Message-ID: <20060126151625.GF1549@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [PATCH] [RFC] group devices by type > > I'd like to get some feedback about adding the ability to group devices > by some higher level type. This would permit identifying all devices > that are of type "Infiniband" from devices of other RDMA transports. > > I've included the patch to ib_verbs.h to do this, along with changes to > mad.c to show how it would be used. If this is okay, then similar changes > would be needed by a dozen or so other files, which I would do before > submitting a final patch. > > Signed-off-by: Sean Hefty > Wouldnt a simple helper function be sufficient? Something like: int rdma_is_ib_device(enum ib_node_type t) { switch (t) { case IB_NODE_CA: case IB_NODE_SWITCH: case IB_NODE_ROUTER: return 1; } return 0; } > ---- > > Index: include/rdma/ib_verbs.h > =================================================================== > --- include/rdma/ib_verbs.h (revision 5098) > +++ include/rdma/ib_verbs.h (working copy) > @@ -57,7 +57,8 @@ union ib_gid { > }; > > enum ib_node_type { Flags as enums, hmm. > - IB_NODE_CA = 1, > + IB_NODE_IB = 0x10, /* mask for all IB node types */ > + IB_NODE_CA, > IB_NODE_SWITCH, > IB_NODE_ROUTER > }; I think this changes the ABI, so its somewhat problematic. A way to do the same without breaking ABI below. > Index: core/mad.c > =================================================================== > --- core/mad.c (revision 5098) > +++ core/mad.c (working copy) > @@ -2661,7 +2661,9 @@ static void ib_mad_init_device(struct ib > { > int start, end, i; > > - if (device->node_type == IB_NODE_SWITCH) { > + if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) How about we have IB_NODE_CA = 1, IB_NODE_SWITCH, IB_NODE_ROUTER IB_NODE_MAX and then you can if (device->node_type >= IB_NODE_MAX) return; > + return; > + else if (device->node_type == IB_NODE_SWITCH) { > start = 0; > end = 0; > } else { -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From swise at opengridcomputing.com Thu Jan 26 07:36:27 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Jan 2006 09:36:27 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: References: <54AD0F12E08D1541B826BE97C98F99F11C3401@NT-SJCA-0751.brcm.ad.broadcom.com> <1138120907.22009.25.camel@trinity.ogc.int> Message-ID: <1138289787.760.12.camel@stevo-desktop> On Tue, 2006-01-24 at 09:13 -0800, Roland Dreier wrote: > Tom> The intended behavior is to provide "full coordination". For > Tom> the example you give, I would expect that rdma_resolve_addr > Tom> would fail due to to a timeout waiting for an ARP reply. > > OK, now I'm going off into crazy-land, but I could have a rule that > filters on source MAC and ethertype, and lets ARPs but no other > packets through. > > - R. Perhaps the netfilter subsystem also needs similar notifier hooks? Then the iwarp CM could be notified of netfilter changes and notify providers to go re-examine the rules and kill any connections that violate the rules. Just thinking out loud... From halr at voltaire.com Thu Jan 26 07:33:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Jan 2006 10:33:49 -0500 Subject: [openib-general] [PATCH 3/4] SA path record caching In-Reply-To: References: Message-ID: <1138289594.4338.61142.camel@hal.voltaire.com> Hi Sean, On Wed, 2006-01-25 at 14:47, Sean Hefty wrote: > Add a local SA database for path records to eliminate queries to the SA > for connection establishment. [snip...] > +static void format_path_req(struct sa_db_port *port, > + struct ib_mad_send_buf *msg) > +{ > + struct ib_sa_mad *mad = msg->mad; > + struct ib_sa_path_rec path_rec; > + > + mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; > + mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; > + mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; > + mad->mad_hdr.method = IB_SA_METHOD_GET_TABLE; > + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); > + mad->mad_hdr.tid = form_tid(msg); > + > + mad->sa_hdr.comp_mask = IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | > + IB_SA_PATH_REC_NUMB_PATH; > + > + path_rec.sgid = port->gid; > + path_rec.pkey = port->pkey; > + path_rec.numb_path = 1; > + ib_sa_pack_attr(mad->data, &path_rec, IB_SA_ATTR_PATH_REC); > +} This looks like a wildcarded DGID request but NumbPaths is 1. Am I missing something ? Also, the ability for one node's SA cache to handle this for another node appears to be a future possibility. -- Hal From swise at opengridcomputing.com Thu Jan 26 07:50:01 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Jan 2006 09:50:01 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1138289787.760.12.camel@stevo-desktop> References: <54AD0F12E08D1541B826BE97C98F99F11C3401@NT-SJCA-0751.brcm.ad.broadcom.com> <1138120907.22009.25.camel@trinity.ogc.int> <1138289787.760.12.camel@stevo-desktop> Message-ID: <1138290601.760.22.camel@stevo-desktop> Here is a comment on the specific CMA/IWARP patch: The iwarp enhancements in the this patch save the each device's node_guid in the associated cma_device. The assumption was that the iwarp device's node_guid would be the mac address for that device. Then, in cma_acquire_iw_dev(), the rdma_dev_addr pulled from the netdev device as a result of route lookup is used to find a cma_dev who's node_guid matches the rdma_dev_addr pulled from the netdev. In ethernet terms, the netdev's dev_addr is used to find an appropriate cma device with a matching node_guid. This is broken, however, for multi-ported devices (and for devices who have multiple mac addrs per port), since there isn't a concept of a port guid in IB (i assume, since the code doesn't have port guids). I discussed this with tom, and we think the correct solution is for the device to promote mac addresses as gids. Then for each port, the iwarp device will advertise its mac address(es) and populate the gid cache with these mac addresses. Then we can change cma_acquire_iw_dev() to find the appropriate gid from the gid cache. In fact, cma_acquire_dev() might not need to switch out to IB vs RNIC functions. It can probably be mostly done with common code. Thoughts? I can provide a patch for this soon, but I'd rather get the current CMA changes into the trunk, then post a delta patch from the trunk... From swise at opengridcomputing.com Thu Jan 26 07:51:39 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Jan 2006 09:51:39 -0600 Subject: [openib-general] iwarp: whats a pkey? Message-ID: <1138290699.760.25.camel@stevo-desktop> iwarp/ib experts: Should a pkey be mapped to a vlan id for iwarp openib devices? From halr at voltaire.com Thu Jan 26 07:55:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Jan 2006 10:55:04 -0500 Subject: [openib-general] iwarp: whats a pkey? In-Reply-To: <1138290699.760.25.camel@stevo-desktop> References: <1138290699.760.25.camel@stevo-desktop> Message-ID: <1138290899.4338.61260.camel@hal.voltaire.com> On Thu, 2006-01-26 at 10:51, Steve Wise wrote: > iwarp/ib experts: > > Should a pkey be mapped to a vlan id for iwarp openib devices? IMO yes. Partitions are analagous to VLANs. What are the semantics of VLAN ID ? Partitions have limited and full members. Full members can talk with anyone; limited only to full members. -- Hal From rdreier at cisco.com Thu Jan 26 08:06:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Jan 2006 08:06:11 -0800 Subject: [openib-general] iwarp: whats a pkey? In-Reply-To: <1138290699.760.25.camel@stevo-desktop> (Steve Wise's message of "Thu, 26 Jan 2006 09:51:39 -0600") References: <1138290699.760.25.camel@stevo-desktop> Message-ID: Steve> iwarp/ib experts: Should a pkey be mapped to a vlan id for Steve> iwarp openib devices? No, I think trying to create a mapping is a bad idea. The semantics of VLANs and IB partitions are sufficiently different that it's probably better to treat each concept natively. - R. From rdreier at cisco.com Thu Jan 26 08:07:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Jan 2006 08:07:03 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1138290601.760.22.camel@stevo-desktop> (Steve Wise's message of "Thu, 26 Jan 2006 09:50:01 -0600") References: <54AD0F12E08D1541B826BE97C98F99F11C3401@NT-SJCA-0751.brcm.ad.broadcom.com> <1138120907.22009.25.camel@trinity.ogc.int> <1138289787.760.12.camel@stevo-desktop> <1138290601.760.22.camel@stevo-desktop> Message-ID: Steve> This is broken, however, for multi-ported devices (and for Steve> devices who have multiple mac addrs per port), since there Steve> isn't a concept of a port guid in IB (i assume, since the Steve> code doesn't have port guids). Each port has one or more port GUIDs in IB -- separate from the node GUID, which is a single GUID for the whole node. - R. From swise at opengridcomputing.com Thu Jan 26 08:22:33 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Jan 2006 10:22:33 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: References: <54AD0F12E08D1541B826BE97C98F99F11C3401@NT-SJCA-0751.brcm.ad.broadcom.com> <1138120907.22009.25.camel@trinity.ogc.int> <1138289787.760.12.camel@stevo-desktop> <1138290601.760.22.camel@stevo-desktop> Message-ID: <1138292553.760.34.camel@stevo-desktop> On Thu, 2006-01-26 at 08:07 -0800, Roland Dreier wrote: > Steve> This is broken, however, for multi-ported devices (and for > Steve> devices who have multiple mac addrs per port), since there > Steve> isn't a concept of a port guid in IB (i assume, since the > Steve> code doesn't have port guids). > > Each port has one or more port GUIDs in IB -- separate from the node > GUID, which is a single GUID for the whole node. > > - R. Oh, ok. So then do you think iwarp openib devices should map port MAC addresses to port guids? From swise at opengridcomputing.com Thu Jan 26 08:23:01 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Jan 2006 10:23:01 -0600 Subject: [openib-general] iwarp: whats a pkey? In-Reply-To: References: <1138290699.760.25.camel@stevo-desktop> Message-ID: <1138292581.760.36.camel@stevo-desktop> On Thu, 2006-01-26 at 08:06 -0800, Roland Dreier wrote: > Steve> iwarp/ib experts: Should a pkey be mapped to a vlan id for > Steve> iwarp openib devices? > > No, I think trying to create a mapping is a bad idea. The semantics > of VLANs and IB partitions are sufficiently different that it's > probably better to treat each concept natively. > > - R. Roland, can you expand on this some? From caitlinb at broadcom.com Thu Jan 26 09:05:04 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 26 Jan 2006 09:05:04 -0800 Subject: [openib-general] iwarp: whats a pkey? Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C36D2@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Steve> iwarp/ib experts: Should a pkey be mapped to a vlan id for > Steve> iwarp openib devices? > > No, I think trying to create a mapping is a bad idea. The > semantics of VLANs and IB partitions are sufficiently > different that it's probably better to treat each concept natively. > I agree. In theory iWARP is layered over L4 (TCP/SCTP) which is layered of IP, which is over Ethernet. Once you have that many theoretical layers you end up with a distinct probability that the iWARP code simply does not have access to VLAN info. Every iWARP RNIC I know of is also a general purpose Ethernet NIC. The VLAN functionality serves both purposes and is not controlled/managed through RDMA-specific interfaces. From caitlinb at broadcom.com Thu Jan 26 09:16:10 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 26 Jan 2006 09:16:10 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C36D6@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Here is a comment on the specific CMA/IWARP patch: > > The iwarp enhancements in the this patch save the each > device's node_guid in the associated cma_device. The > assumption was that the iwarp device's node_guid would be the > mac address for that device. > Then, in cma_acquire_iw_dev(), the rdma_dev_addr pulled from > the netdev device as a result of route lookup is used to find > a cma_dev who's node_guid matches the rdma_dev_addr pulled > from the netdev. > > In ethernet terms, the netdev's dev_addr is used to find an > appropriate cma device with a matching node_guid. > > This is broken, however, for multi-ported devices (and for > devices who have multiple mac addrs per port), since there > isn't a concept of a port > guid in IB (i assume, since the code doesn't have port guids). I > discussed this with tom, and we think the correct solution is > for the device to promote mac addresses as gids. Then for > each port, the iwarp device will advertise its mac > address(es) and populate the gid cache with these mac addresses. > > Then we can change cma_acquire_iw_dev() to find the > appropriate gid from the gid cache. In fact, > cma_acquire_dev() might not need to switch out to IB vs RNIC > functions. It can probably be mostly done with common code. > > Thoughts? > > I can provide a patch for this soon, but I'd rather get the > current CMA changes into the trunk, then post a delta patch > from the trunk... > By definition iWARP is cleanly layered over IP. Therefore an iWARP port is not a physical port but a logical one. Management of physical ports is something that must be done independently of RDMA software. For example, if two physical Ethernet ports are teamed this is NOT visible to the RDMA layer. This is a major example of the need to let each transport express itself naturally, and finding the common ground that is meaningful to applications, rather than forcing one to emulate the other. By delegating physical port selection to the IP layer, iWARP inherits existig Ethernet port failover solutions and even full teaming. While not as general as InfiniBand Path Migration, it has the benefit of working without being exposed to the application layer. There is no way to make the two fabrics look identical to applications that need to be fabric aware. Fortunately most applications just want to connect to X and don't care much about the fabric as long as the connection works -- and most of the application logic is for the phase when the connection is functional. Applications that need to deal with fabric failures will probably need to have transport dependent conditionals. I don't think you can abstract the different fabric configuration paradigms into something that can actually be used to diagnose or fix a problem. What we can do is provide abstractions that allow applications to *use* a working post-discovery fabric in a transport neutral way. The 'port' is not terribly important for that. We can make the meaning nebulous, or we can enumerate what it means for each transport. But it needs to be clear that it is NOT a physical Ethernet port. From caitlinb at broadcom.com Thu Jan 26 09:20:48 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 26 Jan 2006 09:20:48 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C36D8@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On Tue, 2006-01-24 at 09:13 -0800, Roland Dreier wrote: >> Tom> The intended behavior is to provide "full coordination". For >> Tom> the example you give, I would expect that rdma_resolve_addr >> Tom> would fail due to to a timeout waiting for an ARP reply. >> >> OK, now I'm going off into crazy-land, but I could have a rule that >> filters on source MAC and ethertype, and lets ARPs but no other >> packets through. >> >> - R. > > Perhaps the netfilter subsystem also needs similar notifier > hooks? Then the iwarp CM could be notified of netfilter > changes and notify providers to go re-examine the rules and > kill any connections that violate the rules. > > Just thinking out loud... > Yes. The key point here is that netfilter will only be able to control the establishment and perhaps the existence of a connection. By the very nature of offloaded stateful connections, netfilter will NOT be able to see individual packets *within* a connection. The three fundamental questions are: 1) How does netfilter approve initiating a connection? 2) How does netfilter approve accepting a connection? 3) How does netfilter cause established connections that are now contrary to policy to be cancelled? Or does it? Once there is a preliminary consensus here, we'll have to bounce that proposal to both netdev and netfilter. From swise at opengridcomputing.com Thu Jan 26 09:33:36 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Jan 2006 11:33:36 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11C36D6@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F11C36D6@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1138296816.760.60.camel@stevo-desktop> On Thu, 2006-01-26 at 09:16 -0800, Caitlin Bestler wrote: > openib-general-bounces at openib.org wrote: > > Here is a comment on the specific CMA/IWARP patch: > > > > The iwarp enhancements in the this patch save the each > > device's node_guid in the associated cma_device. The > > assumption was that the iwarp device's node_guid would be the > > mac address for that device. > > Then, in cma_acquire_iw_dev(), the rdma_dev_addr pulled from > > the netdev device as a result of route lookup is used to find > > a cma_dev who's node_guid matches the rdma_dev_addr pulled > > from the netdev. > > > > In ethernet terms, the netdev's dev_addr is used to find an > > appropriate cma device with a matching node_guid. > > > > This is broken, however, for multi-ported devices (and for > > devices who have multiple mac addrs per port), since there > > isn't a concept of a port > > guid in IB (i assume, since the code doesn't have port guids). I > > discussed this with tom, and we think the correct solution is > > for the device to promote mac addresses as gids. Then for > > each port, the iwarp device will advertise its mac > > address(es) and populate the gid cache with these mac addresses. > > > > Then we can change cma_acquire_iw_dev() to find the > > appropriate gid from the gid cache. In fact, > > cma_acquire_dev() might not need to switch out to IB vs RNIC > > functions. It can probably be mostly done with common code. > > > > Thoughts? > > > > I can provide a patch for this soon, but I'd rather get the > > current CMA changes into the trunk, then post a delta patch > > from the trunk... > > > > By definition iWARP is cleanly layered over IP. Therefore an > iWARP port is not a physical port but a logical one. > > Management of physical ports is something that must be done > independently of RDMA software. > > For example, if two physical Ethernet ports are teamed this > is NOT visible to the RDMA layer. > > This is a major example of the need to let each transport > express itself naturally, and finding the common ground that > is meaningful to applications, rather than forcing one to > emulate the other. > I'd like us to focus on phase I of iwarp. For phase I, the CMA tries to find an appropriate openib device, given the dev_addr from the associated netdev that was found during a routing lookup. For IB, the dev_addr is matched against gids. For iwarp, the dev_addr is matched against the mac addr of the openib dev. Currently the phase I iwarp cma patch assumes a single mac address for an openib iwarp device, mapped to the node_guid. I'm proposing using either the IB gid cache infrastructure to allow the iwarp openib device to advertise its set of local mac addrs -or- make the iwarp cma code search the list of port guids (roland pointed out that there are port guids), and thus map the device's mac addr list to port guids. This method parallels the current openib method for mapping IPoIB netdevs to openib devices. It is a loosely coupled model for associating netdevs with openib devs (which I'm not particularly fond of btw). But it works, and we can make iwarp openib devs work in the same manner. I'm not sure that this method prohibits any of the issues you raise wrt what a "port" is in iwarp. But maybe I'm confused... Steve. From mshefty at ichips.intel.com Thu Jan 26 09:55:27 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jan 2006 09:55:27 -0800 Subject: [openib-general] Re: [PATCH] [RFC] group devices by type In-Reply-To: <20060126151625.GF1549@mellanox.co.il> References: <20060126151625.GF1549@mellanox.co.il> Message-ID: <43D90D0F.3080801@ichips.intel.com> Michael S. Tsirkin wrote: > Wouldnt a simple helper function be sufficient? I thought about this, but it seemed like there should be a better approach. >> enum ib_node_type { > > Flags as enums, hmm. We can drop the name on the enum. > I think this changes the ABI, so its somewhat problematic. > A way to do the same without breaking ABI below. Good point. argh... Another way to accomplish this is to break the node_type into 2 pieces (similar to what my patch was doing anyway): bits 7:4 - transport type (IB type would be 0) bits 3:0 - defined by transport This keeps the IB definitions exactly the same. > How about we have > > IB_NODE_CA = 1, > IB_NODE_SWITCH, > IB_NODE_ROUTER > IB_NODE_MAX So, as Roland pointed out, one problem that we have is that these values map to the IB specific values. Now I'm not so sure that extending this enum to support non-IB devices is the right approach. The node_type is even defined as a u8 to map directly to a NodeInfo. To be clear on the motivation behind this, I want to replace checks like: if (device->node_type == IB_NODE_RNIC) return; with if (device->node_type != some_IB_check) return; See Tom's CMA patch for an example. - Sean From mshefty at ichips.intel.com Thu Jan 26 09:57:19 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jan 2006 09:57:19 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1138290601.760.22.camel@stevo-desktop> References: <54AD0F12E08D1541B826BE97C98F99F11C3401@NT-SJCA-0751.brcm.ad.broadcom.com> <1138120907.22009.25.camel@trinity.ogc.int> <1138289787.760.12.camel@stevo-desktop> <1138290601.760.22.camel@stevo-desktop> Message-ID: <43D90D7F.8010800@ichips.intel.com> Steve Wise wrote: > I can provide a patch for this soon, but I'd rather get the current CMA > changes into the trunk, then post a delta patch from the trunk... Regarding this, I'm trying to identify each of the individual changes to the trunk in the CMA patch, discuss the effect of the change, then merge that change in. - Sean From mshefty at ichips.intel.com Thu Jan 26 10:08:49 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jan 2006 10:08:49 -0800 Subject: [openib-general] [PATCH 3/4] SA path record caching In-Reply-To: <1138289594.4338.61142.camel@hal.voltaire.com> References: <1138289594.4338.61142.camel@hal.voltaire.com> Message-ID: <43D91031.8050409@ichips.intel.com> Hal Rosenstock wrote: > This looks like a wildcarded DGID request but NumbPaths is 1. Am I > missing something ? This is correct. The cache is only retrieving a single path to each destination from a given port. NumbPaths is required for a GetTable request. Supporting multiple paths between an SGID and a DGID would require additional work, since the key used to lookup a path record expands beyond just the DGID. I've already thought of a couple ways to handle this. The easiest is to maintain a list of path records with matching DGIDs in the index. The index would return the list, which would then be walked to return multiple paths to the user. > Also, the ability for one node's SA cache to handle this for another > node appears to be a future possibility. This could be made possible, but would require separate SA GetTable queries on the part of the node with the cache. An additional query is required for each SGID. - Sean From mshefty at ichips.intel.com Thu Jan 26 10:20:20 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jan 2006 10:20:20 -0800 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43D870AA.9080204@voltaire.com> References: <43D870AA.9080204@voltaire.com> Message-ID: <43D912E4.3020603@ichips.intel.com> Or Gerlitz wrote: > - the ib_get_path_rec api does not seem to allow its consumer to specify > a callback to be called when there is no path in the cache > > - the ib_get_path_rec seems to only look in the index and return path > only if there is cached one. > > So is it an incomplete implementation just sent for review or i am > missing something(s) ? The implementation is complete. The interface to the cache operates synchronously. If an item is found in the cache, it is returned. If no item is found, an error is returned. The caller can query the SA directly in this case. (If we wanted to be fancy, the results of that query could be copied into the cache, but the cache will update on its own.) I originally implemented an asynchronous API, but it complicated the implementation and was inefficient in the most common case, where an item is found in the cache. - Sean From caitlinb at broadcom.com Thu Jan 26 10:27:00 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 26 Jan 2006 10:27:00 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C36F7@NT-SJCA-0751.brcm.ad.broadcom.com> Steve Wise wrote: > On Thu, 2006-01-26 at 09:16 -0800, Caitlin Bestler wrote: >> openib-general-bounces at openib.org wrote: >>> Here is a comment on the specific CMA/IWARP patch: >>> >>> The iwarp enhancements in the this patch save the each device's >>> node_guid in the associated cma_device. The assumption was that the >>> iwarp device's node_guid would be the mac address for that device. >>> Then, in cma_acquire_iw_dev(), the rdma_dev_addr pulled from the >>> netdev device as a result of route lookup is used to find a cma_dev >>> who's node_guid matches the rdma_dev_addr pulled from the netdev. >>> >>> In ethernet terms, the netdev's dev_addr is used to find an >>> appropriate cma device with a matching node_guid. >>> >>> This is broken, however, for multi-ported devices (and for devices >>> who have multiple mac addrs per port), since there isn't a concept >>> of a port guid in IB (i assume, since the code doesn't have port >>> guids). I discussed this with tom, and we think the correct >>> solution is for the device to promote mac addresses as gids. Then >>> for each port, the iwarp device will advertise its mac >>> address(es) and populate the gid cache with these mac addresses. >>> >>> Then we can change cma_acquire_iw_dev() to find the appropriate gid >>> from the gid cache. In fact, cma_acquire_dev() might not need to >>> switch out to IB vs RNIC functions. It can probably be mostly done >>> with common code. >>> >>> Thoughts? >>> >>> I can provide a patch for this soon, but I'd rather get the current >>> CMA changes into the trunk, then post a delta patch from the >>> trunk... >>> >> >> By definition iWARP is cleanly layered over IP. Therefore an iWARP >> port is not a physical port but a logical one. >> >> Management of physical ports is something that must be done >> independently of RDMA software. >> >> For example, if two physical Ethernet ports are teamed this is NOT >> visible to the RDMA layer. >> >> This is a major example of the need to let each transport express >> itself naturally, and finding the common ground that is meaningful to >> applications, rather than forcing one to emulate the other. >> > > I'd like us to focus on phase I of iwarp. > > For phase I, the CMA tries to find an appropriate openib > device, given the dev_addr from the associated netdev that > was found during a routing lookup. For IB, the dev_addr is > matched against gids. For iwarp, the dev_addr is matched > against the mac addr of the openib dev. > Stop right there. Once you have the associated netdev you should have 0 or 1 associated iWARP RNICs. If you go any deeper you risk breaking existing solutions for IP Aliasing, Ethernet teaming, etc. From halr at voltaire.com Thu Jan 26 10:30:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Jan 2006 13:30:13 -0500 Subject: [openib-general] Re: [patch] management/osm/autogen.sh: 'head -1' -> 'head -n 1' usage fix In-Reply-To: <20060125154902.GK10560@sashak.voltaire.com> References: <20060125154902.GK10560@sashak.voltaire.com> Message-ID: <1138300119.4338.62125.camel@hal.voltaire.com> On Wed, 2006-01-25 at 10:49, Sasha Khapyorsky wrote: > Hello Hal, > > There is small fix for 'head' usage. > > Sasha. > > > Coreutils-5.3 warns about obsolete 'head -1' usage. Changed to traditional > 'head -n 1'. Thanks. Applied. > Signed-off-by: Sasha Khapyorsky From mshefty at ichips.intel.com Thu Jan 26 10:37:27 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jan 2006 10:37:27 -0800 Subject: [openib-general] [PATCH] [RFC] group devices by type In-Reply-To: References: Message-ID: <43D916E7.50506@ichips.intel.com> Here's an updated version of the patch that doesn't break the ABI. Note that for RNICs, node_type would be set to IB_NODE_IWARP | IB_NODE_CA. Comments? Signed-off-by: Sean Hefty --- Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 5098) +++ ib_verbs.h (working copy) @@ -56,7 +56,15 @@ } global; }; -enum ib_node_type { +/* + * 8-bit node type - for IB this maps to NodeInfo:NodeType. + */ +enum { + /* bits 7:4 - transport type */ + IB_NODE_IB = (0<<4), + IB_NODE_IWARP = (1<<4), + + /* bits 3:0 - device type */ IB_NODE_CA = 1, IB_NODE_SWITCH, IB_NODE_ROUTER From swise at opengridcomputing.com Thu Jan 26 10:45:35 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Jan 2006 12:45:35 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11C36F7@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F11C36F7@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1138301135.760.66.camel@stevo-desktop> > >> By definition iWARP is cleanly layered over IP. Therefore an iWARP > >> port is not a physical port but a logical one. > >> > >> Management of physical ports is something that must be done > >> independently of RDMA software. > >> > >> For example, if two physical Ethernet ports are teamed this is NOT > >> visible to the RDMA layer. > >> > >> This is a major example of the need to let each transport express > >> itself naturally, and finding the common ground that is meaningful to > >> applications, rather than forcing one to emulate the other. > >> > > > > I'd like us to focus on phase I of iwarp. > > > > For phase I, the CMA tries to find an appropriate openib > > device, given the dev_addr from the associated netdev that > > was found during a routing lookup. For IB, the dev_addr is > > matched against gids. For iwarp, the dev_addr is matched > > against the mac addr of the openib dev. > > > > Stop right there. > > Once you have the associated netdev you should have 0 or 1 > associated iWARP RNICs. > > If you go any deeper you risk breaking existing solutions > for IP Aliasing, Ethernet teaming, etc. > I agree. However, the question is how to find the associated openib device once you determine which netdev device you are using for the next hop. In the existing IB CMA code, this is done by a linear search through the ib devices and finding a device that has the gid associated with the IPoIB netdev device. I'm proposing we do exactly the same thing, except we compare mac addresses. And we map mac addresses into either gids or guids. This is a loosely coupled association between netdevs and open_ib devs. However, I think everyone agrees that a netdev maps to only one ib device. A tightly coupled design can be considered, but that requires more hits into the core netdev code. For instance, we could add a method to the netdev struct to return the openib device and thus let the netdev driver tell the CMA exactly which openib device it maps to. Steve. From mshefty at ichips.intel.com Thu Jan 26 10:55:31 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jan 2006 10:55:31 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1138301135.760.66.camel@stevo-desktop> References: <54AD0F12E08D1541B826BE97C98F99F11C36F7@NT-SJCA-0751.brcm.ad.broadcom.com> <1138301135.760.66.camel@stevo-desktop> Message-ID: <43D91B23.1070908@ichips.intel.com> Steve Wise wrote: > I agree. However, the question is how to find the associated openib > device once you determine which netdev device you are using for the next > hop. In the existing IB CMA code, this is done by a linear search > through the ib devices and finding a device that has the gid associated > with the IPoIB netdev device. I'm proposing we do exactly the same > thing, except we compare mac addresses. And we map mac addresses into > either gids or guids. To clarify slightly, in IB a netdev device maps to a specific port on an IB device. The linear search is required in order to handle device removal between identifying what the mapping is and acquiring the reference on the ib_device. Even if the ib_device could be returned directly, the search would still be needed to guarantee that the CMA currently has access to the device. - Sean From caitlinb at broadcom.com Thu Jan 26 10:59:46 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 26 Jan 2006 10:59:46 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C3707@NT-SJCA-0751.brcm.ad.broadcom.com> Steve Wise wrote: > > >>>> By definition iWARP is cleanly layered over IP. Therefore an iWARP >>>> port is not a physical port but a logical one. >>>> >>>> Management of physical ports is something that must be done >>>> independently of RDMA software. >>>> >>>> For example, if two physical Ethernet ports are teamed this is NOT >>>> visible to the RDMA layer. >>>> >>>> This is a major example of the need to let each transport express >>>> itself naturally, and finding the common ground that is meaningful >>>> to applications, rather than forcing one to emulate the other. >>>> >>> >>> I'd like us to focus on phase I of iwarp. >>> >>> For phase I, the CMA tries to find an appropriate openib device, >>> given the dev_addr from the associated netdev that was found during >>> a routing lookup. For IB, the dev_addr is matched against gids. >>> For iwarp, the dev_addr is matched against the mac addr of the >>> openib dev. >>> >> >> Stop right there. >> >> Once you have the associated netdev you should have 0 or 1 >> associated iWARP RNICs. >> >> If you go any deeper you risk breaking existing solutions for IP >> Aliasing, Ethernet teaming, etc. >> > > I agree. However, the question is how to find the associated > openib device once you determine which netdev device you are > using for the next hop. In the existing IB CMA code, this is > done by a linear search through the ib devices and finding a > device that has the gid associated with the IPoIB netdev > device. I'm proposing we do exactly the same thing, except > we compare mac addresses. And we map mac addresses into either gids > or guids. > > This is a loosely coupled association between netdevs and open_ib > devs. However, I think everyone agrees that a netdev maps to only one > ib device. > > A tightly coupled design can be considered, but that requires > more hits into the core netdev code. For instance, we could > add a method to the netdev struct to return the openib device > and thus let the netdev driver tell the CMA exactly which openib > device it maps to. > > Steve. A direct link from the net device to 0 or 1 rdma device would be better, but if doing it by search then the "mac address" for iWARP should be the IP address -- not the Ethernet address. From swise at opengridcomputing.com Thu Jan 26 11:04:28 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Jan 2006 13:04:28 -0600 Subject: [openib-general] possible cma bug Message-ID: <1138302268.760.69.camel@stevo-desktop> Sean, While debugging some iwarp connection setup problems, I _might_ have stumbled onto a cma bug. I'm running the kernel cmatose. The server side gets a connect request, but the init_node() returns an error because the qp create fails. The cmatose module then rejects the connect request on that connect request upcall. Concurrently (on the main work thread running run_server()), cmatose calls rdma_destroy_id() on the listening id. The destroy happens before the connect request upcall thread finishes (SMP :). Then as the other thread doing the connection request upcall unwinds the stack and finishes processing in iw_conn_req_handler(), the system Oopses in cma_release_remove() at line 1048 (with the iwarp cma patch). I think the oops is because the listen_id was already destroyed, and iw_conn_req_handler() didn't have a refence to it. So the cma_release_remove() code is touching freed memory. I _think_ the solution is to bump the listen_id refcnt at the top of cma_req_handler() and iw_conn_req_handler(), and do a cma_deref_id() on the listen_id at the end of the functions. I added this logic to the iwarp side and it appears to have fixed the problem. Steve. From mshefty at ichips.intel.com Thu Jan 26 11:15:41 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jan 2006 11:15:41 -0800 Subject: [openib-general] possible cma bug In-Reply-To: <1138302268.760.69.camel@stevo-desktop> References: <1138302268.760.69.camel@stevo-desktop> Message-ID: <43D91FDD.7040502@ichips.intel.com> Steve Wise wrote: > I'm running the kernel cmatose. The server side gets a connect request, > but the init_node() returns an error because the qp create fails. The > cmatose module then rejects the connect request on that connect request > upcall. Concurrently (on the main work thread running run_server()), > cmatose calls rdma_destroy_id() on the listening id. The destroy > happens before the connect request upcall thread finishes (SMP :). rdma_destroy_id() must block while there is a callback outstanding against the id being destroyed. This is true when pushed down to the lower level IB/iWarp code. The reference counting needs to be handled by the module invoking the callback, not the module being called. Imagine if the callback is about to be invoked when the user calls rdma_destroy_id(). rdma_destroy_id() completes, the user releases their memory, then the callback hits there code. The user's context is already invalid, so a reference doesn't help. > I _think_ the solution is to bump the listen_id refcnt at the top of > cma_req_handler() and iw_conn_req_handler(), and do a cma_deref_id() on > the listen_id at the end of the functions. See above. If the listen_id can be destroyed while the user is in cma_req_handler(), then it can be destroyed before the reference can be taken. Does the iWarp CM take a reference on the corresponding listen_id before invoking a connect request callback? - Sean From rolandd at cisco.com Thu Jan 26 11:23:32 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 26 Jan 2006 11:23:32 -0800 Subject: [openib-general] [PATCH 1/5] [RFC] core kernel changes for resize CQ In-Reply-To: <20061261123.DfEriXNd4dgM9rIS@cisco.com> Message-ID: <20061261123.IcTK8Ewv0LFOTjuP@cisco.com> Core kernel changes to add support for the resize CQ operation for userspace CQs. --- --- infiniband/include/rdma/ib_user_verbs.h (revision 5179) +++ infiniband/include/rdma/ib_user_verbs.h (working copy) @@ -265,6 +265,17 @@ struct ib_uverbs_create_cq_resp { __u32 cqe; }; +struct ib_uverbs_resize_cq { + __u64 response; + __u32 cq_handle; + __u32 cqe; + __u64 driver_data[0]; +}; + +struct ib_uverbs_resize_cq_resp { + __u32 cqe; +}; + struct ib_uverbs_poll_cq { __u64 response; __u32 cq_handle; --- infiniband/include/rdma/ib_verbs.h (revision 5179) +++ infiniband/include/rdma/ib_verbs.h (working copy) @@ -880,7 +880,8 @@ struct ib_device { struct ib_ucontext *context, struct ib_udata *udata); int (*destroy_cq)(struct ib_cq *cq); - int (*resize_cq)(struct ib_cq *cq, int cqe); + int (*resize_cq)(struct ib_cq *cq, int cqe, + struct ib_udata *udata); int (*poll_cq)(struct ib_cq *cq, int num_entries, struct ib_wc *wc); int (*peek_cq)(struct ib_cq *cq, int wc_cnt); --- infiniband/core/uverbs_main.c (revision 5179) +++ infiniband/core/uverbs_main.c (working copy) @@ -91,6 +91,7 @@ static ssize_t (*uverbs_cmd_table[])(str [IB_USER_VERBS_CMD_DEREG_MR] = ib_uverbs_dereg_mr, [IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL] = ib_uverbs_create_comp_channel, [IB_USER_VERBS_CMD_CREATE_CQ] = ib_uverbs_create_cq, + [IB_USER_VERBS_CMD_RESIZE_CQ] = ib_uverbs_resize_cq, [IB_USER_VERBS_CMD_POLL_CQ] = ib_uverbs_poll_cq, [IB_USER_VERBS_CMD_REQ_NOTIFY_CQ] = ib_uverbs_req_notify_cq, [IB_USER_VERBS_CMD_DESTROY_CQ] = ib_uverbs_destroy_cq, --- infiniband/core/verbs.c (revision 5179) +++ infiniband/core/verbs.c (working copy) @@ -325,7 +325,7 @@ int ib_resize_cq(struct ib_cq *cq, int cqe) { return cq->device->resize_cq ? - cq->device->resize_cq(cq, cqe) : -ENOSYS; + cq->device->resize_cq(cq, cqe, NULL) : -ENOSYS; } EXPORT_SYMBOL(ib_resize_cq); --- infiniband/core/uverbs.h (revision 5179) +++ infiniband/core/uverbs.h (working copy) @@ -185,6 +185,7 @@ IB_UVERBS_DECLARE_CMD(reg_mr); IB_UVERBS_DECLARE_CMD(dereg_mr); IB_UVERBS_DECLARE_CMD(create_comp_channel); IB_UVERBS_DECLARE_CMD(create_cq); +IB_UVERBS_DECLARE_CMD(resize_cq); IB_UVERBS_DECLARE_CMD(poll_cq); IB_UVERBS_DECLARE_CMD(req_notify_cq); IB_UVERBS_DECLARE_CMD(destroy_cq); --- infiniband/core/uverbs_cmd.c (revision 5179) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -675,6 +675,46 @@ err: return ret; } +ssize_t ib_uverbs_resize_cq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_resize_cq cmd; + struct ib_uverbs_resize_cq_resp resp; + struct ib_udata udata; + struct ib_cq *cq; + int ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + INIT_UDATA(&udata, buf + sizeof cmd, + (unsigned long) cmd.response + sizeof resp, + in_len - sizeof cmd, out_len - sizeof resp); + + down(&ib_uverbs_idr_mutex); + + cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); + if (!cq || cq->uobject->context != file->ucontext || !cq->device->resize_cq) + goto out; + + ret = cq->device->resize_cq(cq, cmd.cqe, &udata); + if (ret) + goto out; + + memset(&resp, 0, sizeof resp); + resp.cqe = cq->cqe; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; + +out: + up(&ib_uverbs_idr_mutex); + + return ret ? ret : in_len; +} + ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) From rolandd at cisco.com Thu Jan 26 11:23:32 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 26 Jan 2006 11:23:32 -0800 Subject: [openib-general] [PATCH 0/5] [RFC] Resize CQ support Message-ID: <20061261123.DfEriXNd4dgM9rIS@cisco.com> Here is a series of patches that adds support for resize CQ operations to both the core uverbs/libibverbs code as well as implementing it in the device-specific mthca/libmthca code. This is a provider ABI breaking change to libibverbs, because it changes the layout of struct ibv_context_ops. Source and binary compatibility with applications that link to libibverbs is preserved, but provider libraries will have to be recompiled. libibverbs remains source compatibly with unchanged provider libraries. I believe the core changes are ready to commit, pending review. The mthca support is not quite done, since I only implemented support for userspace CQs. Implementing mthca support for resizing kernel CQs should not take more than a day or two (with a lot of the time going into coding a test module), so I am planning on holding off on committing the mthca/libmthca support until that is ready. Also, review of the mthca code, especially from Mellanox and other people familiar with the hardware, would be great. I am also including a simple test program that I used to try out the support. I would very much appreciate test reports for these patches -- if someone ambitious wants to extend my program to try to find other corner cases or races that I missed, that would be even better. Thanks, Roland From rolandd at cisco.com Thu Jan 26 11:23:33 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 26 Jan 2006 11:23:33 -0800 Subject: [openib-general] [PATCH 3/5] [RFC] libibverbs changes for resize CQ In-Reply-To: <20061261123.xib5y5vqz4iiT1C5@cisco.com> Message-ID: <20061261123.1OF7s4QkekUpNAH3@cisco.com> libibverbs changes to handle resizing CQs. Essentially just adding API and support for passing the call through to provider plug-ins. --- --- libibverbs/include/infiniband/driver.h (revision 5192) +++ libibverbs/include/infiniband/driver.h (working copy) @@ -95,6 +95,8 @@ extern int ibv_cmd_create_cq(struct ibv_ struct ibv_create_cq_resp *resp, size_t resp_size); extern int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); extern int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only); +extern int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, + struct ibv_resize_cq *cmd, size_t cmd_size); extern int ibv_cmd_destroy_cq(struct ibv_cq *cq); extern int ibv_cmd_create_srq(struct ibv_pd *pd, --- libibverbs/include/infiniband/verbs.h (revision 5193) +++ libibverbs/include/infiniband/verbs.h (working copy) @@ -549,6 +549,7 @@ struct ibv_context_ops { int (*poll_cq)(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc); int (*req_notify_cq)(struct ibv_cq *cq, int solicited_only); void (*cq_event)(struct ibv_cq *cq); + int (*resize_cq)(struct ibv_cq *cq, int cqe); int (*destroy_cq)(struct ibv_cq *cq); struct ibv_srq * (*create_srq)(struct ibv_pd *pd, struct ibv_srq_init_attr *srq_init_attr); @@ -717,6 +718,15 @@ extern struct ibv_cq *ibv_create_cq(stru int comp_vector); /** + * ibv_resize_cq - Modifies the capacity of the CQ. + * @cq: The CQ to resize. + * @cqe: The minimum size of the CQ. + * + * Users can examine the cq structure to determine the actual CQ size. + */ +extern int ibv_resize_cq(struct ibv_cq *cq, int cqe); + +/** * ibv_destroy_cq - Destroy a completion queue */ extern int ibv_destroy_cq(struct ibv_cq *cq); --- libibverbs/include/infiniband/kern-abi.h (revision 5192) +++ libibverbs/include/infiniband/kern-abi.h (working copy) @@ -343,6 +343,20 @@ struct ibv_req_notify_cq { __u32 solicited; }; +struct ibv_resize_cq { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u32 cq_handle; + __u32 cqe; + __u64 driver_data[0]; +}; + +struct ibv_resize_cq_resp { + __u32 cqe; +}; + struct ibv_destroy_cq { __u32 command; __u16 in_words; --- libibverbs/ChangeLog (revision 5192) +++ libibverbs/ChangeLog (working copy) @@ -1,3 +1,13 @@ +2006-01-26 Roland Dreier + + * include/infiniband/driver.h, src/cmd.c (ibv_cmd_resize_cq): Add + driver interface for calling resize CQ kernel command. + + * include/infiniband/kern-abi.h: Add resize CQ kernel ABI. + + * include/infiniband/verbs.h, src/verbs.c (ibv_resize_cq): Add + resize CQ library API. + 2006-01-25 Roland Dreier * examples/pingpong.c, examples/pingpong.h, --- libibverbs/src/libibverbs.map (revision 5192) +++ libibverbs/src/libibverbs.map (working copy) @@ -19,6 +19,7 @@ IBVERBS_1.0 { ibv_create_comp_channel; ibv_destroy_comp_channel; ibv_create_cq; + ibv_resize_cq; ibv_destroy_cq; ibv_get_cq_event; ibv_ack_cq_events; @@ -44,6 +45,7 @@ IBVERBS_1.0 { ibv_cmd_create_cq; ibv_cmd_poll_cq; ibv_cmd_req_notify_cq; + ibv_cmd_resize_cq; ibv_cmd_destroy_cq; ibv_cmd_create_srq; ibv_cmd_modify_srq; --- libibverbs/src/verbs.c (revision 5192) +++ libibverbs/src/verbs.c (working copy) @@ -212,6 +212,14 @@ struct ibv_cq *ibv_create_cq(struct ibv_ return cq; } +int ibv_resize_cq(struct ibv_cq *cq, int cqe) +{ + if (!cq->context->ops.resize_cq) + return ENOSYS; + + return cq->context->ops.resize_cq(cq, cqe); +} + int ibv_destroy_cq(struct ibv_cq *cq) { return cq->context->ops.destroy_cq(cq); --- libibverbs/src/cmd.c (revision 5192) +++ libibverbs/src/cmd.c (working copy) @@ -364,6 +364,23 @@ int ibv_cmd_req_notify_cq(struct ibv_cq return 0; } +int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, + struct ibv_resize_cq *cmd, size_t cmd_size) +{ + struct ibv_resize_cq_resp resp; + + IBV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, &resp, sizeof resp); + cmd->cq_handle = cq->handle; + cmd->cqe = cqe; + + if (write(cq->context->cmd_fd, cmd, cmd_size) != cmd_size) + return errno; + + cq->cqe = resp.cqe; + + return 0; +} + static int ibv_cmd_destroy_cq_v1(struct ibv_cq *cq) { struct ibv_destroy_cq_v1 cmd; --- libibverbs/README (revision 5192) +++ libibverbs/README (working copy) @@ -98,6 +98,5 @@ necessary permissions to release your wo TODO ==== - * Completion queue (CQ) resizing need to be implemented. * Memory windows (MWs) need to be implemented. * Query QP, query SRQ and other query verbs need to be implemented. From rolandd at cisco.com Thu Jan 26 11:23:33 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 26 Jan 2006 11:23:33 -0800 Subject: [openib-general] [PATCH 5/5] [RFC] stupid test program for resize CQ In-Reply-To: <20061261123.ccmJRI6mfjlIXwNR@cisco.com> Message-ID: <20061261123.ZOXEzVVaCFx5bI6K@cisco.com> Here's a simple test program I wrote that just creates a loopback QP on the first port of the first HCA it finds, and tries to keep 100 RDMA writes queued all the time. In another thread, it resizes the CQ at random intervals. The first thread makes sure that all the completions expected do arrive in the expected order. To build it, just compile and link with "-libverbs". It doesn't take any command line parameters to run. If it's working correctly, the output should just look like an endless stream of Resizing to 500... resized to 511 500000 writes done Resizing to 600... resized to 1023 Resizing to 500... resized to 511 Resizing to 800... resized to 1023 1000000 writes done 1500000 writes done Resizing to 1100... resized to 2047 Resizing to 100... resized to 127 Resizing to 1100... resized to 2047 2000000 writes done If either the "Resizing" or "xxx writes done" lines stop appearing, then something went wrong. --- /* * Copyright (c) 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * * $Id: rc_pingpong.c 5046 2006-01-17 17:20:37Z roland $ */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include static int depth = 100; static int page_size; static uint16_t get_local_lid(struct ibv_context *context, int port) { struct ibv_port_attr attr; if (ibv_query_port(context, port, &attr)) return 0; return attr.lid; } static void *resize_task(void *cq_ptr) { struct ibv_cq *cq = cq_ptr; int new_size; while (1) { usleep(drand48() * 1000000); new_size = (lrand48() % 11 + 1) * depth; printf("Resizing to %4d... ", new_size); fflush(stdout); if (ibv_resize_cq(cq, new_size)) { fprintf(stderr, "\nResize failed\n"); exit(1); } printf("resized to %4d\n", cq->cqe); fflush(stdout); } } static int post_write(uint64_t wrid, void *buf, struct ibv_qp *qp, struct ibv_mr *mr) { struct ibv_sge list = { .addr = (uintptr_t) buf, .length = 1, .lkey = mr->lkey }; struct ibv_send_wr wr = { .wr_id = wrid, .sg_list = &list, .num_sge = 1, .opcode = IBV_WR_RDMA_WRITE, .send_flags = IBV_SEND_SIGNALED, .wr.rdma = { .remote_addr = (uintptr_t) buf + 8, .rkey = mr->rkey } }; struct ibv_send_wr *bad_wr; return ibv_post_send(qp, &wr, &bad_wr); } int main(int argc, char *argv[]) { struct ibv_device **dev_list; struct ibv_device *ib_dev; struct ibv_context *context; struct ibv_pd *pd; struct ibv_mr *mr; struct ibv_cq *cq; struct ibv_qp *qp; struct ibv_wc wc; void *buf; uint16_t lid; uint64_t wrid, exp_wrid; int i; pthread_t resize_thread; srand48(getpid() * time(NULL)); page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } ib_dev = *dev_list; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } buf = memalign(page_size, page_size); if (!buf) { fprintf(stderr, "Couldn't allocate work buf.\n"); return 1; } context = ibv_open_device(ib_dev); if (!context) { fprintf(stderr, "Couldn't get context for %s\n", ibv_get_device_name(ib_dev)); return 1; } lid = get_local_lid(context, 1); pd = ibv_alloc_pd(context); if (!pd) { fprintf(stderr, "Couldn't allocate PD\n"); return 1; } mr = ibv_reg_mr(pd, buf, page_size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE); if (!mr) { fprintf(stderr, "Couldn't allocate MR\n"); return 1; } cq = ibv_create_cq(context, depth, NULL, NULL, 0); if (!cq) { fprintf(stderr, "Couldn't create CQ\n"); return 1; } { struct ibv_qp_init_attr attr = { .send_cq = cq, .recv_cq = cq, .cap = { .max_send_wr = depth, .max_recv_wr = 0, .max_send_sge = 1, .max_recv_sge = 1 }, .qp_type = IBV_QPT_RC }; qp = ibv_create_qp(pd, &attr); if (!qp) { fprintf(stderr, "Couldn't create QP\n"); return 1; } } { struct ibv_qp_attr attr; attr.qp_state = IBV_QPS_INIT; attr.pkey_index = 0; attr.port_num = 1; attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; if (ibv_modify_qp(qp, &attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS)) { fprintf(stderr, "Failed to modify QP to INIT\n"); return 1; } attr.qp_state = IBV_QPS_RTR; attr.path_mtu = IBV_MTU_1024; attr.dest_qp_num = qp->qp_num; attr.rq_psn = 1; attr.max_dest_rd_atomic = 4; attr.min_rnr_timer = 12; attr.ah_attr.is_global = 0; attr.ah_attr.dlid = lid; attr.ah_attr.sl = 0; attr.ah_attr.src_path_bits = 0; attr.ah_attr.port_num = 1; if (ibv_modify_qp(qp, &attr, IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN | IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER)) { fprintf(stderr, "Failed to modify QP to RTR\n"); return 1; } attr.qp_state = IBV_QPS_RTS; attr.timeout = 14; attr.retry_cnt = 7; attr.rnr_retry = 7; attr.sq_psn = 1; attr.max_rd_atomic = 4; if (ibv_modify_qp(qp, &attr, IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC)) { fprintf(stderr, "Failed to modify QP to RTS\n"); return 1; } } wrid = exp_wrid = 0; if (pthread_create(&resize_thread, NULL, resize_task, cq)) { fprintf(stderr, "Couldn't start resize_task\n"); return 1; } for (i = 0; i < depth; ++i) { if (post_write(wrid, buf, qp, mr)) { fprintf(stderr, "Couldn't post work request %lld\n", (long long) wrid); return 1; } ++wrid; } while (1) { i = ibv_poll_cq(cq, 1, &wc); if (i < 0) { fprintf(stderr, "poll CQ failed %d\n", i); return 1; } if (i) { if (wc.status != IBV_WC_SUCCESS) { fprintf(stderr, "Failed status %d for wr_id %lld\n", wc.status, (long long) wc.wr_id); return 1; } if (wc.wr_id != exp_wrid) { fprintf(stderr, "wr_id mismatch %lld != %lld\n", (long long) wc.wr_id, (long long) exp_wrid); return 1; } ++exp_wrid; if (!(exp_wrid % 500000)) printf("%12lld writes done\n", (long long) exp_wrid); if (post_write(wrid, buf, qp, mr)) { fprintf(stderr, "Couldn't post work request %lld\n", (long long) wrid); return 1; } ++wrid; } } return 0; } From rolandd at cisco.com Thu Jan 26 11:23:32 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 26 Jan 2006 11:23:32 -0800 Subject: [openib-general] [PATCH 2/5] [RFC] mthca kernel changes for resize CQ In-Reply-To: <20061261123.IcTK8Ewv0LFOTjuP@cisco.com> Message-ID: <20061261123.xib5y5vqz4iiT1C5@cisco.com> mthca kernel changes to handle resizing userspace CQs. --- --- infiniband/hw/mthca/mthca_user.h (revision 5179) +++ infiniband/hw/mthca/mthca_user.h (working copy) @@ -75,6 +75,11 @@ struct mthca_create_cq_resp { __u32 reserved; }; +struct mthca_resize_cq { + __u32 lkey; + __u32 reserved; +}; + struct mthca_create_srq { __u32 lkey; __u32 db_index; --- infiniband/hw/mthca/mthca_provider.c (revision 5179) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -707,6 +707,35 @@ err_unmap_set: return ERR_PTR(err); } +static int mthca_resize_cq(struct ib_cq *cq, int entries, struct ib_udata *udata) +{ + struct mthca_resize_cq ucmd; + u8 status; + int ret; + + if (!udata) + return -ENOSYS; + + if (entries < 1 || entries > to_mdev(cq->device)->limits.max_cqes) + return -EINVAL; + + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + return -EFAULT; + + entries = roundup_pow_of_two(entries + 1); + + ret = mthca_RESIZE_CQ(to_mdev(cq->device), to_mcq(cq)->cqn, + ucmd.lkey, long_log2(entries), &status); + if (ret) + return ret; + if (status) + return -EINVAL; + + cq->cqe = entries - 1; + + return 0; +} + static int mthca_destroy_cq(struct ib_cq *cq) { if (cq->uobject) { @@ -1113,6 +1142,7 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_DEREG_MR) | (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_RESIZE_CQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | (1ull << IB_USER_VERBS_CMD_CREATE_QP) | (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | @@ -1154,6 +1184,7 @@ int mthca_register_device(struct mthca_d dev->ib_dev.modify_qp = mthca_modify_qp; dev->ib_dev.destroy_qp = mthca_destroy_qp; dev->ib_dev.create_cq = mthca_create_cq; + dev->ib_dev.resize_cq = mthca_resize_cq; dev->ib_dev.destroy_cq = mthca_destroy_cq; dev->ib_dev.poll_cq = mthca_poll_cq; dev->ib_dev.get_dma_mr = mthca_get_dma_mr; --- infiniband/hw/mthca/mthca_cmd.c (revision 5179) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -1517,6 +1517,37 @@ int mthca_HW2SW_CQ(struct mthca_dev *dev CMD_TIME_CLASS_A, status); } +int mthca_RESIZE_CQ(struct mthca_dev *dev, int cq_num, u32 lkey, u8 log_size, + u8 *status) +{ + struct mthca_mailbox *mailbox; + __be32 *inbox; + int err; + +#define RESIZE_CQ_IN_SIZE 0x40 +#define RESIZE_CQ_LOG_SIZE_OFFSET 0x0c +#define RESIZE_CQ_LKEY_OFFSET 0x1c + + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + inbox = mailbox->buf; + + memset(inbox, 0, RESIZE_CQ_IN_SIZE); + /* + * Leave start address fields zeroed out -- mthca assumes that + * MRs for CQs always start at virtual address 0. + */ + MTHCA_PUT(inbox, log_size, RESIZE_CQ_LOG_SIZE_OFFSET); + MTHCA_PUT(inbox, lkey, RESIZE_CQ_LKEY_OFFSET); + + err = mthca_cmd(dev, mailbox->dma, cq_num, 1, CMD_RESIZE_CQ, + CMD_TIME_CLASS_B, status); + + mthca_free_mailbox(dev, mailbox); + return err; +} + int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status) { --- infiniband/hw/mthca/mthca_cmd.h (revision 5179) +++ infiniband/hw/mthca/mthca_cmd.h (working copy) @@ -298,6 +298,8 @@ int mthca_SW2HW_CQ(struct mthca_dev *dev int cq_num, u8 *status); int mthca_HW2SW_CQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int cq_num, u8 *status); +int mthca_RESIZE_CQ(struct mthca_dev *dev, int cq_num, u32 lkey, u8 log_size, + u8 *status); int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status); int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, From rolandd at cisco.com Thu Jan 26 11:23:33 2006 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 26 Jan 2006 11:23:33 -0800 Subject: [openib-general] [PATCH 4/5] [RFC] libmthca changes for resize CQ In-Reply-To: <20061261123.1OF7s4QkekUpNAH3@cisco.com> Message-ID: <20061261123.ccmJRI6mfjlIXwNR@cisco.com> libmthca implementation of resizing CQs. This is the real guts of it -- there is some slightly tricky code to handle the transition from the old CQ buffer to the new one. --- --- libmthca/src/verbs.c (revision 5192) +++ libmthca/src/verbs.c (working copy) @@ -41,6 +41,7 @@ #include #include #include +#include #include #include "mthca.h" @@ -154,6 +155,16 @@ int mthca_dereg_mr(struct ibv_mr *mr) return 0; } +static int align_cq_size(int cqe) +{ + int nent; + + for (nent = 1; nent <= cqe; nent <<= 1) + ; /* nothing */ + + return nent; +} + struct ibv_cq *mthca_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector) @@ -161,27 +172,24 @@ struct ibv_cq *mthca_create_cq(struct ib struct mthca_create_cq cmd; struct mthca_create_cq_resp resp; struct mthca_cq *cq; - int nent; int ret; cq = malloc(sizeof *cq); if (!cq) return NULL; + cq->cons_index = 0; + if (pthread_spin_init(&cq->lock, PTHREAD_PROCESS_PRIVATE)) goto err; - for (nent = 1; nent <= cqe; nent <<= 1) - ; /* nothing */ - - if (posix_memalign(&cq->buf, to_mdev(context->device)->page_size, - align(nent * MTHCA_CQ_ENTRY_SIZE, to_mdev(context->device)->page_size))) + cqe = align_cq_size(cqe); + cq->buf = mthca_alloc_cq_buf(to_mdev(context->device), cqe); + if (!cq->buf) goto err; - mthca_init_cq_buf(cq, nent); - cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf, - nent * MTHCA_CQ_ENTRY_SIZE, + cqe * MTHCA_CQ_ENTRY_SIZE, 0, IBV_ACCESS_LOCAL_WRITE); if (!cq->mr) goto err_buf; @@ -210,7 +218,7 @@ struct ibv_cq *mthca_create_cq(struct ib cmd.lkey = cq->mr->lkey; cmd.pdn = to_mpd(to_mctx(context)->pd)->pdn; - ret = ibv_cmd_create_cq(context, nent - 1, channel, comp_vector, + ret = ibv_cmd_create_cq(context, cqe - 1, channel, comp_vector, &cq->ibv_cq, &cmd.ibv_cmd, sizeof cmd, &resp.ibv_resp, sizeof resp); if (ret) @@ -247,6 +255,63 @@ err: return NULL; } +int mthca_resize_cq(struct ibv_cq *ibcq, int cqe) +{ + struct mthca_cq *cq = to_mcq(ibcq); + struct mthca_resize_cq cmd; + struct ibv_mr *mr; + void *buf; + int old_cqe; + int ret; + + pthread_spin_lock(&cq->lock); + + cqe = align_cq_size(cqe); + if (cqe == ibcq->cqe + 1) { + ret = 0; + goto out; + } + + buf = mthca_alloc_cq_buf(to_mdev(ibcq->context->device), cqe); + if (!buf) { + ret = ENOMEM; + goto out; + } + + mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf, + cqe * MTHCA_CQ_ENTRY_SIZE, + 0, IBV_ACCESS_LOCAL_WRITE); + if (!mr) { + free(buf); + ret = ENOMEM; + goto out; + } + + mr->context = ibcq->context; + + old_cqe = ibcq->cqe; + + cmd.lkey = mr->lkey; + ret = ibv_cmd_resize_cq(ibcq, cqe - 1, &cmd.ibv_cmd, sizeof cmd); + if (ret) { + mthca_dereg_mr(mr); + free(buf); + goto out; + } + + mthca_cq_resize_copy_cqes(cq, buf, old_cqe); + + mthca_dereg_mr(cq->mr); + free(cq->buf); + + cq->buf = buf; + cq->mr = mr; + +out: + pthread_spin_unlock(&cq->lock); + return ret; +} + int mthca_destroy_cq(struct ibv_cq *cq) { int ret; --- libmthca/src/mthca.h (revision 5192) +++ libmthca/src/mthca.h (working copy) @@ -281,6 +281,7 @@ extern int mthca_dereg_mr(struct ibv_mr struct ibv_cq *mthca_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector); +extern int mthca_resize_cq(struct ibv_cq *cq, int cqe); extern int mthca_destroy_cq(struct ibv_cq *cq); extern int mthca_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); extern int mthca_tavor_arm_cq(struct ibv_cq *cq, int solicited); @@ -288,7 +289,8 @@ extern int mthca_arbel_arm_cq(struct ibv extern void mthca_arbel_cq_event(struct ibv_cq *cq); extern void mthca_cq_clean(struct mthca_cq *cq, uint32_t qpn, struct mthca_srq *srq); -extern void mthca_init_cq_buf(struct mthca_cq *cq, int nent); +extern void mthca_cq_resize_copy_cqes(struct mthca_cq *cq, void *buf, int new_cqe); +extern void *mthca_alloc_cq_buf(struct mthca_device *dev, int cqe); extern struct ibv_srq *mthca_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *attr); --- libmthca/src/cq.c (revision 5192) +++ libmthca/src/cq.c (working copy) @@ -38,8 +38,9 @@ #endif /* HAVE_CONFIG_H */ #include -#include +#include #include +#include #include @@ -578,12 +579,38 @@ void mthca_cq_clean(struct mthca_cq *cq, pthread_spin_unlock(&cq->lock); } -void mthca_init_cq_buf(struct mthca_cq *cq, int nent) +void mthca_cq_resize_copy_cqes(struct mthca_cq *cq, void *buf, int old_cqe) +{ + int i; + + /* + * In Tavor mode, the hardware keeps the consumer and producer + * indices mod the CQ size. Since we might be making the CQ + * bigger, we need to deal with the case where the producer + * index wrapped around before the CQ was resized. + */ + if (!mthca_is_memfree(cq->ibv_cq.context) && old_cqe < cq->ibv_cq.cqe) { + cq->cons_index &= old_cqe; + if (cqe_sw(cq, old_cqe)) + cq->cons_index -= old_cqe + 1; + } + + for (i = cq->cons_index; cqe_sw(cq, i & old_cqe); ++i) + memcpy(buf + (i & cq->ibv_cq.cqe) * MTHCA_CQ_ENTRY_SIZE, + get_cqe(cq, i & old_cqe), MTHCA_CQ_ENTRY_SIZE); +} + +void *mthca_alloc_cq_buf(struct mthca_device *dev, int nent) { + void *buf; int i; + if (posix_memalign(&buf, dev->page_size, + align(nent * MTHCA_CQ_ENTRY_SIZE, dev->page_size))) + return NULL; + for (i = 0; i < nent; ++i) - set_cqe_hw(get_cqe(cq, i)); + ((struct mthca_cqe *) buf)[i].owner = MTHCA_CQ_ENTRY_OWNER_HW; - cq->cons_index = 0; + return buf; } --- libmthca/src/mthca-abi.h (revision 5192) +++ libmthca/src/mthca-abi.h (working copy) @@ -65,6 +65,12 @@ struct mthca_create_cq_resp { __u32 reserved; }; +struct mthca_resize_cq { + struct ibv_resize_cq ibv_cmd; + __u32 lkey; + __u32 reserved; +}; + struct mthca_create_srq { struct ibv_create_srq ibv_cmd; __u32 lkey; --- libmthca/src/mthca.c (revision 5192) +++ libmthca/src/mthca.c (working copy) @@ -105,6 +105,7 @@ static struct ibv_context_ops mthca_ctx_ .dereg_mr = mthca_dereg_mr, .create_cq = mthca_create_cq, .poll_cq = mthca_poll_cq, + .resize_cq = mthca_resize_cq, .destroy_cq = mthca_destroy_cq, .create_srq = mthca_create_srq, .modify_srq = mthca_modify_srq, --- libmthca/ChangeLog (revision 5192) +++ libmthca/ChangeLog (working copy) @@ -1,3 +1,10 @@ +2006-01-26 Roland Dreier + + * src/mthca.h, src/verbs.c, src/cq.c, src/mthca.c: Add + implementation of resize CQ operation. + + * src/mthca-abi.h: Add mthca-specific resize CQ ABI. + 2006-01-22 Roland Dreier * Release version 1.0-rc5. From mshefty at ichips.intel.com Thu Jan 26 11:24:46 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jan 2006 11:24:46 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <1136578777.14108.6.camel@trinity.austin.ammasso.com> References: <1136578777.14108.6.camel@trinity.austin.ammasso.com> Message-ID: <43D921FE.2020809@ichips.intel.com> Tom Tucker wrote: > +/* Handles an inbound connect request. The function creates a new > + * iw_cm_id to represent the new connection and inherits the client > + * callback function and other attributes from the listening parent. > + * > + * The work item contains a pointer to the listen_cm_id and the event. The > + * listen_cm_id contains the client cm_handler, context and device. These are > + * copied when the device is cloned. The event contains the new four tuple. Does the code take a reference on the listen_cm_id before scheduling the work item? > + */ > +static int cm_conn_req_handler(struct iwcm_work* work) > +{ > + struct iw_cm_id* cm_id; > + struct iwcm_id_private* cm_id_priv; > + int rc; > + > + /* If the status was not successful, ignore request */ > + if (work->event.status) { > + printk(KERN_ERR "%s:%d Bad status=%d for connection request ... " > + "should be filtered by provider\n", > + __FUNCTION__, __LINE__, > + work->event.status); > + return work->event.status; > + } > + cm_id = iw_create_cm_id(work->cm_id->id.device, work->cm_id->id.cm_handler, > + work->cm_id->id.context); > + if (IS_ERR(cm_id)) > + return PTR_ERR(cm_id); > + > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + cm_id_priv->id.local_addr = work->event.local_addr; > + cm_id_priv->id.remote_addr = work->event.remote_addr; > + cm_id_priv->id.provider_id = work->event.provider_id; > + cm_id_priv->id.state = IW_CM_STATE_CONN_RECV; > + > + /* Call the client CM handler */ > + rc = cm_id->cm_handler(cm_id, &work->event); > + if (rc) { > + cm_id->state = IW_CM_STATE_IDLE; > + iw_destroy_cm_id(cm_id); > + } > + kfree(work); > + return 0; > +} > + > +/* > + * Handles the transition to established state on the passive side. > + */ > +static int cm_conn_est_handler(struct iwcm_work* work) > +{ {snip} > + /* Call the client CM handler */ > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); A reference needs to be taken on the cm_id_priv before invoking the callback to block destruction. (I didn't see that a reference was released...) > +static int cm_conn_rep_handler(struct iwcm_work* work) > +{ > + struct iwcm_id_private* cm_id_priv; > + unsigned long flags; > + int ret = 0; {snip} > + > + /* Call the client CM handler */ > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); > + if (ret) { > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + iw_destroy_cm_id(&cm_id_priv->id); > + } Same here - a reference is needed to block destruction before invoking the callback. > +static int cm_disconnect_handler(struct iwcm_work* work) > +{ > + struct iwcm_id_private* cm_id_priv; > + int ret = 0; > + > + cm_id_priv = work->cm_id; > + > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + > + /* Call the client CM handler */ > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); > + if (ret) > + iw_destroy_cm_id(&cm_id_priv->id); And here... > +static void cm_event_handler(struct iw_cm_id* cm_id, > + struct iw_cm_event* event) > +{ > + struct iwcm_work *work; > + struct iwcm_id_private* cm_id_priv; > + > + work = kmalloc(sizeof *work, GFP_ATOMIC); > + if (!work) > + return; > + > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + INIT_WORK(&work->work, cm_work_handler, work); > + work->cm_id = cm_id_priv; Reference the cm_id before queuing the work item. It needs to be released after processing any callbacks. - Sean From hozer at hozed.org Thu Jan 26 11:37:36 2006 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 26 Jan 2006 13:37:36 -0600 Subject: [openib-general] [PATCH] OpenSM: include OpenIB svn version when OpenIB build In-Reply-To: <1136310211.4331.47477.camel@hal.voltaire.com> References: <1136310211.4331.47477.camel@hal.voltaire.com> Message-ID: <20060126193736.GG17445@narn.hozed.org> Is there a good reason that this patche hasn't been applied yet?? If you want me to provide usefull debugging reports, I need to be able to tell from the log which SVN version opensm was built from. On Tue, Jan 03, 2006 at 12:43:33PM -0500, Hal Rosenstock wrote: > OpenSM: include OpenIB svn version when OpenIB build > > Signed-off-by: Hal Rosenstock > > Index: osm_opensm.c > =================================================================== > --- osm_opensm.c (revision 4716) > +++ osm_opensm.c (working copy) > @@ -59,6 +59,9 @@ > #include > #include > #include > +#ifdef OSM_VENDOR_INTF_OPENIB > +#include > +#endif > #include > #include > #include > @@ -206,12 +209,33 @@ osm_opensm_init( > if( status != IB_SUCCESS ) > return ( status ); > > +#ifndef OSM_VENDOR_INTF_OPENIB > /* If there is a log level defined - add the OSM_VERSION to it. */ > osm_log( &p_osm->log, > osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", > OSM_VERSION ); > /* Write the OSM_VERSION to the SYS_LOG */ > osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format Waived */ > +#else > + if (strlen(OSM_SVN_REVISION)) > + { > + /* If there is a log level defined - add OSM_VERSION and OSM_SVN_REVISION to it. */ > + osm_log( &p_osm->log, > + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s OpenIB svn %s\n", > + OSM_VERSION, OSM_SVN_REVISION ); > + /* Write the OSM_VERSION and OSM_SVN_REVISION to the SYS_LOG */ > + osm_log( &p_osm->log, OSM_LOG_SYS, "%s OpenIB svn %s\n", OSM_VERSION, OSM_SVN_REVISION ); /* Format Waived */ > + } > + else > + { > + /* If there is a log level defined - add the OSM_VERSION to it. */ > + osm_log( &p_osm->log, > + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", > + OSM_VERSION ); > + /* Write the OSM_VERSION to the SYS_LOG */ > + osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format Waived */ > + } > +#endif > > osm_log( &p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: [\n" ); /* Format Waived */ > > Index: main.c > =================================================================== > --- main.c (revision 4716) > +++ main.c (working copy) > @@ -57,6 +57,9 @@ > #include > #include > #include > +#ifdef OSM_VENDOR_INTF_OPENIB > +#include > +#endif > #include > #include > #include > @@ -522,6 +525,10 @@ main( > > printf("-------------------------------------------------\n"); > printf("%s\n", OSM_VERSION); > +#if defined ( OSM_VENDOR_INTF_OPENIB ) > + if (strlen(OSM_SVN_REVISION)) > + printf("Based on OpenIB svn %s\n", OSM_SVN_REVISION); > +#endif > > osm_subn_set_default_opt(&opt); > osm_subn_parse_conf_file(&opt); > Index: Makefile.am > =================================================================== > --- Makefile.am (revision 4716) > +++ Makefile.am (working copy) > @@ -9,6 +9,22 @@ else > DBGFLAGS = -g -O2 > endif > > +if OSMV_OPENIB > +$(srcdir)/../include/opensm/osm_svn_revision.h: > + if test -f $(srcdir)/../.svn/entries; then \ > + grep revision $(srcdir)/../.svn/entries | sed 's/revision=/#define OSM_SVN_REVISION /' | sed 's/\/>//' >$(srcdir)/../include/opensm/osm_svn_revision.h; \ > + else \ > + echo "#define OSM_SVN_REVISION \"\"" >$(srcdir)/../include/opensm/osm_svn_revision.h; \ > + fi > + > +main.c: $(srcdir)/../include/opensm/osm_svn_revision.h > + if test -f $(srcdir)/../include/opensm/osm_svn_revision.h; then \ > + if test -f $(srcdir)/../.svn/entries; then \ > + grep revision $(srcdir)/../.svn/entries | sed 's/revision=/#define OSM_SVN_REVISION /' | sed 's/\/>//' >$(srcdir)/../include/opensm/osm_svn_revision.h; \ > + fi \ > + fi > +endif > + > libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > if HAVE_LD_VERSION_SCRIPT > > From arlin.r.davis at intel.com Thu Jan 26 12:01:31 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 26 Jan 2006 12:01:31 -0800 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal In-Reply-To: Message-ID: >But this penalizes user which need to deal with 2 way to deal >with post calls and completions. > >I do not think we are not to far from consensus. >Transport independent App will allocate 4 bytes extra >for buffers that can match immediate data. >Completion data will return where the immediate data is return >(Consumer can not request it on posting), and 4 bytes for immediate >data in completion event. >The rest are ironing details for complete specification. >This is no different than for any other new functionality proposed. >And except for wasting 4 bytes per buffer or completion I do >not see how it penalizes IB. Moreover if Apps knows that Provider >returns immediate data in completion event it can avoid any penalty. There is no penalty to the user if you just provide native features via extensions. Your extension will provide the best possible interface for your native capabilities. I think we are further from consensus then we first thought: Right now we have a new post recv, different delivery mechanisms, and a requirement to allocate an extra 4 bytes of user data. The only requirement to support immediate data on IB, is a new post send and write immediate data calls and a new event data construct. The normal post_recv can be used unchanged and can already process normal and immediate data. No requirement on the user to allocate and manage an extra 4 bytes in the receive buffer. In fact, you can post receive with no buffer. In order to support immediate data via iWARP, you now have a requirement to use a special new receive post, new user buffer constructs to place the data, and new delivery method that has to be checked via provider attributes or at event time. Is there anyway to get this closer? If not, I would recommend going back to an extension interface for immediate data. -arlin From swise at opengridcomputing.com Thu Jan 26 12:25:42 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Jan 2006 14:25:42 -0600 Subject: [openib-general] possible cma bug In-Reply-To: <43D91FDD.7040502@ichips.intel.com> References: <1138302268.760.69.camel@stevo-desktop> <43D91FDD.7040502@ichips.intel.com> Message-ID: <1138307142.760.85.camel@stevo-desktop> ok. I'll try refing the iw_cm_id before upcalling into the cma. That should have the same effect. On Thu, 2006-01-26 at 11:15 -0800, Sean Hefty wrote: > Steve Wise wrote: > > I'm running the kernel cmatose. The server side gets a connect request, > > but the init_node() returns an error because the qp create fails. The > > cmatose module then rejects the connect request on that connect request > > upcall. Concurrently (on the main work thread running run_server()), > > cmatose calls rdma_destroy_id() on the listening id. The destroy > > happens before the connect request upcall thread finishes (SMP :). > > rdma_destroy_id() must block while there is a callback outstanding against the > id being destroyed. This is true when pushed down to the lower level IB/iWarp > code. The reference counting needs to be handled by the module invoking the > callback, not the module being called. > > Imagine if the callback is about to be invoked when the user calls > rdma_destroy_id(). rdma_destroy_id() completes, the user releases their memory, > then the callback hits there code. The user's context is already invalid, so a > reference doesn't help. > > > I _think_ the solution is to bump the listen_id refcnt at the top of > > cma_req_handler() and iw_conn_req_handler(), and do a cma_deref_id() on > > the listen_id at the end of the functions. > > See above. If the listen_id can be destroyed while the user is in > cma_req_handler(), then it can be destroyed before the reference can be taken. > > Does the iWarp CM take a reference on the corresponding listen_id before > invoking a connect request callback? > > - Sean From eitan at mellanox.co.il Thu Jan 26 12:38:21 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 26 Jan 2006 22:38:21 +0200 Subject: [openib-general] [PATCH] OpenSM: include OpenIB svn version when OpenIB build Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B625@mtlexch01.mtl.com> Hi Troy, I think it got forgotten as I proposed using standard svn command to extract the version. I do not recall exactly where we have left it. I remember someone proposed a standard svn command to extract that. Sorry about that. We all would like to get that information too. Eitan > -----Original Message----- > From: Troy Benjegerdes [mailto:hozer at hozed.org] > Sent: Thursday, January 26, 2006 9:38 PM > To: Eitan Zahavi; openib-general at openib.org > Cc: Hal Rosenstock > Subject: Re: [openib-general] [PATCH] OpenSM: include OpenIB svn version when > OpenIB build > > Is there a good reason that this patche hasn't been applied yet?? > > If you want me to provide usefull debugging reports, I need to be able > to tell from the log which SVN version opensm was built from. > > On Tue, Jan 03, 2006 at 12:43:33PM -0500, Hal Rosenstock wrote: > > OpenSM: include OpenIB svn version when OpenIB build > > > > Signed-off-by: Hal Rosenstock > > > > Index: osm_opensm.c > > =================================================================== > > --- osm_opensm.c (revision 4716) > > +++ osm_opensm.c (working copy) > > @@ -59,6 +59,9 @@ > > #include > > #include > > #include > > +#ifdef OSM_VENDOR_INTF_OPENIB > > +#include > > +#endif > > #include > > #include > > #include > > @@ -206,12 +209,33 @@ osm_opensm_init( > > if( status != IB_SUCCESS ) > > return ( status ); > > > > +#ifndef OSM_VENDOR_INTF_OPENIB > > /* If there is a log level defined - add the OSM_VERSION to it. */ > > osm_log( &p_osm->log, > > osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", > > OSM_VERSION ); > > /* Write the OSM_VERSION to the SYS_LOG */ > > osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format > Waived */ > > +#else > > + if (strlen(OSM_SVN_REVISION)) > > + { > > + /* If there is a log level defined - add OSM_VERSION and > OSM_SVN_REVISION to it. */ > > + osm_log( &p_osm->log, > > + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s > OpenIB svn %s\n", > > + OSM_VERSION, OSM_SVN_REVISION ); > > + /* Write the OSM_VERSION and OSM_SVN_REVISION to the SYS_LOG */ > > + osm_log( &p_osm->log, OSM_LOG_SYS, "%s OpenIB svn %s\n", > OSM_VERSION, OSM_SVN_REVISION ); /* Format Waived */ > > + } > > + else > > + { > > + /* If there is a log level defined - add the OSM_VERSION to it. */ > > + osm_log( &p_osm->log, > > + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", > > + OSM_VERSION ); > > + /* Write the OSM_VERSION to the SYS_LOG */ > > + osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format > Waived */ > > + } > > +#endif > > > > osm_log( &p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: [\n" ); /* Format > Waived */ > > > > Index: main.c > > =================================================================== > > --- main.c (revision 4716) > > +++ main.c (working copy) > > @@ -57,6 +57,9 @@ > > #include > > #include > > #include > > +#ifdef OSM_VENDOR_INTF_OPENIB > > +#include > > +#endif > > #include > > #include > > #include > > @@ -522,6 +525,10 @@ main( > > > > printf("-------------------------------------------------\n"); > > printf("%s\n", OSM_VERSION); > > +#if defined ( OSM_VENDOR_INTF_OPENIB ) > > + if (strlen(OSM_SVN_REVISION)) > > + printf("Based on OpenIB svn %s\n", OSM_SVN_REVISION); > > +#endif > > > > osm_subn_set_default_opt(&opt); > > osm_subn_parse_conf_file(&opt); > > Index: Makefile.am > > =================================================================== > > --- Makefile.am (revision 4716) > > +++ Makefile.am (working copy) > > @@ -9,6 +9,22 @@ else > > DBGFLAGS = -g -O2 > > endif > > > > +if OSMV_OPENIB > > +$(srcdir)/../include/opensm/osm_svn_revision.h: > > + if test -f $(srcdir)/../.svn/entries; then \ > > + grep revision $(srcdir)/../.svn/entries | sed 's/revision=/#define > OSM_SVN_REVISION /' | sed 's/\/>//' > >$(srcdir)/../include/opensm/osm_svn_revision.h; \ > > + else \ > > + echo "#define OSM_SVN_REVISION \"\"" > >$(srcdir)/../include/opensm/osm_svn_revision.h; \ > > + fi > > + > > +main.c: $(srcdir)/../include/opensm/osm_svn_revision.h > > + if test -f $(srcdir)/../include/opensm/osm_svn_revision.h; then \ > > + if test -f $(srcdir)/../.svn/entries; then \ > > + grep revision $(srcdir)/../.svn/entries | sed 's/revision=/#define > OSM_SVN_REVISION /' | sed 's/\/>//' > >$(srcdir)/../include/opensm/osm_svn_revision.h; \ > > + fi \ > > + fi > > +endif > > + > > libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT > $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > > > if HAVE_LD_VERSION_SCRIPT > > > > From hozer at hozed.org Thu Jan 26 12:42:32 2006 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 26 Jan 2006 14:42:32 -0600 Subject: [openib-general] [PATCH] OpenSM: include OpenIB svn version when OpenIB build In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B625@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B625@mtlexch01.mtl.com> Message-ID: <20060126204231.GH17445@narn.hozed.org> svnversion $SRC_DIR is what you want.. troy at opteron1:/usr/src/openib-src/userspace/management$ svnversion . 5188:5193 On Thu, Jan 26, 2006 at 10:38:21PM +0200, Eitan Zahavi wrote: > Hi Troy, > > I think it got forgotten as I proposed using standard svn command to > extract the version. > I do not recall exactly where we have left it. I remember someone > proposed a standard svn command to extract that. > Sorry about that. We all would like to get that information too. > > Eitan > > > -----Original Message----- > > From: Troy Benjegerdes [mailto:hozer at hozed.org] > > Sent: Thursday, January 26, 2006 9:38 PM > > To: Eitan Zahavi; openib-general at openib.org > > Cc: Hal Rosenstock > > Subject: Re: [openib-general] [PATCH] OpenSM: include OpenIB svn > version when > > OpenIB build > > > > Is there a good reason that this patche hasn't been applied yet?? > > > > If you want me to provide usefull debugging reports, I need to be able > > to tell from the log which SVN version opensm was built from. > > > > On Tue, Jan 03, 2006 at 12:43:33PM -0500, Hal Rosenstock wrote: > > > OpenSM: include OpenIB svn version when OpenIB build > > > > > > Signed-off-by: Hal Rosenstock > > > > > > Index: osm_opensm.c > > > =================================================================== > > > --- osm_opensm.c (revision 4716) > > > +++ osm_opensm.c (working copy) > > > @@ -59,6 +59,9 @@ > > > #include > > > #include > > > #include > > > +#ifdef OSM_VENDOR_INTF_OPENIB > > > +#include > > > +#endif > > > #include > > > #include > > > #include > > > @@ -206,12 +209,33 @@ osm_opensm_init( > > > if( status != IB_SUCCESS ) > > > return ( status ); > > > > > > +#ifndef OSM_VENDOR_INTF_OPENIB > > > /* If there is a log level defined - add the OSM_VERSION to it. > */ > > > osm_log( &p_osm->log, > > > osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF > ), "%s\n", > > > OSM_VERSION ); > > > /* Write the OSM_VERSION to the SYS_LOG */ > > > osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* > Format > > Waived */ > > > +#else > > > + if (strlen(OSM_SVN_REVISION)) > > > + { > > > + /* If there is a log level defined - add OSM_VERSION and > > OSM_SVN_REVISION to it. */ > > > + osm_log( &p_osm->log, > > > + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ > 0xFF ), "%s > > OpenIB svn %s\n", > > > + OSM_VERSION, OSM_SVN_REVISION ); > > > + /* Write the OSM_VERSION and OSM_SVN_REVISION to the SYS_LOG > */ > > > + osm_log( &p_osm->log, OSM_LOG_SYS, "%s OpenIB svn %s\n", > > OSM_VERSION, OSM_SVN_REVISION ); /* Format Waived */ > > > + } > > > + else > > > + { > > > + /* If there is a log level defined - add the OSM_VERSION to > it. */ > > > + osm_log( &p_osm->log, > > > + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ > 0xFF ), "%s\n", > > > + OSM_VERSION ); > > > + /* Write the OSM_VERSION to the SYS_LOG */ > > > + osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); > /* Format > > Waived */ > > > + } > > > +#endif > > > > > > osm_log( &p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: [\n" ); > /* Format > > Waived */ > > > > > > Index: main.c > > > =================================================================== > > > --- main.c (revision 4716) > > > +++ main.c (working copy) > > > @@ -57,6 +57,9 @@ > > > #include > > > #include > > > #include > > > +#ifdef OSM_VENDOR_INTF_OPENIB > > > +#include > > > +#endif > > > #include > > > #include > > > #include > > > @@ -522,6 +525,10 @@ main( > > > > > > printf("-------------------------------------------------\n"); > > > printf("%s\n", OSM_VERSION); > > > +#if defined ( OSM_VENDOR_INTF_OPENIB ) > > > + if (strlen(OSM_SVN_REVISION)) > > > + printf("Based on OpenIB svn %s\n", OSM_SVN_REVISION); > > > +#endif > > > > > > osm_subn_set_default_opt(&opt); > > > osm_subn_parse_conf_file(&opt); > > > Index: Makefile.am > > > =================================================================== > > > --- Makefile.am (revision 4716) > > > +++ Makefile.am (working copy) > > > @@ -9,6 +9,22 @@ else > > > DBGFLAGS = -g -O2 > > > endif > > > > > > +if OSMV_OPENIB > > > +$(srcdir)/../include/opensm/osm_svn_revision.h: > > > + if test -f $(srcdir)/../.svn/entries; then \ > > > + grep revision $(srcdir)/../.svn/entries | sed > 's/revision=/#define > > OSM_SVN_REVISION /' | sed 's/\/>//' > > >$(srcdir)/../include/opensm/osm_svn_revision.h; \ > > > + else \ > > > + echo "#define OSM_SVN_REVISION \"\"" > > >$(srcdir)/../include/opensm/osm_svn_revision.h; \ > > > + fi > > > + > > > +main.c: $(srcdir)/../include/opensm/osm_svn_revision.h > > > + if test -f $(srcdir)/../include/opensm/osm_svn_revision.h; then > \ > > > + if test -f $(srcdir)/../.svn/entries; then \ > > > + grep revision $(srcdir)/../.svn/entries | sed > 's/revision=/#define > > OSM_SVN_REVISION /' | sed 's/\/>//' > > >$(srcdir)/../include/opensm/osm_svn_revision.h; \ > > > + fi \ > > > + fi > > > +endif > > > + > > > libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT > > $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > > > > > if HAVE_LD_VERSION_SCRIPT > > > > > > -- -------------------------------------------------------------------------- Troy Benjegerdes 'da hozer' hozer at hozed.org Somone asked me why I work on this free (http://www.fsf.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer: "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shultz From swise at opengridcomputing.com Thu Jan 26 13:02:43 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Jan 2006 15:02:43 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11C3707@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F11C3707@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1138309363.760.89.camel@stevo-desktop> > A direct link from the net device to 0 or 1 rdma device would > be better, but if doing it by search then the "mac address" > for iWARP should be the IP address -- not the Ethernet address. Why? From Arkady.Kanevsky at netapp.com Thu Jan 26 13:24:23 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 26 Jan 2006 16:24:23 -0500 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal Message-ID: Sean, Immediate data can be handled in Transport independent way. API for it certainly is. I am more concern that different vendors will come up with their own extensions for the same features. The size of immediate data is no big deal. The reall issue is that App will need to be changes to handle more data. So DAT can just increase the size of the immed_data field in event and in posted buffer. NO API functionality change just API header change and recompile of app. But these kind of changes will face the same problem whether it is part of DAT or part of the DAT extension. Let talk more about it on the DAT call tomorrow. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, January 24, 2006 7:17 PM > To: Kanevsky, Arkady > Cc: Arlin Davis; Caitlin Bestler; Lentini, James; > dat-discussions at yahoogroups.com; openib-general at openib.org; > Davis, Arlin R > Subject: Re: [openib-general] RE: [RFC] DAT 2.0 immediate > data proposal > > Kanevsky, Arkady wrote: > > But this penalizes user which need to deal with 2 way to deal with > > post calls and completions. > > Yes, any app that wants to take advantage of transport > specific features, which immediate data is, is no longer > transport neutral. > > How do you plan to handle the next RDMA transport that comes > along with 64-bytes of immediate data? > > - Sean > From hozer at hozed.org Thu Jan 26 13:26:59 2006 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 26 Jan 2006 15:26:59 -0600 Subject: [openib-general] 2.6.14 & ib_umad segfault Message-ID: <20060126212658.GA538@narn.hozed.org> 2.6.14, amd64, svn 5193 This happens when loading 'ib_umad' [ 282.510929] Unable to handle kernel paging request at 000000000e70010c RIP: [ 282.516469] {kref_get+1} [ 282.524371] PGD d6f9d067 PUD d6825067 PMD 0 [ 282.529521] Oops: 0000 [1] SMP [ 282.533312] CPU 0 [ 282.535745] Modules linked in: ib_umad ipv6 ide_cd cdrom ide_disk ib_mthca ib_mad ib_core sata_promise ohci_hcd i2c_amd756 hw_random i2c_amd8111 i2c_core generic amd74xx ide_core ext3 jbd mbcache forcedeth tg3 bnx2 crc32 e1000 af_packet nfs lockd sunrpc libafs sd_mod sata_nv libata arcmsr scsi_mod unix [ 282.568292] Pid: 2504, comm: modprobe Tainted: P 2.6.14 #1 [ 282.575362] RIP: 0010:[] {kref_get+1} [ 282.583283] RSP: 0000:ffff8100d6f09d08 EFLAGS: 00010206 [ 282.589912] RAX: ffff8100d7b07220 RBX: 000000000e7000f0 RCX: 000000000e7000f0 [ 282.598470] RDX: ffff8100d7b07220 RSI: ffffffff80315f0a RDI: 000000000e70010c [ 282.607040] RBP: ffffffff80315f03 R08: 0000000000000002 R09: 0000000000000000 [ 282.615584] R10: ffff8100d7a8eb40 R11: 00000000000000a0 R12: 00000000fffffff4 [ 282.624178] R13: 000000000e7000f0 R14: ffff8100d7812150 R15: ffff8100d7a8eb50 [ 282.632731] FS: 00002aaaaae00ae0(0000) GS:ffffffff80402800(0000) knlGS:0000000000000000 [ 282.642470] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 282.649379] CR2: 000000000e70010c CR3: 00000000d725e000 CR4: 00000000000006e0 [ 282.657922] Process modprobe (pid: 2504, threadinfo ffff8100d6f08000, task ffff8100d7b9b4e0) [ 282.668073] Stack: 000000000e7000f0 ffffffff801e7b42 ffff8100d7b07200 ffffffff801c1049 [ 282.677510] ffff8100d7262198 ffff8100d7a8eb40 00000000fffffff4 ffff8100d7a3f600 [ 282.687205] 000000000e700000 ffffffff801c10f3 [ 282.693303] Call Trace:{kobject_get+18} {sysfs_add_link+121} [ 282.703945] {sysfs_create_link+83} {class_device_add+297} [ 282.714981] {class_device_create+238} {new_inode+171} [ 282.725568] {sysfs_create+195} {dput+33} [ 282.736567] {kobj_map+102} {exact_lock+0} [ 282.745916] {exact_match+0} {:ib_umad:ib_umad_init_port+286} [ 282.757279] {:ib_umad:ib_umad_add_one+153} {:ib_core:ib_register_client+124} [ 282.770281] {:ib_umad:ib_umad_init+144} {sys_init_module+247} [ 282.781703] {system_call+126} [ 282.788294] [ 282.788295] Code: 8b 07 48 89 fb 85 c0 75 26 b9 20 00 00 00 48 c7 c2 95 56 31 [ 282.798966] RIP {kref_get+1} RSP [ 282.806739] CR2: 000000000e70010c [ 282.810701] opteron1:~# cat /proc/version Linux version 2.6.14 (troy at octeropt.scl.ameslab.gov) (gcc version 3.3.6 (Debian 1:3.3.6-7)) #1 SMP Sun Oct 30 19:36:53 CST 2005 From rdreier at cisco.com Thu Jan 26 13:28:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Jan 2006 13:28:59 -0800 Subject: [openib-general] 2.6.14 & ib_umad segfault In-Reply-To: <20060126212658.GA538@narn.hozed.org> (Troy Benjegerdes's message of "Thu, 26 Jan 2006 15:26:59 -0600") References: <20060126212658.GA538@narn.hozed.org> Message-ID: http://openib.org/pipermail/openib-general/2006-January/015216.html From hozer at hozed.org Thu Jan 26 13:38:19 2006 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 26 Jan 2006 15:38:19 -0600 Subject: [openib-general] 2.6.14 & ib_umad segfault In-Reply-To: References: <20060126212658.GA538@narn.hozed.org> Message-ID: <20060126213819.GB538@narn.hozed.org> On Thu, Jan 26, 2006 at 01:28:59PM -0800, Roland Dreier wrote: > http://openib.org/pipermail/openib-general/2006-January/015216.html Blah. I guess that's what I deserve for running something more than 2 weeks old ;) From caitlinb at broadcom.com Thu Jan 26 13:38:52 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 26 Jan 2006 13:38:52 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C374B@NT-SJCA-0751.brcm.ad.broadcom.com> Steve Wise wrote: >> A direct link from the net device to 0 or 1 rdma device would be >> better, but if doing it by search then the "mac address" >> for iWARP should be the IP address -- not the Ethernet address. > > Why? Because an iWARP device can encompass multiple Ethernet ports with the same mac address. If they do not have a different IP address then there is no reason for the consumer to understand that they are distinct physical ports. The starting point here is the socket model, where the consumer is not even aware of which Ethernet NIC they are using by default. We are disturbing that model by saying that you have to know which RDMA device so that memory can be registered. We should not go more fine-grained than that. Doing so is complicating the programming model, and interferes with existing port-failover solutions already deployed in the Ethernet space. From rdreier at cisco.com Thu Jan 26 13:44:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 26 Jan 2006 13:44:49 -0800 Subject: [openib-general] [PATCH 1/5] [RFC] core kernel changes for resize CQ In-Reply-To: <20061261123.IcTK8Ewv0LFOTjuP@cisco.com> (Roland Dreier's message of "Thu, 26 Jan 2006 11:23:32 -0800") References: <20061261123.IcTK8Ewv0LFOTjuP@cisco.com> Message-ID: > ... > + down(&ib_uverbs_idr_mutex); > ... > + up(&ib_uverbs_idr_mutex); > ... This should be mutex_lock()/mutex_unlock() -- already fixed in my tree. - R. From Arkady.Kanevsky at netapp.com Thu Jan 26 13:45:17 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 26 Jan 2006 16:45:17 -0500 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal Message-ID: Arlin, I am not convinced we need a new recv for immediate data. But what is needed is change in normative text in many places. Recv, RDMA Write, DTO completion events, error behavior. Sure you can define immed data in extension but it still effects behavior of the normative part of the spec. This is why my preference is to put it into the main spec. The xfer_size is minor thing. We just need to define it meaning with respect to immed_data. Defining it either way is fine. Handling extra space on CQ can be handled by Provider. We can add a new EVD attribute for the use for handling RDMA_write with immed data and Provider can automatically add extra space on CQ. Provider is already responsible to handing user a single completion. SO it will only be used for error handling. Error handling takes maost of the new write up anyhow. Regardless where it is done in the spec or in extension. Question on do we want to support Send with immed_data have to be decided. Ditto remote RMR invalidation with new post(s) for immed_data. Just because IB supports all possible correlation under one Send post does not mean that uDAPL should follow that too. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Arlin Davis [mailto:arlin.r.davis at intel.com] > Sent: Thursday, January 26, 2006 3:02 PM > To: Kanevsky, Arkady; Arlin Davis; Caitlin Bestler > Cc: Lentini, James; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: RE: [openib-general] RE: [RFC] DAT 2.0 immediate > data proposal > > > >But this penalizes user which need to deal with 2 way to > deal with post > >calls and completions. > > > >I do not think we are not to far from consensus. > >Transport independent App will allocate 4 bytes extra for > buffers that > >can match immediate data. > >Completion data will return where the immediate data is return > >(Consumer can not request it on posting), and 4 bytes for immediate > >data in completion event. > >The rest are ironing details for complete specification. > >This is no different than for any other new functionality proposed. > >And except for wasting 4 bytes per buffer or completion I do not see > >how it penalizes IB. Moreover if Apps knows that Provider returns > >immediate data in completion event it can avoid any penalty. > > There is no penalty to the user if you just provide native > features via extensions. Your extension > will provide the best possible interface for your native > capabilities. > > I think we are further from consensus then we first thought: > > Right now we have a new post recv, different delivery > mechanisms, and a requirement to allocate an extra 4 bytes of > user data. > > The only requirement to support immediate data on IB, is a > new post send and write immediate data calls and a new event > data construct. The normal post_recv can be used unchanged > and can already process normal and immediate data. No > requirement on the user to allocate and manage an extra 4 > bytes in the receive buffer. In fact, you can post receive > with no buffer. > > In order to support immediate data via iWARP, you now have a > requirement to use a special new receive post, new user > buffer constructs to place the data, and new delivery method > that has to be checked via provider attributes or at event time. > > Is there anyway to get this closer? If not, I would recommend > going back to an extension interface for immediate data. > > -arlin > > From swise at opengridcomputing.com Thu Jan 26 13:57:03 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Jan 2006 15:57:03 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11C374B@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F11C374B@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1138312623.760.116.camel@stevo-desktop> > Because an iWARP device can encompass multiple Ethernet ports > with the same mac address. At the same time? How does a switch handle this? I'm not clear on your issue here and why my mac based approach breaks it. Lets start with a concrete example: Lets say I have an rnic device. It has 2 ports. It has 1 mac address per port. Assume the driver for this device installs 2 netdevs, one for each port, and there's one ipaddr on each netdev. Now, this rnic device will register with the openib core as _one_ openib device with 2 ports. When the CMA attempts to resolve a route and/or bind to an interface, it first consults the routing table. Lets assume it finds one of the netdev devices for this rnic. It then needs to find the associated openib device. What is wrong with using the mac address found in the netdev device and finding which openib device has that mac address? Why does that break things? I'm still fuzzy on this issue (and most things ;) Given my example above, tell me what delta to that example exposes the design flow and helps me understand your issue. Sorry if i'm being dumb... :-\ > If they do not have a different IP > address then there is no reason for the consumer to understand > that they are distinct physical ports. > The starting point here is the socket model, where the consumer > is not even aware of which Ethernet NIC they are using by > default. We are disturbing that model by saying that you > have to know which RDMA device so that memory can be registered. > > We should not go more fine-grained than that. > > Doing so is complicating the programming model, and interferes > with existing port-failover solutions already deployed in the > Ethernet space. This doesn't have anything to do with the consumer/ULP programming model, which is all based on ip addresses with the new cma module. We're really talking about the interface between the openib core, the openib iwarp devices, and the linux netdev subsystem. From halr at voltaire.com Thu Jan 26 14:30:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Jan 2006 17:30:47 -0500 Subject: [openib-general] [PATCH] OpenSM: include OpenIB svn version when OpenIB build In-Reply-To: <20060126193736.GG17445@narn.hozed.org> References: <1136310211.4331.47477.camel@hal.voltaire.com> <20060126193736.GG17445@narn.hozed.org> Message-ID: <1138314642.4338.63877.camel@hal.voltaire.com> On Thu, 2006-01-26 at 14:37, Troy Benjegerdes wrote: > Is there a good reason that this patche hasn't been applied yet?? I hadn't converted the original patch over to use svnversion and got caught up in other things. I will try to finish this off. -- Hal > If you want me to provide usefull debugging reports, I need to be able > to tell from the log which SVN version opensm was built from. > > On Tue, Jan 03, 2006 at 12:43:33PM -0500, Hal Rosenstock wrote: > > OpenSM: include OpenIB svn version when OpenIB build > > > > Signed-off-by: Hal Rosenstock > > > > Index: osm_opensm.c > > =================================================================== > > --- osm_opensm.c (revision 4716) > > +++ osm_opensm.c (working copy) > > @@ -59,6 +59,9 @@ > > #include > > #include > > #include > > +#ifdef OSM_VENDOR_INTF_OPENIB > > +#include > > +#endif > > #include > > #include > > #include > > @@ -206,12 +209,33 @@ osm_opensm_init( > > if( status != IB_SUCCESS ) > > return ( status ); > > > > +#ifndef OSM_VENDOR_INTF_OPENIB > > /* If there is a log level defined - add the OSM_VERSION to it. */ > > osm_log( &p_osm->log, > > osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", > > OSM_VERSION ); > > /* Write the OSM_VERSION to the SYS_LOG */ > > osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format Waived */ > > +#else > > + if (strlen(OSM_SVN_REVISION)) > > + { > > + /* If there is a log level defined - add OSM_VERSION and OSM_SVN_REVISION to it. */ > > + osm_log( &p_osm->log, > > + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s OpenIB svn %s\n", > > + OSM_VERSION, OSM_SVN_REVISION ); > > + /* Write the OSM_VERSION and OSM_SVN_REVISION to the SYS_LOG */ > > + osm_log( &p_osm->log, OSM_LOG_SYS, "%s OpenIB svn %s\n", OSM_VERSION, OSM_SVN_REVISION ); /* Format Waived */ > > + } > > + else > > + { > > + /* If there is a log level defined - add the OSM_VERSION to it. */ > > + osm_log( &p_osm->log, > > + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", > > + OSM_VERSION ); > > + /* Write the OSM_VERSION to the SYS_LOG */ > > + osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format Waived */ > > + } > > +#endif > > > > osm_log( &p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: [\n" ); /* Format Waived */ > > > > Index: main.c > > =================================================================== > > --- main.c (revision 4716) > > +++ main.c (working copy) > > @@ -57,6 +57,9 @@ > > #include > > #include > > #include > > +#ifdef OSM_VENDOR_INTF_OPENIB > > +#include > > +#endif > > #include > > #include > > #include > > @@ -522,6 +525,10 @@ main( > > > > printf("-------------------------------------------------\n"); > > printf("%s\n", OSM_VERSION); > > +#if defined ( OSM_VENDOR_INTF_OPENIB ) > > + if (strlen(OSM_SVN_REVISION)) > > + printf("Based on OpenIB svn %s\n", OSM_SVN_REVISION); > > +#endif > > > > osm_subn_set_default_opt(&opt); > > osm_subn_parse_conf_file(&opt); > > Index: Makefile.am > > =================================================================== > > --- Makefile.am (revision 4716) > > +++ Makefile.am (working copy) > > @@ -9,6 +9,22 @@ else > > DBGFLAGS = -g -O2 > > endif > > > > +if OSMV_OPENIB > > +$(srcdir)/../include/opensm/osm_svn_revision.h: > > + if test -f $(srcdir)/../.svn/entries; then \ > > + grep revision $(srcdir)/../.svn/entries | sed 's/revision=/#define OSM_SVN_REVISION /' | sed 's/\/>//' >$(srcdir)/../include/opensm/osm_svn_revision.h; \ > > + else \ > > + echo "#define OSM_SVN_REVISION \"\"" >$(srcdir)/../include/opensm/osm_svn_revision.h; \ > > + fi > > + > > +main.c: $(srcdir)/../include/opensm/osm_svn_revision.h > > + if test -f $(srcdir)/../include/opensm/osm_svn_revision.h; then \ > > + if test -f $(srcdir)/../.svn/entries; then \ > > + grep revision $(srcdir)/../.svn/entries | sed 's/revision=/#define OSM_SVN_REVISION /' | sed 's/\/>//' >$(srcdir)/../include/opensm/osm_svn_revision.h; \ > > + fi \ > > + fi > > +endif > > + > > libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > > > if HAVE_LD_VERSION_SCRIPT > > > > From caitlinb at broadcom.com Thu Jan 26 15:48:19 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 26 Jan 2006 15:48:19 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C378A@NT-SJCA-0751.brcm.ad.broadcom.com> Steve Wise wrote: >> Because an iWARP device can encompass multiple Ethernet ports with >> the same mac address. > > At the same time? How does a switch handle this? > A multi-port host can re-ARP an IP address to a different port (with a different MAC address) without disrupting established TCP connections. Therefore it can be handled by existing NIC logic, without having to involve any RNIC firmware/verbs/drivers/middleware. From caitlinb at broadcom.com Thu Jan 26 15:55:39 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 26 Jan 2006 15:55:39 -0800 Subject: [openib-general] Re: [PATCH] CMA and iWARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F11C378E@NT-SJCA-0751.brcm.ad.broadcom.com> Steve Wise wrote: >> Because an iWARP device can encompass multiple Ethernet ports with >> the same mac address. > > At the same time? How does a switch handle this? > > I'm not clear on your issue here and why my mac based > approach breaks it. > > Lets start with a concrete example: > > Lets say I have an rnic device. It has 2 ports. It has 1 > mac address per port. Assume the driver for this device installs 2 > netdevs, one for each port, and there's one ipaddr on each netdev. > > Now, this rnic device will register with the openib core as > _one_ openib device with 2 ports. > > When the CMA attempts to resolve a route and/or bind to an > interface, it first consults the routing table. Lets assume > it finds one of the netdev devices for this rnic. It then > needs to find the associated openib device. What is wrong > with using the mac address found in the netdev device and > finding which openib device has that mac address? Why does that > break things? > > I'm still fuzzy on this issue (and most things ;) > > Given my example above, tell me what delta to that example > exposes the design flow and helps me understand your issue. > > Sorry if i'm being dumb... :-\ > I have two ports, two MAC addresses and two IP addresses. The cable for the 2nd port breaks, so I move the IP address to the other MAC address. If someone does their queries at this point all connections will appear to be going through the same "port". When the cable is restored, the IP address will be moved back to its original MAC address. Because this process occurs *below* the TCP layer, it is not visible to the RDMA layer. Therefore the RDMA layer can only identify the IP address. The RDMA layer does not need, and should not have knowledge of the L2 address on an IP network. From mshefty at ichips.intel.com Thu Jan 26 16:24:12 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jan 2006 16:24:12 -0800 Subject: [openib-general] [PATCH 1/4] SA path record caching In-Reply-To: References: Message-ID: <43D9682C.8000309@ichips.intel.com> Sean Hefty wrote: > Expose functions to pack/unpack SA attributes. This capability is > also required by the local SA database. Roland, any objection to committing this change to sa_query? I can also move index.h into include/rdma/linux subdirectory to clarify that it is not IB specific code. - Sean From mshefty at ichips.intel.com Thu Jan 26 16:30:08 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jan 2006 16:30:08 -0800 Subject: [openib-general] [PATCH 1/4] SA path record caching In-Reply-To: <43D9682C.8000309@ichips.intel.com> References: <43D9682C.8000309@ichips.intel.com> Message-ID: <43D96990.7090708@ichips.intel.com> Sean Hefty wrote: > I can also move index.h into include/rdma/linux subdirectory to clarify > that it is not IB specific code. Er... I meant include/linux. From openib-general at openib.org Fri Jan 27 00:55:09 2006 From: openib-general at openib.org (openib-general at openib.org) Date: Fri, 27 Jan 2006 00:55:09 -0800 (PST) Subject: [openib-general] openib-general@openib.org Message-ID: <20060127085509.8AE742283DC@openib.ca.sandia.gov> ------------------------------------------------------------------------- ADULT MEDIA Video Clips .: Slide Shows .: Screen Shots ADULTS ONLY ------------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: Download-and-Buy.zip Type: application/x-zip-compressed Size: 3759 bytes Desc: Download-and-Buy.zip URL: From enealusa at pipeten.com Fri Jan 27 06:49:35 2006 From: enealusa at pipeten.com (enealusa at pipeten.com) Date: Fri, 27 Jan 2006 21:49:35 +0700 Subject: [openib-general] Re[9]: GOt Meds?V|@gra Soma Diet Pill Many M3ds [Save $38] Message-ID: <002f01c62350$e3fe3462$2401a8c0@phamaha-anek> An HTML attachment was scrubbed... URL: From jeff.walls at hp.com Fri Jan 27 07:31:41 2006 From: jeff.walls at hp.com (Walls, Jeffrey Joel) Date: Fri, 27 Jan 2006 10:31:41 -0500 Subject: [openib-general] Debugging Infiniband? Message-ID: Hi Dotan, Thank you for your suggestions! I was finally able to figure out my problem. I was Using the WinIB package from Mellanox, and there were some fields in the data Structures that were in big-endian format, even though it was running on Win32. Once I got these figured out and swapped correctly, things worked again. Thanks again! Jeff -----Original Message----- From: Dotan Barak [mailto:dotanb at mellanox.co.il] Sent: Wednesday, January 18, 2006 8:53 AM To: Walls, Jeffrey Joel; openib-general Subject: RE: [openib-general] Debugging Infiniband? Hi jeff. there are some issues you need to check: there are WR that were posted to the remote QP RQ before posting the WR to the SQ in local side both of the QPs are alive and in valid states (at least RTR for responder and RTS for requestor) the QPs parameters are synch (for example: the psn) the route that you are using is valid (port, remote QP number, remote lid) if you are using UD/UC QPs maybe the packet were dropped .. if you have an IB analyzer you should check that the packet was sent to the expected QP number you can check the port counters to see how many data was sent / received to each IB port I Hope i gave you some useful information [Dotan Barak] -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org]On Behalf Of Walls, Jeffrey Joel Sent: Wednesday, January 18, 2006 5:39 PM To: openib-general Subject: [openib-general] Debugging Infiniband? Hi, I first must admit that I'm new to Infiniband and Infiniband programming. I have just begun writing my first commercial application using IB late last year. I'm very familiar with socket programming (TCP, Multicast, etc), though. I'm wondering what techniques expierenced IB programmers use to debug IB applications. My situation is that I'm running an data producer on Windows XP and a set of data consumers on Linux. So for Windows, I'm using WinIB (gen1) and for linux I'm using OpenIB (gen2). I have both sides implemented according to some of the example code I've seen and also according to the documents I've been able to find. The connections all seem to be set up properly and my producer successfully posts all of its sends (at least according to my CQE's returned). The problem is that my receiver never sees any of the IB packets. I post the receive and then wait forever polling the CQ. I've run out of ideas on what to even look at and am now looking for suggestions on how to best figure out this problem. If you have any ideas or need more clarification, I'd love to hear from you. Also, if this isn't the proper forum for such discussions, if you could please guide me in the right direction, I would greatly appreciate that as well. Best Regards, Jeff From caitlinb at broadcom.com Fri Jan 27 09:22:33 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 27 Jan 2006 09:22:33 -0800 Subject: [dat-discussions] RE: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122CFD2@NT-SJCA-0751.brcm.ad.broadcom.com> dat-discussions at yahoogroups.com wrote: >> But this penalizes user which need to deal with 2 way to deal with >> post calls and completions. >> >> I do not think we are not to far from consensus. >> Transport independent App will allocate 4 bytes extra for buffers >> that can match immediate data. Completion data will return where the >> immediate data is return (Consumer can not request it on posting), >> and 4 bytes for immediate data in completion event. The rest are >> ironing details for complete specification. >> This is no different than for any other new functionality proposed. >> And except for wasting 4 bytes per buffer or completion I do not see >> how it penalizes IB. Moreover if Apps knows that Provider returns >> immediate data in completion event it can avoid any penalty. > > There is no penalty to the user if you just provide native > features via extensions. Your extension > will provide the best possible interface for your native capabilities. > > I think we are further from consensus then we first thought: > > Right now we have a new post recv, different delivery > mechanisms, and a requirement to allocate an extra 4 bytes of user > data. > > The only requirement to support immediate data on IB, is a > new post send and write immediate data calls and a new event > data construct. The normal post_recv can be used unchanged > and can already process normal and immediate data. No > requirement on the user to allocate and manage an extra 4 > bytes in the receive buffer. In fact, you can post receive > with no buffer. > > In order to support immediate data via iWARP, you now have a > requirement to use a special new receive post, new user > buffer constructs to place the data, and new delivery method > that has to be checked via provider attributes or at event time. > > Is there anyway to get this closer? If not, I would recommend > going back to an extension interface for immediate data. > I think the trick to finding out if there is something useful that can be made transport neutral is to work in the opposite direction. Start with the message sequence that the application would use *without* immediates, and then ask if there is a way to allow an InfiniBand Provider to compress that message sequence. That is possible for RDMA Write with Immediate. With careful definition of a composite message it can be viewed as a transport specific replacement for an RDMA Write followed by a 4-byte RDMA Send. There are only two special considerations required: 1) A single post has to submit the combination (otherwise it is too difficult for the Provider to detect the optimization). 2) The receive completion may report the received data in the user supplied buffer OR in an "immediate data" field in the completion. I do not think it is feasible to define a transport neutral equivalent of a RDMA Send with Immediate. How is the extra data transmitted via iWARP? An extra send? Pre-pend the four bytes? Or 4 bytes at the end? Delivery of the immediate data is transport dependent? Adding an immediate data field to the completion doesn't cost much, and it would allow IB DAT Provider to interact with IB-specific fields. But I can't see adding a send with immediate method in any way that would create an expectation in developers that it would work in a transport neutral fashion. Write with immediate is possible. It carries the complexity that a single DTO request might result in two flushed work completions. The current consensus is that this was too complex relative to the benefit. But that's really a call for application developers. It isn't that hard for an iWARP Provider to implement. The question is whether given the complexity of properly sizing the QP that it will actually be used. From Arkady.Kanevsky at netapp.com Fri Jan 27 10:42:54 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 27 Jan 2006 13:42:54 -0500 Subject: [openib-general] IP addressing service annex v2 Message-ID: Enclosed, is an updated version of the annex incorporating feedback from last SWG meeting and emails. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ip_address_annex.pdf Type: application/octet-stream Size: 66838 bytes Desc: ip_address_annex.pdf URL: From ralphc at pathscale.com Fri Jan 27 12:10:48 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 27 Jan 2006 12:10:48 -0800 Subject: [openib-general] [PATCH] SDP doesn't flush path cache if device removed Message-ID: <1138392648.23076.13.camel@brick.internal.keyresearch.com> SDP doesn't remove cached SA path records if the device is removed. Minor nit: flush_workqueue() doesn't need to be called before destroy_workqueue(). Signed-off-by: Ralph Campbell Index: sdp_proto.h =================================================================== --- sdp_proto.h (revision 5193) +++ sdp_proto.h (working copy) @@ -405,6 +405,8 @@ void sdp_link_addr_cleanup(void); +void sdp_link_device_remove_one(struct ib_device *device); + /* * Function types */ Index: sdp_conn.c =================================================================== --- sdp_conn.c (revision 5193) +++ sdp_conn.c (working copy) @@ -178,8 +178,8 @@ } sdp_conn_inet_error(conn, error); - return; } + /* * sdp_inet_accept_q_put - put a conn into a listen conn's accept Q. */ @@ -1861,6 +1861,8 @@ ib_destroy_cm_id(hca->listen_id); + sdp_link_device_remove_one(device); + list_for_each_entry_safe(port, tmp, &hca->port_list, list) { list_del(&port->list); kfree(port); Index: sdp_link.c =================================================================== --- sdp_link.c (revision 5193) +++ sdp_link.c (working copy) @@ -234,9 +234,9 @@ static void sdp_path_info_destroy(struct sdp_path_info *info, int status) { struct sdp_path_wait *wait, *tmp; - /* TODO: replace by list_del once we have proper locking */ - list_del_init(&info->info_list); + list_del(&info->info_list); + list_for_each_entry_safe(wait, tmp, &info->wait_list, list) sdp_path_wait_complete(wait, info, status); @@ -747,6 +747,24 @@ } /* + * Remove any cached path information for the given device. + */ +void sdp_link_device_remove_one(struct ib_device *device) +{ + struct sdp_path_info *info; + + down(&sdp_link_mutex); + + list_for_each_entry(info, &info_list, info_list) + if (info->ca == device) { + sdp_path_info_destroy(info, -ENODEV); + break; + } + + up(&sdp_link_mutex); +} + +/* * primary initialization/cleanup functions */ static struct packet_type sdp_arp_type = { @@ -829,7 +847,6 @@ * destroy work queue */ cancel_delayed_work(&link_timer); - flush_workqueue(link_wq); destroy_workqueue(link_wq); /* * clear objects -- Ralph Campbell From Arkady.Kanevsky at netapp.com Fri Jan 27 12:11:27 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 27 Jan 2006 15:11:27 -0500 Subject: [dat-discussions] RE: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal Message-ID: Caitlin, Agree that Send with immed is too hard to handle. I have not heard from any ULP that they need that. So we can take informal vote and close that issue. The sizing of EVD to handle 2 completions in case of the error for post of RDMA_write_with_immed can be handled by Provider adding extra if EVD will be used for posting RDMA_write_with_immed. It does not allow Consumer to "optimize" queue size based on "exact" number of oustanding RDMA_write_with_immed ops but it is simpler to program to. Of course ULP can be "adaptive" and chooses the code pass based on Provider attr if we add the attr if extra queue size is needed. It is separate from how immed data is returned. We can combined the 2 under one Provider attr but conceptually it is wrong to combine two. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Friday, January 27, 2006 12:23 PM > To: dat-discussions at yahoogroups.com; Kanevsky, Arkady; Arlin Davis > Cc: Lentini, James; openib-general at openib.org > Subject: RE: [dat-discussions] RE: [openib-general] RE: [RFC] > DAT 2.0 immediate data proposal > > dat-discussions at yahoogroups.com wrote: > >> But this penalizes user which need to deal with 2 way to deal with > >> post calls and completions. > >> > >> I do not think we are not to far from consensus. > >> Transport independent App will allocate 4 bytes extra for buffers > >> that can match immediate data. Completion data will return > where the > >> immediate data is return (Consumer can not request it on posting), > >> and 4 bytes for immediate data in completion event. The rest are > >> ironing details for complete specification. > >> This is no different than for any other new functionality proposed. > >> And except for wasting 4 bytes per buffer or completion I > do not see > >> how it penalizes IB. Moreover if Apps knows that Provider returns > >> immediate data in completion event it can avoid any penalty. > > > > There is no penalty to the user if you just provide native features > > via extensions. Your extension will provide the best possible > > interface for your native capabilities. > > > > I think we are further from consensus then we first thought: > > > > Right now we have a new post recv, different delivery > mechanisms, and > > a requirement to allocate an extra 4 bytes of user data. > > > > The only requirement to support immediate data on IB, is a new post > > send and write immediate data calls and a new event data construct. > > The normal post_recv can be used unchanged and can already process > > normal and immediate data. No requirement on the user to > allocate and > > manage an extra 4 bytes in the receive buffer. In fact, you > can post > > receive with no buffer. > > > > In order to support immediate data via iWARP, you now have a > > requirement to use a special new receive post, new user buffer > > constructs to place the data, and new delivery method that > has to be > > checked via provider attributes or at event time. > > > > Is there anyway to get this closer? If not, I would recommend going > > back to an extension interface for immediate data. > > > > I think the trick to finding out if there is something useful > that can be made transport neutral is to work in the opposite > direction. > > Start with the message sequence that the application would use > *without* immediates, and then ask if there is a way to allow > an InfiniBand Provider to compress that message sequence. > > That is possible for RDMA Write with Immediate. With careful > definition of a composite message it can be viewed as a > transport specific replacement for an RDMA Write followed by > a 4-byte RDMA Send. There are only two special considerations > required: > > 1) A single post has to submit the combination (otherwise it > is too difficult for the Provider to detect the optimization). > 2) The receive completion may report the received data in the > user supplied buffer OR in an "immediate data" field in the > completion. > > I do not think it is feasible to define a transport neutral > equivalent of a RDMA Send with Immediate. How is the extra > data transmitted via iWARP? An extra send? Pre-pend the four > bytes? Or 4 bytes at the end? Delivery of the immediate data > is transport dependent? > > Adding an immediate data field to the completion doesn't cost > much, and it would allow IB DAT Provider to interact with > IB-specific fields. But I can't see adding a send with > immediate method in any way that would create an expectation > in developers that it would work in a transport neutral fashion. > > Write with immediate is possible. It carries the complexity > that a single DTO request might result in two flushed work > completions. > The current consensus is that this was too complex relative > to the benefit. But that's really a call for application developers. > It isn't that hard for an iWARP Provider to implement. The > question is whether given the complexity of properly sizing > the QP that it will actually be used. > From rdreier at cisco.com Fri Jan 27 12:30:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 27 Jan 2006 12:30:49 -0800 Subject: [openib-general] [PATCH 1/4] SA path record caching In-Reply-To: <43D9682C.8000309@ichips.intel.com> (Sean Hefty's message of "Thu, 26 Jan 2006 16:24:12 -0800") References: <43D9682C.8000309@ichips.intel.com> Message-ID: No objection, this seems OK to commit. - R. From rdreier at cisco.com Fri Jan 27 12:32:43 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 27 Jan 2006 12:32:43 -0800 Subject: [openib-general] iwarp: whats a pkey? In-Reply-To: <1138292581.760.36.camel@stevo-desktop> (Steve Wise's message of "Thu, 26 Jan 2006 10:23:01 -0600") References: <1138290699.760.25.camel@stevo-desktop> <1138292581.760.36.camel@stevo-desktop> Message-ID: Roland> No, I think trying to create a mapping is a bad idea. The Roland> semantics of VLANs and IB partitions are sufficiently Roland> different that it's probably better to treat each concept Roland> natively. Steve> Roland, can you expand on this some? I don't think transport neutral code should be dealing with either P_Keys or VLANs. The Linux model for handling VLANs is that each VLAN has a separate network interface. So an iWARP consumer should never deal with VLANs, just with a routing choice of interfaces. Similarly if a consumer is using the iWARP-emulation CM for IB, then the P_Key will come from the IPoIB interface. Only native IB consumers that understand partitions ever have to deal with P_Keys. - R. From mshefty at ichips.intel.com Fri Jan 27 12:46:13 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 27 Jan 2006 12:46:13 -0800 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: References: Message-ID: <43DA8695.8000907@ichips.intel.com> Sean Hefty wrote: > The following patch series adds caching of path records with the local system. These changes have been committed in svn revision 5194. - Sean From halr at voltaire.com Fri Jan 27 12:55:45 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Jan 2006 15:55:45 -0500 Subject: [openib-general] OpenSM: Add support for optional SA GUIDInfoRecord Message-ID: <1138395329.4338.71898.camel@hal.voltaire.com> OpenSM: Add support for optional SA GUIDInfoRecord Signed-off-by: Hal Rosenstock Index: osm/include/opensm/osm_sa_guidinfo_record.h =================================================================== --- osm/include/opensm/osm_sa_guidinfo_record.h (revision 0) +++ osm/include/opensm/osm_sa_guidinfo_record.h (revision 0) @@ -0,0 +1,283 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: $ + */ + + +/* + * Abstract: + * Declaration of osm_gir_rcv_t. + * This object represents the GUIDInfo Record Receiver object. + * attribute from a node. + * This object is part of the OpenSM family of objects. + * + * Environment: + * Linux User Mode + * + */ + + +#ifndef _OSM_GIR_RCV_H_ +#define _OSM_GIR_RCV_H_ + + +#include +#include +#include +#include +#include +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +/****h* OpenSM/GUIDInfo Record Receiver +* NAME +* GUIDInfo Record Receiver +* +* DESCRIPTION +* The GUIDInfo Record Receiver object encapsulates the information +* needed to receive the GUIDInfoRecord attribute from a node. +* +* The GUIDInfo Record Receiver object is thread safe. +* +* This object should be treated as opaque and should be +* manipulated only through the provided functions. +* +* AUTHOR +* Hal Rosenstock, Voltaire +* +*********/ +/****s* OpenSM: GUIDInfo Record Receiver/osm_gir_rcv_t +* NAME +* osm_gir_rcv_t +* +* DESCRIPTION +* GUIDInfo Record Receiver structure. +* +* This object should be treated as opaque and should +* be manipulated only through the provided functions. +* +* SYNOPSIS +*/ +typedef struct _osm_gir_rcv +{ + const osm_subn_t *p_subn; + osm_sa_resp_t *p_resp; + osm_mad_pool_t *p_mad_pool; + osm_log_t *p_log; + cl_plock_t *p_lock; + cl_qlock_pool_t pool; + +} osm_gir_rcv_t; +/* +* FIELDS +* p_subn +* Pointer to the Subnet object for this subnet. +* +* p_resp +* Pointer to the SA reponder. +* +* p_mad_pool +* Pointer to the mad pool. +* +* p_log +* Pointer to the log object. +* +* p_lock +* Pointer to the serializing lock. +* +* pool +* Pool of linkable GUIDInfo Record objects used to generate +* the query response. +* +* SEE ALSO +* +*********/ + +/****f* OpenSM: GUIDInfo Record Receiver/osm_gir_rcv_construct +* NAME +* osm_gir_rcv_construct +* +* DESCRIPTION +* This function constructs a GUIDInfo Record Receiver object. +* +* SYNOPSIS +*/ +void +osm_gir_rcv_construct( + IN osm_gir_rcv_t* const p_rcv ); +/* +* PARAMETERS +* p_rcv +* [in] Pointer to a GUIDInfo Record Receiver object to construct. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Allows calling osm_gir_rcv_init, osm_gir_rcv_destroy +* +* Calling osm_gir_rcv_construct is a prerequisite to calling any other +* method except osm_gir_rcv_init. +* +* SEE ALSO +* GUIDInfo Record Receiver object, osm_gir_rcv_init, +* osm_gir_rcv_destroy +*********/ + +/****f* OpenSM: GUIDInfo Record Receiver/osm_gir_rcv_destroy +* NAME +* osm_gir_rcv_destroy +* +* DESCRIPTION +* The osm_gir_rcv_destroy function destroys the object, releasing +* all resources. +* +* SYNOPSIS +*/ +void +osm_gir_rcv_destroy( + IN osm_gir_rcv_t* const p_rcv ); +/* +* PARAMETERS +* p_rcv +* [in] Pointer to the object to destroy. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Performs any necessary cleanup of the specified +* GUIDInfo Record Receiver object. +* Further operations should not be attempted on the destroyed object. +* This function should only be called after a call to +* osm_gir_rcv_construct or osm_gir_rcv_init. +* +* SEE ALSO +* GUIDInfo Record Receiver object, osm_gir_rcv_construct, +* osm_gir_rcv_init +*********/ + +/****f* OpenSM: GUIDInfo Record Receiver/osm_gir_rcv_init +* NAME +* osm_gir_rcv_init +* +* DESCRIPTION +* The osm_gir_rcv_init function initializes a +* GUIDInfo Record Receiver object for use. +* +* SYNOPSIS +*/ +ib_api_status_t +osm_gir_rcv_init( + IN osm_gir_rcv_t* const p_rcv, + IN osm_sa_resp_t* const p_resp, + IN osm_mad_pool_t* const p_mad_pool, + IN const osm_subn_t* const p_subn, + IN osm_log_t* const p_log, + IN cl_plock_t* const p_lock ); +/* +* PARAMETERS +* p_rcv +* [in] Pointer to an osm_gir_rcv_t object to initialize. +* +* p_req +* [in] Pointer to an osm_req_t object. +* +* p_subn +* [in] Pointer to the Subnet object for this subnet. +* +* p_log +* [in] Pointer to the log object. +* +* p_lock +* [in] Pointer to the OpenSM serializing lock. +* +* RETURN VALUES +* CL_SUCCESS if the GUIDInfo Record Receiver object was initialized +* successfully. +* +* NOTES +* Allows calling other GUIDInfo Record Receiver methods. +* +* SEE ALSO +* GUIDInfo Record Receiver object, osm_gir_rcv_construct, +* osm_gir_rcv_destroy +*********/ + +/****f* OpenSM: GUIDInfo Record Receiver/osm_gir_rcv_process +* NAME +* osm_gir_rcv_process +* +* DESCRIPTION +* Process the GUIDInfoRecord attribute. +* +* SYNOPSIS +*/ +void +osm_gir_rcv_process( + IN osm_gir_rcv_t* const p_rcv, + IN const osm_madw_t* const p_madw ); +/* +* PARAMETERS +* p_rcv +* [in] Pointer to an osm_gir_rcv_t object. +* +* p_madw +* [in] Pointer to the MAD Wrapper containing the MAD +* that contains the node's GUIDInfoRecord attribute. +* +* RETURN VALUES +* CL_SUCCESS if the GUIDInfoRecord processing was successful. +* +* NOTES +* This function processes a GUIDInfoRecord attribute. +* +* SEE ALSO +* GUIDInfo Record Receiver, GUIDInfo Record Response Controller +*********/ + +END_C_DECLS + +#endif /* _OSM_GIR_RCV_H_ */ Property changes on: osm/include/opensm/osm_sa_guidinfo_record.h ___________________________________________________________________ Name: svn:keywords + Id Index: osm/include/opensm/osm_helper.h =================================================================== --- osm/include/opensm/osm_helper.h (revision 5193) +++ osm/include/opensm/osm_helper.h (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -247,6 +247,12 @@ osm_dump_portinfo_record( IN const osm_log_level_t log_level ); void +osm_dump_guidinfo_record( + IN osm_log_t* const p_log, + IN const ib_guidinfo_record_t* const p_gir, + IN const osm_log_level_t log_level ); + +void osm_dump_inform_info( IN osm_log_t* const p_log, IN const ib_inform_info_t* const p_ii, Index: osm/include/opensm/osm_sa.h =================================================================== --- osm/include/opensm/osm_sa.h (revision 5193) +++ osm/include/opensm/osm_sa.h (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -64,6 +64,7 @@ #include #include #include +#include #include #include #include @@ -154,6 +155,8 @@ typedef struct _osm_sa osm_nr_rcv_ctrl_t nr_rcv_ctrl; osm_pir_rcv_t pir_rcv; osm_pir_rcv_ctrl_t pir_rcv_ctrl; + osm_gir_rcv_t gir_rcv; + osm_gir_rcv_ctrl_t gir_rcv_ctrl; osm_lr_rcv_t lr_rcv; osm_lr_rcv_ctrl_t lr_rcv_ctrl; osm_pr_rcv_t pr_rcv; Index: osm/include/opensm/osm_msgdef.h =================================================================== --- osm/include/opensm/osm_msgdef.h (revision 5193) +++ osm/include/opensm/osm_msgdef.h (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -191,6 +191,7 @@ enum OSM_MSG_MAD_PKEY, OSM_MSG_MAD_VL_ARB, OSM_MSG_MAD_SLVL, + OSM_MSG_MAD_GUIDINFO_RECORD, OSM_MSG_MAX }; Index: osm/include/opensm/osm_sa_guidinfo_record_ctrl.h =================================================================== --- osm/include/opensm/osm_sa_guidinfo_record_ctrl.h (revision 0) +++ osm/include/opensm/osm_sa_guidinfo_record_ctrl.h (revision 0) @@ -0,0 +1,233 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: $ + */ + + +/* + * Abstract: + * Declaration of osm_sa_gir_rec_rcv_ctrl_t. + * This object represents a controller that receives the IBA GUID Info + * record query from SA client. + * This object is part of the OpenSM family of objects. + * + * Environment: + * Linux User Mode + * + */ + + +#ifndef _OSM_GIR_CTRL_H_ +#define _OSM_GIR_CTRL_H_ + + +#include +#include +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +/****h* OpenSM/GUID Info Record Receive Controller +* NAME +* GUID Info Record Receive Controller +* +* DESCRIPTION +* The GUID Info Record Receive Controller object encapsulates +* the information needed to handle GUID Info record query from SA client. +* +* The GUID Info Record Receive Controller object is thread safe. +* +* This object should be treated as opaque and should be +* manipulated only through the provided functions. +* +* AUTHOR +* Hal Rosenstock, Voltaire +* +*********/ +/****s* OpenSM: GUID Info Record Receive Controller/osm_gir_rcv_ctrl_t +* NAME +* osm_gir_rcv_ctrl_t +* +* DESCRIPTION +* GUID Info Record Receive Controller structure. +* +* This object should be treated as opaque and should +* be manipulated only through the provided functions. +* +* SYNOPSIS +*/ +typedef struct _osm_gir_rcv_ctrl +{ + osm_gir_rcv_t *p_rcv; + osm_log_t *p_log; + cl_dispatcher_t *p_disp; + cl_disp_reg_handle_t h_disp; + +} osm_gir_rcv_ctrl_t; +/* +* FIELDS +* p_rcv +* Pointer to the GUID Info Record Receiver object. +* +* p_log +* Pointer to the log object. +* +* p_disp +* Pointer to the Dispatcher. +* +* h_disp +* Handle returned from dispatcher registration. +* +* SEE ALSO +* GUID Info Record Receive Controller object +* GUID Info Record Receiver object +*********/ + +/****f* OpenSM: GUID Info Record Receive Controller/osm_gir_rec_rcv_ctrl_construct +* NAME +* osm_gir_rcv_ctrl_construct +* +* DESCRIPTION +* This function constructs a GUID Info Record Receive Controller object. +* +* SYNOPSIS +*/ +void osm_gir_rcv_ctrl_construct( + IN osm_gir_rcv_ctrl_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to a GUID Info Record Receive Controller +* object to construct. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Allows calling osm_gir_rcv_ctrl_init, osm_gir_rcv_ctrl_destroy +* +* Calling osm_gir_rcv_ctrl_construct is a prerequisite to calling any other +* method except osm_gir_rcv_ctrl_init. +* +* SEE ALSO +* GUID Info Record Receive Controller object, osm_gir_rcv_ctrl_init, +* osm_gir_rcv_ctrl_destroy +*********/ + +/****f* OpenSM: GUID Info Record Receive Controller/osm_gir_rcv_ctrl_destroy +* NAME +* osm_gir_rcv_ctrl_destroy +* +* DESCRIPTION +* The osm_gir_rcv_ctrl_destroy function destroys the object, releasing +* all resources. +* +* SYNOPSIS +*/ +void osm_gir_rcv_ctrl_destroy( + IN osm_gir_rcv_ctrl_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to the object to destroy. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Performs any necessary cleanup of the specified +* GUIDInfo Record Receive Controller object. +* Further operations should not be attempted on the destroyed object. +* This function should only be called after a call to +* osm_gir_rcv_ctrl_construct or osm_gir_rcv_ctrl_init. +* +* SEE ALSO +* GUIDInfo Record Receive Controller object, osm_gir_rcv_ctrl_construct, +* osm_gir_rcv_ctrl_init +*********/ + +/****f* OpenSM: GUID Info Record Receive Controller/osm_gir_rcv_ctrl_init +* NAME +* osm_gir_rcv_ctrl_init +* +* DESCRIPTION +* The osm_gir_rcv_ctrl_init function initializes a +* GUID Info Record Receive Controller object for use. +* +* SYNOPSIS +*/ +ib_api_status_t osm_gir_rcv_ctrl_init( + IN osm_gir_rcv_ctrl_t* const p_ctrl, + IN osm_gir_rcv_t* const p_rcv, + IN osm_log_t* const p_log, + IN cl_dispatcher_t* const p_disp ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to an osm_gir_rcv_ctrl_t object to initialize. +* +* p_rcv +* [in] Pointer to an osm_gir_rcv_t object. +* +* p_log +* [in] Pointer to the log object. +* +* p_disp +* [in] Pointer to the OpenSM central Dispatcher. +* +* RETURN VALUES +* CL_SUCCESS if the GUID Info Record Receive Controller object was initialized +* successfully. +* +* NOTES +* Allows calling other GUID Info Record Receive Controller methods. +* +* SEE ALSO +* GUID Info Record Receive Controller object, osm_gir_rcv_ctrl_construct, +* osm_gir_rcv_ctrl_destroy +*********/ + +END_C_DECLS + +#endif /* _OSM_GIR_CTRL_H_ */ Property changes on: osm/include/opensm/osm_sa_guidinfo_record_ctrl.h ___________________________________________________________________ Name: svn:keywords + Id Index: osm/include/Makefile.am =================================================================== --- osm/include/Makefile.am (revision 5193) +++ osm/include/Makefile.am (working copy) @@ -5,6 +5,7 @@ nobase_pkginclude_HEADERS = iba/ib_types EXTRA_DIST = \ $(srcdir)/opensm/osm_version.h \ $(srcdir)/opensm/osm_sa_portinfo_record_ctrl.h \ + $(srcdir)/opensm/osm_sa_guidinfo_record_ctrl.h \ $(srcdir)/opensm/osm_sa_path_record.h \ $(srcdir)/opensm/osm_lid_mgr.h \ $(srcdir)/opensm/osm_vl_arb_rcv.h \ @@ -33,6 +34,7 @@ EXTRA_DIST = \ $(srcdir)/opensm/osm_sa_pkey_record_ctrl.h \ $(srcdir)/opensm/osm_helper.h \ $(srcdir)/opensm/osm_sa_portinfo_record.h \ + $(srcdir)/opensm/osm_sa_guidinfo_record.h \ $(srcdir)/opensm/osm_sa_service_record.h \ $(srcdir)/opensm/osm_sa_response.h \ $(srcdir)/opensm/osm_node.h \ Index: osm/include/iba/ib_types.h =================================================================== --- osm/include/iba/ib_types.h (revision 5193) +++ osm/include/iba/ib_types.h (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -1109,6 +1109,18 @@ ib_class_is_vendor_specific( #define IB_MAD_ATTR_SMINFO_RECORD (CL_NTOH16(0x0018)) /**********/ +/****d* IBA Base: Constants/IB_MAD_ATTR_GUIDINFO_RECORD +* NAME +* IB_MAD_ATTR_GUIDINFO_RECORD +* +* DESCRIPTION +* GuidInfoRecord attribute (15.2.5) +* +* SOURCE +*/ +#define IB_MAD_ATTR_GUIDINFO_RECORD (CL_NTOH16(0x0030)) +/**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_VENDOR_DIAG * NAME * IB_MAD_ATTR_VENDOR_DIAG @@ -1120,6 +1132,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_VENDOR_DIAG (CL_NTOH16(0x0030)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_LED_INFO * NAME * IB_MAD_ATTR_LED_INFO @@ -1131,6 +1144,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_LED_INFO (CL_NTOH16(0x0031)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_SERVICE_RECORD * NAME * IB_MAD_ATTR_SERVICE_RECORD @@ -1142,6 +1156,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_SERVICE_RECORD (CL_NTOH16(0x0031)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_LFT_RECORD * NAME * IB_MAD_ATTR_LFT_RECORD @@ -1153,6 +1168,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_LFT_RECORD (CL_NTOH16(0x0015)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_PKEYTBL_RECORD * NAME * IB_MAD_ATTR_PKEYTBL_RECORD @@ -1164,6 +1180,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_PKEY_TBL_RECORD (CL_NTOH16(0x0033)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_PATH_RECORD * NAME * IB_MAD_ATTR_PATH_RECORD @@ -1175,6 +1192,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_PATH_RECORD (CL_NTOH16(0x0035)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_VLARB_RECORD * NAME * IB_MAD_ATTR_VLARB_RECORD @@ -1186,6 +1204,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_VLARB_RECORD (CL_NTOH16(0x0036)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_SLVL_RECORD * NAME * IB_MAD_ATTR_SLVL_RECORD @@ -1197,6 +1216,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_SLVL_RECORD (CL_NTOH16(0x0013)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_MCMEMBER_RECORD * NAME * IB_MAD_ATTR_MCMEMBER_RECORD @@ -1208,6 +1228,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_MCMEMBER_RECORD (CL_NTOH16(0x0038)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_TRACE_RECORD * NAME * IB_MAD_ATTR_MTRACE_RECORD @@ -1219,6 +1240,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_TRACE_RECORD (CL_NTOH16(0x0039)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_MULTIPATH_RECORD * NAME * IB_MAD_ATTR_MULTIPATH_RECORD @@ -1230,6 +1252,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_MULTIPATH_RECORD (CL_NTOH16(0x003A)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_SVC_ASSOCIATION_RECORD * NAME * IB_MAD_ATTR_SVC_ASSOCIATION_RECORD @@ -1241,6 +1264,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_SVC_ASSOCIATION_RECORD (CL_NTOH16(0x003B)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_IO_UNIT_INFO * NAME * IB_MAD_ATTR_IO_UNIT_INFO @@ -1252,6 +1276,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_IO_UNIT_INFO (CL_NTOH16(0x0010)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_IO_CONTROLLER_PROFILE * NAME * IB_MAD_ATTR_IO_CONTROLLER_PROFILE @@ -1263,6 +1288,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_IO_CONTROLLER_PROFILE (CL_NTOH16(0x0011)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_SERVICE_ENTRIES * NAME * IB_MAD_ATTR_SERVICE_ENTRIES @@ -1274,6 +1300,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_SERVICE_ENTRIES (CL_NTOH16(0x0012)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_DIAGNOSTIC_TIMEOUT * NAME * IB_MAD_ATTR_DIAGNOSTIC_TIMEOUT @@ -1285,6 +1312,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_DIAGNOSTIC_TIMEOUT (CL_NTOH16(0x0020)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_PREPARE_TO_TEST * NAME * IB_MAD_ATTR_PREPARE_TO_TEST @@ -1296,6 +1324,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_PREPARE_TO_TEST (CL_NTOH16(0x0021)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_TEST_DEVICE_ONCE * NAME * IB_MAD_ATTR_TEST_DEVICE_ONCE @@ -1307,6 +1336,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_TEST_DEVICE_ONCE (CL_NTOH16(0x0022)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_TEST_DEVICE_LOOP * NAME * IB_MAD_ATTR_TEST_DEVICE_LOOP @@ -1318,6 +1348,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_TEST_DEVICE_LOOP (CL_NTOH16(0x0023)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_DIAG_CODE * NAME * IB_MAD_ATTR_DIAG_CODE @@ -1341,6 +1372,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_SVC_ASSOCIATION_RECORD (CL_NTOH16(0x003B)) /**********/ + /****d* IBA Base: Constants/IB_NODE_TYPE_CA * NAME * IB_NODE_TYPE_CA @@ -1352,6 +1384,7 @@ ib_class_is_vendor_specific( */ #define IB_NODE_TYPE_CA 0x01 /**********/ + /****d* IBA Base: Constants/IB_NODE_TYPE_SWITCH * NAME * IB_NODE_TYPE_SWITCH @@ -1363,6 +1396,7 @@ ib_class_is_vendor_specific( */ #define IB_NODE_TYPE_SWITCH 0x02 /**********/ + /****d* IBA Base: Constants/IB_NODE_TYPE_ROUTER * NAME * IB_NODE_TYPE_ROUTER @@ -1386,6 +1420,7 @@ ib_class_is_vendor_specific( */ #define IB_NOTICE_NODE_TYPE_CA (CL_NTOH32(0x000001)) /**********/ + /****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_SWITCH * NAME * IB_NOTICE_NODE_TYPE_SWITCH @@ -1397,6 +1432,7 @@ ib_class_is_vendor_specific( */ #define IB_NOTICE_NODE_TYPE_SWITCH (CL_NTOH32(0x000002)) /**********/ + /****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_ROUTER * NAME * IB_NOTICE_NODE_TYPE_ROUTER @@ -1408,6 +1444,7 @@ ib_class_is_vendor_specific( */ #define IB_NOTICE_NODE_TYPE_ROUTER (CL_NTOH32(0x000003)) /**********/ + /****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_SUBN_MGMT * NAME * IB_NOTICE_NODE_TYPE_SUBN_MGMT @@ -2148,18 +2185,22 @@ typedef struct _ib_path_rec #define IB_VLA_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_VLA_COMPMASK_OUT_PORT (CL_HTON64(((uint64_t)1)<<1)) #define IB_VLA_COMPMASK_BLOCK (CL_HTON64(((uint64_t)1)<<2)) + /* SLtoVL Mapping Record Masks */ #define IB_SLVL_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_SLVL_COMPMASK_IN_PORT (CL_HTON64(((uint64_t)1)<<1)) #define IB_SLVL_COMPMASK_OUT_PORT (CL_HTON64(((uint64_t)1)<<2)) + /* P_Key Table Record Masks */ #define IB_PKEY_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_PKEY_COMPMASK_BLOCK (CL_HTON64(((uint64_t)1)<<1)) #define IB_PKEY_COMPMASK_PORT (CL_HTON64(((uint64_t)1)<<2)) -/* LFT Record MASKS */ + +/* LFT Record Masks */ #define IB_LFTR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_LFTR_COMPMASK_BLOCK (CL_HTON64(((uint64_t)1)<<1)) -/* ModeInfo Record MASKS */ + +/* NodeInfo Record Masks */ #define IB_NR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_NR_COMPMASK_RESERVED1 (CL_HTON64(((uint64_t)1)<<1)) #define IB_NR_COMPMASK_BASEVERSION (CL_HTON64(((uint64_t)1)<<2)) @@ -2175,6 +2216,7 @@ typedef struct _ib_path_rec #define IB_NR_COMPMASK_PORTNUM (CL_HTON64(((uint64_t)1)<<12)) #define IB_NR_COMPMASK_VENDID (CL_HTON64(((uint64_t)1)<<13)) #define IB_NR_COMPMASK_NODEDESC (CL_HTON64(((uint64_t)1)<<14)) + /* Service Record Component Mask Sec 15.2.5.14 Ver 1.1*/ #define IB_SR_COMPMASK_SID (CL_HTON64(((uint64_t)1)<<0)) #define IB_SR_COMPMASK_SGID (CL_HTON64(((uint64_t)1)<<1)) @@ -2213,6 +2255,7 @@ typedef struct _ib_path_rec #define IB_SR_COMPMASK_SDATA32_3 (CL_HTON64(((uint64_t)1)<<34)) #define IB_SR_COMPMASK_SDATA64_0 (CL_HTON64(((uint64_t)1)<<35)) #define IB_SR_COMPMASK_SDATA64_1 (CL_HTON64(((uint64_t)1)<<36)) + /* Port Info Record Component Masks */ #define IB_PIR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_PIR_COMPMASK_PORTNUM (CL_HTON64(((uint64_t)1)<<1)) @@ -2263,6 +2306,7 @@ typedef struct _ib_path_rec #define IB_PIR_COMPMASK_RESPTIME (CL_HTON64(((uint64_t)1)<<46)) #define IB_PIR_COMPMASK_LOCALPHYERR (CL_HTON64(((uint64_t)1)<<47)) #define IB_PIR_COMPMASK_OVERRUNERR (CL_HTON64(((uint64_t)1)<<48)) + /* Multicast Member Record Component Masks */ #define IB_MCR_COMPMASK_GID (CL_HTON64(((uint64_t)1)<<0)) #define IB_MCR_COMPMASK_MGID (CL_HTON64(((uint64_t)1)<<0)) @@ -2284,6 +2328,20 @@ typedef struct _ib_path_rec #define IB_MCR_COMPMASK_JOIN_STATE (CL_HTON64(((uint64_t)1)<<16)) #define IB_MCR_COMPMASK_PROXY (CL_HTON64(((uint64_t)1)<<17)) +/* GUID Info Record Component Masks */ +#define IB_GIR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) +#define IB_GIR_COMPMASK_BLOCKNUM (CL_HTON64(((uint64_t)1)<<1)) +#define IB_GIR_COMPMASK_RESV1 (CL_HTON64(((uint64_t)1)<<2)) +#define IB_GIR_COMPMASK_RESV2 (CL_HTON64(((uint64_t)1)<<3)) +#define IB_GIR_COMPMASK_GID0 (CL_HTON64(((uint64_t)1)<<4)) +#define IB_GIR_COMPMASK_GID1 (CL_HTON64(((uint64_t)1)<<5)) +#define IB_GIR_COMPMASK_GID2 (CL_HTON64(((uint64_t)1)<<6)) +#define IB_GIR_COMPMASK_GID3 (CL_HTON64(((uint64_t)1)<<7)) +#define IB_GIR_COMPMASK_GID4 (CL_HTON64(((uint64_t)1)<<8)) +#define IB_GIR_COMPMASK_GID5 (CL_HTON64(((uint64_t)1)<<9)) +#define IB_GIR_COMPMASK_GID6 (CL_HTON64(((uint64_t)1)<<10)) +#define IB_GIR_COMPMASK_GID7 (CL_HTON64(((uint64_t)1)<<11)) + /****f* IBA Base: Types/ib_path_rec_init_local * NAME * ib_path_rec_init_local @@ -5383,6 +5441,17 @@ typedef struct _ib_guid_info #include /************/ +#include +typedef struct _ib_guidinfo_record +{ + ib_net16_t lid; + uint8_t block_num; + uint8_t resv; + uint32_t reserved; + ib_guid_info_t guid_info; +} PACK_SUFFIX ib_guidinfo_record_t; +#include + #define IB_NUM_PKEY_ELEMENTS_IN_BLOCK 32 /****s* IBA Base: Types/ib_pkey_table_t * NAME Index: osm/opensm/osm_sa_guidinfo_record.c =================================================================== --- osm/opensm/osm_sa_guidinfo_record.c (revision 0) +++ osm/opensm/osm_sa_guidinfo_record.c (revision 0) @@ -0,0 +1,624 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: $ + */ + + +/* + * Abstract: + * Implementation of osm_gir_rcv_t. + * This object represents the GUIDInfoRecord Receiver object. + * This object is part of the opensm family of objects. + * + * Environment: + * Linux User Mode + * + */ + +/* + Next available error code: 0x403 +*/ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define OSM_GIR_RCV_POOL_MIN_SIZE 32 +#define OSM_GIR_RCV_POOL_GROW_SIZE 32 + +typedef struct _osm_gir_item +{ + cl_pool_item_t pool_item; + ib_guidinfo_record_t rec; + +} osm_gir_item_t; + +typedef struct _osm_gir_search_ctxt +{ + const ib_guidinfo_record_t* p_rcvd_rec; + ib_net64_t comp_mask; + cl_qlist_t* p_list; + osm_gir_rcv_t* p_rcv; + const osm_physp_t* p_req_physp; + +} osm_gir_search_ctxt_t; + +/********************************************************************** + **********************************************************************/ +void +osm_gir_rcv_construct( + IN osm_gir_rcv_t* const p_rcv ) +{ + cl_memclr( p_rcv, sizeof(*p_rcv) ); + cl_qlock_pool_construct( &p_rcv->pool ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_gir_rcv_destroy( + IN osm_gir_rcv_t* const p_rcv ) +{ + OSM_LOG_ENTER( p_rcv->p_log, osm_gir_rcv_destroy ); + cl_qlock_pool_destroy( &p_rcv->pool ); + OSM_LOG_EXIT( p_rcv->p_log ); +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osm_gir_rcv_init( + IN osm_gir_rcv_t* const p_rcv, + IN osm_sa_resp_t* const p_resp, + IN osm_mad_pool_t* const p_mad_pool, + IN const osm_subn_t* const p_subn, + IN osm_log_t* const p_log, + IN cl_plock_t* const p_lock ) +{ + ib_api_status_t status; + + OSM_LOG_ENTER( p_log, osm_gir_rcv_init ); + + osm_gir_rcv_construct( p_rcv ); + + p_rcv->p_log = p_log; + p_rcv->p_subn = p_subn; + p_rcv->p_lock = p_lock; + p_rcv->p_resp = p_resp; + p_rcv->p_mad_pool = p_mad_pool; + + status = cl_qlock_pool_init( &p_rcv->pool, + OSM_GIR_RCV_POOL_MIN_SIZE, + 0, + OSM_GIR_RCV_POOL_GROW_SIZE, + sizeof(osm_gir_item_t), + NULL, NULL, NULL ); + + OSM_LOG_EXIT( p_log ); + return( status ); +} + +/********************************************************************** + **********************************************************************/ +static ib_api_status_t +__osm_gir_rcv_new_gir( + IN osm_gir_rcv_t* const p_rcv, + IN const osm_node_t* const p_node, + IN cl_qlist_t* const p_list, + IN ib_net64_t const match_port_guid, + IN ib_net16_t const match_lid, + IN const osm_physp_t* const p_req_physp, + IN uint8_t const block_num ) +{ + osm_gir_item_t* p_rec_item; + ib_api_status_t status = IB_SUCCESS; + + OSM_LOG_ENTER( p_rcv->p_log, __osm_gir_rcv_new_gir ); + + p_rec_item = (osm_gir_item_t*)cl_qlock_pool_get( &p_rcv->pool ); + if( p_rec_item == NULL ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_gir_rcv_new_gir: ERR 5102: " + "cl_qlock_pool_get failed\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_gir_rcv_new_gir: " + "New GUIDInfoRecord: lid 0x%X, block num %d\n", + cl_ntoh16( match_lid ), block_num ); + } + + cl_memclr( &p_rec_item->rec, sizeof( p_rec_item->rec ) ); + + p_rec_item->rec.lid = match_lid; + p_rec_item->rec.block_num = block_num; + if (!block_num) + p_rec_item->rec.guid_info.guid[0] = osm_physp_get_port_guid( p_req_physp ); + + cl_qlist_insert_tail( p_list, (cl_list_item_t*)&p_rec_item->pool_item ); + + Exit: + OSM_LOG_EXIT( p_rcv->p_log ); + return( status ); +} + +/********************************************************************** + **********************************************************************/ +void +__osm_sa_gir_create_gir( + IN osm_gir_rcv_t* const p_rcv, + IN const osm_node_t* const p_node, + IN cl_qlist_t* const p_list, + IN ib_net64_t const match_port_guid, + IN ib_net16_t const match_lid, + IN const osm_physp_t* const p_req_physp, + IN uint8_t const match_block_num ) +{ + const osm_physp_t* p_physp; + uint8_t port_num; + uint8_t num_ports; + uint16_t match_lid_ho; + uint16_t lid_ho; + ib_net16_t base_lid_ho; + ib_net16_t max_lid_ho; + uint8_t lmc; + ib_net64_t port_guid; + ib_api_status_t status; + const ib_port_info_t* p_pi; + uint8_t block_num, start_block_num, end_block_num, num_blocks; + + OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_gir_create_gir ); + + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_sa_gir_create_gir: " + "Looking for GUIDRecord with LID: 0x%X GUID:0x%016" PRIx64 "\n", + cl_ntoh16( match_lid ), + cl_ntoh64( match_port_guid ) + ); + } + + /* + For switches, do not return the GUIDInfo record(s) + for each port on the switch, just for port 0. + */ + if( osm_node_get_type( p_node ) == IB_NODE_TYPE_SWITCH ) + num_ports = 1; + else + num_ports = osm_node_get_num_physp( p_node ); + + for( port_num = 0; port_num < num_ports; port_num++ ) + { + p_physp = osm_node_get_physp_ptr( p_node, port_num ); + + if( !osm_physp_is_valid( p_physp ) ) + continue; + + /* Check to see if the found p_physp and the requestor physp + share a pkey. If not - continue */ + if (!osm_physp_share_pkey( p_rcv->p_log, p_physp, p_req_physp ) ) + continue; + + port_guid = osm_physp_get_port_guid( p_physp ); + + if( match_port_guid && ( port_guid != match_port_guid ) ) + continue; + + p_pi = osm_physp_get_port_info_ptr( p_physp ); + num_blocks = p_pi->guid_cap / 8; + if ( p_pi->guid_cap % 8 ) + num_blocks++; + if (match_block_num == 255) + { + start_block_num = 0; + end_block_num = num_blocks - 1; + } + else + { + if (match_block_num >= num_blocks) + continue; + end_block_num = start_block_num = match_block_num; + } + + base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_physp ) ); + lmc = osm_physp_get_lmc( p_physp ); + max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 ); + match_lid_ho = cl_ntoh16( match_lid ); + + if( match_lid_ho ) + { + /* + We validate that the lid belongs to this node. + */ + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_sa_gir_create_gir: " + "Comparing LID: 0x%X <= 0x%X <= 0x%X\n", + cl_ntoh16( base_lid_ho ), + cl_ntoh16( match_lid_ho ), + cl_ntoh16( max_lid_ho ) + ); + } + + if( (match_lid_ho <= max_lid_ho) && (match_lid_ho >= base_lid_ho) ) + { + /* + Ignore return code for now. + */ + for (block_num = start_block_num; block_num <= end_block_num; block_num++) + __osm_gir_rcv_new_gir( p_rcv, p_node, p_list, + port_guid, match_lid, + p_physp, block_num ); + } + } + else + { + /* + For every lid value create the GUIDInfo record(s). + */ + for( lid_ho = base_lid_ho; lid_ho <= max_lid_ho; lid_ho++ ) + { + for (block_num = start_block_num; block_num <= end_block_num; block_num++) + { + status = __osm_gir_rcv_new_gir( p_rcv, p_node, p_list, + port_guid, cl_hton16( lid_ho ), + p_physp, block_num ); + if( status != IB_SUCCESS ) + break; + } + } + } + } + + OSM_LOG_EXIT( p_rcv->p_log ); +} + +/********************************************************************** + **********************************************************************/ +void +__osm_sa_gir_by_comp_mask_cb( + IN cl_map_item_t* const p_map_item, + IN void* context ) +{ + const osm_gir_search_ctxt_t* const p_ctxt = (osm_gir_search_ctxt_t *)context; + const osm_node_t* const p_node = (osm_node_t*)p_map_item; + const ib_guidinfo_record_t* const p_rcvd_rec = p_ctxt->p_rcvd_rec; + const osm_physp_t* const p_req_physp = p_ctxt->p_req_physp; + osm_gir_rcv_t* const p_rcv = p_ctxt->p_rcv; + const ib_guid_info_t* p_comp_gi; + ib_net64_t const comp_mask = p_ctxt->comp_mask; + ib_net64_t match_port_guid = 0; + ib_net16_t match_lid = 0; + uint8_t match_block_num = 255; + + OSM_LOG_ENTER( p_ctxt->p_rcv->p_log, __osm_sa_gir_by_comp_mask_cb); + + if( comp_mask & IB_GIR_COMPMASK_LID ) + match_lid = p_rcvd_rec->lid; + + if( comp_mask & IB_GIR_COMPMASK_BLOCKNUM ) + match_block_num = p_rcvd_rec->block_num; + + p_comp_gi = &p_rcvd_rec->guid_info; + /* Different rule for block 0 v. other blocks */ + if( comp_mask & IB_GIR_COMPMASK_GID0 ) + { + if ( !p_rcvd_rec->block_num ) + match_port_guid = osm_physp_get_port_guid( p_req_physp ); + if ( p_comp_gi->guid[0] != match_port_guid ) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID1 ) + { + if ( p_comp_gi->guid[1] != 0) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID2 ) + { + if ( p_comp_gi->guid[2] != 0) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID3 ) + { + if ( p_comp_gi->guid[3] != 0) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID4 ) + { + if ( p_comp_gi->guid[4] != 0) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID5 ) + { + if ( p_comp_gi->guid[5] != 0) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID6 ) + { + if ( p_comp_gi->guid[6] != 0) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID7 ) + { + if ( p_comp_gi->guid[7] != 0) + goto Exit; + } + + __osm_sa_gir_create_gir( p_rcv, p_node, p_ctxt->p_list, + match_port_guid, match_lid, p_req_physp, + match_block_num ); + + Exit: + OSM_LOG_EXIT( p_ctxt->p_rcv->p_log ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_gir_rcv_process( + IN osm_gir_rcv_t* const p_rcv, + IN const osm_madw_t* const p_madw ) +{ + const ib_sa_mad_t* p_rcvd_mad; + const ib_guidinfo_record_t* p_rcvd_rec; + cl_qlist_t rec_list; + osm_madw_t* p_resp_madw; + ib_sa_mad_t* p_resp_sa_mad; + ib_guidinfo_record_t* p_resp_rec; + uint32_t num_rec, pre_trim_num_rec; +#ifndef VENDOR_RMPP_SUPPORT + uint32_t trim_num_rec; +#endif + uint32_t i; + osm_gir_search_ctxt_t context; + osm_gir_item_t* p_rec_item; + ib_api_status_t status; + osm_physp_t* p_req_physp; + + CL_ASSERT( p_rcv ); + + OSM_LOG_ENTER( p_rcv->p_log, osm_gir_rcv_process ); + + CL_ASSERT( p_madw ); + + p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); + p_rcvd_rec = (ib_guidinfo_record_t*)ib_sa_mad_get_payload_ptr( p_rcvd_mad ); + + CL_ASSERT( p_rcvd_mad->attr_id == IB_MAD_ATTR_GUIDINFO_RECORD ); + + /* update the requestor physical port. */ + p_req_physp = osm_get_physp_by_mad_addr(p_rcv->p_log, + p_rcv->p_subn, + osm_madw_get_mad_addr_ptr(p_madw) ); + if (p_req_physp == NULL) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_gir_rcv_process: ERR 5104: " + "Cannot find requestor physical port\n" ); + goto Exit; + } + + if ( (p_rcvd_mad->method != IB_MAD_METHOD_GET) && + (p_rcvd_mad->method != IB_MAD_METHOD_GETTABLE) ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_gir_rcv_process: ERR 5105: " + "Unsupported Method (%s)\n", + ib_get_sa_method_str( p_rcvd_mad->method ) ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID); + goto Exit; + } + + if ( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + osm_dump_guidinfo_record( p_rcv->p_log, p_rcvd_rec, OSM_LOG_DEBUG ); + + cl_qlist_init( &rec_list ); + + context.p_rcvd_rec = p_rcvd_rec; + context.p_list = &rec_list; + context.comp_mask = p_rcvd_mad->comp_mask; + context.p_rcv = p_rcv; + context.p_req_physp = p_req_physp; + + cl_plock_acquire( p_rcv->p_lock ); + + cl_qmap_apply_func( &p_rcv->p_subn->node_guid_tbl, + __osm_sa_gir_by_comp_mask_cb, + &context ); + + cl_plock_release( p_rcv->p_lock ); + + num_rec = cl_qlist_count( &rec_list ); + + /* + * C15-0.1.30: + * If we do a SubnAdmGet and got more than one record it is an error ! + */ + if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && + (num_rec > 1)) { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_gir_rcv_process: " + "Got more than one record for SubnAdmGet (%u)\n", + num_rec ); + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_TOO_MANY_RECORDS); + + /* need to set the mem free ... */ + p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list ); + while( p_rec_item != (osm_gir_item_t*)cl_qlist_end( &rec_list ) ) + { + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list ); + } + + goto Exit; + } + + pre_trim_num_rec = num_rec; +#ifndef VENDOR_RMPP_SUPPORT + trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_guidinfo_record_t); + if (trim_num_rec < num_rec) + { + osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, + "osm_gir_rcv_process: " + "Number of records:%u trimmed to:%u to fit in one MAD\n", + num_rec, trim_num_rec ); + num_rec = trim_num_rec; + } +#endif + + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "osm_gir_rcv_process: " + "Returning %u records\n", num_rec ); + + if ((p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec == 0)) + { + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; + } + + /* + * Get a MAD to reply. Address of Mad is in the received mad_wrapper + */ + p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool, + p_madw->h_bind, + num_rec * sizeof(ib_guidinfo_record_t) + IB_SA_MAD_HDR_SIZE, + &p_madw->mad_addr ); + + if( !p_resp_madw ) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "osm_gir_rcv_process: ERR 5106: " + "osm_mad_pool_get failed\n" ); + + for( i = 0; i < num_rec; i++ ) + { + p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list ); + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + } + + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_NO_RESOURCES ); + + goto Exit; + } + + p_resp_sa_mad = osm_madw_get_sa_mad_ptr( p_resp_madw ); + + /* + Copy the MAD header back into the response mad. + Set the 'R' bit and the payload length, + Then copy all records from the list into the response payload. + */ + + cl_memcpy( p_resp_sa_mad, p_rcvd_mad, IB_SA_MAD_HDR_SIZE ); + p_resp_sa_mad->method = (uint8_t)(p_resp_sa_mad->method | 0x80); + /* C15-0.1.5 - always return SM_Key = 0 (table 151 p 782) */ + p_resp_sa_mad->sm_key = 0; + /* Fill in the offset (paylen will be done by the rmpp SAR) */ + p_resp_sa_mad->attr_offset = + ib_get_attr_offset( sizeof(ib_guidinfo_record_t) ); + + p_resp_rec = (ib_guidinfo_record_t*) + ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); + +#ifndef VENDOR_RMPP_SUPPORT + /* we support only one packet RMPP - so we will set the first and + last flags for gettable */ + if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP) + { + p_resp_sa_mad->rmpp_type = IB_RMPP_TYPE_DATA; + p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_FIRST | IB_RMPP_FLAG_LAST | IB_RMPP_FLAG_ACTIVE; + } +#else + /* forcefully define the packet as RMPP one */ + if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP) + p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE; +#endif + + for( i = 0; i < pre_trim_num_rec; i++ ) + { + p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list ); + /* copy only if not trimmed */ + if (i < num_rec) + { + *p_resp_rec = p_rec_item->rec; + } + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_resp_rec++; + } + + CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); + + status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); + if(status != IB_SUCCESS) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "osm_gir_rcv_process: ERR 5107: " + "osm_vendor_send. status = %s\n", + ib_get_err_str(status)); + goto Exit; + } + + Exit: + OSM_LOG_EXIT( p_rcv->p_log ); +} Property changes on: osm/opensm/osm_sa_guidinfo_record.c ___________________________________________________________________ Name: svn:keywords + Id Index: osm/opensm/osm_helper.c =================================================================== --- osm/opensm/osm_helper.c (revision 5193) +++ osm/opensm/osm_helper.c (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -763,6 +763,50 @@ osm_dump_portinfo_record( /********************************************************************** **********************************************************************/ void +osm_dump_guidinfo_record( + IN osm_log_t* const p_log, + IN const ib_guidinfo_record_t* const p_gir, + IN const osm_log_level_t log_level ) +{ + const ib_guid_info_t * const p_gi = &p_gir->guid_info; + + if( osm_log_is_active( p_log, log_level ) ) + { + osm_log( p_log, log_level, + "GUIDInfo Record dump:\n" + "\t\t\t\tRID\n" + "\t\t\t\tLid.....................0x%X\n" + "\t\t\t\tBlockNum................0x%X\n" + "\t\t\t\tReserved................0x%X\n" + "\t\t\t\tGUIDInfo dump\n" + "\t\t\t\tReserved................0x%X\n" + "\t\t\t\tGUID 0..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 1..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 2..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 3..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 4..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 5..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 6..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 7..................0x%016" PRIx64 "\n", + cl_ntoh16(p_gir->lid), + p_gir->block_num, + p_gir->resv, + cl_ntoh32(p_gir->reserved), + cl_ntoh64(p_gi->guid[0]), + cl_ntoh64(p_gi->guid[1]), + cl_ntoh64(p_gi->guid[2]), + cl_ntoh64(p_gi->guid[3]), + cl_ntoh64(p_gi->guid[4]), + cl_ntoh64(p_gi->guid[5]), + cl_ntoh64(p_gi->guid[6]), + cl_ntoh64(p_gi->guid[7]) + ); + } +} + +/********************************************************************** + **********************************************************************/ +void osm_dump_node_info( IN osm_log_t* const p_log, IN const ib_node_info_t* const p_ni, Index: osm/opensm/osm_sa.c =================================================================== --- osm/opensm/osm_sa.c (revision 5193) +++ osm/opensm/osm_sa.c (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -89,6 +89,9 @@ osm_sa_construct( osm_pir_rcv_construct( &p_sa->pir_rcv ); osm_pir_rcv_ctrl_construct( &p_sa->pir_rcv_ctrl ); + osm_gir_rcv_construct( &p_sa->gir_rcv ); + osm_gir_rcv_ctrl_construct( &p_sa->gir_rcv_ctrl ); + osm_lr_rcv_construct( &p_sa->lr_rcv ); osm_lr_rcv_ctrl_construct( &p_sa->lr_rcv_ctrl ); @@ -135,6 +138,7 @@ osm_sa_shutdown( /* remove any registered dispatcher message */ osm_nr_rcv_ctrl_destroy( &p_sa->nr_rcv_ctrl ); osm_pir_rcv_ctrl_destroy( &p_sa->pir_rcv_ctrl ); + osm_gir_rcv_ctrl_destroy( &p_sa->gir_rcv_ctrl ); osm_lr_rcv_ctrl_destroy( &p_sa->lr_rcv_ctrl ); osm_pr_rcv_ctrl_destroy( &p_sa->pr_rcv_ctrl ); osm_smir_ctrl_destroy( &p_sa->smir_ctrl ); @@ -162,6 +166,7 @@ osm_sa_destroy( osm_nr_rcv_destroy( &p_sa->nr_rcv ); osm_pir_rcv_destroy( &p_sa->pir_rcv ); + osm_gir_rcv_destroy( &p_sa->gir_rcv ); osm_lr_rcv_destroy( &p_sa->lr_rcv ); osm_pr_rcv_destroy( &p_sa->pr_rcv ); osm_smir_rcv_destroy( &p_sa->smir_rcv ); @@ -276,6 +281,24 @@ osm_sa_init( if( status != IB_SUCCESS ) goto Exit; + status = osm_gir_rcv_init( + &p_sa->gir_rcv, + &p_sa->resp, + p_sa->p_mad_pool, + p_subn, + p_log, + p_lock ); + if( status != IB_SUCCESS ) + goto Exit; + + status = osm_gir_rcv_ctrl_init( + &p_sa->gir_rcv_ctrl, + &p_sa->gir_rcv, + p_log, + p_disp ); + if( status != IB_SUCCESS ) + goto Exit; + status = osm_lr_rcv_init( &p_sa->lr_rcv, &p_sa->resp, Index: osm/opensm/osm_sa_class_port_info.c =================================================================== --- osm/opensm/osm_sa_class_port_info.c (revision 5193) +++ osm/opensm/osm_sa_class_port_info.c (working copy) @@ -183,7 +183,6 @@ __osm_cpi_rcv_respond( SMInfoRecord, (we do support it - under the table) InformInfoRecord, LinkRecord, (we do support it - under the table) - GuidInfoRecord ServiceAssociationRecord OSM_CAP_IS_SUBN_OPT_MULTI_PATH_SUP: Index: osm/opensm/libopensm.map =================================================================== --- osm/opensm/libopensm.map (revision 5193) +++ osm/opensm/libopensm.map (working copy) @@ -17,6 +17,7 @@ OPENSM_1.0 { osm_dbg_get_capabilities_str; osm_dump_port_info; osm_dump_portinfo_record; + osm_dump_guidinfo_record; osm_dump_node_info; osm_dump_node_record; osm_dump_path_record; Index: osm/opensm/osm_sa_mad_ctrl.c =================================================================== --- osm/opensm/osm_sa_mad_ctrl.c (revision 5193) +++ osm/opensm/osm_sa_mad_ctrl.c (working copy) @@ -205,6 +205,10 @@ __osm_sa_mad_ctrl_process( msg_id = OSM_MSG_MAD_LFT_RECORD; break; + case IB_MAD_ATTR_GUIDINFO_RECORD: + msg_id = OSM_MSG_MAD_GUIDINFO_RECORD; + break; + default: osm_log( p_ctrl->p_log, OSM_LOG_ERROR, "__osm_sa_mad_ctrl_process: ERR 1A01: " Index: osm/opensm/osm_sa_guidinfo_record_ctrl.c =================================================================== --- osm/opensm/osm_sa_guidinfo_record_ctrl.c (revision 0) +++ osm/opensm/osm_sa_guidinfo_record_ctrl.c (revision 0) @@ -0,0 +1,129 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: $ + */ + + +/* + * Abstract: + * Implementation of osm_gir_rcv_ctrl_t. + * This object represents the GUIDInfoRecord request controller object. + * This object is part of the opensm family of objects. + * + * Environment: + * Linux User Mode + * + */ + +/* + Next available error code: 0x203 +*/ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include + +/********************************************************************** + **********************************************************************/ +void +__osm_gir_rcv_ctrl_disp_callback( + IN void *context, + IN void *p_data ) +{ + /* ignore return status when invoked via the dispatcher */ + osm_gir_rcv_process( ((osm_gir_rcv_ctrl_t*)context)->p_rcv, + (osm_madw_t*)p_data ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_gir_rcv_ctrl_construct( + IN osm_gir_rcv_ctrl_t* const p_ctrl ) +{ + cl_memclr( p_ctrl, sizeof(*p_ctrl) ); + p_ctrl->h_disp = CL_DISP_INVALID_HANDLE; +} + +/********************************************************************** + **********************************************************************/ +void +osm_gir_rcv_ctrl_destroy( + IN osm_gir_rcv_ctrl_t* const p_ctrl ) +{ + CL_ASSERT( p_ctrl ); + cl_disp_unregister( p_ctrl->h_disp ); +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osm_gir_rcv_ctrl_init( + IN osm_gir_rcv_ctrl_t* const p_ctrl, + IN osm_gir_rcv_t* const p_rcv, + IN osm_log_t* const p_log, + IN cl_dispatcher_t* const p_disp ) +{ + ib_api_status_t status = IB_SUCCESS; + + OSM_LOG_ENTER( p_log, osm_gir_rcv_ctrl_init ); + + osm_gir_rcv_ctrl_construct( p_ctrl ); + p_ctrl->p_log = p_log; + p_ctrl->p_rcv = p_rcv; + p_ctrl->p_disp = p_disp; + + p_ctrl->h_disp = cl_disp_register( + p_disp, + OSM_MSG_MAD_GUIDINFO_RECORD, + __osm_gir_rcv_ctrl_disp_callback, + p_ctrl ); + + if( p_ctrl->h_disp == CL_DISP_INVALID_HANDLE ) + { + osm_log( p_log, OSM_LOG_ERROR, + "osm_gir_rcv_ctrl_init: ERR 5201: " + "Dispatcher registration failed\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + + Exit: + OSM_LOG_EXIT( p_log ); + return( status ); +} Property changes on: osm/opensm/osm_sa_guidinfo_record_ctrl.c ___________________________________________________________________ Name: svn:keywords + Id Index: osm/opensm/Makefile.am =================================================================== --- osm/opensm/Makefile.am (revision 5193) +++ osm/opensm/Makefile.am (working copy) @@ -46,6 +46,7 @@ opensm_SOURCES = main.c osm_console.c os osm_sa_path_record.c osm_sa_path_record_ctrl.c \ osm_sa_pkey_record.c osm_sa_pkey_record_ctrl.c \ osm_sa_portinfo_record.c osm_sa_portinfo_record_ctrl.c \ + osm_sa_guidinfo_record.c osm_sa_guidinfo_record_ctrl.c \ osm_sa_response.c osm_sa_service_record.c \ osm_sa_service_record_ctrl.c osm_sa_slvl_record.c \ osm_sa_slvl_record_ctrl.c osm_sa_sminfo_record.c \ From halr at voltaire.com Fri Jan 27 13:07:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Jan 2006 16:07:43 -0500 Subject: [openib-general] iwarp: whats a pkey? In-Reply-To: References: <1138290699.760.25.camel@stevo-desktop> <1138292581.760.36.camel@stevo-desktop> Message-ID: <1138395827.4338.71948.camel@hal.voltaire.com> On Fri, 2006-01-27 at 15:32, Roland Dreier wrote: > Roland> No, I think trying to create a mapping is a bad idea. The > Roland> semantics of VLANs and IB partitions are sufficiently > Roland> different that it's probably better to treat each concept > Roland> natively. > > Steve> Roland, can you expand on this some? > > I don't think transport neutral code should be dealing with either > P_Keys or VLANs. The Linux model for handling VLANs is that each VLAN > has a separate network interface. So an iWARP consumer should never > deal with VLANs, just with a routing choice of interfaces. > > Similarly if a consumer is using the iWARP-emulation CM for IB, then > the P_Key will come from the IPoIB interface. Only native IB > consumers that understand partitions ever have to deal with P_Keys. What about a gateway between iWARP and IB ? Would it need the mappings between VLAN and IB partition ? If so, I would presume that is at a layer above what you are talking about. -- Hal > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mamidala at cse.ohio-state.edu Fri Jan 27 14:47:55 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Fri, 27 Jan 2006 17:47:55 -0500 (EST) Subject: [openib-general] modify UD QP error Message-ID: I was trying to modify a UD QP to init state and the modify call was failing with an error code of 22, I am including the code snippet which does create/ modify of the QP: memset (&qp_init_attr, 0, sizeof (qp_init_attr)); qp_init_attr.cap.max_recv_wr = 100; qp_init_attr.cap.max_send_wr = 100; qp_init_attr.cap.max_recv_sge = 10; qp_init_attr.cap.max_send_sge = 10; qp_init_attr.cap.max_inline_data = 0; qp_init_attr.recv_cq = ud_rcq_hndl; /* Created earlier */ qp_init_attr.send_cq = ud_scq_hndl; /* Created earlier */ qp_init_attr.sq_sig_all = 0; qp_init_attr.qp_type = IBV_QPT_UD; ud_qp_hndl= ibv_create_qp (ptag, &qp_init_attr); /* ptag allocated earlier */ memset (&qp_attr, 0, sizeof (qp_attr)); qp_attr.qp_state = IBV_QPS_INIT; qp_attr.pkey_index = 0; qp_attr.port_num = 0; qp_attr.qkey = 0; if (ret = ibv_modify_qp(ud_qp_hndl, &qp_attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_QKEY)) error : 22 Thanks, Amith From mshefty at ichips.intel.com Fri Jan 27 14:56:09 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 27 Jan 2006 14:56:09 -0800 Subject: [openib-general] modify UD QP error In-Reply-To: References: Message-ID: <43DAA509.1010209@ichips.intel.com> amith rajith mamidala wrote: > qp_attr.port_num = 0; Try port_num = 1. - Sean From caitlinb at broadcom.com Fri Jan 27 15:34:48 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 27 Jan 2006 15:34:48 -0800 Subject: [openib-general] iwarp: whats a pkey? Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D07D@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On Fri, 2006-01-27 at 15:32, Roland Dreier wrote: >> Roland> No, I think trying to create a mapping is a bad idea. >> The Roland> semantics of VLANs and IB partitions are sufficiently >> Roland> different that it's probably better to treat each >> concept Roland> natively. >> >> Steve> Roland, can you expand on this some? >> >> I don't think transport neutral code should be dealing with either >> P_Keys or VLANs. The Linux model for handling VLANs is that each >> VLAN has a separate network interface. So an iWARP consumer should >> never deal with VLANs, just with a routing choice of interfaces. >> >> Similarly if a consumer is using the iWARP-emulation CM for IB, then >> the P_Key will come from the IPoIB interface. Only native IB >> consumers that understand partitions ever have to deal with P_Keys. > > What about a gateway between iWARP and IB ? Would it need the > mappings between VLAN and IB partition ? If so, I would > presume that is at a layer above what you are talking about. > If you are attempting to implement an iWARP/IB gateway *above* transport neutral verbs then I don't think there is anything that can be defined that will be of much help. From sean.hefty at intel.com Fri Jan 27 15:47:04 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 27 Jan 2006 15:47:04 -0800 Subject: [openib-general] RFC multicast support Message-ID: I'd like to start a discussion about the following proposal regarding userspace multicast support over IB. 1. I'd like to expand the CMA to include an asynchronous rdma_set_option() routine. 2. This routine would become the user interface to joining and leaving multicast groups. For example: rdma_set_option(struct rdma_cm_id *, IPPROTO_IP, IP_ADD_MEMBERSHIP, struct in_addr *, sizeof(struct in_addr)); The rdma_cm_id must already have been bound to a device. 3. The specified IP address would be converted into an MGID as follows: FF:01:scope:signature:group_id scope - determined by IP address range signature - x4001 or x6001 for IPv4 or IPv6, respectively group_id - lower 28 or 80 bits of IP address 4. Join/leave requests would be tracked by the local SA cache. A port would not be removed from the group while there were active members. Optionally, a port could remain in the group without any members for some user specified duration. (I'm not sure how useful this would be in practice.) 5. Group creation would either be controlled by some other mechanism, or by the first join. In the latter case, values for the qkey, sl, flowlabel, and tclass are needed. We could use the same values as IPoIB for this, with the exception of making up a new qkey. 6. Some additional APIs may be necessary to assist users in sending data (i.e. address handle attributes) over the QP. (rdma_get_option()?) Comments? - Sean From rdreier at cisco.com Fri Jan 27 15:59:38 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 27 Jan 2006 15:59:38 -0800 Subject: [openib-general] RFC multicast support In-Reply-To: (Sean Hefty's message of "Fri, 27 Jan 2006 15:47:04 -0800") References: Message-ID: > 1. I'd like to expand the CMA to include an asynchronous rdma_set_option() > routine. > > 2. This routine would become the user interface to joining and leaving multicast > groups. For example: > > rdma_set_option(struct rdma_cm_id *, IPPROTO_IP, IP_ADD_MEMBERSHIP, > struct in_addr *, sizeof(struct in_addr)); I'm not sure this makes sense. Only IB has the notion of multicast on QPs and only UD QPs can be used for multicast. So why would we want a transport-neutral API for an IB-specific feature? > 4. Join/leave requests would be tracked by the local SA cache. A port would not > be removed from the group while there were active members. Optionally, a port > could remain in the group without any members for some user specified duration. > (I'm not sure how useful this would be in practice.) Reference counting multicast group joins is definitely something that IB needs. Otherwise there's no way for userspace to use IB multicast in a sane way. - R. From rdreier at cisco.com Fri Jan 27 16:43:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 27 Jan 2006 16:43:09 -0800 Subject: [openib-general] RFC multicast support In-Reply-To: (Roland Dreier's message of "Fri, 27 Jan 2006 15:59:38 -0800") References: Message-ID: Roland> I'm not sure this makes sense. Only IB has the notion of Roland> multicast on QPs and only UD QPs can be used for Roland> multicast. So why would we want a transport-neutral API Roland> for an IB-specific feature? Oh yeah... also why would you want a CM ID for an unconnected QP? - R. From tom at opengridcomputing.com Fri Jan 27 16:55:49 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 27 Jan 2006 18:55:49 -0600 Subject: [openib-general] RFC multicast support In-Reply-To: References: Message-ID: <1138409749.9767.11.camel@trinity.ogc.int> Well, before you decide this is crazy, a UDP endpoint has a "socket", so an unconnected cm_id is not really all that crazy. A cm_id at the layer we've defined it encapsulates primarily application layer information like callback function pointers, contexts, and addresses. These values are every bit as important for an "unconnected" QP. The fact that iWARP doesn't currently have a notion of multicast, etc... doesn't mean that a) other transports besides IB won't, b) that iWARP companies haven't ever thought about how to do it and will never do it, or ... Ok...maybe it really is crazy.... On Fri, 2006-01-27 at 16:43 -0800, Roland Dreier wrote: > Roland> I'm not sure this makes sense. Only IB has the notion of > Roland> multicast on QPs and only UD QPs can be used for > Roland> multicast. So why would we want a transport-neutral API > Roland> for an IB-specific feature? > > Oh yeah... also why would you want a CM ID for an unconnected QP? > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hozer at hozed.org Fri Jan 27 17:45:29 2006 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 27 Jan 2006 19:45:29 -0600 Subject: [openib-general] iwarp: whats a pkey? In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F122D07D@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F122D07D@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20060128014529.GC538@narn.hozed.org> On Fri, Jan 27, 2006 at 03:34:48PM -0800, Caitlin Bestler wrote: > openib-general-bounces at openib.org wrote: > > On Fri, 2006-01-27 at 15:32, Roland Dreier wrote: > >> Roland> No, I think trying to create a mapping is a bad idea. > >> The Roland> semantics of VLANs and IB partitions are sufficiently > >> Roland> different that it's probably better to treat each > >> concept Roland> natively. > >> > >> Steve> Roland, can you expand on this some? > >> > >> I don't think transport neutral code should be dealing with either > >> P_Keys or VLANs. The Linux model for handling VLANs is that each > >> VLAN has a separate network interface. So an iWARP consumer should > >> never deal with VLANs, just with a routing choice of interfaces. > >> > >> Similarly if a consumer is using the iWARP-emulation CM for IB, then > >> the P_Key will come from the IPoIB interface. Only native IB > >> consumers that understand partitions ever have to deal with P_Keys. > > > > What about a gateway between iWARP and IB ? Would it need the > > mappings between VLAN and IB partition ? If so, I would > > presume that is at a layer above what you are talking about. > > > > If you are attempting to implement an iWARP/IB gateway *above* > transport neutral verbs then I don't think there is anything > that can be defined that will be of much help. The IB side of the gateway should know about P_Keys, but I think attempting to translate that to vlans is a bad idea. Any use case I can think of that is going to want to map IB partitions to a vlan will do it by having the gateway know what P_Key goes to what ethernet interface, and there are already other tools to map ethernet interfaces to vlans. Let's not reinvent this until there's an actual need. From sean.hefty at intel.com Fri Jan 27 18:17:20 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 27 Jan 2006 18:17:20 -0800 Subject: [openib-general] RFC multicast support In-Reply-To: Message-ID: > > rdma_set_option(struct rdma_cm_id *, IPPROTO_IP, IP_ADD_MEMBERSHIP, > > struct in_addr *, sizeof(struct in_addr)); > >I'm not sure this makes sense. Only IB has the notion of multicast on >QPs and only UD QPs can be used for multicast. So why would we want a >transport-neutral API for an IB-specific feature? First, I'll plead ignorance. I didn't realize that iWarp didn't support multicast, so I'll need to rethink this some. As for some other reasons... The CMA API came as reasonably close as I could make it to providing socket-like semantics. One of the intents here is to provide an API that feels familiar to what people are used to using. I was trying to allow applications to define multicast groups using IP addresses, as opposed to GIDs directly. My understanding is that MPI will use both connected and multicast QPs for communication; I wanted to provide a somewhat consistent addressing model. To answer your other post, adding this as an extension to the rdma_cm_id lets the CMA handle device hotplug, and gives us an existing asynchronous event model (for userspace). The CMA might also be able to handle the QP transitions and multicast group attach/detach calls for the user. I'm just throwing out random ideas now, but taking a longer term approach, the CMA interface could be expanded to support UD QPs, to abstract SIDR and address handle management from the user. >Reference counting multicast group joins is definitely something that >IB needs. Otherwise there's no way for userspace to use IB multicast >in a sane way. This is a definite requirement regardless of the resulting API. From swise at opengridcomputing.com Sat Jan 28 07:47:28 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 28 Jan 2006 09:47:28 -0600 Subject: [openib-general] RFC multicast support References: Message-ID: <01d201c62422$24cb5070$020010ac@haggard> > > The CMA API came as reasonably close as I could make it to providing > socket-like > semantics. One of the intents here is to provide an API that feels > familiar to > what people are used to using. I was trying to allow applications to > define > multicast groups using IP addresses, as opposed to GIDs directly. My > understanding is that MPI will use both connected and multicast QPs > for > communication; I wanted to provide a somewhat consistent addressing > model. > IMO this is a good reason to do the work. From halr at voltaire.com Sat Jan 28 09:23:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Jan 2006 12:23:00 -0500 Subject: [openib-general] [PATCHv2] OpenSM: include svn version in build string Message-ID: <1138468979.4453.20.camel@hal.voltaire.com> OpenSM: include OpenIB svn version when OpenIB build [Note: this is implemented using svnversion as seemed to be the consensus rather than using .svn/entries. There are some downsides to this approach. ] Signed-off-by: Hal Rosenstock Index: include/opensm/osm_svn_revision.h =================================================================== --- include/opensm/osm_svn_revision.h (revision 0) +++ include/opensm/osm_svn_revision.h (revision 0) @@ -0,0 +1 @@ +#define OSM_SVN_REVISION "" Index: opensm/osm_opensm.c =================================================================== --- opensm/osm_opensm.c (revision 5193) +++ opensm/osm_opensm.c (working copy) @@ -59,6 +59,9 @@ #include #include #include +#ifdef OSM_VENDOR_INTF_OPENIB +#include +#endif #include #include #include @@ -206,12 +209,33 @@ osm_opensm_init( if( status != IB_SUCCESS ) return ( status ); +#ifndef OSM_VENDOR_INTF_OPENIB /* If there is a log level defined - add the OSM_VERSION to it. */ osm_log( &p_osm->log, osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", OSM_VERSION ); /* Write the OSM_VERSION to the SYS_LOG */ osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format Waived */ +#else + if (strlen(OSM_SVN_REVISION)) + { + /* If there is a log level defined - add OSM_VERSION and OSM_SVN_REVISION to it. */ + osm_log( &p_osm->log, + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s OpenIB svn %s\n", + OSM_VERSION, OSM_SVN_REVISION ); + /* Write the OSM_VERSION and OSM_SVN_REVISION to the SYS_LOG */ + osm_log( &p_osm->log, OSM_LOG_SYS, "%s OpenIB svn %s\n", OSM_VERSION, OSM_SVN_REVISION ); /* Format Waived */ + } + else + { + /* If there is a log level defined - add the OSM_VERSION to it. */ + osm_log( &p_osm->log, + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", + OSM_VERSION ); + /* Write the OSM_VERSION to the SYS_LOG */ + osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format Waived */ + } +#endif osm_log( &p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: [\n" ); /* Format Waived */ Index: opensm/main.c =================================================================== --- opensm/main.c (revision 5193) +++ opensm/main.c (working copy) @@ -57,6 +57,9 @@ #include #include #include +#ifdef OSM_VENDOR_INTF_OPENIB +#include +#endif #include #include #include @@ -522,6 +525,10 @@ main( printf("-------------------------------------------------\n"); printf("%s\n", OSM_VERSION); +#if defined ( OSM_VENDOR_INTF_OPENIB ) + if (strlen(OSM_SVN_REVISION)) + printf("Based on OpenIB svn %s\n", OSM_SVN_REVISION); +#endif osm_subn_set_default_opt(&opt); osm_subn_parse_conf_file(&opt); Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 5193) +++ opensm/Makefile.am (working copy) @@ -9,6 +9,23 @@ else DBGFLAGS = -g endif +if OSMV_OPENIB +$(srcdir)/../include/opensm/osm_svn_revision_new.h: + echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ + svnversion $(srcdir)/.. | tr -d '\n' >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ + echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h + +$(srcdir)/../include/opensm/osm_svn_revision.h: $(srcdir)/../include/opensm/osm_svn_revision_new.h + if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ + $(srcdir)/../include/opensm/osm_svn_revision.h ; \ + then \ + rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ + else \ + mv $(srcdir)/../include/opensm/osm_svn_revision_new.h \ + $(srcdir)/../include/opensm/osm_svn_revision.h ; \ + fi +endif + libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 if HAVE_LD_VERSION_SCRIPT From mamidala at cse.ohio-state.edu Sat Jan 28 09:46:48 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Sat, 28 Jan 2006 12:46:48 -0500 (EST) Subject: [openib-general] multicast join errors Message-ID: Hi, I was able to create multicast groups after Hal's fix. But, when I do join subsequently from the same program I am getting a port_alloc error: Jan 28 12:22:12 119632 [AB2223C0] -> osm_vendor_bind: Binding to port 0x6270510000005. -I- Created the Multicast Group: MGID....................0xff13a01cfe800000 : 0x0000000000000000 PortGid.................0xfe80000000000000 : 0x0006270510000005 qkey....................0x0 Mlid....................0xC002 ScopeState..............0x21 Rate....................0x83 Mtu.....................0x84 Jan 28 12:22:12 140486 [AB2223C0] -> osm_vendor_bind: Binding to port 0x6270510000005. ibwarn: [4057] port_alloc: umad port id 0 is already allocated for mthca0 1 Jan 28 12:22:12 143240 [AB2223C0] -> osm_vendor_open_port: ERR 542C: umad_open_port() failed Jan 28 12:22:12 143253 [AB2223C0] -> osm_vendor_bind: ERR 5424: Unable to Open Port 0x6270510000005. Jan 28 12:22:12 143262 [AB2223C0] -> osmv_bind_sa: ERR 5506: Failed to bind to vendor GSI Jan 28 12:22:12 143267 [AB2223C0] -> ibmcgrp_bind: ERR 00137: Unable to bind to SA I am trying to trace the source of this error, Thanks, Amith From halr at voltaire.com Sat Jan 28 11:34:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Jan 2006 14:34:55 -0500 Subject: [openib-general] multicast join errors In-Reply-To: References: Message-ID: <1138476894.4453.46.camel@hal.voltaire.com> Hi Amith, On Sat, 2006-01-28 at 12:46, amith rajith mamidala wrote: > Hi, > > I was able to create multicast groups after Hal's fix. But, when I do join > subsequently from the same program I am getting a port_alloc error: > > Jan 28 12:22:12 119632 [AB2223C0] -> osm_vendor_bind: Binding to port > 0x6270510000005. > -I- Created the Multicast Group: > MGID....................0xff13a01cfe800000 : 0x0000000000000000 > PortGid.................0xfe80000000000000 : 0x0006270510000005 > qkey....................0x0 > Mlid....................0xC002 > ScopeState..............0x21 > Rate....................0x83 > Mtu.....................0x84 > Jan 28 12:22:12 140486 [AB2223C0] -> osm_vendor_bind: Binding to port > 0x6270510000005. > > ibwarn: [4057] port_alloc: umad port id 0 is already allocated for mthca0 > 1 > Jan 28 12:22:12 143240 [AB2223C0] -> osm_vendor_open_port: ERR 542C: > umad_open_port() failed > Jan 28 12:22:12 143253 [AB2223C0] -> osm_vendor_bind: ERR 5424: Unable to > Open Port 0x6270510000005. > Jan 28 12:22:12 143262 [AB2223C0] -> osmv_bind_sa: ERR 5506: Failed to > bind to vendor GSI > Jan 28 12:22:12 143267 [AB2223C0] -> ibmcgrp_bind: ERR 00137: Unable to > bind to SA > > I am trying to trace the source of this error, Is this the only IB application running or are there others (and if so, what else is running) ? -- Hal > Thanks, > Amith > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mamidala at cse.ohio-state.edu Sat Jan 28 13:18:54 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Sat, 28 Jan 2006 16:18:54 -0500 (EST) Subject: [openib-general] multicast join errors In-Reply-To: <1138476894.4453.46.camel@hal.voltaire.com> Message-ID: Hi Hal, There is only one application running on a node. I am running opensm on a different node. I am also listing the other processes I observed on doing a "ps": root 3564 11 0 Jan26 ? 00:00:00 [ib_cm/0] root 3565 11 0 Jan26 ? 00:00:00 [ib_cm/1] root 1294 11 0 Jan26 ? 00:00:00 [ib_mad1] root 1295 11 0 Jan26 ? 00:00:00 [ib_mad2] root 1298 11 0 Jan26 ? 00:00:00 [ib_mad1] root 1299 11 0 Jan26 ? 00:00:00 [ib_mad2] Thanks, Amith On 28 Jan 2006, Hal Rosenstock wrote: > Hi Amith, > > On Sat, 2006-01-28 at 12:46, amith rajith mamidala wrote: > > Hi, > > > > I was able to create multicast groups after Hal's fix. But, when I do join > > subsequently from the same program I am getting a port_alloc error: > > > > Jan 28 12:22:12 119632 [AB2223C0] -> osm_vendor_bind: Binding to port > > 0x6270510000005. > > -I- Created the Multicast Group: > > MGID....................0xff13a01cfe800000 : 0x0000000000000000 > > PortGid.................0xfe80000000000000 : 0x0006270510000005 > > qkey....................0x0 > > Mlid....................0xC002 > > ScopeState..............0x21 > > Rate....................0x83 > > Mtu.....................0x84 > > Jan 28 12:22:12 140486 [AB2223C0] -> osm_vendor_bind: Binding to port > > 0x6270510000005. > > > > ibwarn: [4057] port_alloc: umad port id 0 is already allocated for mthca0 > > 1 > > Jan 28 12:22:12 143240 [AB2223C0] -> osm_vendor_open_port: ERR 542C: > > umad_open_port() failed > > Jan 28 12:22:12 143253 [AB2223C0] -> osm_vendor_bind: ERR 5424: Unable to > > Open Port 0x6270510000005. > > Jan 28 12:22:12 143262 [AB2223C0] -> osmv_bind_sa: ERR 5506: Failed to > > bind to vendor GSI > > Jan 28 12:22:12 143267 [AB2223C0] -> ibmcgrp_bind: ERR 00137: Unable to > > bind to SA > > > > I am trying to trace the source of this error, > > Is this the only IB application running or are there others (and if so, > what else is running) ? > > -- Hal > > > Thanks, > > Amith > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Sat Jan 28 23:36:26 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 29 Jan 2006 09:36:26 +0200 Subject: [openib-general] multicast join errors Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B637@mtlexch01.mtl.com> Hi Amith, Please send the ibstat output for that node. I suspect the port 0x6270510000005 is not up. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of amith rajith mamidala > Sent: Saturday, January 28, 2006 11:19 PM > To: Hal Rosenstock > Cc: mvapich-core at cse.ohio-state.edu; openib-general at openib.org > Subject: Re: [openib-general] multicast join errors > > Hi Hal, > > There is only one application running on a node. I am running opensm on > a different node. I am also listing the other processes I observed on > doing a "ps": > > root 3564 11 0 Jan26 ? 00:00:00 [ib_cm/0] > root 3565 11 0 Jan26 ? 00:00:00 [ib_cm/1] > root 1294 11 0 Jan26 ? 00:00:00 [ib_mad1] > root 1295 11 0 Jan26 ? 00:00:00 [ib_mad2] > root 1298 11 0 Jan26 ? 00:00:00 [ib_mad1] > root 1299 11 0 Jan26 ? 00:00:00 [ib_mad2] > > > Thanks, > Amith > > On 28 Jan 2006, Hal Rosenstock wrote: > > > Hi Amith, > > > > On Sat, 2006-01-28 at 12:46, amith rajith mamidala wrote: > > > Hi, > > > > > > I was able to create multicast groups after Hal's fix. But, when I do join > > > subsequently from the same program I am getting a port_alloc error: > > > > > > Jan 28 12:22:12 119632 [AB2223C0] -> osm_vendor_bind: Binding to port > > > 0x6270510000005. > > > -I- Created the Multicast Group: > > > MGID....................0xff13a01cfe800000 : 0x0000000000000000 > > > PortGid.................0xfe80000000000000 : 0x0006270510000005 > > > qkey....................0x0 > > > Mlid....................0xC002 > > > ScopeState..............0x21 > > > Rate....................0x83 > > > Mtu.....................0x84 > > > Jan 28 12:22:12 140486 [AB2223C0] -> osm_vendor_bind: Binding to port > > > 0x6270510000005. > > > > > > ibwarn: [4057] port_alloc: umad port id 0 is already allocated for mthca0 > > > 1 > > > Jan 28 12:22:12 143240 [AB2223C0] -> osm_vendor_open_port: ERR 542C: > > > umad_open_port() failed > > > Jan 28 12:22:12 143253 [AB2223C0] -> osm_vendor_bind: ERR 5424: Unable to > > > Open Port 0x6270510000005. > > > Jan 28 12:22:12 143262 [AB2223C0] -> osmv_bind_sa: ERR 5506: Failed to > > > bind to vendor GSI > > > Jan 28 12:22:12 143267 [AB2223C0] -> ibmcgrp_bind: ERR 00137: Unable to > > > bind to SA > > > > > > I am trying to trace the source of this error, > > > > Is this the only IB application running or are there others (and if so, > > what else is running) ? > > > > -- Hal > > > > > Thanks, > > > Amith > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bill.boas at gmail.com Sat Jan 28 17:45:12 2006 From: bill.boas at gmail.com (Bill Boas) Date: Sat, 28 Jan 2006 17:45:12 -0800 Subject: [openib-general] Revised Sonoma Agenda plus a request to those attending Message-ID: <19a929370601281745n156444bdj664729dea100536f@mail.gmail.com> All attending the Workshop and others interested: Attached is an updated draft of the "Program at a Glance". It is still subject to change and your feedback will be appreciated if you have suggestions for improvement or see problems/conflicts or topics left out. There are 3 or 4 further rooms at the Lodge available to us for break out or Birds of a feather, just let me know in advance to make a reservation or come to the registration desk outside the Sonoma Ballroom and ask for one when you are there. REQUEST TO ALL ATTENDING There are over 30 people REGISTERED through the Acteva site www.acteva.com/go/rdma WHO HAVE NOT RESERVED a room at the Lodge. As a result OpenIB is BELOW THE GUARANTEED NUMBER we gave the Lodge. Those who have forgotten to make a room reservation please make it THROUGH these links, NOT by calling the Lodge directly: - General group room rate, priority code OPAOPAA - http://marriott.com/property/propertypage/sfols?groupCode=opaopaa&app=resvlink - For US Government badge holders, priority code OPAOPAG. - http://marriott.com/property/propertypage/sfols?groupCode=opaopag&app=resvlink There are also 14 people who have called the Lodge directly and are not counted in the OpenIB group and therefore may not get the discounted rate. If you are one of them, please contact the hotel and get them to include you in the OpenIB group. Thank you. Bill. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Program third draft Jan28.doc Type: application/msword Size: 61952 bytes Desc: not available URL: From MAILER-DAEMON at antispam.sify.net Sun Jan 29 03:26:31 2006 From: MAILER-DAEMON at antispam.sify.net (MAILER-DAEMON at antispam.sify.net) Date: 29 Jan 2006 16:56:31 +0530 Subject: [openib-general] failure notice Message-ID: <20060129113623.3965E22834D@openib.ca.sandia.gov> Hi. This is the qmail-send program at antispam.sify.net. I'm afraid I wasn't able to deliver your message to the following addresses. This is a permanent error; I've given up. Sorry it didn't work out. : 202.144.65.123 failed after I sent the message. Remote host said: 552-we don't accept email with the below content (#5.3.4) 552 Further Information: Virus mail --- Enclosed is a copy of the message. -------------- next part -------------- An embedded message was scrubbed... From: openib-general at openib.org Subject: hello Date: Sun, 29 Jan 2006 12:34:27 -0800 Size: 50155 URL: From TimmykhBroussard at albedo.net Sun Jan 29 04:30:08 2006 From: TimmykhBroussard at albedo.net (Timmy Broussard) Date: Sun, 29 Jan 2006 12:30:08 +0000 Subject: [openib-general] Mortgage News Update. Message-ID: We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://teddypear.com/now/ Have a good day. Sincerely, Timmy Broussard Customer Service Rep eBIK Inc. From postmaster at netlab.hr Sun Jan 29 05:32:17 2006 From: postmaster at netlab.hr (postmaster at netlab.hr) Date: Sun, 29 Jan 2006 14:32:17 +0100 Subject: [openib-general] Delivery Status Notification (Failure) Message-ID: This is an automatically generated Delivery Status Notification. Delivery to the following recipients failed. marin at netlab.hr -------------- next part -------------- An embedded message was scrubbed... From: openib-general at openib.org Subject: hello Date: Sun, 29 Jan 2006 14:35:06 -0800 Size: 89475 URL: From Administrator at netapp.com Sun Jan 29 05:30:35 2006 From: Administrator at netapp.com (Administrator at netapp.com) Date: Sun, 29 Jan 2006 08:30:35 -0500 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <001701c624d8$2f338800$3d00610a@hq.netapp.com> ScanMail for Microsoft Exchange has blocked an attachment. Sender = postmaster at netlab.hr Recipient(s) = openib-general at openib.org Subject = SPAM: [openib-general] Delivery Status Notification (Failure) Scanning time = 1/29/2006 8:30:34 AM Action on file blocking: The attachment document.zip matches the file blocking settings. ScanMail has Quarantine failed it. The attachment was quarantined to C:\Program Files\Trend\Smex\Alert\document43dcc37ac.zip_. Warning to Recipient: Action taken by attachment blocking. From Administrator at netapp.com Sun Jan 29 05:30:34 2006 From: Administrator at netapp.com (Administrator at netapp.com) Date: Sun, 29 Jan 2006 08:30:34 -0500 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <001501c624d8$2f2c5c10$3d00610a@hq.netapp.com> ScanMail for Microsoft Exchange has blocked an attachment. Sender = postmaster at netlab.hr Recipient(s) = openib-general at openib.org Subject = SPAM: [openib-general] Delivery Status Notification (Failure) Scanning time = 1/29/2006 8:30:34 AM Action on file blocking: The attachment document.zip matches the file blocking settings. ScanMail has Quarantine failed it. The attachment was quarantined to C:\Program Files\Trend\Smex\Alert\document43dcc37ab.zip_. Warning to Recipient: Action taken by attachment blocking. From Administrator at openib.org Sun Jan 29 05:30:34 2006 From: Administrator at openib.org (Administrator at openib.org) Date: Sun, 29 Jan 2006 07:30:34 -0600 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <035a01c624d8$2f1b9330$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Delivery Status Notification (Failure) Scanning time = 1/29/2006 7:30:33 AM Engine/Pattern = 8.000-1001/3.181.00 Action on virus found: The attachment document.zip contains WORM_MYTOB.BT virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 1/29/2006 document.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] Delivery Status Notification (Failure) From yael at mellanox.co.il Sun Jan 29 05:39:58 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Sun, 29 Jan 2006 15:39:58 +0200 Subject: [openib-general] RE: [PATCHv2] OpenSM: include svn version in build string Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD56@mtlexch01.mtl.com> Hi Hal, Looks good. Go ahead and add it. Thanks, Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Saturday, January 28, 2006 7:23 PM To: Yael Kalka; Eitan Zahavi Cc: Troy Benjegerdes; Troy Benjegerdes; openib-general at openib.org Subject: [PATCHv2] OpenSM: include svn version in build string OpenSM: include OpenIB svn version when OpenIB build [Note: this is implemented using svnversion as seemed to be the consensus rather than using .svn/entries. There are some downsides to this approach. ] Signed-off-by: Hal Rosenstock Index: include/opensm/osm_svn_revision.h =================================================================== --- include/opensm/osm_svn_revision.h (revision 0) +++ include/opensm/osm_svn_revision.h (revision 0) @@ -0,0 +1 @@ +#define OSM_SVN_REVISION "" Index: opensm/osm_opensm.c =================================================================== --- opensm/osm_opensm.c (revision 5193) +++ opensm/osm_opensm.c (working copy) @@ -59,6 +59,9 @@ #include #include #include +#ifdef OSM_VENDOR_INTF_OPENIB +#include +#endif #include #include #include @@ -206,12 +209,33 @@ osm_opensm_init( if( status != IB_SUCCESS ) return ( status ); +#ifndef OSM_VENDOR_INTF_OPENIB /* If there is a log level defined - add the OSM_VERSION to it. */ osm_log( &p_osm->log, osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", OSM_VERSION ); /* Write the OSM_VERSION to the SYS_LOG */ osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format Waived */ +#else + if (strlen(OSM_SVN_REVISION)) + { + /* If there is a log level defined - add OSM_VERSION and OSM_SVN_REVISION to it. */ + osm_log( &p_osm->log, + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s OpenIB svn %s\n", + OSM_VERSION, OSM_SVN_REVISION ); + /* Write the OSM_VERSION and OSM_SVN_REVISION to the SYS_LOG */ + osm_log( &p_osm->log, OSM_LOG_SYS, "%s OpenIB svn %s\n", OSM_VERSION, OSM_SVN_REVISION ); /* Format Waived */ + } + else + { + /* If there is a log level defined - add the OSM_VERSION to it. */ + osm_log( &p_osm->log, + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", + OSM_VERSION ); + /* Write the OSM_VERSION to the SYS_LOG */ + osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format Waived */ + } +#endif osm_log( &p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: [\n" ); /* Format Waived */ Index: opensm/main.c =================================================================== --- opensm/main.c (revision 5193) +++ opensm/main.c (working copy) @@ -57,6 +57,9 @@ #include #include #include +#ifdef OSM_VENDOR_INTF_OPENIB +#include +#endif #include #include #include @@ -522,6 +525,10 @@ main( printf("-------------------------------------------------\n"); printf("%s\n", OSM_VERSION); +#if defined ( OSM_VENDOR_INTF_OPENIB ) + if (strlen(OSM_SVN_REVISION)) + printf("Based on OpenIB svn %s\n", OSM_SVN_REVISION); +#endif osm_subn_set_default_opt(&opt); osm_subn_parse_conf_file(&opt); Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 5193) +++ opensm/Makefile.am (working copy) @@ -9,6 +9,23 @@ else DBGFLAGS = -g endif +if OSMV_OPENIB +$(srcdir)/../include/opensm/osm_svn_revision_new.h: + echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ + svnversion $(srcdir)/.. | tr -d '\n' >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ + echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h + +$(srcdir)/../include/opensm/osm_svn_revision.h: $(srcdir)/../include/opensm/osm_svn_revision_new.h + if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ + $(srcdir)/../include/opensm/osm_svn_revision.h ; \ + then \ + rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ + else \ + mv $(srcdir)/../include/opensm/osm_svn_revision_new.h \ + $(srcdir)/../include/opensm/osm_svn_revision.h ; \ + fi +endif + libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 if HAVE_LD_VERSION_SCRIPT From Administrator at netapp.com Sun Jan 29 05:30:35 2006 From: Administrator at netapp.com (Administrator at netapp.com) Date: Sun, 29 Jan 2006 08:30:35 -0500 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <001b01c624d8$2f41dfe0$3d00610a@hq.netapp.com> ScanMail for Microsoft Exchange has blocked an attachment. Sender = postmaster at netlab.hr Recipient(s) = openib-general at openib.org Subject = SPAM: [openib-general] Delivery Status Notification (Failure) Scanning time = 1/29/2006 8:30:34 AM Action on file blocking: The attachment document.zip matches the file blocking settings. ScanMail has Quarantine failed it. The attachment was quarantined to C:\Program Files\Trend\Smex\Alert\document43dcc37ae.zip_. Warning to Recipient: Action taken by attachment blocking. From Administrator at netapp.com Sun Jan 29 05:30:35 2006 From: Administrator at netapp.com (Administrator at netapp.com) Date: Sun, 29 Jan 2006 08:30:35 -0500 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <001901c624d8$2f3ab3f0$3d00610a@hq.netapp.com> ScanMail for Microsoft Exchange has blocked an attachment. Sender = postmaster at netlab.hr Recipient(s) = openib-general at openib.org Subject = SPAM: [openib-general] Delivery Status Notification (Failure) Scanning time = 1/29/2006 8:30:34 AM Action on file blocking: The attachment document.zip matches the file blocking settings. ScanMail has Quarantine failed it. The attachment was quarantined to C:\Program Files\Trend\Smex\Alert\document43dcc37ad.zip_. Warning to Recipient: Action taken by attachment blocking. From eitan at mellanox.co.il Sun Jan 29 08:54:57 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 29 Jan 2006 18:54:57 +0200 Subject: [openib-general] RE: [PATCHv2] OpenSM: include svn version in build string Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B641@mtlexch01.mtl.com> Hi Hal, It seems that this patch breaks "make dist" based builds. Probably just missing the osm_svn_revision.h file in the EXTRA_DIST. I will be able to send a patch tomorrow but if you have a chance please fix. Thanks EZ creating libopensm.la (cd .libs && rm -f libopensm.la && ln -s ../libopensm.la libopensm.la) if gcc -DHAVE_CONFIG_H -I. -I. -I. -I/usr/local/ibg2/include -I./../include -I./../../libibcommon/include/infiniband -I./../../libibuma d/include/infiniband -Wall -DOSM_VENDOR_INTF_OPENIB -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT -g -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -MT opensm-main.o -MD -MP -MF ".deps/opensm-main.Tpo" -c -o opensm-main.o `test -f 'main.c' || echo './'`main.c; \ then mv -f ".deps/opensm-main.Tpo" ".deps/opensm-main.Po"; else rm -f ".deps/opensm-main.Tpo"; exit 1; fi main.c:61:37: error: opensm/osm_svn_revision.h: No such file or directory main.c: In function 'main': main.c:529: error: 'OSM_SVN_REVISION' undeclared (first use in this function) main.c:529: error: (Each undeclared identifier is reported only once main.c:529: error: for each function it appears in.) make[2]: *** [opensm-main.o] Error 1 make[2]: Leaving directory `/tmp/osm-build/osm-20060129-1819/opensm' make[1]: *** [all] Error 2 make[1]: Leaving directory `/tmp/osm-build/osm-20060129-1819/opensm' make: *** [all-recursive] Error 1 Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Saturday, January 28, 2006 7:23 PM > To: Yael Kalka; Eitan Zahavi > Cc: Troy Benjegerdes; Troy Benjegerdes; openib-general at openib.org > Subject: [PATCHv2] OpenSM: include svn version in build string > > OpenSM: include OpenIB svn version when OpenIB build > > [Note: this is implemented using svnversion as seemed to be the > consensus rather than using .svn/entries. There are some downsides to > this approach. ] > > Signed-off-by: Hal Rosenstock > > Index: include/opensm/osm_svn_revision.h > =================================================================== > --- include/opensm/osm_svn_revision.h (revision 0) > +++ include/opensm/osm_svn_revision.h (revision 0) > @@ -0,0 +1 @@ > +#define OSM_SVN_REVISION "" > Index: opensm/osm_opensm.c > =================================================================== > --- opensm/osm_opensm.c (revision 5193) > +++ opensm/osm_opensm.c (working copy) > @@ -59,6 +59,9 @@ > #include > #include > #include > +#ifdef OSM_VENDOR_INTF_OPENIB > +#include > +#endif > #include > #include > #include > @@ -206,12 +209,33 @@ osm_opensm_init( > if( status != IB_SUCCESS ) > return ( status ); > > +#ifndef OSM_VENDOR_INTF_OPENIB > /* If there is a log level defined - add the OSM_VERSION to it. */ > osm_log( &p_osm->log, > osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", > OSM_VERSION ); > /* Write the OSM_VERSION to the SYS_LOG */ > osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format > Waived */ > +#else > + if (strlen(OSM_SVN_REVISION)) > + { > + /* If there is a log level defined - add OSM_VERSION and OSM_SVN_REVISION > to it. */ > + osm_log( &p_osm->log, > + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s OpenIB > svn %s\n", > + OSM_VERSION, OSM_SVN_REVISION ); > + /* Write the OSM_VERSION and OSM_SVN_REVISION to the SYS_LOG */ > + osm_log( &p_osm->log, OSM_LOG_SYS, "%s OpenIB svn %s\n", > OSM_VERSION, OSM_SVN_REVISION ); /* Format Waived */ > + } > + else > + { > + /* If there is a log level defined - add the OSM_VERSION to it. */ > + osm_log( &p_osm->log, > + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", > + OSM_VERSION ); > + /* Write the OSM_VERSION to the SYS_LOG */ > + osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format > Waived */ > + } > +#endif > > osm_log( &p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: [\n" ); /* Format > Waived */ > > Index: opensm/main.c > =================================================================== > --- opensm/main.c (revision 5193) > +++ opensm/main.c (working copy) > @@ -57,6 +57,9 @@ > #include > #include > #include > +#ifdef OSM_VENDOR_INTF_OPENIB > +#include > +#endif > #include > #include > #include > @@ -522,6 +525,10 @@ main( > > printf("-------------------------------------------------\n"); > printf("%s\n", OSM_VERSION); > +#if defined ( OSM_VENDOR_INTF_OPENIB ) > + if (strlen(OSM_SVN_REVISION)) > + printf("Based on OpenIB svn %s\n", OSM_SVN_REVISION); > +#endif > > osm_subn_set_default_opt(&opt); > osm_subn_parse_conf_file(&opt); > Index: opensm/Makefile.am > =================================================================== > --- opensm/Makefile.am (revision 5193) > +++ opensm/Makefile.am (working copy) > @@ -9,6 +9,23 @@ else > DBGFLAGS = -g > endif > > +if OSMV_OPENIB > +$(srcdir)/../include/opensm/osm_svn_revision_new.h: > + echo -n "#define OSM_SVN_REVISION \"" > >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ > + svnversion $(srcdir)/.. | tr -d '\n' >> > $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ > + echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h > + > +$(srcdir)/../include/opensm/osm_svn_revision.h: > $(srcdir)/../include/opensm/osm_svn_revision_new.h > + if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ > + $(srcdir)/../include/opensm/osm_svn_revision.h ; \ > + then \ > + rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ > + else \ > + mv $(srcdir)/../include/opensm/osm_svn_revision_new.h \ > + $(srcdir)/../include/opensm/osm_svn_revision.h ; \ > + fi > +endif > + > libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT > $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > if HAVE_LD_VERSION_SCRIPT > > > From halr at voltaire.com Sun Jan 29 08:48:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Jan 2006 11:48:21 -0500 Subject: [openib-general] RE: [PATCHv2] OpenSM: include svn version in build string In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B641@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B641@mtlexch01.mtl.com> Message-ID: <1138553299.4453.615.camel@hal.voltaire.com> On Sun, 2006-01-29 at 11:54, Eitan Zahavi wrote: > Hi Hal, > > It seems that this patch breaks "make dist" based builds. > Probably just missing the osm_svn_revision.h file in the EXTRA_DIST. > I will be able to send a patch tomorrow but if you have a chance please > fix. Thanks; I just fixed this. -- Hal > Thanks > > EZ > > creating libopensm.la > (cd .libs && rm -f libopensm.la && ln -s ../libopensm.la libopensm.la) > if gcc -DHAVE_CONFIG_H -I. -I. -I. -I/usr/local/ibg2/include > -I./../include -I./../../libibcommon/include/infiniband > -I./../../libibuma > d/include/infiniband -Wall -DOSM_VENDOR_INTF_OPENIB > -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT -g -D_XOPEN_SOURCE=600 > -D_BSD_SOURCE=1 > -MT opensm-main.o -MD -MP -MF ".deps/opensm-main.Tpo" -c -o > opensm-main.o `test -f 'main.c' || echo './'`main.c; \ > then mv -f ".deps/opensm-main.Tpo" ".deps/opensm-main.Po"; else rm -f > ".deps/opensm-main.Tpo"; exit 1; fi > main.c:61:37: error: opensm/osm_svn_revision.h: No such file or > directory > main.c: In function 'main': > main.c:529: error: 'OSM_SVN_REVISION' undeclared (first use in this > function) > main.c:529: error: (Each undeclared identifier is reported only once > main.c:529: error: for each function it appears in.) > make[2]: *** [opensm-main.o] Error 1 > make[2]: Leaving directory `/tmp/osm-build/osm-20060129-1819/opensm' > make[1]: *** [all] Error 2 > make[1]: Leaving directory `/tmp/osm-build/osm-20060129-1819/opensm' > make: *** [all-recursive] Error 1 > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Saturday, January 28, 2006 7:23 PM > > To: Yael Kalka; Eitan Zahavi > > Cc: Troy Benjegerdes; Troy Benjegerdes; openib-general at openib.org > > Subject: [PATCHv2] OpenSM: include svn version in build string > > > > OpenSM: include OpenIB svn version when OpenIB build > > > > [Note: this is implemented using svnversion as seemed to be the > > consensus rather than using .svn/entries. There are some downsides to > > this approach. ] > > > > Signed-off-by: Hal Rosenstock > > > > Index: include/opensm/osm_svn_revision.h > > =================================================================== > > --- include/opensm/osm_svn_revision.h (revision 0) > > +++ include/opensm/osm_svn_revision.h (revision 0) > > @@ -0,0 +1 @@ > > +#define OSM_SVN_REVISION "" > > Index: opensm/osm_opensm.c > > =================================================================== > > --- opensm/osm_opensm.c (revision 5193) > > +++ opensm/osm_opensm.c (working copy) > > @@ -59,6 +59,9 @@ > > #include > > #include > > #include > > +#ifdef OSM_VENDOR_INTF_OPENIB > > +#include > > +#endif > > #include > > #include > > #include > > @@ -206,12 +209,33 @@ osm_opensm_init( > > if( status != IB_SUCCESS ) > > return ( status ); > > > > +#ifndef OSM_VENDOR_INTF_OPENIB > > /* If there is a log level defined - add the OSM_VERSION to it. */ > > osm_log( &p_osm->log, > > osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF > ), "%s\n", > > OSM_VERSION ); > > /* Write the OSM_VERSION to the SYS_LOG */ > > osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* > Format > > Waived */ > > +#else > > + if (strlen(OSM_SVN_REVISION)) > > + { > > + /* If there is a log level defined - add OSM_VERSION and > OSM_SVN_REVISION > > to it. */ > > + osm_log( &p_osm->log, > > + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ > 0xFF ), "%s OpenIB > > svn %s\n", > > + OSM_VERSION, OSM_SVN_REVISION ); > > + /* Write the OSM_VERSION and OSM_SVN_REVISION to the SYS_LOG */ > > + osm_log( &p_osm->log, OSM_LOG_SYS, "%s OpenIB svn %s\n", > > OSM_VERSION, OSM_SVN_REVISION ); /* Format Waived */ > > + } > > + else > > + { > > + /* If there is a log level defined - add the OSM_VERSION to it. > */ > > + osm_log( &p_osm->log, > > + osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ > 0xFF ), "%s\n", > > + OSM_VERSION ); > > + /* Write the OSM_VERSION to the SYS_LOG */ > > + osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* > Format > > Waived */ > > + } > > +#endif > > > > osm_log( &p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: [\n" ); /* > Format > > Waived */ > > > > Index: opensm/main.c > > =================================================================== > > --- opensm/main.c (revision 5193) > > +++ opensm/main.c (working copy) > > @@ -57,6 +57,9 @@ > > #include > > #include > > #include > > +#ifdef OSM_VENDOR_INTF_OPENIB > > +#include > > +#endif > > #include > > #include > > #include > > @@ -522,6 +525,10 @@ main( > > > > printf("-------------------------------------------------\n"); > > printf("%s\n", OSM_VERSION); > > +#if defined ( OSM_VENDOR_INTF_OPENIB ) > > + if (strlen(OSM_SVN_REVISION)) > > + printf("Based on OpenIB svn %s\n", OSM_SVN_REVISION); > > +#endif > > > > osm_subn_set_default_opt(&opt); > > osm_subn_parse_conf_file(&opt); > > Index: opensm/Makefile.am > > =================================================================== > > --- opensm/Makefile.am (revision 5193) > > +++ opensm/Makefile.am (working copy) > > @@ -9,6 +9,23 @@ else > > DBGFLAGS = -g > > endif > > > > +if OSMV_OPENIB > > +$(srcdir)/../include/opensm/osm_svn_revision_new.h: > > + echo -n "#define OSM_SVN_REVISION \"" > > >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ > > + svnversion $(srcdir)/.. | tr -d '\n' >> > > $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ > > + echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h > > + > > +$(srcdir)/../include/opensm/osm_svn_revision.h: > > $(srcdir)/../include/opensm/osm_svn_revision_new.h > > + if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ > > + $(srcdir)/../include/opensm/osm_svn_revision.h ; \ > > + then \ > > + rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ; > \ > > + else \ > > + mv $(srcdir)/../include/opensm/osm_svn_revision_new.h \ > > + $(srcdir)/../include/opensm/osm_svn_revision.h ; \ > > + fi > > +endif > > + > > libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT > > $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > > > > if HAVE_LD_VERSION_SCRIPT > > > > > > > From mamidala at cse.ohio-state.edu Sun Jan 29 10:04:57 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Sun, 29 Jan 2006 13:04:57 -0500 (EST) Subject: [openib-general] multicast join errors In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B637@mtlexch01.mtl.com> Message-ID: Hi Eitan, I am sending the ibstat output: CA 'mthca0' CA type: MT25208 Number of ports: 2 Firmware version: 5.1.0 Hardware version: a0 Node GUID: 0x0006270510000004 System image GUID: 0x0000000000000000 Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 3 LMC: 0 SM lid: 105 Capability mask: 0x02510a68 Port GUID: 0x0006270510000005 Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510a68 Port GUID: 0x0006270510000006 Thanks, Amith On Sun, 29 Jan 2006, Eitan Zahavi wrote: > Hi Amith, > > Please send the ibstat output for that node. > I suspect the port 0x6270510000005 is not up. > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of amith rajith mamidala > > Sent: Saturday, January 28, 2006 11:19 PM > > To: Hal Rosenstock > > Cc: mvapich-core at cse.ohio-state.edu; openib-general at openib.org > > Subject: Re: [openib-general] multicast join errors > > > > Hi Hal, > > > > There is only one application running on a node. I am running opensm > on > > a different node. I am also listing the other processes I observed on > > doing a "ps": > > > > root 3564 11 0 Jan26 ? 00:00:00 [ib_cm/0] > > root 3565 11 0 Jan26 ? 00:00:00 [ib_cm/1] > > root 1294 11 0 Jan26 ? 00:00:00 [ib_mad1] > > root 1295 11 0 Jan26 ? 00:00:00 [ib_mad2] > > root 1298 11 0 Jan26 ? 00:00:00 [ib_mad1] > > root 1299 11 0 Jan26 ? 00:00:00 [ib_mad2] > > > > > > Thanks, > > Amith > > > > On 28 Jan 2006, Hal Rosenstock wrote: > > > > > Hi Amith, > > > > > > On Sat, 2006-01-28 at 12:46, amith rajith mamidala wrote: > > > > Hi, > > > > > > > > I was able to create multicast groups after Hal's fix. But, when I > do join > > > > subsequently from the same program I am getting a port_alloc > error: > > > > > > > > Jan 28 12:22:12 119632 [AB2223C0] -> osm_vendor_bind: Binding to > port > > > > 0x6270510000005. > > > > -I- Created the Multicast Group: > > > > MGID....................0xff13a01cfe800000 : > 0x0000000000000000 > > > > PortGid.................0xfe80000000000000 : > 0x0006270510000005 > > > > qkey....................0x0 > > > > Mlid....................0xC002 > > > > ScopeState..............0x21 > > > > Rate....................0x83 > > > > Mtu.....................0x84 > > > > Jan 28 12:22:12 140486 [AB2223C0] -> osm_vendor_bind: Binding to > port > > > > 0x6270510000005. > > > > > > > > ibwarn: [4057] port_alloc: umad port id 0 is already allocated for > mthca0 > > > > 1 > > > > Jan 28 12:22:12 143240 [AB2223C0] -> osm_vendor_open_port: ERR > 542C: > > > > umad_open_port() failed > > > > Jan 28 12:22:12 143253 [AB2223C0] -> osm_vendor_bind: ERR 5424: > Unable to > > > > Open Port 0x6270510000005. > > > > Jan 28 12:22:12 143262 [AB2223C0] -> osmv_bind_sa: ERR 5506: > Failed to > > > > bind to vendor GSI > > > > Jan 28 12:22:12 143267 [AB2223C0] -> ibmcgrp_bind: ERR 00137: > Unable to > > > > bind to SA > > > > > > > > I am trying to trace the source of this error, > > > > > > Is this the only IB application running or are there others (and if > so, > > > what else is running) ? > > > > > > -- Hal > > > > > > > Thanks, > > > > Amith > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From dorty at home.cam.net.uk Sun Jan 29 11:45:01 2006 From: dorty at home.cam.net.uk (Dorthe Alequin) Date: Sun, 29 Jan 2006 14:45:01 -0500 Subject: [openib-general] Re: pietism Phhar amaceutical Message-ID: <000001c6250c$7e7bb6b0$2beba8c0@superrealism> C V V l A l A L A L l G l U R S M A from from from $ $ $ 3 1 3 , , , 7 2 3 5 1 3 And many other, Save up to 70% with us - http://www.wasrectedi.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From chanceayaneakx at kobej.zzn.com Sun Jan 29 23:04:23 2006 From: chanceayaneakx at kobej.zzn.com (chanceayaneakx at kobej.zzn.com) Date: Sun, 29 Jan 2006 23:04:23 -0800 (PST) Subject: [openib-general] =?utf-8?b?woHCoTMwwpbCnMKKbcKOw4DCgcKhwoJnwoI=?= =?utf-8?b?wrXCgsOEwoLCrcKCwr7CgsKzwoLCosKBRcKBRcKBRQ==?= Message-ID: 20060130140002.96305mail@mail.hotyournet8548754521254_server08_221x251x99x253.ap221.topgogohopq87514.cc �@�������������������������������������������������� �@���w�ǂ�Ȓj���ł�\��Ȃ��A������g���������B�x�� �@�������������������������������������������������� �Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q ���������c�̂́A����ȏ����l�֒j���l���������킹����A ����΂r�d�w���i�c�̂Ɍ�����܂��B���������������������� �P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P �@http://ad.deai-ciao.net/?hkcka ������������������������������������������������������ �@�w���N��E�e�p���ǂ�Ȓj���ł�\��Ȃ��B�ǂ�ȏ� �@�ł�\��Ȃ��B�ǂ�Ȓj���Ƃ̂g�ł�\��Ȃ�����g�� �@�������B�x ������������������������������������������������������ ���̓x�A���L�ɋL�ڂ���鏗���l���A���c�̂Ɍ����Ă̂��˗� �𒸂��܂����B �w�ǂ�Ȓj���ł�\��Ȃ��A������g���������I�x ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ �j���l�Ɏ���ẮA����ȏ�ɖ����l�ȁA�ł��ĕt���̂��b�B ��͂菉�Ζʂ̏����l�Ƃ����ẮA�s����������̂� �����͌�����܂���B ����ȋM���l�̂��C�������ݎ���āA�����l�͉��L���񎦂� �M���l�ɂ��񑩒����܂��B �������������������������������������������������������������� ��. ���Ζʂg��������Ē�����ȏ�A���ꑊ���̂���i����j�� �@�@���p�ӂ����Ē����܂��B�y��x�̂g�ɑ΂�300.000�i30���~�j�z ��. �M���l���ǂ̗l�Ȓj���l�i�N��E�e�p�j�ł���܂��Ă�A �@�@���߂�ꂽ�g���܂��A�K���������x���������Ē����܂��B �������������������������������������������������������������� ��L���񎦂���A�w�ǂ�Ȓj���ł�\��Ȃ��A������g���������I�x �Ƃ����������l�ȁA���^�̈��������l���͂�����̏����l�ƂȂ�܂��B ��*��*��*��*��*��*��*��*��*��*��*��*��*��*��*��*��*��*��*��*�� ��. �����O�@���i�A�C�j�l ��. ���N��@29�� ��. �g�����z�@300.000�i30���~�j ��. �M���l�ւ̂����A �@�@���Ɛ\���܂��B���̘A���́A�����Đl�Ⴂ�ł͖�����ł��B �@�@���ꂩ���Ɍ����Ă����܂��B������M���l�Ƃ͉������ �@�@�͂���܂���A�R�͐�΂Ɍ����܂���B �@�@������̒c�̂̕��ցw�ǂ�Ȓj���ł�\��Ȃ��A������g���������I�x �@�@�Ƃ̊�]��`���A�M���l�ւƂ��A�����͂�������Ō�����܂��B �@�@�����M���l�ւ̒����S��`�ɂ����Ē��������񎦁A�������ł����H �@�@�������̒����Ȃ��ꍇ�ɂ́A�����񌻋�z��悹��”\�ł��B �@�@�^���ȋC�����Ɛ��ӂ͌`�ɂ��Ă��Ԃ��v���܂��B �@�@���Ԏ��𒸂��鎖��A�S������Ă���܂��B���Ƃg�A���ĉ������B �@http://ad.deai-ciao.net/?hkcka ��*��*��*��*��*��*��*��*��*��*��*��*��*��*��*��*��*��*��*��*�� From eitan at mellanox.co.il Sun Jan 29 23:49:42 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 30 Jan 2006 09:49:42 +0200 Subject: [openib-general] multicast join errors Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B64F@mtlexch01.mtl.com> Hi Amith Sorry but the ibstat looks good. Can you send a pointer (or attachment) for the code that does the ibumad open ? It seems like your application (exact same application) has already opened that port. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: amith rajith mamidala [mailto:mamidala at cse.ohio-state.edu] > Sent: Sunday, January 29, 2006 8:05 PM > To: Eitan Zahavi > Cc: Hal Rosenstock; mvapich-core at cse.ohio-state.edu; openib-general at openib.org > Subject: RE: [openib-general] multicast join errors > > Hi Eitan, > > I am sending the ibstat output: > CA 'mthca0' > CA type: MT25208 > Number of ports: 2 > Firmware version: 5.1.0 > Hardware version: a0 > Node GUID: 0x0006270510000004 > System image GUID: 0x0000000000000000 > Port 1: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 3 > LMC: 0 > SM lid: 105 > Capability mask: 0x02510a68 > Port GUID: 0x0006270510000005 > Port 2: > State: Down > Physical state: Polling > Rate: 10 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x02510a68 > Port GUID: 0x0006270510000006 > > Thanks, > Amith > > > On Sun, 29 Jan 2006, Eitan Zahavi wrote: > > > Hi Amith, > > > > Please send the ibstat output for that node. > > I suspect the port 0x6270510000005 is not up. > > > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > -----Original Message----- > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > bounces at openib.org] On Behalf Of amith rajith mamidala > > > Sent: Saturday, January 28, 2006 11:19 PM > > > To: Hal Rosenstock > > > Cc: mvapich-core at cse.ohio-state.edu; openib-general at openib.org > > > Subject: Re: [openib-general] multicast join errors > > > > > > Hi Hal, > > > > > > There is only one application running on a node. I am running opensm > > on > > > a different node. I am also listing the other processes I observed on > > > doing a "ps": > > > > > > root 3564 11 0 Jan26 ? 00:00:00 [ib_cm/0] > > > root 3565 11 0 Jan26 ? 00:00:00 [ib_cm/1] > > > root 1294 11 0 Jan26 ? 00:00:00 [ib_mad1] > > > root 1295 11 0 Jan26 ? 00:00:00 [ib_mad2] > > > root 1298 11 0 Jan26 ? 00:00:00 [ib_mad1] > > > root 1299 11 0 Jan26 ? 00:00:00 [ib_mad2] > > > > > > > > > Thanks, > > > Amith > > > > > > On 28 Jan 2006, Hal Rosenstock wrote: > > > > > > > Hi Amith, > > > > > > > > On Sat, 2006-01-28 at 12:46, amith rajith mamidala wrote: > > > > > Hi, > > > > > > > > > > I was able to create multicast groups after Hal's fix. But, when I > > do join > > > > > subsequently from the same program I am getting a port_alloc > > error: > > > > > > > > > > Jan 28 12:22:12 119632 [AB2223C0] -> osm_vendor_bind: Binding to > > port > > > > > 0x6270510000005. > > > > > -I- Created the Multicast Group: > > > > > MGID....................0xff13a01cfe800000 : > > 0x0000000000000000 > > > > > PortGid.................0xfe80000000000000 : > > 0x0006270510000005 > > > > > qkey....................0x0 > > > > > Mlid....................0xC002 > > > > > ScopeState..............0x21 > > > > > Rate....................0x83 > > > > > Mtu.....................0x84 > > > > > Jan 28 12:22:12 140486 [AB2223C0] -> osm_vendor_bind: Binding to > > port > > > > > 0x6270510000005. > > > > > > > > > > ibwarn: [4057] port_alloc: umad port id 0 is already allocated for > > mthca0 > > > > > 1 > > > > > Jan 28 12:22:12 143240 [AB2223C0] -> osm_vendor_open_port: ERR > > 542C: > > > > > umad_open_port() failed > > > > > Jan 28 12:22:12 143253 [AB2223C0] -> osm_vendor_bind: ERR 5424: > > Unable to > > > > > Open Port 0x6270510000005. > > > > > Jan 28 12:22:12 143262 [AB2223C0] -> osmv_bind_sa: ERR 5506: > > Failed to > > > > > bind to vendor GSI > > > > > Jan 28 12:22:12 143267 [AB2223C0] -> ibmcgrp_bind: ERR 00137: > > Unable to > > > > > bind to SA > > > > > > > > > > I am trying to trace the source of this error, > > > > > > > > Is this the only IB application running or are there others (and if > > so, > > > > what else is running) ? > > > > > > > > -- Hal > > > > > > > > > Thanks, > > > > > Amith > > > > > > > > > > _______________________________________________ > > > > > openib-general mailing list > > > > > openib-general at openib.org > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > From dotanb at mellanox.co.il Mon Jan 30 00:59:27 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 30 Jan 2006 10:59:27 +0200 Subject: [openib-general] does the mthca driver support RTS->SQD event request? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3013DC66A@mtlexch01.mtl.com> Hi. does the mthca driver support the request of an async event when changing the QP state from RTS -> SQD? (the event is actually a notification when the HCA changes the QP state from SQD:draining -> SQD:drained) if the answer is no, do you plan to add it (and when)? thanks Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From rep.nop at aon.at Mon Jan 30 02:58:31 2006 From: rep.nop at aon.at (Bernhard Fischer) Date: Mon, 30 Jan 2006 11:58:31 +0100 Subject: [openib-general] Errors in compilation with 2.6.14.4!! In-Reply-To: <1135341078.4328.91384.camel@hal.voltaire.com> References: <309a667c0512230425o6869a2a2u8a7f0fd2ede720e8@mail.gmail.com> <1135341078.4328.91384.camel@hal.voltaire.com> Message-ID: <20060130105831.GA11627@aon.at> On Fri, Dec 23, 2005 at 07:31:21AM -0500, Hal Rosenstock wrote: >On Fri, 2005-12-23 at 07:25, Devesh Sharma wrote: >> Hi I have downloaded openib stack from svn with Revision 4595 and >> trying to compile it with 2.6.14.4. I am getting following errors >> >> >> [root at infini00 linux-2.6.14.4]# make >> CHK include/linux/version.h >> CHK include/linux/compile.h >> CHK usr/initramfs_list >> CC [M] drivers/infiniband/core/cm.o >> drivers/infiniband/core/cm.c: In function `cm_alloc_msg': >> drivers/infiniband/core/cm.c:180: error: `IB_MGMT_MAD_HDR' undeclared >> (first use in this function) >> drivers/infiniband/core/cm.c:180: error: (Each undeclared identifier >> is reported only once >> drivers/infiniband/core/cm.c:180: error: for each function it appears in.) >> drivers/infiniband/core/cm.c:181: error: too few arguments to function >> `ib_create_send_mad' >> drivers/infiniband/core/cm.c:188: error: structure has no member named `ah' >> drivers/infiniband/core/cm.c:189: error: structure has no member named `retries' >> drivers/infiniband/core/cm.c: In function `cm_alloc_response_msg': >> drivers/infiniband/core/cm.c:210: error: `IB_MGMT_MAD_HDR' undeclared >> (first use in this function) >> drivers/infiniband/core/cm.c:211: error: too few arguments to function >> `ib_create_send_mad' >> drivers/infiniband/core/cm.c:216: error: structure has no member named `ah' >> drivers/infiniband/core/cm.c: In function `cm_free_msg': >> drivers/infiniband/core/cm.c:223: error: structure has no member named `ah' >> drivers/infiniband/core/cm.c: In function `cm_mask_compare_data': >> drivers/infiniband/core/cm.c:363: error: >> `IB_CM_PRIVATE_DATA_COMPARE_SIZE' undeclared (first use in this >> function) >> drivers/infiniband/core/cm.c: In function `cm_compare_data': >> drivers/infiniband/core/cm.c:370: error: >> `IB_CM_PRIVATE_DATA_COMPARE_SIZE' undeclared (first use in this >> function) >> drivers/infiniband/core/cm.c:376: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:376: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:377: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:377: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:370: warning: unused variable `src' >> drivers/infiniband/core/cm.c:371: warning: unused variable `dst' >> drivers/infiniband/core/cm.c: In function `cm_compare_private_data': >> drivers/infiniband/core/cm.c:384: error: >> `IB_CM_PRIVATE_DATA_COMPARE_SIZE' undeclared (first use in this >> function) >> drivers/infiniband/core/cm.c:389: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:390: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:384: warning: unused variable `src' >> drivers/infiniband/core/cm.c: In function `cm_insert_listen': >> drivers/infiniband/core/cm.c:410: error: structure has no member named `device' >> drivers/infiniband/core/cm.c:410: error: structure has no member named `device' >> drivers/infiniband/core/cm.c:414: error: structure has no member named `device' >> drivers/infiniband/core/cm.c:414: error: structure has no member named `device' >> drivers/infiniband/core/cm.c:416: error: structure has no member named `device' >> drivers/infiniband/core/cm.c:416: error: structure has no member named `device' >> drivers/infiniband/core/cm.c: In function `cm_find_listen': >> drivers/infiniband/core/cm.c:446: error: structure has no member named `device' >> drivers/infiniband/core/cm.c:449: error: structure has no member named `device' >> drivers/infiniband/core/cm.c:451: error: structure has no member named `device' >> drivers/infiniband/core/cm.c: At top level: >> drivers/infiniband/core/cm.c:595: error: conflicting types for 'ib_create_cm_id' >> include/rdma/ib_cm.h:306: error: previous declaration of >> 'ib_create_cm_id' was here >> drivers/infiniband/core/cm.c:595: error: conflicting types for 'ib_create_cm_id' >> include/rdma/ib_cm.h:306: error: previous declaration of >> 'ib_create_cm_id' was here >> drivers/infiniband/core/cm.c: In function `ib_create_cm_id': >> drivers/infiniband/core/cm.c:604: error: structure has no member named `device' >> drivers/infiniband/core/cm.c: In function `ib_destroy_cm_id': >> drivers/infiniband/core/cm.c:731: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c:739: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c:749: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c:764: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c: At top level: >> drivers/infiniband/core/cm.c:790: error: conflicting types for 'ib_cm_listen' >> include/rdma/ib_cm.h:334: error: previous declaration of 'ib_cm_listen' was here >> drivers/infiniband/core/cm.c:790: error: conflicting types for 'ib_cm_listen' >> include/rdma/ib_cm.h:334: error: previous declaration of 'ib_cm_listen' was here >> drivers/infiniband/core/cm.c: In function `ib_cm_listen': >> drivers/infiniband/core/cm.c:807: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:811: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:812: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:812: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:813: error: >> `IB_CM_PRIVATE_DATA_COMPARE_SIZE' undeclared (first use in this >> function) >> drivers/infiniband/core/cm.c:813: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:813: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:813: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c:813: error: dereferencing pointer to >> incomplete type >> drivers/infiniband/core/cm.c: In function `ib_send_cm_req': >> drivers/infiniband/core/cm.c:1003: error: structure has no member >> named `timeout_ms' >> drivers/infiniband/core/cm.c:1012: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:1012: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_issue_rej': >> drivers/infiniband/core/cm.c:1057: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:1057: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_dup_req_handler': >> drivers/infiniband/core/cm.c:1265: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:1265: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_match_req': >> drivers/infiniband/core/cm.c:1305: error: structure has no member named `device' >> drivers/infiniband/core/cm.c: In function `ib_send_cm_rep': >> drivers/infiniband/core/cm.c:1452: error: structure has no member >> named `timeout_ms' >> drivers/infiniband/core/cm.c:1455: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:1455: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `ib_send_cm_rtu': >> drivers/infiniband/core/cm.c:1519: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:1519: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_dup_rep_handler': >> drivers/infiniband/core/cm.c:1591: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:1591: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_rep_handler': >> drivers/infiniband/core/cm.c:1659: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c: In function `cm_establish_handler': >> drivers/infiniband/core/cm.c:1693: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c: In function `cm_rtu_handler': >> drivers/infiniband/core/cm.c:1732: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c: In function `ib_send_cm_dreq': >> drivers/infiniband/core/cm.c:1790: error: structure has no member >> named `timeout_ms' >> drivers/infiniband/core/cm.c:1793: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:1793: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `ib_send_cm_drep': >> drivers/infiniband/core/cm.c:1856: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:1856: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_dreq_handler': >> drivers/infiniband/core/cm.c:1891: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c:1905: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:1905: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_drep_handler': >> drivers/infiniband/core/cm.c:1952: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c: In function `ib_send_cm_rej': >> drivers/infiniband/core/cm.c:2020: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:2020: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_rej_handler': >> drivers/infiniband/core/cm.c:2096: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c:2106: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c: In function `ib_send_cm_mra': >> drivers/infiniband/core/cm.c:2164: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:2164: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c:2177: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:2177: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c:2190: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:2190: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_mra_handler': >> drivers/infiniband/core/cm.c:2252: warning: passing arg 2 of >> `ib_modify_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c:2259: warning: passing arg 2 of >> `ib_modify_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c:2267: warning: passing arg 2 of >> `ib_modify_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c: In function `ib_send_cm_lap': >> drivers/infiniband/core/cm.c:2350: error: structure has no member >> named `timeout_ms' >> drivers/infiniband/core/cm.c:2353: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:2353: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_lap_handler': >> drivers/infiniband/core/cm.c:2430: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:2430: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `ib_send_cm_apr': >> drivers/infiniband/core/cm.c:2508: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:2508: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_apr_handler': >> drivers/infiniband/core/cm.c:2547: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c: In function `ib_send_cm_sidr_req': >> drivers/infiniband/core/cm.c:2644: error: structure has no member >> named `timeout_ms' >> drivers/infiniband/core/cm.c:2649: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:2649: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_sidr_req_handler': >> drivers/infiniband/core/cm.c:2713: error: structure has no member named `device' >> drivers/infiniband/core/cm.c: In function `ib_send_cm_sidr_rep': >> drivers/infiniband/core/cm.c:2785: warning: passing arg 1 of >> `ib_post_send_mad' from incompatible pointer type >> drivers/infiniband/core/cm.c:2785: error: too few arguments to >> function `ib_post_send_mad' >> drivers/infiniband/core/cm.c: In function `cm_sidr_rep_handler': >> drivers/infiniband/core/cm.c:2838: warning: passing arg 2 of >> `ib_cancel_mad' makes integer from pointer without a cast >> drivers/infiniband/core/cm.c: In function `cm_send_handler': >> drivers/infiniband/core/cm.c:2906: error: structure has no member >> named `send_buf' >> make[3]: *** [drivers/infiniband/core/cm.o] Error 1 >> make[2]: *** [drivers/infiniband/core] Error 2 >> make[1]: *** [drivers/infiniband] Error 2 >> make: *** [drivers] Error 2 >> >> >> What is the issue?? > >You need to link include/rdma in your Linux tree to the OpenIB one after >moving the Linux one away as it is not up to date. > Add note about linking the rdma includes and linux/*h to the kernel sources. Signed-off-by: Bernhard Fischer Please apply if appropriate, Bernhard -------------- next part -------------- Index: README.kernel-build =================================================================== --- README.kernel-build (revision 5204) +++ README.kernel-build (working copy) @@ -3,8 +3,12 @@ 1. Check out https://openib.org/svn/gen2/trunk, say into /my/path/to/openib 2. Unpack your kernel source into /my/path/to/linux -3. Link the kernel source into your Linux tree: +3. Link the kernel source and headers into your Linux tree: ln -s /my/path/to/openib/src/linux-kernel/infiniband /my/path/to/linux/drivers + ln -s /my/path/to/openib/src/linux-kernel/infiniband/include/rdma \ + /my/path/to/linux/include/rdma + ln -s /my/path/to/openib/src/linux-kernel/infiniband/include/linux/* \ + /my/path/to/linux/include/linux/ 4. cd into /my/path/to/linux and apply the appropriate patches. The patches currently available will be in /my/path/to/openib/src/linux-kernel/patches From mst at mellanox.co.il Mon Jan 30 03:05:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jan 2006 13:05:12 +0200 Subject: [openib-general] Re: Errors in compilation with 2.6.14.4!! In-Reply-To: <20060130105831.GA11627@aon.at> References: <20060130105831.GA11627@aon.at> Message-ID: <20060130110512.GW31887@mellanox.co.il> Quoting r. Bernhard Fischer : > Add note about linking the rdma includes and linux/*h to the kernel > sources. > > Signed-off-by: Bernhard Fischer > > Please apply if appropriate, > Bernhard This has been on wiki for a while now https://openib.org/tiki/tiki-index.php?page=Installation+Cheat+Sheet -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rep.nop at aon.at Mon Jan 30 03:15:12 2006 From: rep.nop at aon.at (Bernhard Fischer) Date: Mon, 30 Jan 2006 12:15:12 +0100 Subject: [openib-general] [patch] convert ipath_driver from i_sem to i_mutex Message-ID: <20060130111512.GA9663@aon.at> Hi, I don't know when exactly the i_sem to i_mutex conversion was done. In 2.6.16 this patch is needed to build the ipath driver. Signed-off-by: Bernhard Fischer Please apply. -------------- next part -------------- Index: infiniband/hw/ipath/ipath_driver.c =================================================================== --- infiniband/hw/ipath/ipath_driver.c (revision 5203) +++ infiniband/hw/ipath/ipath_driver.c (working copy) @@ -2754,7 +2754,12 @@ static loff_t ipath_llseek(struct file * loff_t ret; /* range checking is done where offset is used, not here. */ +/* XXX remove this compatibility hack */ +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,16) down(&fp->f_dentry->d_inode->i_sem); +#else + mutex_lock(&fp->f_dentry->d_inode->i_mutex); +#endif if (!whence) ret = fp->f_pos = off; else if (whence == 1) { @@ -2762,7 +2767,11 @@ static loff_t ipath_llseek(struct file * ret = fp->f_pos; } else ret = -EINVAL; +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,16) up(&fp->f_dentry->d_inode->i_sem); +#else + mutex_unlock(&fp->f_dentry->d_inode->i_mutex); +#endif _IPATH_DBG("New offset %llx from seek %llx whence=%d\n", fp->f_pos, off, whence); From yael at mellanox.co.il Mon Jan 30 03:18:16 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 30 Jan 2006 13:18:16 +0200 Subject: [openib-general] [PATCH] Opensm - using default dir Message-ID: <5zwtgi0xs7.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch adds the use of the OSM_DEFAULT_CACHE_DIR instead of full name usage. Also - add the "/" at the end of the OSM_DEFAULT_CACHE_DIR definition, and refrain from adding it in the code, to avoid problems in Windows. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/osm_base.h =================================================================== --- include/opensm/osm_base.h (revision 5203) +++ include/opensm/osm_base.h (working copy) @@ -194,7 +194,7 @@ BEGIN_C_DECLS #ifdef __WIN__ #define OSM_DEFAULT_CACHE_DIR "C:\\Windows\\Temp\\" #else -#define OSM_DEFAULT_CACHE_DIR "/var/cache/osm" +#define OSM_DEFAULT_CACHE_DIR "/var/cache/osm/" #endif /***********/ Index: include/opensm/osm_svn_revision.h =================================================================== --- include/opensm/osm_svn_revision.h (revision 5203) +++ include/opensm/osm_svn_revision.h (working copy) @@ -1 +1 @@ -#define OSM_SVN_REVISION "" +#define OSM_SVN_REVISION "5203M" Index: opensm/osm_subnet.c =================================================================== --- opensm/osm_subnet.c (revision 5203) +++ opensm/osm_subnet.c (working copy) @@ -617,7 +617,7 @@ osm_subn_parse_conf_file( if (! p_cache_dir) p_cache_dir = OSM_DEFAULT_CACHE_DIR; strcpy(file_name, p_cache_dir); - strcat(file_name,"/opensm.opts"); + strcat(file_name,"opensm.opts"); opts_file = fopen(file_name, "r"); if (!opts_file) return; @@ -789,7 +789,7 @@ osm_subn_write_conf_file( if (! p_cache_dir) p_cache_dir = OSM_DEFAULT_CACHE_DIR; strcpy(file_name, p_cache_dir); - strcat(file_name,"/opensm.opts"); + strcat(file_name,"opensm.opts"); opts_file = fopen(file_name, "w"); if (!opts_file) return; Index: opensm/osm_db_files.c =================================================================== --- opensm/osm_db_files.c (revision 5203) +++ opensm/osm_db_files.c (working copy) @@ -170,7 +170,7 @@ osm_db_init( p_db_imp->db_dir_name = getenv("OSM_CACHE_DIR"); if ( p_db_imp->db_dir_name == NULL ) - p_db_imp->db_dir_name = "/var/cache/osm"; + p_db_imp->db_dir_name = OSM_DEFAULT_CACHE_DIR; /* make sure the directory exists */ if (lstat(p_db_imp->db_dir_name, &dstat)) @@ -226,7 +226,6 @@ osm_db_domain_init( (char *)cl_malloc(sizeof(char)*(dir_name_len) + strlen(domain_name) + 2); CL_ASSERT(p_domain_imp->file_name != NULL); strcpy(p_domain_imp->file_name,((osm_db_imp_t*)p_db->p_db_imp)->db_dir_name); - strcat(p_domain_imp->file_name,"/"); strcat(p_domain_imp->file_name,domain_name); /* make sure the file exists - or exit if not writable */ @@ -413,6 +412,7 @@ osm_db_restore( Exit: cl_spinlock_release( &p_domain_imp->lock ); + OSM_LOG_EXIT( p_log ); return status; } From thomas.bub at thomson.net Mon Jan 30 04:30:21 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Mon, 30 Jan 2006 13:30:21 +0100 Subject: [openib-general] gen1 drivers as rpm or tgz? Message-ID: Hi, we are planning to replace a HighSpeed GSN (Gigabyte System Network) data interface with IB. We'd like to to use the verbs layer with the Access Layer on top. Unfortunately Mellanox, our Hardware provider, does not supply an Access layer with their IBGold 1.8 stack. Yes I know openIB currently supports only gen2 drivers with 2.6er kernels Unfortunately all our environment is still 2.4er kernels under RedHat EL 3 Update 6. Before my whole development and target environment is on 2.6er kernels I have to wait for another 2 month. In order to speed up things for me as an developer I'd like to start with gen1 openIB drivers. Is there soothing like rpm's or tgz's I can use which supplies wole IB stack for 2.4er kernels or an Acess layer extension for Mellanox IBGold stack. I'd like to avoid to download via cvs, which I'm not familiar with yet. Thanks Thomas Bub ............................................................ Thomas Bub Grass Valley Germany GmbH Brunnenweg 9 64331 Weiterstadt, Germany Tel: +49 6150 104 147 Fax: +49 6150 104 656 Email: Thomas.Bub at thomson.net www.GrassValley.com ............................................................ -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Jan 30 04:26:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Jan 2006 07:26:55 -0500 Subject: [openib-general] Re: [PATCH] Opensm - using default dir In-Reply-To: <5zwtgi0xs7.fsf@mtl066.yok.mtl.com> References: <5zwtgi0xs7.fsf@mtl066.yok.mtl.com> Message-ID: <1138624014.4453.6092.camel@hal.voltaire.com> Hi Yael, On Mon, 2006-01-30 at 06:18, Yael Kalka wrote: > Hi Hal, > > The following patch adds the use of the OSM_DEFAULT_CACHE_DIR instead > of full name usage. Also - add the "/" at the end of the > OSM_DEFAULT_CACHE_DIR definition, and refrain from adding it in the > code, to avoid problems in Windows. Thanks. Applied. See note below. -- Hal > Thanks, > Yael > > > Signed-off-by: Yael Kalka > > Index: include/opensm/osm_base.h > =================================================================== > --- include/opensm/osm_base.h (revision 5203) > +++ include/opensm/osm_base.h (working copy) > @@ -194,7 +194,7 @@ BEGIN_C_DECLS > #ifdef __WIN__ > #define OSM_DEFAULT_CACHE_DIR "C:\\Windows\\Temp\\" > #else > -#define OSM_DEFAULT_CACHE_DIR "/var/cache/osm" > +#define OSM_DEFAULT_CACHE_DIR "/var/cache/osm/" > #endif > /***********/ > > Index: include/opensm/osm_svn_revision.h > =================================================================== > --- include/opensm/osm_svn_revision.h (revision 5203) > +++ include/opensm/osm_svn_revision.h (working copy) > @@ -1 +1 @@ > -#define OSM_SVN_REVISION "" > +#define OSM_SVN_REVISION "5203M" Please try not to include this in patches. Please revert before preparing the patch. > Index: opensm/osm_subnet.c > =================================================================== > --- opensm/osm_subnet.c (revision 5203) > +++ opensm/osm_subnet.c (working copy) > @@ -617,7 +617,7 @@ osm_subn_parse_conf_file( > if (! p_cache_dir) p_cache_dir = OSM_DEFAULT_CACHE_DIR; > > strcpy(file_name, p_cache_dir); > - strcat(file_name,"/opensm.opts"); > + strcat(file_name,"opensm.opts"); > > opts_file = fopen(file_name, "r"); > if (!opts_file) return; > @@ -789,7 +789,7 @@ osm_subn_write_conf_file( > if (! p_cache_dir) p_cache_dir = OSM_DEFAULT_CACHE_DIR; > > strcpy(file_name, p_cache_dir); > - strcat(file_name,"/opensm.opts"); > + strcat(file_name,"opensm.opts"); > > opts_file = fopen(file_name, "w"); > if (!opts_file) return; > Index: opensm/osm_db_files.c > =================================================================== > --- opensm/osm_db_files.c (revision 5203) > +++ opensm/osm_db_files.c (working copy) > @@ -170,7 +170,7 @@ osm_db_init( > > p_db_imp->db_dir_name = getenv("OSM_CACHE_DIR"); > if ( p_db_imp->db_dir_name == NULL ) > - p_db_imp->db_dir_name = "/var/cache/osm"; > + p_db_imp->db_dir_name = OSM_DEFAULT_CACHE_DIR; > > /* make sure the directory exists */ > if (lstat(p_db_imp->db_dir_name, &dstat)) > @@ -226,7 +226,6 @@ osm_db_domain_init( > (char *)cl_malloc(sizeof(char)*(dir_name_len) + strlen(domain_name) + 2); > CL_ASSERT(p_domain_imp->file_name != NULL); > strcpy(p_domain_imp->file_name,((osm_db_imp_t*)p_db->p_db_imp)->db_dir_name); > - strcat(p_domain_imp->file_name,"/"); > strcat(p_domain_imp->file_name,domain_name); > > /* make sure the file exists - or exit if not writable */ > @@ -413,6 +412,7 @@ osm_db_restore( > > Exit: > cl_spinlock_release( &p_domain_imp->lock ); > + OSM_LOG_EXIT( p_log ); > return status; > } > > From mst at mellanox.co.il Mon Jan 30 04:41:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jan 2006 14:41:22 +0200 Subject: [openib-general] Re: [PATCH] Opensm - using default dir In-Reply-To: <5zwtgi0xs7.fsf@mtl066.yok.mtl.com> References: <5zwtgi0xs7.fsf@mtl066.yok.mtl.com> Message-ID: <20060130124122.GY31887@mellanox.co.il> Quoting r. Yael Kalka : > =================================================================== > --- include/opensm/osm_svn_revision.h (revision 5203) > +++ include/opensm/osm_svn_revision.h (working copy) > @@ -1 +1 @@ > -#define OSM_SVN_REVISION "" > +#define OSM_SVN_REVISION "5203M" This looks like a mistake. And, I think this shows that keeping the generated file osm_svn_revision.h represents a problem. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From yael at mellanox.co.il Mon Jan 30 04:47:51 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Mon, 30 Jan 2006 14:47:51 +0200 Subject: [openib-general] OpenSM: Add support for optional SA GUIDInfoRecord Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD60@mtlexch01.mtl.com> Hi Hal, Patch seems good to me. Please go ahead and apply it. Thanks, Yael -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org]On Behalf Of Hal Rosenstock Sent: Friday, January 27, 2006 10:56 PM To: Yael Kalka; Eitan Zahavi Cc: openib-general at openib.org Subject: [openib-general] OpenSM: Add support for optional SA GUIDInfoRecord OpenSM: Add support for optional SA GUIDInfoRecord Signed-off-by: Hal Rosenstock Index: osm/include/opensm/osm_sa_guidinfo_record.h =================================================================== --- osm/include/opensm/osm_sa_guidinfo_record.h (revision 0) +++ osm/include/opensm/osm_sa_guidinfo_record.h (revision 0) @@ -0,0 +1,283 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: $ + */ + + +/* + * Abstract: + * Declaration of osm_gir_rcv_t. + * This object represents the GUIDInfo Record Receiver object. + * attribute from a node. + * This object is part of the OpenSM family of objects. + * + * Environment: + * Linux User Mode + * + */ + + +#ifndef _OSM_GIR_RCV_H_ +#define _OSM_GIR_RCV_H_ + + +#include +#include +#include +#include +#include +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +/****h* OpenSM/GUIDInfo Record Receiver +* NAME +* GUIDInfo Record Receiver +* +* DESCRIPTION +* The GUIDInfo Record Receiver object encapsulates the information +* needed to receive the GUIDInfoRecord attribute from a node. +* +* The GUIDInfo Record Receiver object is thread safe. +* +* This object should be treated as opaque and should be +* manipulated only through the provided functions. +* +* AUTHOR +* Hal Rosenstock, Voltaire +* +*********/ +/****s* OpenSM: GUIDInfo Record Receiver/osm_gir_rcv_t +* NAME +* osm_gir_rcv_t +* +* DESCRIPTION +* GUIDInfo Record Receiver structure. +* +* This object should be treated as opaque and should +* be manipulated only through the provided functions. +* +* SYNOPSIS +*/ +typedef struct _osm_gir_rcv +{ + const osm_subn_t *p_subn; + osm_sa_resp_t *p_resp; + osm_mad_pool_t *p_mad_pool; + osm_log_t *p_log; + cl_plock_t *p_lock; + cl_qlock_pool_t pool; + +} osm_gir_rcv_t; +/* +* FIELDS +* p_subn +* Pointer to the Subnet object for this subnet. +* +* p_resp +* Pointer to the SA reponder. +* +* p_mad_pool +* Pointer to the mad pool. +* +* p_log +* Pointer to the log object. +* +* p_lock +* Pointer to the serializing lock. +* +* pool +* Pool of linkable GUIDInfo Record objects used to generate +* the query response. +* +* SEE ALSO +* +*********/ + +/****f* OpenSM: GUIDInfo Record Receiver/osm_gir_rcv_construct +* NAME +* osm_gir_rcv_construct +* +* DESCRIPTION +* This function constructs a GUIDInfo Record Receiver object. +* +* SYNOPSIS +*/ +void +osm_gir_rcv_construct( + IN osm_gir_rcv_t* const p_rcv ); +/* +* PARAMETERS +* p_rcv +* [in] Pointer to a GUIDInfo Record Receiver object to construct. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Allows calling osm_gir_rcv_init, osm_gir_rcv_destroy +* +* Calling osm_gir_rcv_construct is a prerequisite to calling any other +* method except osm_gir_rcv_init. +* +* SEE ALSO +* GUIDInfo Record Receiver object, osm_gir_rcv_init, +* osm_gir_rcv_destroy +*********/ + +/****f* OpenSM: GUIDInfo Record Receiver/osm_gir_rcv_destroy +* NAME +* osm_gir_rcv_destroy +* +* DESCRIPTION +* The osm_gir_rcv_destroy function destroys the object, releasing +* all resources. +* +* SYNOPSIS +*/ +void +osm_gir_rcv_destroy( + IN osm_gir_rcv_t* const p_rcv ); +/* +* PARAMETERS +* p_rcv +* [in] Pointer to the object to destroy. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Performs any necessary cleanup of the specified +* GUIDInfo Record Receiver object. +* Further operations should not be attempted on the destroyed object. +* This function should only be called after a call to +* osm_gir_rcv_construct or osm_gir_rcv_init. +* +* SEE ALSO +* GUIDInfo Record Receiver object, osm_gir_rcv_construct, +* osm_gir_rcv_init +*********/ + +/****f* OpenSM: GUIDInfo Record Receiver/osm_gir_rcv_init +* NAME +* osm_gir_rcv_init +* +* DESCRIPTION +* The osm_gir_rcv_init function initializes a +* GUIDInfo Record Receiver object for use. +* +* SYNOPSIS +*/ +ib_api_status_t +osm_gir_rcv_init( + IN osm_gir_rcv_t* const p_rcv, + IN osm_sa_resp_t* const p_resp, + IN osm_mad_pool_t* const p_mad_pool, + IN const osm_subn_t* const p_subn, + IN osm_log_t* const p_log, + IN cl_plock_t* const p_lock ); +/* +* PARAMETERS +* p_rcv +* [in] Pointer to an osm_gir_rcv_t object to initialize. +* +* p_req +* [in] Pointer to an osm_req_t object. +* +* p_subn +* [in] Pointer to the Subnet object for this subnet. +* +* p_log +* [in] Pointer to the log object. +* +* p_lock +* [in] Pointer to the OpenSM serializing lock. +* +* RETURN VALUES +* CL_SUCCESS if the GUIDInfo Record Receiver object was initialized +* successfully. +* +* NOTES +* Allows calling other GUIDInfo Record Receiver methods. +* +* SEE ALSO +* GUIDInfo Record Receiver object, osm_gir_rcv_construct, +* osm_gir_rcv_destroy +*********/ + +/****f* OpenSM: GUIDInfo Record Receiver/osm_gir_rcv_process +* NAME +* osm_gir_rcv_process +* +* DESCRIPTION +* Process the GUIDInfoRecord attribute. +* +* SYNOPSIS +*/ +void +osm_gir_rcv_process( + IN osm_gir_rcv_t* const p_rcv, + IN const osm_madw_t* const p_madw ); +/* +* PARAMETERS +* p_rcv +* [in] Pointer to an osm_gir_rcv_t object. +* +* p_madw +* [in] Pointer to the MAD Wrapper containing the MAD +* that contains the node's GUIDInfoRecord attribute. +* +* RETURN VALUES +* CL_SUCCESS if the GUIDInfoRecord processing was successful. +* +* NOTES +* This function processes a GUIDInfoRecord attribute. +* +* SEE ALSO +* GUIDInfo Record Receiver, GUIDInfo Record Response Controller +*********/ + +END_C_DECLS + +#endif /* _OSM_GIR_RCV_H_ */ Property changes on: osm/include/opensm/osm_sa_guidinfo_record.h ___________________________________________________________________ Name: svn:keywords + Id Index: osm/include/opensm/osm_helper.h =================================================================== --- osm/include/opensm/osm_helper.h (revision 5193) +++ osm/include/opensm/osm_helper.h (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -247,6 +247,12 @@ osm_dump_portinfo_record( IN const osm_log_level_t log_level ); void +osm_dump_guidinfo_record( + IN osm_log_t* const p_log, + IN const ib_guidinfo_record_t* const p_gir, + IN const osm_log_level_t log_level ); + +void osm_dump_inform_info( IN osm_log_t* const p_log, IN const ib_inform_info_t* const p_ii, Index: osm/include/opensm/osm_sa.h =================================================================== --- osm/include/opensm/osm_sa.h (revision 5193) +++ osm/include/opensm/osm_sa.h (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -64,6 +64,7 @@ #include #include #include +#include #include #include #include @@ -154,6 +155,8 @@ typedef struct _osm_sa osm_nr_rcv_ctrl_t nr_rcv_ctrl; osm_pir_rcv_t pir_rcv; osm_pir_rcv_ctrl_t pir_rcv_ctrl; + osm_gir_rcv_t gir_rcv; + osm_gir_rcv_ctrl_t gir_rcv_ctrl; osm_lr_rcv_t lr_rcv; osm_lr_rcv_ctrl_t lr_rcv_ctrl; osm_pr_rcv_t pr_rcv; Index: osm/include/opensm/osm_msgdef.h =================================================================== --- osm/include/opensm/osm_msgdef.h (revision 5193) +++ osm/include/opensm/osm_msgdef.h (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -191,6 +191,7 @@ enum OSM_MSG_MAD_PKEY, OSM_MSG_MAD_VL_ARB, OSM_MSG_MAD_SLVL, + OSM_MSG_MAD_GUIDINFO_RECORD, OSM_MSG_MAX }; Index: osm/include/opensm/osm_sa_guidinfo_record_ctrl.h =================================================================== --- osm/include/opensm/osm_sa_guidinfo_record_ctrl.h (revision 0) +++ osm/include/opensm/osm_sa_guidinfo_record_ctrl.h (revision 0) @@ -0,0 +1,233 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: $ + */ + + +/* + * Abstract: + * Declaration of osm_sa_gir_rec_rcv_ctrl_t. + * This object represents a controller that receives the IBA GUID Info + * record query from SA client. + * This object is part of the OpenSM family of objects. + * + * Environment: + * Linux User Mode + * + */ + + +#ifndef _OSM_GIR_CTRL_H_ +#define _OSM_GIR_CTRL_H_ + + +#include +#include +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +/****h* OpenSM/GUID Info Record Receive Controller +* NAME +* GUID Info Record Receive Controller +* +* DESCRIPTION +* The GUID Info Record Receive Controller object encapsulates +* the information needed to handle GUID Info record query from SA client. +* +* The GUID Info Record Receive Controller object is thread safe. +* +* This object should be treated as opaque and should be +* manipulated only through the provided functions. +* +* AUTHOR +* Hal Rosenstock, Voltaire +* +*********/ +/****s* OpenSM: GUID Info Record Receive Controller/osm_gir_rcv_ctrl_t +* NAME +* osm_gir_rcv_ctrl_t +* +* DESCRIPTION +* GUID Info Record Receive Controller structure. +* +* This object should be treated as opaque and should +* be manipulated only through the provided functions. +* +* SYNOPSIS +*/ +typedef struct _osm_gir_rcv_ctrl +{ + osm_gir_rcv_t *p_rcv; + osm_log_t *p_log; + cl_dispatcher_t *p_disp; + cl_disp_reg_handle_t h_disp; + +} osm_gir_rcv_ctrl_t; +/* +* FIELDS +* p_rcv +* Pointer to the GUID Info Record Receiver object. +* +* p_log +* Pointer to the log object. +* +* p_disp +* Pointer to the Dispatcher. +* +* h_disp +* Handle returned from dispatcher registration. +* +* SEE ALSO +* GUID Info Record Receive Controller object +* GUID Info Record Receiver object +*********/ + +/****f* OpenSM: GUID Info Record Receive Controller/osm_gir_rec_rcv_ctrl_construct +* NAME +* osm_gir_rcv_ctrl_construct +* +* DESCRIPTION +* This function constructs a GUID Info Record Receive Controller object. +* +* SYNOPSIS +*/ +void osm_gir_rcv_ctrl_construct( + IN osm_gir_rcv_ctrl_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to a GUID Info Record Receive Controller +* object to construct. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Allows calling osm_gir_rcv_ctrl_init, osm_gir_rcv_ctrl_destroy +* +* Calling osm_gir_rcv_ctrl_construct is a prerequisite to calling any other +* method except osm_gir_rcv_ctrl_init. +* +* SEE ALSO +* GUID Info Record Receive Controller object, osm_gir_rcv_ctrl_init, +* osm_gir_rcv_ctrl_destroy +*********/ + +/****f* OpenSM: GUID Info Record Receive Controller/osm_gir_rcv_ctrl_destroy +* NAME +* osm_gir_rcv_ctrl_destroy +* +* DESCRIPTION +* The osm_gir_rcv_ctrl_destroy function destroys the object, releasing +* all resources. +* +* SYNOPSIS +*/ +void osm_gir_rcv_ctrl_destroy( + IN osm_gir_rcv_ctrl_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to the object to destroy. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Performs any necessary cleanup of the specified +* GUIDInfo Record Receive Controller object. +* Further operations should not be attempted on the destroyed object. +* This function should only be called after a call to +* osm_gir_rcv_ctrl_construct or osm_gir_rcv_ctrl_init. +* +* SEE ALSO +* GUIDInfo Record Receive Controller object, osm_gir_rcv_ctrl_construct, +* osm_gir_rcv_ctrl_init +*********/ + +/****f* OpenSM: GUID Info Record Receive Controller/osm_gir_rcv_ctrl_init +* NAME +* osm_gir_rcv_ctrl_init +* +* DESCRIPTION +* The osm_gir_rcv_ctrl_init function initializes a +* GUID Info Record Receive Controller object for use. +* +* SYNOPSIS +*/ +ib_api_status_t osm_gir_rcv_ctrl_init( + IN osm_gir_rcv_ctrl_t* const p_ctrl, + IN osm_gir_rcv_t* const p_rcv, + IN osm_log_t* const p_log, + IN cl_dispatcher_t* const p_disp ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to an osm_gir_rcv_ctrl_t object to initialize. +* +* p_rcv +* [in] Pointer to an osm_gir_rcv_t object. +* +* p_log +* [in] Pointer to the log object. +* +* p_disp +* [in] Pointer to the OpenSM central Dispatcher. +* +* RETURN VALUES +* CL_SUCCESS if the GUID Info Record Receive Controller object was initialized +* successfully. +* +* NOTES +* Allows calling other GUID Info Record Receive Controller methods. +* +* SEE ALSO +* GUID Info Record Receive Controller object, osm_gir_rcv_ctrl_construct, +* osm_gir_rcv_ctrl_destroy +*********/ + +END_C_DECLS + +#endif /* _OSM_GIR_CTRL_H_ */ Property changes on: osm/include/opensm/osm_sa_guidinfo_record_ctrl.h ___________________________________________________________________ Name: svn:keywords + Id Index: osm/include/Makefile.am =================================================================== --- osm/include/Makefile.am (revision 5193) +++ osm/include/Makefile.am (working copy) @@ -5,6 +5,7 @@ nobase_pkginclude_HEADERS = iba/ib_types EXTRA_DIST = \ $(srcdir)/opensm/osm_version.h \ $(srcdir)/opensm/osm_sa_portinfo_record_ctrl.h \ + $(srcdir)/opensm/osm_sa_guidinfo_record_ctrl.h \ $(srcdir)/opensm/osm_sa_path_record.h \ $(srcdir)/opensm/osm_lid_mgr.h \ $(srcdir)/opensm/osm_vl_arb_rcv.h \ @@ -33,6 +34,7 @@ EXTRA_DIST = \ $(srcdir)/opensm/osm_sa_pkey_record_ctrl.h \ $(srcdir)/opensm/osm_helper.h \ $(srcdir)/opensm/osm_sa_portinfo_record.h \ + $(srcdir)/opensm/osm_sa_guidinfo_record.h \ $(srcdir)/opensm/osm_sa_service_record.h \ $(srcdir)/opensm/osm_sa_response.h \ $(srcdir)/opensm/osm_node.h \ Index: osm/include/iba/ib_types.h =================================================================== --- osm/include/iba/ib_types.h (revision 5193) +++ osm/include/iba/ib_types.h (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -1109,6 +1109,18 @@ ib_class_is_vendor_specific( #define IB_MAD_ATTR_SMINFO_RECORD (CL_NTOH16(0x0018)) /**********/ +/****d* IBA Base: Constants/IB_MAD_ATTR_GUIDINFO_RECORD +* NAME +* IB_MAD_ATTR_GUIDINFO_RECORD +* +* DESCRIPTION +* GuidInfoRecord attribute (15.2.5) +* +* SOURCE +*/ +#define IB_MAD_ATTR_GUIDINFO_RECORD (CL_NTOH16(0x0030)) +/**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_VENDOR_DIAG * NAME * IB_MAD_ATTR_VENDOR_DIAG @@ -1120,6 +1132,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_VENDOR_DIAG (CL_NTOH16(0x0030)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_LED_INFO * NAME * IB_MAD_ATTR_LED_INFO @@ -1131,6 +1144,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_LED_INFO (CL_NTOH16(0x0031)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_SERVICE_RECORD * NAME * IB_MAD_ATTR_SERVICE_RECORD @@ -1142,6 +1156,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_SERVICE_RECORD (CL_NTOH16(0x0031)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_LFT_RECORD * NAME * IB_MAD_ATTR_LFT_RECORD @@ -1153,6 +1168,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_LFT_RECORD (CL_NTOH16(0x0015)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_PKEYTBL_RECORD * NAME * IB_MAD_ATTR_PKEYTBL_RECORD @@ -1164,6 +1180,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_PKEY_TBL_RECORD (CL_NTOH16(0x0033)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_PATH_RECORD * NAME * IB_MAD_ATTR_PATH_RECORD @@ -1175,6 +1192,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_PATH_RECORD (CL_NTOH16(0x0035)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_VLARB_RECORD * NAME * IB_MAD_ATTR_VLARB_RECORD @@ -1186,6 +1204,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_VLARB_RECORD (CL_NTOH16(0x0036)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_SLVL_RECORD * NAME * IB_MAD_ATTR_SLVL_RECORD @@ -1197,6 +1216,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_SLVL_RECORD (CL_NTOH16(0x0013)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_MCMEMBER_RECORD * NAME * IB_MAD_ATTR_MCMEMBER_RECORD @@ -1208,6 +1228,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_MCMEMBER_RECORD (CL_NTOH16(0x0038)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_TRACE_RECORD * NAME * IB_MAD_ATTR_MTRACE_RECORD @@ -1219,6 +1240,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_TRACE_RECORD (CL_NTOH16(0x0039)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_MULTIPATH_RECORD * NAME * IB_MAD_ATTR_MULTIPATH_RECORD @@ -1230,6 +1252,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_MULTIPATH_RECORD (CL_NTOH16(0x003A)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_SVC_ASSOCIATION_RECORD * NAME * IB_MAD_ATTR_SVC_ASSOCIATION_RECORD @@ -1241,6 +1264,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_SVC_ASSOCIATION_RECORD (CL_NTOH16(0x003B)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_IO_UNIT_INFO * NAME * IB_MAD_ATTR_IO_UNIT_INFO @@ -1252,6 +1276,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_IO_UNIT_INFO (CL_NTOH16(0x0010)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_IO_CONTROLLER_PROFILE * NAME * IB_MAD_ATTR_IO_CONTROLLER_PROFILE @@ -1263,6 +1288,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_IO_CONTROLLER_PROFILE (CL_NTOH16(0x0011)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_SERVICE_ENTRIES * NAME * IB_MAD_ATTR_SERVICE_ENTRIES @@ -1274,6 +1300,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_SERVICE_ENTRIES (CL_NTOH16(0x0012)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_DIAGNOSTIC_TIMEOUT * NAME * IB_MAD_ATTR_DIAGNOSTIC_TIMEOUT @@ -1285,6 +1312,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_DIAGNOSTIC_TIMEOUT (CL_NTOH16(0x0020)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_PREPARE_TO_TEST * NAME * IB_MAD_ATTR_PREPARE_TO_TEST @@ -1296,6 +1324,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_PREPARE_TO_TEST (CL_NTOH16(0x0021)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_TEST_DEVICE_ONCE * NAME * IB_MAD_ATTR_TEST_DEVICE_ONCE @@ -1307,6 +1336,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_TEST_DEVICE_ONCE (CL_NTOH16(0x0022)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_TEST_DEVICE_LOOP * NAME * IB_MAD_ATTR_TEST_DEVICE_LOOP @@ -1318,6 +1348,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_TEST_DEVICE_LOOP (CL_NTOH16(0x0023)) /**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_DIAG_CODE * NAME * IB_MAD_ATTR_DIAG_CODE @@ -1341,6 +1372,7 @@ ib_class_is_vendor_specific( */ #define IB_MAD_ATTR_SVC_ASSOCIATION_RECORD (CL_NTOH16(0x003B)) /**********/ + /****d* IBA Base: Constants/IB_NODE_TYPE_CA * NAME * IB_NODE_TYPE_CA @@ -1352,6 +1384,7 @@ ib_class_is_vendor_specific( */ #define IB_NODE_TYPE_CA 0x01 /**********/ + /****d* IBA Base: Constants/IB_NODE_TYPE_SWITCH * NAME * IB_NODE_TYPE_SWITCH @@ -1363,6 +1396,7 @@ ib_class_is_vendor_specific( */ #define IB_NODE_TYPE_SWITCH 0x02 /**********/ + /****d* IBA Base: Constants/IB_NODE_TYPE_ROUTER * NAME * IB_NODE_TYPE_ROUTER @@ -1386,6 +1420,7 @@ ib_class_is_vendor_specific( */ #define IB_NOTICE_NODE_TYPE_CA (CL_NTOH32(0x000001)) /**********/ + /****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_SWITCH * NAME * IB_NOTICE_NODE_TYPE_SWITCH @@ -1397,6 +1432,7 @@ ib_class_is_vendor_specific( */ #define IB_NOTICE_NODE_TYPE_SWITCH (CL_NTOH32(0x000002)) /**********/ + /****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_ROUTER * NAME * IB_NOTICE_NODE_TYPE_ROUTER @@ -1408,6 +1444,7 @@ ib_class_is_vendor_specific( */ #define IB_NOTICE_NODE_TYPE_ROUTER (CL_NTOH32(0x000003)) /**********/ + /****d* IBA Base: Constants/IB_NOTICE_NODE_TYPE_SUBN_MGMT * NAME * IB_NOTICE_NODE_TYPE_SUBN_MGMT @@ -2148,18 +2185,22 @@ typedef struct _ib_path_rec #define IB_VLA_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_VLA_COMPMASK_OUT_PORT (CL_HTON64(((uint64_t)1)<<1)) #define IB_VLA_COMPMASK_BLOCK (CL_HTON64(((uint64_t)1)<<2)) + /* SLtoVL Mapping Record Masks */ #define IB_SLVL_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_SLVL_COMPMASK_IN_PORT (CL_HTON64(((uint64_t)1)<<1)) #define IB_SLVL_COMPMASK_OUT_PORT (CL_HTON64(((uint64_t)1)<<2)) + /* P_Key Table Record Masks */ #define IB_PKEY_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_PKEY_COMPMASK_BLOCK (CL_HTON64(((uint64_t)1)<<1)) #define IB_PKEY_COMPMASK_PORT (CL_HTON64(((uint64_t)1)<<2)) -/* LFT Record MASKS */ + +/* LFT Record Masks */ #define IB_LFTR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_LFTR_COMPMASK_BLOCK (CL_HTON64(((uint64_t)1)<<1)) -/* ModeInfo Record MASKS */ + +/* NodeInfo Record Masks */ #define IB_NR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_NR_COMPMASK_RESERVED1 (CL_HTON64(((uint64_t)1)<<1)) #define IB_NR_COMPMASK_BASEVERSION (CL_HTON64(((uint64_t)1)<<2)) @@ -2175,6 +2216,7 @@ typedef struct _ib_path_rec #define IB_NR_COMPMASK_PORTNUM (CL_HTON64(((uint64_t)1)<<12)) #define IB_NR_COMPMASK_VENDID (CL_HTON64(((uint64_t)1)<<13)) #define IB_NR_COMPMASK_NODEDESC (CL_HTON64(((uint64_t)1)<<14)) + /* Service Record Component Mask Sec 15.2.5.14 Ver 1.1*/ #define IB_SR_COMPMASK_SID (CL_HTON64(((uint64_t)1)<<0)) #define IB_SR_COMPMASK_SGID (CL_HTON64(((uint64_t)1)<<1)) @@ -2213,6 +2255,7 @@ typedef struct _ib_path_rec #define IB_SR_COMPMASK_SDATA32_3 (CL_HTON64(((uint64_t)1)<<34)) #define IB_SR_COMPMASK_SDATA64_0 (CL_HTON64(((uint64_t)1)<<35)) #define IB_SR_COMPMASK_SDATA64_1 (CL_HTON64(((uint64_t)1)<<36)) + /* Port Info Record Component Masks */ #define IB_PIR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_PIR_COMPMASK_PORTNUM (CL_HTON64(((uint64_t)1)<<1)) @@ -2263,6 +2306,7 @@ typedef struct _ib_path_rec #define IB_PIR_COMPMASK_RESPTIME (CL_HTON64(((uint64_t)1)<<46)) #define IB_PIR_COMPMASK_LOCALPHYERR (CL_HTON64(((uint64_t)1)<<47)) #define IB_PIR_COMPMASK_OVERRUNERR (CL_HTON64(((uint64_t)1)<<48)) + /* Multicast Member Record Component Masks */ #define IB_MCR_COMPMASK_GID (CL_HTON64(((uint64_t)1)<<0)) #define IB_MCR_COMPMASK_MGID (CL_HTON64(((uint64_t)1)<<0)) @@ -2284,6 +2328,20 @@ typedef struct _ib_path_rec #define IB_MCR_COMPMASK_JOIN_STATE (CL_HTON64(((uint64_t)1)<<16)) #define IB_MCR_COMPMASK_PROXY (CL_HTON64(((uint64_t)1)<<17)) +/* GUID Info Record Component Masks */ +#define IB_GIR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) +#define IB_GIR_COMPMASK_BLOCKNUM (CL_HTON64(((uint64_t)1)<<1)) +#define IB_GIR_COMPMASK_RESV1 (CL_HTON64(((uint64_t)1)<<2)) +#define IB_GIR_COMPMASK_RESV2 (CL_HTON64(((uint64_t)1)<<3)) +#define IB_GIR_COMPMASK_GID0 (CL_HTON64(((uint64_t)1)<<4)) +#define IB_GIR_COMPMASK_GID1 (CL_HTON64(((uint64_t)1)<<5)) +#define IB_GIR_COMPMASK_GID2 (CL_HTON64(((uint64_t)1)<<6)) +#define IB_GIR_COMPMASK_GID3 (CL_HTON64(((uint64_t)1)<<7)) +#define IB_GIR_COMPMASK_GID4 (CL_HTON64(((uint64_t)1)<<8)) +#define IB_GIR_COMPMASK_GID5 (CL_HTON64(((uint64_t)1)<<9)) +#define IB_GIR_COMPMASK_GID6 (CL_HTON64(((uint64_t)1)<<10)) +#define IB_GIR_COMPMASK_GID7 (CL_HTON64(((uint64_t)1)<<11)) + /****f* IBA Base: Types/ib_path_rec_init_local * NAME * ib_path_rec_init_local @@ -5383,6 +5441,17 @@ typedef struct _ib_guid_info #include /************/ +#include +typedef struct _ib_guidinfo_record +{ + ib_net16_t lid; + uint8_t block_num; + uint8_t resv; + uint32_t reserved; + ib_guid_info_t guid_info; +} PACK_SUFFIX ib_guidinfo_record_t; +#include + #define IB_NUM_PKEY_ELEMENTS_IN_BLOCK 32 /****s* IBA Base: Types/ib_pkey_table_t * NAME Index: osm/opensm/osm_sa_guidinfo_record.c =================================================================== --- osm/opensm/osm_sa_guidinfo_record.c (revision 0) +++ osm/opensm/osm_sa_guidinfo_record.c (revision 0) @@ -0,0 +1,624 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: $ + */ + + +/* + * Abstract: + * Implementation of osm_gir_rcv_t. + * This object represents the GUIDInfoRecord Receiver object. + * This object is part of the opensm family of objects. + * + * Environment: + * Linux User Mode + * + */ + +/* + Next available error code: 0x403 +*/ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define OSM_GIR_RCV_POOL_MIN_SIZE 32 +#define OSM_GIR_RCV_POOL_GROW_SIZE 32 + +typedef struct _osm_gir_item +{ + cl_pool_item_t pool_item; + ib_guidinfo_record_t rec; + +} osm_gir_item_t; + +typedef struct _osm_gir_search_ctxt +{ + const ib_guidinfo_record_t* p_rcvd_rec; + ib_net64_t comp_mask; + cl_qlist_t* p_list; + osm_gir_rcv_t* p_rcv; + const osm_physp_t* p_req_physp; + +} osm_gir_search_ctxt_t; + +/********************************************************************** + **********************************************************************/ +void +osm_gir_rcv_construct( + IN osm_gir_rcv_t* const p_rcv ) +{ + cl_memclr( p_rcv, sizeof(*p_rcv) ); + cl_qlock_pool_construct( &p_rcv->pool ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_gir_rcv_destroy( + IN osm_gir_rcv_t* const p_rcv ) +{ + OSM_LOG_ENTER( p_rcv->p_log, osm_gir_rcv_destroy ); + cl_qlock_pool_destroy( &p_rcv->pool ); + OSM_LOG_EXIT( p_rcv->p_log ); +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osm_gir_rcv_init( + IN osm_gir_rcv_t* const p_rcv, + IN osm_sa_resp_t* const p_resp, + IN osm_mad_pool_t* const p_mad_pool, + IN const osm_subn_t* const p_subn, + IN osm_log_t* const p_log, + IN cl_plock_t* const p_lock ) +{ + ib_api_status_t status; + + OSM_LOG_ENTER( p_log, osm_gir_rcv_init ); + + osm_gir_rcv_construct( p_rcv ); + + p_rcv->p_log = p_log; + p_rcv->p_subn = p_subn; + p_rcv->p_lock = p_lock; + p_rcv->p_resp = p_resp; + p_rcv->p_mad_pool = p_mad_pool; + + status = cl_qlock_pool_init( &p_rcv->pool, + OSM_GIR_RCV_POOL_MIN_SIZE, + 0, + OSM_GIR_RCV_POOL_GROW_SIZE, + sizeof(osm_gir_item_t), + NULL, NULL, NULL ); + + OSM_LOG_EXIT( p_log ); + return( status ); +} + +/********************************************************************** + **********************************************************************/ +static ib_api_status_t +__osm_gir_rcv_new_gir( + IN osm_gir_rcv_t* const p_rcv, + IN const osm_node_t* const p_node, + IN cl_qlist_t* const p_list, + IN ib_net64_t const match_port_guid, + IN ib_net16_t const match_lid, + IN const osm_physp_t* const p_req_physp, + IN uint8_t const block_num ) +{ + osm_gir_item_t* p_rec_item; + ib_api_status_t status = IB_SUCCESS; + + OSM_LOG_ENTER( p_rcv->p_log, __osm_gir_rcv_new_gir ); + + p_rec_item = (osm_gir_item_t*)cl_qlock_pool_get( &p_rcv->pool ); + if( p_rec_item == NULL ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_gir_rcv_new_gir: ERR 5102: " + "cl_qlock_pool_get failed\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_gir_rcv_new_gir: " + "New GUIDInfoRecord: lid 0x%X, block num %d\n", + cl_ntoh16( match_lid ), block_num ); + } + + cl_memclr( &p_rec_item->rec, sizeof( p_rec_item->rec ) ); + + p_rec_item->rec.lid = match_lid; + p_rec_item->rec.block_num = block_num; + if (!block_num) + p_rec_item->rec.guid_info.guid[0] = osm_physp_get_port_guid( p_req_physp ); + + cl_qlist_insert_tail( p_list, (cl_list_item_t*)&p_rec_item->pool_item ); + + Exit: + OSM_LOG_EXIT( p_rcv->p_log ); + return( status ); +} + +/********************************************************************** + **********************************************************************/ +void +__osm_sa_gir_create_gir( + IN osm_gir_rcv_t* const p_rcv, + IN const osm_node_t* const p_node, + IN cl_qlist_t* const p_list, + IN ib_net64_t const match_port_guid, + IN ib_net16_t const match_lid, + IN const osm_physp_t* const p_req_physp, + IN uint8_t const match_block_num ) +{ + const osm_physp_t* p_physp; + uint8_t port_num; + uint8_t num_ports; + uint16_t match_lid_ho; + uint16_t lid_ho; + ib_net16_t base_lid_ho; + ib_net16_t max_lid_ho; + uint8_t lmc; + ib_net64_t port_guid; + ib_api_status_t status; + const ib_port_info_t* p_pi; + uint8_t block_num, start_block_num, end_block_num, num_blocks; + + OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_gir_create_gir ); + + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_sa_gir_create_gir: " + "Looking for GUIDRecord with LID: 0x%X GUID:0x%016" PRIx64 "\n", + cl_ntoh16( match_lid ), + cl_ntoh64( match_port_guid ) + ); + } + + /* + For switches, do not return the GUIDInfo record(s) + for each port on the switch, just for port 0. + */ + if( osm_node_get_type( p_node ) == IB_NODE_TYPE_SWITCH ) + num_ports = 1; + else + num_ports = osm_node_get_num_physp( p_node ); + + for( port_num = 0; port_num < num_ports; port_num++ ) + { + p_physp = osm_node_get_physp_ptr( p_node, port_num ); + + if( !osm_physp_is_valid( p_physp ) ) + continue; + + /* Check to see if the found p_physp and the requestor physp + share a pkey. If not - continue */ + if (!osm_physp_share_pkey( p_rcv->p_log, p_physp, p_req_physp ) ) + continue; + + port_guid = osm_physp_get_port_guid( p_physp ); + + if( match_port_guid && ( port_guid != match_port_guid ) ) + continue; + + p_pi = osm_physp_get_port_info_ptr( p_physp ); + num_blocks = p_pi->guid_cap / 8; + if ( p_pi->guid_cap % 8 ) + num_blocks++; + if (match_block_num == 255) + { + start_block_num = 0; + end_block_num = num_blocks - 1; + } + else + { + if (match_block_num >= num_blocks) + continue; + end_block_num = start_block_num = match_block_num; + } + + base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_physp ) ); + lmc = osm_physp_get_lmc( p_physp ); + max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 ); + match_lid_ho = cl_ntoh16( match_lid ); + + if( match_lid_ho ) + { + /* + We validate that the lid belongs to this node. + */ + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_sa_gir_create_gir: " + "Comparing LID: 0x%X <= 0x%X <= 0x%X\n", + cl_ntoh16( base_lid_ho ), + cl_ntoh16( match_lid_ho ), + cl_ntoh16( max_lid_ho ) + ); + } + + if( (match_lid_ho <= max_lid_ho) && (match_lid_ho >= base_lid_ho) ) + { + /* + Ignore return code for now. + */ + for (block_num = start_block_num; block_num <= end_block_num; block_num++) + __osm_gir_rcv_new_gir( p_rcv, p_node, p_list, + port_guid, match_lid, + p_physp, block_num ); + } + } + else + { + /* + For every lid value create the GUIDInfo record(s). + */ + for( lid_ho = base_lid_ho; lid_ho <= max_lid_ho; lid_ho++ ) + { + for (block_num = start_block_num; block_num <= end_block_num; block_num++) + { + status = __osm_gir_rcv_new_gir( p_rcv, p_node, p_list, + port_guid, cl_hton16( lid_ho ), + p_physp, block_num ); + if( status != IB_SUCCESS ) + break; + } + } + } + } + + OSM_LOG_EXIT( p_rcv->p_log ); +} + +/********************************************************************** + **********************************************************************/ +void +__osm_sa_gir_by_comp_mask_cb( + IN cl_map_item_t* const p_map_item, + IN void* context ) +{ + const osm_gir_search_ctxt_t* const p_ctxt = (osm_gir_search_ctxt_t *)context; + const osm_node_t* const p_node = (osm_node_t*)p_map_item; + const ib_guidinfo_record_t* const p_rcvd_rec = p_ctxt->p_rcvd_rec; + const osm_physp_t* const p_req_physp = p_ctxt->p_req_physp; + osm_gir_rcv_t* const p_rcv = p_ctxt->p_rcv; + const ib_guid_info_t* p_comp_gi; + ib_net64_t const comp_mask = p_ctxt->comp_mask; + ib_net64_t match_port_guid = 0; + ib_net16_t match_lid = 0; + uint8_t match_block_num = 255; + + OSM_LOG_ENTER( p_ctxt->p_rcv->p_log, __osm_sa_gir_by_comp_mask_cb); + + if( comp_mask & IB_GIR_COMPMASK_LID ) + match_lid = p_rcvd_rec->lid; + + if( comp_mask & IB_GIR_COMPMASK_BLOCKNUM ) + match_block_num = p_rcvd_rec->block_num; + + p_comp_gi = &p_rcvd_rec->guid_info; + /* Different rule for block 0 v. other blocks */ + if( comp_mask & IB_GIR_COMPMASK_GID0 ) + { + if ( !p_rcvd_rec->block_num ) + match_port_guid = osm_physp_get_port_guid( p_req_physp ); + if ( p_comp_gi->guid[0] != match_port_guid ) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID1 ) + { + if ( p_comp_gi->guid[1] != 0) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID2 ) + { + if ( p_comp_gi->guid[2] != 0) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID3 ) + { + if ( p_comp_gi->guid[3] != 0) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID4 ) + { + if ( p_comp_gi->guid[4] != 0) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID5 ) + { + if ( p_comp_gi->guid[5] != 0) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID6 ) + { + if ( p_comp_gi->guid[6] != 0) + goto Exit; + } + + if( comp_mask & IB_GIR_COMPMASK_GID7 ) + { + if ( p_comp_gi->guid[7] != 0) + goto Exit; + } + + __osm_sa_gir_create_gir( p_rcv, p_node, p_ctxt->p_list, + match_port_guid, match_lid, p_req_physp, + match_block_num ); + + Exit: + OSM_LOG_EXIT( p_ctxt->p_rcv->p_log ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_gir_rcv_process( + IN osm_gir_rcv_t* const p_rcv, + IN const osm_madw_t* const p_madw ) +{ + const ib_sa_mad_t* p_rcvd_mad; + const ib_guidinfo_record_t* p_rcvd_rec; + cl_qlist_t rec_list; + osm_madw_t* p_resp_madw; + ib_sa_mad_t* p_resp_sa_mad; + ib_guidinfo_record_t* p_resp_rec; + uint32_t num_rec, pre_trim_num_rec; +#ifndef VENDOR_RMPP_SUPPORT + uint32_t trim_num_rec; +#endif + uint32_t i; + osm_gir_search_ctxt_t context; + osm_gir_item_t* p_rec_item; + ib_api_status_t status; + osm_physp_t* p_req_physp; + + CL_ASSERT( p_rcv ); + + OSM_LOG_ENTER( p_rcv->p_log, osm_gir_rcv_process ); + + CL_ASSERT( p_madw ); + + p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); + p_rcvd_rec = (ib_guidinfo_record_t*)ib_sa_mad_get_payload_ptr( p_rcvd_mad ); + + CL_ASSERT( p_rcvd_mad->attr_id == IB_MAD_ATTR_GUIDINFO_RECORD ); + + /* update the requestor physical port. */ + p_req_physp = osm_get_physp_by_mad_addr(p_rcv->p_log, + p_rcv->p_subn, + osm_madw_get_mad_addr_ptr(p_madw) ); + if (p_req_physp == NULL) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_gir_rcv_process: ERR 5104: " + "Cannot find requestor physical port\n" ); + goto Exit; + } + + if ( (p_rcvd_mad->method != IB_MAD_METHOD_GET) && + (p_rcvd_mad->method != IB_MAD_METHOD_GETTABLE) ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_gir_rcv_process: ERR 5105: " + "Unsupported Method (%s)\n", + ib_get_sa_method_str( p_rcvd_mad->method ) ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID); + goto Exit; + } + + if ( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + osm_dump_guidinfo_record( p_rcv->p_log, p_rcvd_rec, OSM_LOG_DEBUG ); + + cl_qlist_init( &rec_list ); + + context.p_rcvd_rec = p_rcvd_rec; + context.p_list = &rec_list; + context.comp_mask = p_rcvd_mad->comp_mask; + context.p_rcv = p_rcv; + context.p_req_physp = p_req_physp; + + cl_plock_acquire( p_rcv->p_lock ); + + cl_qmap_apply_func( &p_rcv->p_subn->node_guid_tbl, + __osm_sa_gir_by_comp_mask_cb, + &context ); + + cl_plock_release( p_rcv->p_lock ); + + num_rec = cl_qlist_count( &rec_list ); + + /* + * C15-0.1.30: + * If we do a SubnAdmGet and got more than one record it is an error ! + */ + if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && + (num_rec > 1)) { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_gir_rcv_process: " + "Got more than one record for SubnAdmGet (%u)\n", + num_rec ); + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_TOO_MANY_RECORDS); + + /* need to set the mem free ... */ + p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list ); + while( p_rec_item != (osm_gir_item_t*)cl_qlist_end( &rec_list ) ) + { + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list ); + } + + goto Exit; + } + + pre_trim_num_rec = num_rec; +#ifndef VENDOR_RMPP_SUPPORT + trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_guidinfo_record_t); + if (trim_num_rec < num_rec) + { + osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, + "osm_gir_rcv_process: " + "Number of records:%u trimmed to:%u to fit in one MAD\n", + num_rec, trim_num_rec ); + num_rec = trim_num_rec; + } +#endif + + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "osm_gir_rcv_process: " + "Returning %u records\n", num_rec ); + + if ((p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec == 0)) + { + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; + } + + /* + * Get a MAD to reply. Address of Mad is in the received mad_wrapper + */ + p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool, + p_madw->h_bind, + num_rec * sizeof(ib_guidinfo_record_t) + IB_SA_MAD_HDR_SIZE, + &p_madw->mad_addr ); + + if( !p_resp_madw ) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "osm_gir_rcv_process: ERR 5106: " + "osm_mad_pool_get failed\n" ); + + for( i = 0; i < num_rec; i++ ) + { + p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list ); + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + } + + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_NO_RESOURCES ); + + goto Exit; + } + + p_resp_sa_mad = osm_madw_get_sa_mad_ptr( p_resp_madw ); + + /* + Copy the MAD header back into the response mad. + Set the 'R' bit and the payload length, + Then copy all records from the list into the response payload. + */ + + cl_memcpy( p_resp_sa_mad, p_rcvd_mad, IB_SA_MAD_HDR_SIZE ); + p_resp_sa_mad->method = (uint8_t)(p_resp_sa_mad->method | 0x80); + /* C15-0.1.5 - always return SM_Key = 0 (table 151 p 782) */ + p_resp_sa_mad->sm_key = 0; + /* Fill in the offset (paylen will be done by the rmpp SAR) */ + p_resp_sa_mad->attr_offset = + ib_get_attr_offset( sizeof(ib_guidinfo_record_t) ); + + p_resp_rec = (ib_guidinfo_record_t*) + ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); + +#ifndef VENDOR_RMPP_SUPPORT + /* we support only one packet RMPP - so we will set the first and + last flags for gettable */ + if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP) + { + p_resp_sa_mad->rmpp_type = IB_RMPP_TYPE_DATA; + p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_FIRST | IB_RMPP_FLAG_LAST | IB_RMPP_FLAG_ACTIVE; + } +#else + /* forcefully define the packet as RMPP one */ + if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP) + p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE; +#endif + + for( i = 0; i < pre_trim_num_rec; i++ ) + { + p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list ); + /* copy only if not trimmed */ + if (i < num_rec) + { + *p_resp_rec = p_rec_item->rec; + } + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_resp_rec++; + } + + CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); + + status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); + if(status != IB_SUCCESS) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "osm_gir_rcv_process: ERR 5107: " + "osm_vendor_send. status = %s\n", + ib_get_err_str(status)); + goto Exit; + } + + Exit: + OSM_LOG_EXIT( p_rcv->p_log ); +} Property changes on: osm/opensm/osm_sa_guidinfo_record.c ___________________________________________________________________ Name: svn:keywords + Id Index: osm/opensm/osm_helper.c =================================================================== --- osm/opensm/osm_helper.c (revision 5193) +++ osm/opensm/osm_helper.c (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -763,6 +763,50 @@ osm_dump_portinfo_record( /********************************************************************** **********************************************************************/ void +osm_dump_guidinfo_record( + IN osm_log_t* const p_log, + IN const ib_guidinfo_record_t* const p_gir, + IN const osm_log_level_t log_level ) +{ + const ib_guid_info_t * const p_gi = &p_gir->guid_info; + + if( osm_log_is_active( p_log, log_level ) ) + { + osm_log( p_log, log_level, + "GUIDInfo Record dump:\n" + "\t\t\t\tRID\n" + "\t\t\t\tLid.....................0x%X\n" + "\t\t\t\tBlockNum................0x%X\n" + "\t\t\t\tReserved................0x%X\n" + "\t\t\t\tGUIDInfo dump\n" + "\t\t\t\tReserved................0x%X\n" + "\t\t\t\tGUID 0..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 1..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 2..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 3..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 4..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 5..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 6..................0x%016" PRIx64 "\n" + "\t\t\t\tGUID 7..................0x%016" PRIx64 "\n", + cl_ntoh16(p_gir->lid), + p_gir->block_num, + p_gir->resv, + cl_ntoh32(p_gir->reserved), + cl_ntoh64(p_gi->guid[0]), + cl_ntoh64(p_gi->guid[1]), + cl_ntoh64(p_gi->guid[2]), + cl_ntoh64(p_gi->guid[3]), + cl_ntoh64(p_gi->guid[4]), + cl_ntoh64(p_gi->guid[5]), + cl_ntoh64(p_gi->guid[6]), + cl_ntoh64(p_gi->guid[7]) + ); + } +} + +/********************************************************************** + **********************************************************************/ +void osm_dump_node_info( IN osm_log_t* const p_log, IN const ib_node_info_t* const p_ni, Index: osm/opensm/osm_sa.c =================================================================== --- osm/opensm/osm_sa.c (revision 5193) +++ osm/opensm/osm_sa.c (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -89,6 +89,9 @@ osm_sa_construct( osm_pir_rcv_construct( &p_sa->pir_rcv ); osm_pir_rcv_ctrl_construct( &p_sa->pir_rcv_ctrl ); + osm_gir_rcv_construct( &p_sa->gir_rcv ); + osm_gir_rcv_ctrl_construct( &p_sa->gir_rcv_ctrl ); + osm_lr_rcv_construct( &p_sa->lr_rcv ); osm_lr_rcv_ctrl_construct( &p_sa->lr_rcv_ctrl ); @@ -135,6 +138,7 @@ osm_sa_shutdown( /* remove any registered dispatcher message */ osm_nr_rcv_ctrl_destroy( &p_sa->nr_rcv_ctrl ); osm_pir_rcv_ctrl_destroy( &p_sa->pir_rcv_ctrl ); + osm_gir_rcv_ctrl_destroy( &p_sa->gir_rcv_ctrl ); osm_lr_rcv_ctrl_destroy( &p_sa->lr_rcv_ctrl ); osm_pr_rcv_ctrl_destroy( &p_sa->pr_rcv_ctrl ); osm_smir_ctrl_destroy( &p_sa->smir_ctrl ); @@ -162,6 +166,7 @@ osm_sa_destroy( osm_nr_rcv_destroy( &p_sa->nr_rcv ); osm_pir_rcv_destroy( &p_sa->pir_rcv ); + osm_gir_rcv_destroy( &p_sa->gir_rcv ); osm_lr_rcv_destroy( &p_sa->lr_rcv ); osm_pr_rcv_destroy( &p_sa->pr_rcv ); osm_smir_rcv_destroy( &p_sa->smir_rcv ); @@ -276,6 +281,24 @@ osm_sa_init( if( status != IB_SUCCESS ) goto Exit; + status = osm_gir_rcv_init( + &p_sa->gir_rcv, + &p_sa->resp, + p_sa->p_mad_pool, + p_subn, + p_log, + p_lock ); + if( status != IB_SUCCESS ) + goto Exit; + + status = osm_gir_rcv_ctrl_init( + &p_sa->gir_rcv_ctrl, + &p_sa->gir_rcv, + p_log, + p_disp ); + if( status != IB_SUCCESS ) + goto Exit; + status = osm_lr_rcv_init( &p_sa->lr_rcv, &p_sa->resp, Index: osm/opensm/osm_sa_class_port_info.c =================================================================== --- osm/opensm/osm_sa_class_port_info.c (revision 5193) +++ osm/opensm/osm_sa_class_port_info.c (working copy) @@ -183,7 +183,6 @@ __osm_cpi_rcv_respond( SMInfoRecord, (we do support it - under the table) InformInfoRecord, LinkRecord, (we do support it - under the table) - GuidInfoRecord ServiceAssociationRecord OSM_CAP_IS_SUBN_OPT_MULTI_PATH_SUP: Index: osm/opensm/libopensm.map =================================================================== --- osm/opensm/libopensm.map (revision 5193) +++ osm/opensm/libopensm.map (working copy) @@ -17,6 +17,7 @@ OPENSM_1.0 { osm_dbg_get_capabilities_str; osm_dump_port_info; osm_dump_portinfo_record; + osm_dump_guidinfo_record; osm_dump_node_info; osm_dump_node_record; osm_dump_path_record; Index: osm/opensm/osm_sa_mad_ctrl.c =================================================================== --- osm/opensm/osm_sa_mad_ctrl.c (revision 5193) +++ osm/opensm/osm_sa_mad_ctrl.c (working copy) @@ -205,6 +205,10 @@ __osm_sa_mad_ctrl_process( msg_id = OSM_MSG_MAD_LFT_RECORD; break; + case IB_MAD_ATTR_GUIDINFO_RECORD: + msg_id = OSM_MSG_MAD_GUIDINFO_RECORD; + break; + default: osm_log( p_ctrl->p_log, OSM_LOG_ERROR, "__osm_sa_mad_ctrl_process: ERR 1A01: " Index: osm/opensm/osm_sa_guidinfo_record_ctrl.c =================================================================== --- osm/opensm/osm_sa_guidinfo_record_ctrl.c (revision 0) +++ osm/opensm/osm_sa_guidinfo_record_ctrl.c (revision 0) @@ -0,0 +1,129 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: $ + */ + + +/* + * Abstract: + * Implementation of osm_gir_rcv_ctrl_t. + * This object represents the GUIDInfoRecord request controller object. + * This object is part of the opensm family of objects. + * + * Environment: + * Linux User Mode + * + */ + +/* + Next available error code: 0x203 +*/ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include + +/********************************************************************** + **********************************************************************/ +void +__osm_gir_rcv_ctrl_disp_callback( + IN void *context, + IN void *p_data ) +{ + /* ignore return status when invoked via the dispatcher */ + osm_gir_rcv_process( ((osm_gir_rcv_ctrl_t*)context)->p_rcv, + (osm_madw_t*)p_data ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_gir_rcv_ctrl_construct( + IN osm_gir_rcv_ctrl_t* const p_ctrl ) +{ + cl_memclr( p_ctrl, sizeof(*p_ctrl) ); + p_ctrl->h_disp = CL_DISP_INVALID_HANDLE; +} + +/********************************************************************** + **********************************************************************/ +void +osm_gir_rcv_ctrl_destroy( + IN osm_gir_rcv_ctrl_t* const p_ctrl ) +{ + CL_ASSERT( p_ctrl ); + cl_disp_unregister( p_ctrl->h_disp ); +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osm_gir_rcv_ctrl_init( + IN osm_gir_rcv_ctrl_t* const p_ctrl, + IN osm_gir_rcv_t* const p_rcv, + IN osm_log_t* const p_log, + IN cl_dispatcher_t* const p_disp ) +{ + ib_api_status_t status = IB_SUCCESS; + + OSM_LOG_ENTER( p_log, osm_gir_rcv_ctrl_init ); + + osm_gir_rcv_ctrl_construct( p_ctrl ); + p_ctrl->p_log = p_log; + p_ctrl->p_rcv = p_rcv; + p_ctrl->p_disp = p_disp; + + p_ctrl->h_disp = cl_disp_register( + p_disp, + OSM_MSG_MAD_GUIDINFO_RECORD, + __osm_gir_rcv_ctrl_disp_callback, + p_ctrl ); + + if( p_ctrl->h_disp == CL_DISP_INVALID_HANDLE ) + { + osm_log( p_log, OSM_LOG_ERROR, + "osm_gir_rcv_ctrl_init: ERR 5201: " + "Dispatcher registration failed\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + + Exit: + OSM_LOG_EXIT( p_log ); + return( status ); +} Property changes on: osm/opensm/osm_sa_guidinfo_record_ctrl.c ___________________________________________________________________ Name: svn:keywords + Id Index: osm/opensm/Makefile.am =================================================================== --- osm/opensm/Makefile.am (revision 5193) +++ osm/opensm/Makefile.am (working copy) @@ -46,6 +46,7 @@ opensm_SOURCES = main.c osm_console.c os osm_sa_path_record.c osm_sa_path_record_ctrl.c \ osm_sa_pkey_record.c osm_sa_pkey_record_ctrl.c \ osm_sa_portinfo_record.c osm_sa_portinfo_record_ctrl.c \ + osm_sa_guidinfo_record.c osm_sa_guidinfo_record_ctrl.c \ osm_sa_response.c osm_sa_service_record.c \ osm_sa_service_record_ctrl.c osm_sa_slvl_record.c \ osm_sa_slvl_record_ctrl.c osm_sa_sminfo_record.c \ _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Mon Jan 30 05:23:43 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 30 Jan 2006 15:23:43 +0200 Subject: [openib-general] [PATCH] osm: support osm_svn_revision.h in case of SVN export Message-ID: <86vew17stc.fsf@mtl066.yok.mtl.com> Hi Hal We are using SVN export when building standalone OpenSM packages. During the SVN export we overwrite the osm_svn_revision.h with the SVN version used for the export. However the makefile override that. This patch avoids this by checking if the svnversion is "exported". Eitan Signed-off-by: Eitan Zahavi Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 5202) +++ opensm/Makefile.am (working copy) @@ -13,7 +13,12 @@ if OSMV_OPENIB $(srcdir)/../include/opensm/osm_svn_revision_new.h: echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ svnversion $(srcdir)/.. | tr -d '\n' >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h + echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h; \ + if test `cat $(srcdir)/../include/opensm/osm_svn_revision_new.h | grep exported | wc -l` = 1; \ + then \ + cp $(srcdir)/../include/opensm/osm_svn_revision.h \ + $(srcdir)/../include/opensm/osm_svn_revision_new.h; \ + fi $(srcdir)/../include/opensm/osm_svn_revision.h: $(srcdir)/../include/opensm/osm_svn_revision_new.h if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ From ogerlitz at voltaire.com Mon Jan 30 05:52:11 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 30 Jan 2006 15:52:11 +0200 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43D912E4.3020603@ichips.intel.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> Message-ID: <43DE1A0B.6030606@voltaire.com> Sean Hefty wrote: > The implementation is complete. The interface to the cache operates > synchronously. If an item is found in the cache, it is returned. If no > item is found, an error is returned. The caller can query the SA > directly in this case. (If we wanted to be fancy, the results of that > query could be copied into the cache, but the cache will update on its > own.) I see. The CMA code (cma_resolve_ib_route) just calls ib_get_path_rec and if the latter fails returns error to its consumer. Two questions come into my mind here: a) is it the final version or you plan to extend the CMA to directly query the SA in that case? b) the code of ib_get_path_rec seems to just search the path in the index and return -ENODATA if nothing is found, does it also trigger an SA lookup for this path? Or. From halr at voltaire.com Mon Jan 30 05:54:16 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Jan 2006 08:54:16 -0500 Subject: [openib-general] Re: [PATCH] osm: support osm_svn_revision.h in case of SVN export In-Reply-To: <86vew17stc.fsf@mtl066.yok.mtl.com> References: <86vew17stc.fsf@mtl066.yok.mtl.com> Message-ID: <1138629255.4453.6717.camel@hal.voltaire.com> On Mon, 2006-01-30 at 08:23, Eitan Zahavi wrote: > Hi Hal > > We are using SVN export when building standalone OpenSM packages. > During the SVN export we overwrite the osm_svn_revision.h with the > SVN version used for the export. > > However the makefile override that. This patch avoids this by checking > if the svnversion is "exported". Thanks. Applied (by hand so you should check it). -- Hal > Eitan > > Signed-off-by: Eitan Zahavi > Index: opensm/Makefile.am > =================================================================== > --- opensm/Makefile.am (revision 5202) > +++ opensm/Makefile.am (working copy) > @@ -13,7 +13,12 @@ if OSMV_OPENIB > $(srcdir)/../include/opensm/osm_svn_revision_new.h: > echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ > svnversion $(srcdir)/.. | tr -d '\n' >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ > - echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h > + echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h; \ > + if test `cat $(srcdir)/../include/opensm/osm_svn_revision_new.h | grep exported | wc -l` = 1; \ > + then \ > + cp $(srcdir)/../include/opensm/osm_svn_revision.h \ > + $(srcdir)/../include/opensm/osm_svn_revision_new.h; \ > + fi > > $(srcdir)/../include/opensm/osm_svn_revision.h: $(srcdir)/../include/opensm/osm_svn_revision_new.h > if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ > From tziporet at mellanox.co.il Mon Jan 30 06:19:40 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 30 Jan 2006 16:19:40 +0200 Subject: [openib-general] ANNOUNCE: mstflint update - test - please ignore Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30100B0A2@mtlexch01.mtl.com> -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Michael S. Tsirkin Sent: Monday, January 23, 2006 7:17 PM To: openib-general at openib.org Subject: [openib-general] ANNOUNCE: mstflint update Hi! I have updated mstflint tool with code from mellanox MFT 1.0.1 package. mstflint is a stand-alone firmware burning tool for Mellanox manufactured HCA cards. Some success has been reported with cards from Topspin/Cisco. See the README file under src/userspace/mstflint for more info. From tziporet at mellanox.co.il Mon Jan 30 06:24:37 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 30 Jan 2006 16:24:37 +0200 Subject: [openib-general] test please ignore Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30100B0A3@mtlexch01.mtl.com> Tziporet From tziporet at mellanox.co.il Mon Jan 30 06:38:25 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 30 Jan 2006 16:38:25 +0200 Subject: [openib-general] gen1 drivers as rpm or tgz? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30100B0A4@mtlexch01.mtl.com> IBGold 1.8.0 does include access layer supplied with it. Our FAE will contact you shortly to explain how to use the access layer of gen1. Tziporet -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Bub Thomas Sent: Monday, January 30, 2006 2:30 PM To: openib-general at openib.org Subject: [openib-general] gen1 drivers as rpm or tgz? Hi, we are planning to replace a HighSpeed GSN (Gigabyte System Network) data interface with IB. We'd like to to use the verbs layer with the Access Layer on top. Unfortunately Mellanox, our Hardware provider, does not supply an Access layer with their IBGold 1.8 stack. Yes I know openIB currently supports only gen2 drivers with 2.6er kernels Unfortunately all our environment is still 2.4er kernels under RedHat EL 3 Update 6. Before my whole development and target environment is on 2.6er kernels I have to wait for another 2 month. In order to speed up things for me as an developer I'd like to start with gen1 openIB drivers. Is there soothing like rpm's or tgz's I can use which supplies wole IB stack for 2.4er kernels or an Acess layer extension for Mellanox IBGold stack. I'd like to avoid to download via cvs, which I'm not familiar with yet. Thanks Thomas Bub ............................................................ Thomas Bub Grass Valley Germany GmbH Brunnenweg 9 64331 Weiterstadt, Germany Tel: +49 6150 104 147 Fax: +49 6150 104 656 Email: Thomas.Bub at thomson.net www.GrassValley.com ............................................................ From halr at voltaire.com Mon Jan 30 06:26:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Jan 2006 09:26:08 -0500 Subject: [openib-general] multicast join errors In-Reply-To: References: Message-ID: <1138631167.4453.6934.camel@hal.voltaire.com> Hi Amith, On Sat, 2006-01-28 at 16:18, amith rajith mamidala wrote: > Hi Hal, > > There is only one application running on a node. I am running opensm on > a different node. I am also listing the other processes I observed on > doing a "ps": > > root 3564 11 0 Jan26 ? 00:00:00 [ib_cm/0] > root 3565 11 0 Jan26 ? 00:00:00 [ib_cm/1] > root 1294 11 0 Jan26 ? 00:00:00 [ib_mad1] > root 1295 11 0 Jan26 ? 00:00:00 [ib_mad2] > root 1298 11 0 Jan26 ? 00:00:00 [ib_mad1] > root 1299 11 0 Jan26 ? 00:00:00 [ib_mad2] > > > Thanks, > Amith > > On 28 Jan 2006, Hal Rosenstock wrote: > > > Hi Amith, > > > > On Sat, 2006-01-28 at 12:46, amith rajith mamidala wrote: > > > Hi, > > > > > > I was able to create multicast groups after Hal's fix. But, when I do join > > > subsequently from the same program I am getting a port_alloc error: > > > > > > Jan 28 12:22:12 119632 [AB2223C0] -> osm_vendor_bind: Binding to port > > > 0x6270510000005. > > > -I- Created the Multicast Group: > > > MGID....................0xff13a01cfe800000 : 0x0000000000000000 > > > PortGid.................0xfe80000000000000 : 0x0006270510000005 > > > qkey....................0x0 > > > Mlid....................0xC002 > > > ScopeState..............0x21 > > > Rate....................0x83 > > > Mtu.....................0x84 > > > Jan 28 12:22:12 140486 [AB2223C0] -> osm_vendor_bind: Binding to port > > > 0x6270510000005. > > > > > > ibwarn: [4057] port_alloc: umad port id 0 is already allocated for mthca0 > > > 1 > > > Jan 28 12:22:12 143240 [AB2223C0] -> osm_vendor_open_port: ERR 542C: > > > umad_open_port() failed > > > Jan 28 12:22:12 143253 [AB2223C0] -> osm_vendor_bind: ERR 5424: Unable to > > > Open Port 0x6270510000005. > > > Jan 28 12:22:12 143262 [AB2223C0] -> osmv_bind_sa: ERR 5506: Failed to > > > bind to vendor GSI > > > Jan 28 12:22:12 143267 [AB2223C0] -> ibmcgrp_bind: ERR 00137: Unable to > > > bind to SA > > > > > > I am trying to trace the source of this error, > > > > Is this the only IB application running or are there others (and if so, > > what else is running) ? I'm able to do the following on both ports 1 and 2: ibmcgrp -c -g 0xff13a01cfe800000:0000000000000000 --port_num=1 -I- Creating Multicast Group -I- MGID 0xff13a01cfe800000:0000000000000000 -I- Port Num:1 Jan 30 09:23:54 980478 [B7F06720] -> osm_vendor_bind: Binding to port 0x8f10403960559. -I- Created the Multicast Group: MGID....................0xff13a01cfe800000 : 0x0000000000000000 PortGid.................0xfe80000000000000 : 0x0008f10403960559 qkey....................0x0 Mlid....................0xC008 ScopeState..............0x21 Rate....................0x82 Mtu.....................0x84 IBMCGRP: PASS ibmcgrp -c -g 0xff13a01cfe800000:0000000000000000 --port_num=2 -I- Creating Multicast Group -I- MGID 0xff13a01cfe800000:0000000000000000 -I- Port Num:2 Jan 30 09:24:02 804602 [B7F50720] -> osm_vendor_bind: Binding to port 0x8f10403960559. -I- Created the Multicast Group: MGID....................0xff13a01cfe800000 : 0x0000000000000000 PortGid.................0xfe80000000000000 : 0x0008f10403960559 qkey....................0x0 Mlid....................0xC008 ScopeState..............0x21 Rate....................0x82 Mtu.....................0x84 The only difference I see is in the rate but that wouldn't cause the error you are seeing. Can you describe your scenario better so I can recreate it to see what is going on ? Thanks. -- Hal > > -- Hal > > > > > Thanks, > > > Amith > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From mst at mellanox.co.il Mon Jan 30 06:53:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jan 2006 16:53:52 +0200 Subject: [openib-general] Re: [PATCH] osm: support osm_svn_revision.h in case ofSVN export In-Reply-To: <86vew17stc.fsf@mtl066.yok.mtl.com> References: <86vew17stc.fsf@mtl066.yok.mtl.com> Message-ID: <20060130145352.GB31887@mellanox.co.il> Quoting r. Eitan Zahavi : > Subject: [PATCH] osm: support osm_svn_revision.h in case ofSVN export > > Hi Hal > > We are using SVN export when building standalone OpenSM packages. > During the SVN export we overwrite the osm_svn_revision.h with the > SVN version used for the export. > > However the makefile override that. This patch avoids this by checking > if the svnversion is "exported". > > Eitan Hi! The way osm_svn_revision_new.h is removed to trigger re-make on the next pass is IMO ugly: thats what .PHONY target is for. --- Simplify Makefile.am (remove an extra target, use -n flag to svnversion) and make it possible to build opensm on platforms without subversion installed. Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/management/osm/opensm/Makefile.am =================================================================== --- openib/src/userspace/management/osm/opensm/Makefile.am (revision 5207) +++ openib/src/userspace/management/osm/opensm/Makefile.am (working copy) @@ -10,25 +10,25 @@ endif if OSMV_OPENIB -$(srcdir)/../include/opensm/osm_svn_revision_new.h: - echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - svnversion $(srcdir)/.. | tr -d '\n' >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - if test `cat $(srcdir)/../include/opensm/osm_svn_revision_new.h | grep exported | wc -l` = 1; \ - then \ - cp $(srcdir)/../include/opensm/osm_svn_revision.h \ - $(srcdir)/../include/opensm/osm_svn_revision_new.h; \ +.PHONY: always +$(srcdir)/../include/opensm/osm_svn_revision.h: always + if \ + test '!' -d '$(srcdir)/.svn';\ + then\ + echo Exported svn revision;\ + else\ + echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ;\ + svnversion -n $(srcdir)/.. >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ;\ + echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ;\ + if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ + $(srcdir)/../include/opensm/osm_svn_revision.h ;\ + then\ + rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ;\ + else\ + mv $(srcdir)/../include/opensm/osm_svn_revision_new.h\ + $(srcdir)/../include/opensm/osm_svn_revision.h ;\ + fi ;\ fi - -$(srcdir)/../include/opensm/osm_svn_revision.h: $(srcdir)/../include/opensm/osm_svn_revision_new.h - if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ - $(srcdir)/../include/opensm/osm_svn_revision.h ; \ - then \ - rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - else \ - mv $(srcdir)/../include/opensm/osm_svn_revision_new.h \ - $(srcdir)/../include/opensm/osm_svn_revision.h ; \ - fi endif libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From nvshayl at netvision.net.il Mon Jan 30 07:04:37 2006 From: nvshayl at netvision.net.il (Shay Levin) Date: Mon, 30 Jan 2006 17:04:37 +0200 Subject: [openib-general] Re: [PATCH] osm: support osm_svn_revision.h in case ofSVN export Message-ID: test -----Original Message----- From: openib-general-bounces at openib.org on behalf of Michael S. Tsirkin Sent: Mon 1/30/2006 4:53 PM To: Hal Rosenstock Cc: OPENIB Subject: [openib-general] Re: [PATCH] osm: support osm_svn_revision.h in case ofSVN export Quoting r. Eitan Zahavi : > Subject: [PATCH] osm: support osm_svn_revision.h in case ofSVN export > > Hi Hal > > We are using SVN export when building standalone OpenSM packages. > During the SVN export we overwrite the osm_svn_revision.h with the > SVN version used for the export. > > However the makefile override that. This patch avoids this by checking > if the svnversion is "exported". > > Eitan Hi! The way osm_svn_revision_new.h is removed to trigger re-make on the next pass is IMO ugly: thats what .PHONY target is for. --- Simplify Makefile.am (remove an extra target, use -n flag to svnversion) and make it possible to build opensm on platforms without subversion installed. Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/management/osm/opensm/Makefile.am =================================================================== --- openib/src/userspace/management/osm/opensm/Makefile.am (revision 5207) +++ openib/src/userspace/management/osm/opensm/Makefile.am (working copy) @@ -10,25 +10,25 @@ endif if OSMV_OPENIB -$(srcdir)/../include/opensm/osm_svn_revision_new.h: - echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - svnversion $(srcdir)/.. | tr -d '\n' >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - if test `cat $(srcdir)/../include/opensm/osm_svn_revision_new.h | grep exported | wc -l` = 1; \ - then \ - cp $(srcdir)/../include/opensm/osm_svn_revision.h \ - $(srcdir)/../include/opensm/osm_svn_revision_new.h; \ +.PHONY: always +$(srcdir)/../include/opensm/osm_svn_revision.h: always + if \ + test '!' -d '$(srcdir)/.svn';\ + then\ + echo Exported svn revision;\ + else\ + echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ;\ + svnversion -n $(srcdir)/.. >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ;\ + echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ;\ + if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ + $(srcdir)/../include/opensm/osm_svn_revision.h ;\ + then\ + rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ;\ + else\ + mv $(srcdir)/../include/opensm/osm_svn_revision_new.h\ + $(srcdir)/../include/opensm/osm_svn_revision.h ;\ + fi ;\ fi - -$(srcdir)/../include/opensm/osm_svn_revision.h: $(srcdir)/../include/opensm/osm_svn_revision_new.h - if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ - $(srcdir)/../include/opensm/osm_svn_revision.h ; \ - then \ - rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - else \ - mv $(srcdir)/../include/opensm/osm_svn_revision_new.h \ - $(srcdir)/../include/opensm/osm_svn_revision.h ; \ - fi endif libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Mon Jan 30 07:45:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 07:45:16 -0800 Subject: [openib-general] Re: does the mthca driver support RTS->SQD event request? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3013DC66A@mtlexch01.mtl.com> (Dotan Barak's message of "Mon, 30 Jan 2006 10:59:27 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3013DC66A@mtlexch01.mtl.com> Message-ID: Dotan> Hi. does the mthca driver support the request of an async Dotan> event when changing the QP state from RTS -> SQD? (the Dotan> event is actually a notification when the HCA changes the Dotan> QP state from SQD:draining -> SQD:drained) Are you talking about the IB_SQ_DRAINED event? If so the code is there but I never tested it. - R. From dotanb at mellanox.co.il Mon Jan 30 07:52:22 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 30 Jan 2006 17:52:22 +0200 Subject: [openib-general] RE: does the mthca driver support RTS->SQD event request? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3013DC82C@mtlexch01.mtl.com> > > Are you talking about the IB_SQ_DRAINED event? If so the code is > there but I never tested it. > Yes, this is the event i'm talking about. i can't see the code that handles this event in the kernel part of the mthca. we wrote a test that tries to create this event in user level, but the modify QP from RTS to SQD with this flag (IBV_QP_EN_SQD_ASYNC_NOTIFY) enabled fails. Dotan From halr at voltaire.com Mon Jan 30 07:56:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Jan 2006 10:56:58 -0500 Subject: [openib-general] Re: [PATCH] osm: support osm_svn_revision.h in case ofSVN export In-Reply-To: <20060130145352.GB31887@mellanox.co.il> References: <86vew17stc.fsf@mtl066.yok.mtl.com> <20060130145352.GB31887@mellanox.co.il> Message-ID: <1138636616.4453.7439.camel@hal.voltaire.com> On Mon, 2006-01-30 at 09:53, Michael S. Tsirkin wrote: > Quoting r. Eitan Zahavi : > > Subject: [PATCH] osm: support osm_svn_revision.h in case ofSVN export > > > > Hi Hal > > > > We are using SVN export when building standalone OpenSM packages. > > During the SVN export we overwrite the osm_svn_revision.h with the > > SVN version used for the export. > > > > However the makefile override that. This patch avoids this by checking > > if the svnversion is "exported". > > > > Eitan > > Hi! > The way osm_svn_revision_new.h is removed to trigger re-make on the > next pass is IMO ugly: thats what .PHONY target is for. > > --- > > Simplify Makefile.am (remove an extra target, use -n flag to svnversion) > and make it possible to build opensm on platforms without subversion installed. Thanks. Applied. -- Hal > Signed-off-by: Michael S. Tsirkin > > Index: openib/src/userspace/management/osm/opensm/Makefile.am > =================================================================== > --- openib/src/userspace/management/osm/opensm/Makefile.am (revision 5207) > +++ openib/src/userspace/management/osm/opensm/Makefile.am (working copy) > @@ -10,25 +10,25 @@ > endif > > if OSMV_OPENIB > -$(srcdir)/../include/opensm/osm_svn_revision_new.h: > - echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ > - svnversion $(srcdir)/.. | tr -d '\n' >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ > - echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ > - if test `cat $(srcdir)/../include/opensm/osm_svn_revision_new.h | grep exported | wc -l` = 1; \ > - then \ > - cp $(srcdir)/../include/opensm/osm_svn_revision.h \ > - $(srcdir)/../include/opensm/osm_svn_revision_new.h; \ > +.PHONY: always > +$(srcdir)/../include/opensm/osm_svn_revision.h: always > + if \ > + test '!' -d '$(srcdir)/.svn';\ > + then\ > + echo Exported svn revision;\ > + else\ > + echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ;\ > + svnversion -n $(srcdir)/.. >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ;\ > + echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ;\ > + if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ > + $(srcdir)/../include/opensm/osm_svn_revision.h ;\ > + then\ > + rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ;\ > + else\ > + mv $(srcdir)/../include/opensm/osm_svn_revision_new.h\ > + $(srcdir)/../include/opensm/osm_svn_revision.h ;\ > + fi ;\ > fi > - > -$(srcdir)/../include/opensm/osm_svn_revision.h: $(srcdir)/../include/opensm/osm_svn_revision_new.h > - if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ > - $(srcdir)/../include/opensm/osm_svn_revision.h ; \ > - then \ > - rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ > - else \ > - mv $(srcdir)/../include/opensm/osm_svn_revision_new.h \ > - $(srcdir)/../include/opensm/osm_svn_revision.h ; \ > - fi > endif > > libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > From rdreier at cisco.com Mon Jan 30 08:26:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 08:26:28 -0800 Subject: [openib-general] Re: does the mthca driver support RTS->SQD event request? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3013DC82C@mtlexch01.mtl.com> (Dotan Barak's message of "Mon, 30 Jan 2006 17:52:22 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3013DC82C@mtlexch01.mtl.com> Message-ID: Dotan> i can't see the code that handles this event in the kernel Dotan> part of the mthca. Dotan> we wrote a test that tries to create this event in user Dotan> level, but the modify QP from RTS to SQD with this flag Dotan> (IBV_QP_EN_SQD_ASYNC_NOTIFY) enabled fails. You're right. The code to dispatch the event when it occurs is there, but there's nothing to request the event in the RTS2SQD command. I can probably add this in the next week or so, or I'm happy to apply a patch if you do it first. - R. From mst at mellanox.co.il Mon Jan 30 08:51:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jan 2006 18:51:21 +0200 Subject: [openib-general] [PATCH] libmthca: overflow test typo Message-ID: <20060130165121.GC31887@mellanox.co.il> Fix typo in overflow test in libmthca. Noted by Yossi Leybovich. Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/libmthca/src/qp.c =================================================================== --- openib/src/userspace/libmthca/src/qp.c (revision 5134) +++ openib/src/userspace/libmthca/src/qp.c (working copy) @@ -349,7 +349,7 @@ int mthca_tavor_post_recv(struct ibv_qp size0 = 0; } - if (wq_overflow(&qp->rq, nreq, to_mcq(qp->ibv_qp.send_cq))) { + if (wq_overflow(&qp->rq, nreq, to_mcq(qp->ibv_qp.recv_cq))) { ret = -1; *bad_wr = wr; goto out; @@ -690,7 +690,7 @@ int mthca_arbel_post_recv(struct ibv_qp ind = qp->rq.head & (qp->rq.max - 1); for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (wq_overflow(&qp->rq, nreq, to_mcq(qp->ibv_qp.send_cq))) { + if (wq_overflow(&qp->rq, nreq, to_mcq(qp->ibv_qp.recv_cq))) { ret = -1; *bad_wr = wr; goto out; -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rolandd at cisco.com Mon Jan 30 09:17:23 2006 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 09:17:23 -0800 Subject: [openib-general] [PATCH 0/6] [RFC] Resize CQ support Message-ID: <2006130917.rSg5wqEMWE7vHUJK@cisco.com> Here is a series of patches that adds support for resize CQ operations to both the core uverbs/libibverbs code as well as implementing it in the device-specific mthca/libmthca code. This is a provider ABI breaking change to libibverbs, because it changes the layout of struct ibv_context_ops. Source and binary compatibility with applications that link to libibverbs is preserved, but provider libraries will have to be recompiled. libibverbs remains source compatibly with unchanged provider libraries. I believe everything is ready to commit now, pending review. I have completed userspace and kernel resize CQ support for mthca/libmthca. However, a review of the mthca code, especially from Mellanox and other people familiar with the hardware, would be great, since that's the trickiest part of the implementation. I'll probably go ahead and commit everything shortly, since it shouldn't break anything major, and we can always fix my mistakes later. I am also including a simple test program and a simple test kernel module that I used to try out the support. I would very much appreciate test reports for these patches -- if someone ambitious wants to extend my code to try to find other corner cases or races that I missed, that would be even better. Thanks, Roland From rolandd at cisco.com Mon Jan 30 09:17:27 2006 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 09:17:27 -0800 Subject: [openib-general] [PATCH 1/6] [RFC] core kernel changes for resize CQ In-Reply-To: <2006130917.rSg5wqEMWE7vHUJK@cisco.com> Message-ID: <2006130917.6AdsrKfVZfp3EZDv@cisco.com> Add support to uverbs to handle resizing userspace CQs (completion queues), including adding an ABI for marshalling requests and responses. The kernel midlayer already has ib_resize_cq(). --- --- infiniband/include/rdma/ib_user_verbs.h (revision 5213) +++ infiniband/include/rdma/ib_user_verbs.h (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -265,6 +265,17 @@ struct ib_uverbs_create_cq_resp { __u32 cqe; }; +struct ib_uverbs_resize_cq { + __u64 response; + __u32 cq_handle; + __u32 cqe; + __u64 driver_data[0]; +}; + +struct ib_uverbs_resize_cq_resp { + __u32 cqe; +}; + struct ib_uverbs_poll_cq { __u64 response; __u32 cq_handle; --- infiniband/include/rdma/ib_verbs.h (revision 5213) +++ infiniband/include/rdma/ib_verbs.h (working copy) @@ -5,7 +5,7 @@ * Copyright (c) 2004 Topspin Corporation. All rights reserved. * Copyright (c) 2004 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -880,7 +880,8 @@ struct ib_device { struct ib_ucontext *context, struct ib_udata *udata); int (*destroy_cq)(struct ib_cq *cq); - int (*resize_cq)(struct ib_cq *cq, int cqe); + int (*resize_cq)(struct ib_cq *cq, int cqe, + struct ib_udata *udata); int (*poll_cq)(struct ib_cq *cq, int num_entries, struct ib_wc *wc); int (*peek_cq)(struct ib_cq *cq, int wc_cnt); --- infiniband/core/uverbs_main.c (revision 5213) +++ infiniband/core/uverbs_main.c (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. * Copyright (c) 2005 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. @@ -91,6 +91,7 @@ static ssize_t (*uverbs_cmd_table[])(str [IB_USER_VERBS_CMD_DEREG_MR] = ib_uverbs_dereg_mr, [IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL] = ib_uverbs_create_comp_channel, [IB_USER_VERBS_CMD_CREATE_CQ] = ib_uverbs_create_cq, + [IB_USER_VERBS_CMD_RESIZE_CQ] = ib_uverbs_resize_cq, [IB_USER_VERBS_CMD_POLL_CQ] = ib_uverbs_poll_cq, [IB_USER_VERBS_CMD_REQ_NOTIFY_CQ] = ib_uverbs_req_notify_cq, [IB_USER_VERBS_CMD_DESTROY_CQ] = ib_uverbs_destroy_cq, --- infiniband/core/verbs.c (revision 5213) +++ infiniband/core/verbs.c (working copy) @@ -5,7 +5,7 @@ * Copyright (c) 2004 Topspin Corporation. All rights reserved. * Copyright (c) 2004 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -325,7 +325,7 @@ int ib_resize_cq(struct ib_cq *cq, int cqe) { return cq->device->resize_cq ? - cq->device->resize_cq(cq, cqe) : -ENOSYS; + cq->device->resize_cq(cq, cqe, NULL) : -ENOSYS; } EXPORT_SYMBOL(ib_resize_cq); --- infiniband/core/uverbs.h (revision 5213) +++ infiniband/core/uverbs.h (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. * Copyright (c) 2005 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. @@ -185,6 +185,7 @@ IB_UVERBS_DECLARE_CMD(reg_mr); IB_UVERBS_DECLARE_CMD(dereg_mr); IB_UVERBS_DECLARE_CMD(create_comp_channel); IB_UVERBS_DECLARE_CMD(create_cq); +IB_UVERBS_DECLARE_CMD(resize_cq); IB_UVERBS_DECLARE_CMD(poll_cq); IB_UVERBS_DECLARE_CMD(req_notify_cq); IB_UVERBS_DECLARE_CMD(destroy_cq); --- infiniband/core/uverbs_cmd.c (revision 5213) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -675,6 +675,46 @@ err: return ret; } +ssize_t ib_uverbs_resize_cq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_resize_cq cmd; + struct ib_uverbs_resize_cq_resp resp; + struct ib_udata udata; + struct ib_cq *cq; + int ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + INIT_UDATA(&udata, buf + sizeof cmd, + (unsigned long) cmd.response + sizeof resp, + in_len - sizeof cmd, out_len - sizeof resp); + + mutex_lock(&ib_uverbs_idr_mutex); + + cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); + if (!cq || cq->uobject->context != file->ucontext || !cq->device->resize_cq) + goto out; + + ret = cq->device->resize_cq(cq, cmd.cqe, &udata); + if (ret) + goto out; + + memset(&resp, 0, sizeof resp); + resp.cqe = cq->cqe; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; + +out: + mutex_unlock(&ib_uverbs_idr_mutex); + + return ret ? ret : in_len; +} + ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) From rolandd at cisco.com Mon Jan 30 09:17:27 2006 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 09:17:27 -0800 Subject: [openib-general] [PATCH 2/6] [RFC] mthca kernel changes for resize CQ In-Reply-To: <2006130917.6AdsrKfVZfp3EZDv@cisco.com> Message-ID: <2006130917.xhWS7MGYQrADiDEW@cisco.com> Add low-level driver support for resizing CQs (both kernel and userspace) to mthca. --- --- infiniband/hw/mthca/mthca_user.h (revision 5213) +++ infiniband/hw/mthca/mthca_user.h (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -75,6 +75,11 @@ struct mthca_create_cq_resp { __u32 reserved; }; +struct mthca_resize_cq { + __u32 lkey; + __u32 reserved; +}; + struct mthca_create_srq { __u32 lkey; __u32 db_index; --- infiniband/hw/mthca/mthca_provider.c (revision 5213) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -1,7 +1,7 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. * Copyright (c) 2004 Voltaire, Inc. All rights reserved. * @@ -669,9 +669,9 @@ static struct ib_cq *mthca_create_cq(str } if (context) { - cq->mr.ibmr.lkey = ucmd.lkey; - cq->set_ci_db_index = ucmd.set_db_index; - cq->arm_db_index = ucmd.arm_db_index; + cq->buf.mr.ibmr.lkey = ucmd.lkey; + cq->set_ci_db_index = ucmd.set_db_index; + cq->arm_db_index = ucmd.arm_db_index; } for (nent = 1; nent <= entries; nent <<= 1) @@ -689,6 +689,8 @@ static struct ib_cq *mthca_create_cq(str goto err_free; } + cq->resize_buf = NULL; + return &cq->ibcq; err_free: @@ -707,6 +709,121 @@ err_unmap_set: return ERR_PTR(err); } +static int mthca_alloc_resize_buf(struct mthca_dev *dev, struct mthca_cq *cq, + int entries) +{ + int ret; + + spin_lock_irq(&cq->lock); + if (cq->resize_buf) { + ret = -EBUSY; + goto unlock; + } + + cq->resize_buf = kmalloc(sizeof *cq->resize_buf, GFP_ATOMIC); + if (!cq->resize_buf) { + ret = -ENOMEM; + goto unlock; + } + + cq->resize_buf->state = CQ_RESIZE_ALLOC; + + ret = 0; + +unlock: + spin_unlock_irq(&cq->lock); + + if (ret) + return ret; + + ret = mthca_alloc_cq_buf(dev, &cq->resize_buf->buf, entries); + if (ret) { + spin_lock_irq(&cq->lock); + kfree(cq->resize_buf); + cq->resize_buf = NULL; + spin_unlock_irq(&cq->lock); + return ret; + } + + cq->resize_buf->cqe = entries - 1; + + spin_lock_irq(&cq->lock); + cq->resize_buf->state = CQ_RESIZE_READY; + spin_unlock_irq(&cq->lock); + + return 0; +} + +static int mthca_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata) +{ + struct mthca_dev *dev = to_mdev(ibcq->device); + struct mthca_cq *cq = to_mcq(ibcq); + struct mthca_resize_cq ucmd; + u32 lkey; + u8 status; + int ret; + + if (entries < 1 || entries > dev->limits.max_cqes) + return -EINVAL; + + entries = roundup_pow_of_two(entries + 1); + if (entries == ibcq->cqe + 1) + return 0; + + if (cq->is_kernel) { + ret = mthca_alloc_resize_buf(dev, cq, entries); + if (ret) + return ret; + lkey = cq->resize_buf->buf.mr.ibmr.lkey; + } else { + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + return -EFAULT; + lkey = ucmd.lkey; + } + + ret = mthca_RESIZE_CQ(dev, cq->cqn, lkey, long_log2(entries), &status); + if (status) + ret = -EINVAL; + + if (ret) { + if (cq->resize_buf) { + mthca_free_cq_buf(dev, &cq->resize_buf->buf, + cq->resize_buf->cqe); + kfree(cq->resize_buf); + spin_lock_irq(&cq->lock); + cq->resize_buf = NULL; + spin_unlock_irq(&cq->lock); + } + return ret; + } + + if (cq->is_kernel) { + struct mthca_cq_buf tbuf; + int tcqe; + + spin_lock_irq(&cq->lock); + if (cq->resize_buf->state == CQ_RESIZE_READY) { + mthca_cq_resize_copy_cqes(cq); + tbuf = cq->buf; + tcqe = cq->ibcq.cqe; + cq->buf = cq->resize_buf->buf; + cq->ibcq.cqe = cq->resize_buf->cqe; + } else { + tbuf = cq->resize_buf->buf; + tcqe = cq->resize_buf->cqe; + } + + kfree(cq->resize_buf); + cq->resize_buf = NULL; + spin_unlock_irq(&cq->lock); + + mthca_free_cq_buf(dev, &tbuf, tcqe); + } else + ibcq->cqe = entries - 1; + + return 0; +} + static int mthca_destroy_cq(struct ib_cq *cq) { if (cq->uobject) { @@ -1113,6 +1230,7 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_DEREG_MR) | (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_RESIZE_CQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | (1ull << IB_USER_VERBS_CMD_CREATE_QP) | (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | @@ -1154,6 +1272,7 @@ int mthca_register_device(struct mthca_d dev->ib_dev.modify_qp = mthca_modify_qp; dev->ib_dev.destroy_qp = mthca_destroy_qp; dev->ib_dev.create_cq = mthca_create_cq; + dev->ib_dev.resize_cq = mthca_resize_cq; dev->ib_dev.destroy_cq = mthca_destroy_cq; dev->ib_dev.poll_cq = mthca_poll_cq; dev->ib_dev.get_dma_mr = mthca_get_dma_mr; --- infiniband/hw/mthca/mthca_provider.h (revision 5213) +++ infiniband/hw/mthca/mthca_provider.h (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2004 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. * * This software is available to you under a choice of one of two @@ -164,9 +164,11 @@ struct mthca_ah { * - wait_event until ref count is zero * * It is the consumer's responsibilty to make sure that no QP - * operations (WQE posting or state modification) are pending when the + * operations (WQE posting or state modification) are pending when a * QP is destroyed. Also, the consumer must make sure that calls to - * qp_modify are serialized. + * qp_modify are serialized. Similarly, the consumer is responsible + * for ensuring that no CQ resize operations are pending when a CQ + * is destroyed. * * Possible optimizations (wait for profile data to see if/where we * have locks bouncing between CPUs): @@ -176,25 +178,40 @@ struct mthca_ah { * send queue and one for the receive queue) */ +struct mthca_cq_buf { + union mthca_buf queue; + struct mthca_mr mr; + int is_direct; +}; + +struct mthca_cq_resize { + struct mthca_cq_buf buf; + int cqe; + enum { + CQ_RESIZE_ALLOC, + CQ_RESIZE_READY, + CQ_RESIZE_SWAPPED + } state; +}; + struct mthca_cq { - struct ib_cq ibcq; - spinlock_t lock; - atomic_t refcount; - int cqn; - u32 cons_index; - int is_direct; - int is_kernel; + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + u32 cons_index; + int is_kernel; + struct mthca_cq_buf buf; /* Next fields are Arbel only */ - int set_ci_db_index; - __be32 *set_ci_db; - int arm_db_index; - __be32 *arm_db; - int arm_sn; + int set_ci_db_index; + __be32 *set_ci_db; + int arm_db_index; + __be32 *arm_db; + int arm_sn; - union mthca_buf queue; - struct mthca_mr mr; - wait_queue_head_t wait; + wait_queue_head_t wait; + struct mthca_cq_resize *resize_buf; }; struct mthca_srq { --- infiniband/hw/mthca/mthca_cmd.c (revision 5213) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -1,7 +1,7 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -1517,6 +1517,37 @@ int mthca_HW2SW_CQ(struct mthca_dev *dev CMD_TIME_CLASS_A, status); } +int mthca_RESIZE_CQ(struct mthca_dev *dev, int cq_num, u32 lkey, u8 log_size, + u8 *status) +{ + struct mthca_mailbox *mailbox; + __be32 *inbox; + int err; + +#define RESIZE_CQ_IN_SIZE 0x40 +#define RESIZE_CQ_LOG_SIZE_OFFSET 0x0c +#define RESIZE_CQ_LKEY_OFFSET 0x1c + + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + inbox = mailbox->buf; + + memset(inbox, 0, RESIZE_CQ_IN_SIZE); + /* + * Leave start address fields zeroed out -- mthca assumes that + * MRs for CQs always start at virtual address 0. + */ + MTHCA_PUT(inbox, log_size, RESIZE_CQ_LOG_SIZE_OFFSET); + MTHCA_PUT(inbox, lkey, RESIZE_CQ_LKEY_OFFSET); + + err = mthca_cmd(dev, mailbox->dma, cq_num, 1, CMD_RESIZE_CQ, + CMD_TIME_CLASS_B, status); + + mthca_free_mailbox(dev, mailbox); + return err; +} + int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status) { --- infiniband/hw/mthca/mthca_cmd.h (revision 5213) +++ infiniband/hw/mthca/mthca_cmd.h (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -298,6 +299,8 @@ int mthca_SW2HW_CQ(struct mthca_dev *dev int cq_num, u8 *status); int mthca_HW2SW_CQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int cq_num, u8 *status); +int mthca_RESIZE_CQ(struct mthca_dev *dev, int cq_num, u32 lkey, u8 log_size, + u8 *status); int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status); int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, --- infiniband/hw/mthca/mthca_dev.h (revision 5213) +++ infiniband/hw/mthca/mthca_dev.h (working copy) @@ -1,7 +1,7 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. * Copyright (c) 2004 Voltaire, Inc. All rights reserved. * @@ -468,6 +468,9 @@ void mthca_cq_event(struct mthca_dev *de enum ib_event_type event_type); void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn, struct mthca_srq *srq); +void mthca_cq_resize_copy_cqes(struct mthca_cq *cq); +int mthca_alloc_cq_buf(struct mthca_dev *dev, struct mthca_cq_buf *buf, int nent); +void mthca_free_cq_buf(struct mthca_dev *dev, struct mthca_cq_buf *buf, int cqe); int mthca_alloc_srq(struct mthca_dev *dev, struct mthca_pd *pd, struct ib_srq_attr *attr, struct mthca_srq *srq); --- infiniband/hw/mthca/mthca_cq.c (revision 5213) +++ infiniband/hw/mthca/mthca_cq.c (working copy) @@ -1,7 +1,7 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * Copyright (c) 2005 Cisco Systems, Inc. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems, Inc. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. * Copyright (c) 2004 Voltaire, Inc. All rights reserved. * @@ -150,24 +150,29 @@ struct mthca_err_cqe { #define MTHCA_ARBEL_CQ_DB_REQ_NOT (2 << 24) #define MTHCA_ARBEL_CQ_DB_REQ_NOT_MULT (3 << 24) -static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) +static inline struct mthca_cqe *get_cqe_from_buf(struct mthca_cq_buf *buf, + int entry) { - if (cq->is_direct) - return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE); + if (buf->is_direct) + return buf->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE); else - return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf + return buf->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf + (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE; } -static inline struct mthca_cqe *cqe_sw(struct mthca_cq *cq, int i) +static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) +{ + return get_cqe_from_buf(&cq->buf, entry); +} + +static inline struct mthca_cqe *cqe_sw(struct mthca_cqe *cqe) { - struct mthca_cqe *cqe = get_cqe(cq, i); return MTHCA_CQ_ENTRY_OWNER_HW & cqe->owner ? NULL : cqe; } static inline struct mthca_cqe *next_cqe_sw(struct mthca_cq *cq) { - return cqe_sw(cq, cq->cons_index & cq->ibcq.cqe); + return cqe_sw(get_cqe(cq, cq->cons_index & cq->ibcq.cqe)); } static inline void set_cqe_hw(struct mthca_cqe *cqe) @@ -289,7 +294,7 @@ void mthca_cq_clean(struct mthca_dev *de * from our QP and therefore don't need to be checked. */ for (prod_index = cq->cons_index; - cqe_sw(cq, prod_index & cq->ibcq.cqe); + cqe_sw(get_cqe(cq, prod_index & cq->ibcq.cqe)); ++prod_index) if (prod_index == cq->cons_index + cq->ibcq.cqe) break; @@ -324,6 +329,53 @@ void mthca_cq_clean(struct mthca_dev *de wake_up(&cq->wait); } +void mthca_cq_resize_copy_cqes(struct mthca_cq *cq) +{ + int i; + + /* + * In Tavor mode, the hardware keeps the consumer and producer + * indices mod the CQ size. Since we might be making the CQ + * bigger, we need to deal with the case where the producer + * index wrapped around before the CQ was resized. + */ + if (!mthca_is_memfree(to_mdev(cq->ibcq.device)) && + cq->ibcq.cqe < cq->resize_buf->cqe) { + cq->cons_index &= cq->ibcq.cqe; + if (cqe_sw(get_cqe(cq, cq->ibcq.cqe))) + cq->cons_index -= cq->ibcq.cqe + 1; + } + + for (i = cq->cons_index; cqe_sw(get_cqe(cq, i & cq->ibcq.cqe)); ++i) + memcpy(get_cqe_from_buf(&cq->resize_buf->buf, + i & cq->resize_buf->cqe), + get_cqe(cq, i & cq->ibcq.cqe), MTHCA_CQ_ENTRY_SIZE); +} + +int mthca_alloc_cq_buf(struct mthca_dev *dev, struct mthca_cq_buf *buf, int nent) +{ + int ret; + int i; + + ret = mthca_buf_alloc(dev, nent * MTHCA_CQ_ENTRY_SIZE, + MTHCA_MAX_DIRECT_CQ_SIZE, + &buf->queue, &buf->is_direct, + &dev->driver_pd, 1, &buf->mr); + if (ret) + return ret; + + for (i = 0; i < nent; ++i) + set_cqe_hw(get_cqe_from_buf(buf, i)); + + return 0; +} + +void mthca_free_cq_buf(struct mthca_dev *dev, struct mthca_cq_buf *buf, int cqe) +{ + mthca_buf_free(dev, (cqe + 1) * MTHCA_CQ_ENTRY_SIZE, &buf->queue, + buf->is_direct, &buf->mr); +} + static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq, struct mthca_qp *qp, int wqe_index, int is_send, struct mthca_err_cqe *cqe, @@ -614,11 +666,14 @@ int mthca_poll_cq(struct ib_cq *ibcq, in spin_lock_irqsave(&cq->lock, flags); - for (npolled = 0; npolled < num_entries; ++npolled) { + npolled = 0; +repoll: + while (npolled < num_entries) { err = mthca_poll_one(dev, cq, &qp, &freed, entry + npolled); if (err) break; + ++npolled; } if (freed) { @@ -626,6 +681,43 @@ int mthca_poll_cq(struct ib_cq *ibcq, in update_cons_index(dev, cq, freed); } + /* + * If a CQ resize is in progress and we discovered that the + * old buffer is empty, then peek in the new buffer, and if + * it's not empty, switch to the new buffer and continue + * polling there. + */ + if (unlikely(cq->resize_buf && + cq->resize_buf->state == CQ_RESIZE_READY && + err == -EAGAIN)) { + /* + * In Tavor mode, the hardware keeps the producer + * index modulo the CQ size. Since we might be making + * the CQ bigger, we need to mask our consumer index + * using the size of the old CQ buffer before looking + * in the new CQ buffer. + */ + if (!mthca_is_memfree(dev)) + cq->cons_index &= cq->ibcq.cqe; + + if (cqe_sw(get_cqe_from_buf(&cq->resize_buf->buf, + cq->cons_index & cq->resize_buf->cqe))) { + struct mthca_cq_buf tbuf; + int tcqe; + + tbuf = cq->buf; + tcqe = cq->ibcq.cqe; + cq->buf = cq->resize_buf->buf; + cq->ibcq.cqe = cq->resize_buf->cqe; + + cq->resize_buf->buf = tbuf; + cq->resize_buf->cqe = tcqe; + cq->resize_buf->state = CQ_RESIZE_SWAPPED; + + goto repoll; + } + } + spin_unlock_irqrestore(&cq->lock, flags); return err == 0 || err == -EAGAIN ? npolled : err; @@ -684,22 +776,14 @@ int mthca_arbel_arm_cq(struct ib_cq *ibc return 0; } -static void mthca_free_cq_buf(struct mthca_dev *dev, struct mthca_cq *cq) -{ - mthca_buf_free(dev, (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE, - &cq->queue, cq->is_direct, &cq->mr); -} - int mthca_init_cq(struct mthca_dev *dev, int nent, struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq) { - int size = nent * MTHCA_CQ_ENTRY_SIZE; struct mthca_mailbox *mailbox; struct mthca_cq_context *cq_context; int err = -ENOMEM; u8 status; - int i; might_sleep(); @@ -739,14 +823,9 @@ int mthca_init_cq(struct mthca_dev *dev, cq_context = mailbox->buf; if (cq->is_kernel) { - err = mthca_buf_alloc(dev, size, MTHCA_MAX_DIRECT_CQ_SIZE, - &cq->queue, &cq->is_direct, - &dev->driver_pd, 1, &cq->mr); + err = mthca_alloc_cq_buf(dev, &cq->buf, nent); if (err) goto err_out_mailbox; - - for (i = 0; i < nent; ++i) - set_cqe_hw(get_cqe(cq, i)); } spin_lock_init(&cq->lock); @@ -765,7 +844,7 @@ int mthca_init_cq(struct mthca_dev *dev, cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); cq_context->pd = cpu_to_be32(pdn); - cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); + cq_context->lkey = cpu_to_be32(cq->buf.mr.ibmr.lkey); cq_context->cqn = cpu_to_be32(cq->cqn); if (mthca_is_memfree(dev)) { @@ -803,7 +882,7 @@ int mthca_init_cq(struct mthca_dev *dev, err_out_free_mr: if (cq->is_kernel) - mthca_free_cq_buf(dev, cq); + mthca_free_cq_buf(dev, &cq->buf, cq->ibcq.cqe); err_out_mailbox: mthca_free_mailbox(dev, mailbox); @@ -871,7 +950,7 @@ void mthca_free_cq(struct mthca_dev *dev wait_event(cq->wait, !atomic_read(&cq->refcount)); if (cq->is_kernel) { - mthca_free_cq_buf(dev, cq); + mthca_free_cq_buf(dev, &cq->buf, cq->ibcq.cqe); if (mthca_is_memfree(dev)) { mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index); From rolandd at cisco.com Mon Jan 30 09:17:27 2006 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 09:17:27 -0800 Subject: [openib-general] [PATCH 3/6] [RFC] libibverbs changes for resize CQ In-Reply-To: <2006130917.xhWS7MGYQrADiDEW@cisco.com> Message-ID: <2006130917.AjRWTFxIQXvD1CQL@cisco.com> libibverbs changes to handle resizing CQs. Essentially just adding API and support for passing the call through to provider plug-ins. --- --- libibverbs/include/infiniband/driver.h (revision 5201) +++ libibverbs/include/infiniband/driver.h (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -95,6 +95,8 @@ extern int ibv_cmd_create_cq(struct ibv_ struct ibv_create_cq_resp *resp, size_t resp_size); extern int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); extern int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only); +extern int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, + struct ibv_resize_cq *cmd, size_t cmd_size); extern int ibv_cmd_destroy_cq(struct ibv_cq *cq); extern int ibv_cmd_create_srq(struct ibv_pd *pd, --- libibverbs/include/infiniband/verbs.h (revision 5201) +++ libibverbs/include/infiniband/verbs.h (working copy) @@ -1,7 +1,7 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. * Copyright (c) 2004 Intel Corporation. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -549,6 +549,7 @@ struct ibv_context_ops { int (*poll_cq)(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc); int (*req_notify_cq)(struct ibv_cq *cq, int solicited_only); void (*cq_event)(struct ibv_cq *cq); + int (*resize_cq)(struct ibv_cq *cq, int cqe); int (*destroy_cq)(struct ibv_cq *cq); struct ibv_srq * (*create_srq)(struct ibv_pd *pd, struct ibv_srq_init_attr *srq_init_attr); @@ -717,6 +718,15 @@ extern struct ibv_cq *ibv_create_cq(stru int comp_vector); /** + * ibv_resize_cq - Modifies the capacity of the CQ. + * @cq: The CQ to resize. + * @cqe: The minimum size of the CQ. + * + * Users can examine the cq structure to determine the actual CQ size. + */ +extern int ibv_resize_cq(struct ibv_cq *cq, int cqe); + +/** * ibv_destroy_cq - Destroy a completion queue */ extern int ibv_destroy_cq(struct ibv_cq *cq); --- libibverbs/include/infiniband/kern-abi.h (revision 5201) +++ libibverbs/include/infiniband/kern-abi.h (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -343,6 +343,20 @@ struct ibv_req_notify_cq { __u32 solicited; }; +struct ibv_resize_cq { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u32 cq_handle; + __u32 cqe; + __u64 driver_data[0]; +}; + +struct ibv_resize_cq_resp { + __u32 cqe; +}; + struct ibv_destroy_cq { __u32 command; __u16 in_words; --- libibverbs/ChangeLog (revision 5201) +++ libibverbs/ChangeLog (working copy) @@ -1,3 +1,17 @@ +2006-01-26 Roland Dreier + + * include/infiniband/driver.h, src/cmd.c (ibv_cmd_resize_cq): Add + driver interface for calling resize CQ kernel command. + + * include/infiniband/kern-abi.h: Add resize CQ kernel ABI. + + * include/infiniband/verbs.h, src/verbs.c (ibv_resize_cq): Add + resize CQ library API. This changes the provider ABI, since a new + field is added to struct ibv_context_ops; source compatibility + with provider libraries is preserved, but binaries will have to be + recompiled. Neither source nor binary compatibility with + consumers of libibverbs is affected. + 2006-01-25 Roland Dreier * examples/pingpong.c, examples/pingpong.h, --- libibverbs/src/libibverbs.map (revision 5201) +++ libibverbs/src/libibverbs.map (working copy) @@ -19,6 +19,7 @@ IBVERBS_1.0 { ibv_create_comp_channel; ibv_destroy_comp_channel; ibv_create_cq; + ibv_resize_cq; ibv_destroy_cq; ibv_get_cq_event; ibv_ack_cq_events; @@ -44,6 +45,7 @@ IBVERBS_1.0 { ibv_cmd_create_cq; ibv_cmd_poll_cq; ibv_cmd_req_notify_cq; + ibv_cmd_resize_cq; ibv_cmd_destroy_cq; ibv_cmd_create_srq; ibv_cmd_modify_srq; --- libibverbs/src/verbs.c (revision 5201) +++ libibverbs/src/verbs.c (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -212,6 +213,14 @@ struct ibv_cq *ibv_create_cq(struct ibv_ return cq; } +int ibv_resize_cq(struct ibv_cq *cq, int cqe) +{ + if (!cq->context->ops.resize_cq) + return ENOSYS; + + return cq->context->ops.resize_cq(cq, cqe); +} + int ibv_destroy_cq(struct ibv_cq *cq) { return cq->context->ops.destroy_cq(cq); --- libibverbs/src/cmd.c (revision 5201) +++ libibverbs/src/cmd.c (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. + * Copyright (c) 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -364,6 +365,23 @@ int ibv_cmd_req_notify_cq(struct ibv_cq return 0; } +int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, + struct ibv_resize_cq *cmd, size_t cmd_size) +{ + struct ibv_resize_cq_resp resp; + + IBV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, &resp, sizeof resp); + cmd->cq_handle = cq->handle; + cmd->cqe = cqe; + + if (write(cq->context->cmd_fd, cmd, cmd_size) != cmd_size) + return errno; + + cq->cqe = resp.cqe; + + return 0; +} + static int ibv_cmd_destroy_cq_v1(struct ibv_cq *cq) { struct ibv_destroy_cq_v1 cmd; --- libibverbs/README (revision 5201) +++ libibverbs/README (working copy) @@ -98,6 +98,5 @@ necessary permissions to release your wo TODO ==== - * Completion queue (CQ) resizing need to be implemented. * Memory windows (MWs) need to be implemented. * Query QP, query SRQ and other query verbs need to be implemented. From rolandd at cisco.com Mon Jan 30 09:17:27 2006 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 09:17:27 -0800 Subject: [openib-general] [PATCH 4/6] [RFC] libmthca changes for resize CQ In-Reply-To: <2006130917.AjRWTFxIQXvD1CQL@cisco.com> Message-ID: <2006130917.eokqtUwcdWUA523N@cisco.com> libmthca implementation of resizing CQs. This is the real guts of it -- there is some slightly tricky code to handle the transition from the old CQ buffer to the new one. --- Index: libmthca/src/verbs.c =================================================================== --- libmthca/src/verbs.c (revision 5201) +++ libmthca/src/verbs.c (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -41,6 +41,7 @@ #include #include #include +#include #include #include "mthca.h" @@ -154,6 +155,16 @@ int mthca_dereg_mr(struct ibv_mr *mr) return 0; } +static int align_cq_size(int cqe) +{ + int nent; + + for (nent = 1; nent <= cqe; nent <<= 1) + ; /* nothing */ + + return nent; +} + struct ibv_cq *mthca_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector) @@ -161,27 +172,24 @@ struct ibv_cq *mthca_create_cq(struct ib struct mthca_create_cq cmd; struct mthca_create_cq_resp resp; struct mthca_cq *cq; - int nent; int ret; cq = malloc(sizeof *cq); if (!cq) return NULL; + cq->cons_index = 0; + if (pthread_spin_init(&cq->lock, PTHREAD_PROCESS_PRIVATE)) goto err; - for (nent = 1; nent <= cqe; nent <<= 1) - ; /* nothing */ - - if (posix_memalign(&cq->buf, to_mdev(context->device)->page_size, - align(nent * MTHCA_CQ_ENTRY_SIZE, to_mdev(context->device)->page_size))) + cqe = align_cq_size(cqe); + cq->buf = mthca_alloc_cq_buf(to_mdev(context->device), cqe); + if (!cq->buf) goto err; - mthca_init_cq_buf(cq, nent); - cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf, - nent * MTHCA_CQ_ENTRY_SIZE, + cqe * MTHCA_CQ_ENTRY_SIZE, 0, IBV_ACCESS_LOCAL_WRITE); if (!cq->mr) goto err_buf; @@ -210,7 +218,7 @@ struct ibv_cq *mthca_create_cq(struct ib cmd.lkey = cq->mr->lkey; cmd.pdn = to_mpd(to_mctx(context)->pd)->pdn; - ret = ibv_cmd_create_cq(context, nent - 1, channel, comp_vector, + ret = ibv_cmd_create_cq(context, cqe - 1, channel, comp_vector, &cq->ibv_cq, &cmd.ibv_cmd, sizeof cmd, &resp.ibv_resp, sizeof resp); if (ret) @@ -247,6 +255,63 @@ err: return NULL; } +int mthca_resize_cq(struct ibv_cq *ibcq, int cqe) +{ + struct mthca_cq *cq = to_mcq(ibcq); + struct mthca_resize_cq cmd; + struct ibv_mr *mr; + void *buf; + int old_cqe; + int ret; + + pthread_spin_lock(&cq->lock); + + cqe = align_cq_size(cqe); + if (cqe == ibcq->cqe + 1) { + ret = 0; + goto out; + } + + buf = mthca_alloc_cq_buf(to_mdev(ibcq->context->device), cqe); + if (!buf) { + ret = ENOMEM; + goto out; + } + + mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf, + cqe * MTHCA_CQ_ENTRY_SIZE, + 0, IBV_ACCESS_LOCAL_WRITE); + if (!mr) { + free(buf); + ret = ENOMEM; + goto out; + } + + mr->context = ibcq->context; + + old_cqe = ibcq->cqe; + + cmd.lkey = mr->lkey; + ret = ibv_cmd_resize_cq(ibcq, cqe - 1, &cmd.ibv_cmd, sizeof cmd); + if (ret) { + mthca_dereg_mr(mr); + free(buf); + goto out; + } + + mthca_cq_resize_copy_cqes(cq, buf, old_cqe); + + mthca_dereg_mr(cq->mr); + free(cq->buf); + + cq->buf = buf; + cq->mr = mr; + +out: + pthread_spin_unlock(&cq->lock); + return ret; +} + int mthca_destroy_cq(struct ibv_cq *cq) { int ret; Index: libmthca/src/mthca.h =================================================================== --- libmthca/src/mthca.h (revision 5201) +++ libmthca/src/mthca.h (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -281,6 +281,7 @@ extern int mthca_dereg_mr(struct ibv_mr struct ibv_cq *mthca_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector); +extern int mthca_resize_cq(struct ibv_cq *cq, int cqe); extern int mthca_destroy_cq(struct ibv_cq *cq); extern int mthca_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); extern int mthca_tavor_arm_cq(struct ibv_cq *cq, int solicited); @@ -288,7 +289,8 @@ extern int mthca_arbel_arm_cq(struct ibv extern void mthca_arbel_cq_event(struct ibv_cq *cq); extern void mthca_cq_clean(struct mthca_cq *cq, uint32_t qpn, struct mthca_srq *srq); -extern void mthca_init_cq_buf(struct mthca_cq *cq, int nent); +extern void mthca_cq_resize_copy_cqes(struct mthca_cq *cq, void *buf, int new_cqe); +extern void *mthca_alloc_cq_buf(struct mthca_device *dev, int cqe); extern struct ibv_srq *mthca_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *attr); Index: libmthca/src/cq.c =================================================================== --- libmthca/src/cq.c (revision 5201) +++ libmthca/src/cq.c (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -38,8 +39,9 @@ #endif /* HAVE_CONFIG_H */ #include -#include +#include #include +#include #include @@ -578,12 +580,38 @@ void mthca_cq_clean(struct mthca_cq *cq, pthread_spin_unlock(&cq->lock); } -void mthca_init_cq_buf(struct mthca_cq *cq, int nent) +void mthca_cq_resize_copy_cqes(struct mthca_cq *cq, void *buf, int old_cqe) +{ + int i; + + /* + * In Tavor mode, the hardware keeps the consumer and producer + * indices mod the CQ size. Since we might be making the CQ + * bigger, we need to deal with the case where the producer + * index wrapped around before the CQ was resized. + */ + if (!mthca_is_memfree(cq->ibv_cq.context) && old_cqe < cq->ibv_cq.cqe) { + cq->cons_index &= old_cqe; + if (cqe_sw(cq, old_cqe)) + cq->cons_index -= old_cqe + 1; + } + + for (i = cq->cons_index; cqe_sw(cq, i & old_cqe); ++i) + memcpy(buf + (i & cq->ibv_cq.cqe) * MTHCA_CQ_ENTRY_SIZE, + get_cqe(cq, i & old_cqe), MTHCA_CQ_ENTRY_SIZE); +} + +void *mthca_alloc_cq_buf(struct mthca_device *dev, int nent) { + void *buf; int i; + if (posix_memalign(&buf, dev->page_size, + align(nent * MTHCA_CQ_ENTRY_SIZE, dev->page_size))) + return NULL; + for (i = 0; i < nent; ++i) - set_cqe_hw(get_cqe(cq, i)); + ((struct mthca_cqe *) buf)[i].owner = MTHCA_CQ_ENTRY_OWNER_HW; - cq->cons_index = 0; + return buf; } Index: libmthca/src/mthca-abi.h =================================================================== --- libmthca/src/mthca-abi.h (revision 5201) +++ libmthca/src/mthca-abi.h (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -65,6 +66,12 @@ struct mthca_create_cq_resp { __u32 reserved; }; +struct mthca_resize_cq { + struct ibv_resize_cq ibv_cmd; + __u32 lkey; + __u32 reserved; +}; + struct mthca_create_srq { struct ibv_create_srq ibv_cmd; __u32 lkey; Index: libmthca/src/mthca.c =================================================================== --- libmthca/src/mthca.c (revision 5201) +++ libmthca/src/mthca.c (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2006 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -105,6 +106,7 @@ static struct ibv_context_ops mthca_ctx_ .dereg_mr = mthca_dereg_mr, .create_cq = mthca_create_cq, .poll_cq = mthca_poll_cq, + .resize_cq = mthca_resize_cq, .destroy_cq = mthca_destroy_cq, .create_srq = mthca_create_srq, .modify_srq = mthca_modify_srq, Index: libmthca/ChangeLog =================================================================== --- libmthca/ChangeLog (revision 5201) +++ libmthca/ChangeLog (working copy) @@ -1,3 +1,10 @@ +2006-01-26 Roland Dreier + + * src/mthca.h, src/verbs.c, src/cq.c, src/mthca.c: Add + implementation of resize CQ operation. + + * src/mthca-abi.h: Add mthca-specific resize CQ ABI. + 2006-01-22 Roland Dreier * Release version 1.0-rc5. From rolandd at cisco.com Mon Jan 30 09:17:27 2006 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 09:17:27 -0800 Subject: [openib-general] [PATCH 5/6] [RFC] stupid test program for resize CQ In-Reply-To: <2006130917.eokqtUwcdWUA523N@cisco.com> Message-ID: <2006130917.bM0UCWw29ULQqmKk@cisco.com> Here's a simple test program I wrote that just creates a loopback QP on the first port of the first HCA it finds, and tries to keep 100 RDMA writes queued all the time. In another thread, it resizes the CQ at random intervals. The first thread makes sure that all the completions expected do arrive in the expected order. To build it, just compile and link with "-libverbs". It doesn't take any command line parameters to run. If it's working correctly, the output should just look like an endless stream of Resizing to 500... resized to 511 500000 writes done Resizing to 600... resized to 1023 Resizing to 500... resized to 511 Resizing to 800... resized to 1023 1000000 writes done 1500000 writes done Resizing to 1100... resized to 2047 Resizing to 100... resized to 127 Resizing to 1100... resized to 2047 2000000 writes done If either the "Resizing" or "xxx writes done" lines stop appearing, then something went wrong. --- /* * Copyright (c) 2006 Cisco Systems. All rights reserved. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License version * 2 as published by the Free Software Foundation. * * $Id$ */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include static int depth = 100; static int page_size; static uint16_t get_local_lid(struct ibv_context *context, int port) { struct ibv_port_attr attr; if (ibv_query_port(context, port, &attr)) return 0; return attr.lid; } static void *resize_task(void *cq_ptr) { struct ibv_cq *cq = cq_ptr; int new_size; while (1) { usleep(drand48() * 1000000); new_size = (lrand48() % 11 + 1) * depth; printf("Resizing to %4d... ", new_size); fflush(stdout); if (ibv_resize_cq(cq, new_size)) { fprintf(stderr, "\nResize failed\n"); exit(1); } printf("resized to %4d\n", cq->cqe); fflush(stdout); } } static int post_write(uint64_t wrid, void *buf, struct ibv_qp *qp, struct ibv_mr *mr) { struct ibv_sge list = { .addr = (uintptr_t) buf, .length = 1, .lkey = mr->lkey }; struct ibv_send_wr wr = { .wr_id = wrid, .sg_list = &list, .num_sge = 1, .opcode = IBV_WR_RDMA_WRITE, .send_flags = IBV_SEND_SIGNALED, .wr.rdma = { .remote_addr = (uintptr_t) buf + 8, .rkey = mr->rkey } }; struct ibv_send_wr *bad_wr; return ibv_post_send(qp, &wr, &bad_wr); } int main(int argc, char *argv[]) { struct ibv_device **dev_list; struct ibv_device *ib_dev; struct ibv_context *context; struct ibv_pd *pd; struct ibv_mr *mr; struct ibv_cq *cq; struct ibv_qp *qp; struct ibv_wc wc; void *buf; uint16_t lid; uint64_t wrid, exp_wrid; int i; pthread_t resize_thread; srand48(getpid() * time(NULL)); page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } ib_dev = *dev_list; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } buf = memalign(page_size, page_size); if (!buf) { fprintf(stderr, "Couldn't allocate work buf.\n"); return 1; } context = ibv_open_device(ib_dev); if (!context) { fprintf(stderr, "Couldn't get context for %s\n", ibv_get_device_name(ib_dev)); return 1; } lid = get_local_lid(context, 1); pd = ibv_alloc_pd(context); if (!pd) { fprintf(stderr, "Couldn't allocate PD\n"); return 1; } mr = ibv_reg_mr(pd, buf, page_size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE); if (!mr) { fprintf(stderr, "Couldn't allocate MR\n"); return 1; } cq = ibv_create_cq(context, depth, NULL, NULL, 0); if (!cq) { fprintf(stderr, "Couldn't create CQ\n"); return 1; } { struct ibv_qp_init_attr attr = { .send_cq = cq, .recv_cq = cq, .cap = { .max_send_wr = depth, .max_recv_wr = 0, .max_send_sge = 1, .max_recv_sge = 1 }, .qp_type = IBV_QPT_RC }; qp = ibv_create_qp(pd, &attr); if (!qp) { fprintf(stderr, "Couldn't create QP\n"); return 1; } } { struct ibv_qp_attr attr; attr.qp_state = IBV_QPS_INIT; attr.pkey_index = 0; attr.port_num = 1; attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; if (ibv_modify_qp(qp, &attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS)) { fprintf(stderr, "Failed to modify QP to INIT\n"); return 1; } attr.qp_state = IBV_QPS_RTR; attr.path_mtu = IBV_MTU_1024; attr.dest_qp_num = qp->qp_num; attr.rq_psn = 1; attr.max_dest_rd_atomic = 4; attr.min_rnr_timer = 12; attr.ah_attr.is_global = 0; attr.ah_attr.dlid = lid; attr.ah_attr.sl = 0; attr.ah_attr.src_path_bits = 0; attr.ah_attr.port_num = 1; if (ibv_modify_qp(qp, &attr, IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN | IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER)) { fprintf(stderr, "Failed to modify QP to RTR\n"); return 1; } attr.qp_state = IBV_QPS_RTS; attr.timeout = 14; attr.retry_cnt = 7; attr.rnr_retry = 7; attr.sq_psn = 1; attr.max_rd_atomic = 4; if (ibv_modify_qp(qp, &attr, IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC)) { fprintf(stderr, "Failed to modify QP to RTS\n"); return 1; } } wrid = exp_wrid = 0; if (pthread_create(&resize_thread, NULL, resize_task, cq)) { fprintf(stderr, "Couldn't start resize_task\n"); return 1; } for (i = 0; i < depth; ++i) { if (post_write(wrid, buf, qp, mr)) { fprintf(stderr, "Couldn't post work request %lld\n", (long long) wrid); return 1; } ++wrid; } while (1) { i = ibv_poll_cq(cq, 1, &wc); if (i < 0) { fprintf(stderr, "poll CQ failed %d\n", i); return 1; } if (i) { if (wc.status != IBV_WC_SUCCESS) { fprintf(stderr, "Failed status %d for wr_id %lld\n", wc.status, (long long) wc.wr_id); return 1; } if (wc.wr_id != exp_wrid) { fprintf(stderr, "wr_id mismatch %lld != %lld\n", (long long) wc.wr_id, (long long) exp_wrid); return 1; } ++exp_wrid; if (!(exp_wrid % 500000)) printf("%12lld writes done\n", (long long) exp_wrid); if (post_write(wrid, buf, qp, mr)) { fprintf(stderr, "Couldn't post work request %lld\n", (long long) wrid); return 1; } ++wrid; } } return 0; } From rolandd at cisco.com Mon Jan 30 09:17:27 2006 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 09:17:27 -0800 Subject: [openib-general] [PATCH 6/6] [RFC] stupid test kernel module for resize CQ In-Reply-To: <2006130917.bM0UCWw29ULQqmKk@cisco.com> Message-ID: <2006130917.qz4EoqpDBE2Isand@cisco.com> And here's a simple test kernel module I wrote to test resizing kernel CQs. It is similar to the userspace test program, and should produce endless output like Resizing to 800... resized to 1023 28328000000 writes done when loaded. If either the "Resizing" or "xxx writes done" lines stop appearing, then something went wrong. --- /* * Copyright (c) 2006 Cisco Systems. All rights reserved. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License version * 2 as published by the Free Software Foundation. * * $Id$ */ #include #include #include #include #include #include #include #include MODULE_LICENSE("GPL"); static int depth = 100; static int wcsize = 2; static u32 seed; DEFINE_MUTEX(seed_mutex); struct kcq_ctx { struct ib_device *ib_dev; struct ib_pd *pd; struct ib_mr *mr; struct ib_cq *cq; struct ib_qp *qp; void *buf; dma_addr_t dma; u64 wrid; u64 exp_wrid; struct task_struct *poll_task; struct work_struct resize_work; }; static u16 get_local_lid(struct ib_device *device, int port) { struct ib_port_attr attr; if (ib_query_port(device, port, &attr)) return 0; return attr.lid; } static u32 rand(void) { u32 ret; mutex_lock(&seed_mutex); /* 3-shift-register generator with period 2^32-1 */ seed ^= seed << 13; seed ^= seed >> 17; seed ^= seed << 5; ret = seed; mutex_unlock(&seed_mutex); return ret; } static void kcq_resize_work(void *ctx_ptr) { struct kcq_ctx *ctx = ctx_ptr; int new_size = (rand() % 11 + 1) * depth; printk(KERN_ERR "Resizing to %4d... ", new_size); if (ib_resize_cq(ctx->cq, new_size)) printk("failed\n"); else printk("resized to %4d\n", ctx->cq->cqe); schedule_delayed_work(&ctx->resize_work, msecs_to_jiffies(rand() % 1000)); } static int post_write(u64 wrid, dma_addr_t buf, struct ib_qp *qp, struct ib_mr *mr) { struct ib_sge list = { .addr = buf, .length = 1, .lkey = mr->lkey }; struct ib_send_wr wr = { .wr_id = wrid, .sg_list = &list, .num_sge = 1, .opcode = IB_WR_RDMA_WRITE, .send_flags = IB_SEND_SIGNALED, .wr.rdma = { .remote_addr = buf + 8, .rkey = mr->rkey } }; struct ib_send_wr *bad_wr; return ib_post_send(qp, &wr, &bad_wr); } static int kcq_poll_thread(void *ctx_ptr) { struct kcq_ctx *ctx = ctx_ptr; struct ib_wc *wc; int i, n; wc = kmalloc(wcsize * sizeof *wc, GFP_KERNEL); if (!wc) goto stall; for (i = 0; i < depth; ++i) { if (post_write(ctx->wrid, ctx->dma, ctx->qp, ctx->mr)) { printk(KERN_ERR "Couldn't post work request %lld\n", (long long) ctx->wrid); goto stall; } ++ctx->wrid; } schedule_delayed_work(&ctx->resize_work, msecs_to_jiffies(rand() % 1000)); while (!kthread_should_stop()) { cond_resched(); n = ib_poll_cq(ctx->cq, wcsize, wc); if (n < 0) { printk(KERN_ERR "poll CQ failed %d\n", n); goto stall; } for (i = 0; i < n; ++i) { if (wc[i].status != IB_WC_SUCCESS) { printk(KERN_ERR "Failed status %d for wr_id %lld\n", wc[i].status, (long long) wc[i].wr_id); goto stall; } if (wc[i].wr_id != ctx->exp_wrid) { printk(KERN_ERR "wr_id mismatch %lld != %lld\n", (long long) wc[i].wr_id, (long long) ctx->exp_wrid); goto stall; } ++ctx->exp_wrid; if (!(ctx->exp_wrid % 500000)) printk(KERN_ERR "%12lld writes done\n", (long long) ctx->exp_wrid); if (post_write(ctx->wrid, ctx->dma, ctx->qp, ctx->mr)) { printk(KERN_ERR "Couldn't post work request %lld\n", (long long) ctx->wrid); goto stall; } ++ctx->wrid; } } return 0; stall: while (!kthread_should_stop()) schedule_timeout_interruptible(1); return -1; } static void kcq_add_one(struct ib_device *device); static void kcq_remove_one(struct ib_device *device); static struct ib_client kcq_client = { .name = "kcq", .add = kcq_add_one, .remove = kcq_remove_one }; static void kcq_add_one(struct ib_device *device) { struct kcq_ctx *ctx; ctx = kzalloc(sizeof *ctx, GFP_KERNEL); if (!ctx) return; INIT_WORK(&ctx->resize_work, kcq_resize_work, ctx); ctx->pd = ib_alloc_pd(device); if (IS_ERR(ctx->pd)) { printk(KERN_ERR "%s: Couldn't allocate PD\n", device->name); goto err; } ctx->buf = dma_alloc_coherent(device->dma_device, PAGE_SIZE, &ctx->dma, GFP_KERNEL); if (!ctx->buf) { printk(KERN_ERR "%s: Couldn't allocate buf\n", device->name); goto err_pd; } ctx->mr = ib_get_dma_mr(ctx->pd, IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE); if (IS_ERR(ctx->mr)) { printk(KERN_ERR "%s: Couldn't allocate MR\n", device->name); goto err_buf; } ctx->cq = ib_create_cq(device, NULL, NULL, NULL, depth); if (IS_ERR(ctx->cq)) { printk(KERN_ERR "%s: Couldn't allocate CQ\n", device->name); goto err_mr; } { struct ib_qp_init_attr attr = { .send_cq = ctx->cq, .recv_cq = ctx->cq, .cap = { .max_send_wr = depth, .max_recv_wr = 0, .max_send_sge = 1, .max_recv_sge = 1 }, .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC }; ctx->qp = ib_create_qp(ctx->pd, &attr); if (!ctx->qp) { printk(KERN_ERR "%s: Couldn't create QP\n", device->name); goto err_cq; } } { struct ib_qp_attr attr; attr.qp_state = IB_QPS_INIT; attr.pkey_index = 0; attr.port_num = 1; attr.qp_access_flags = IB_ACCESS_REMOTE_WRITE; if (ib_modify_qp(ctx->qp, &attr, IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT | IB_QP_ACCESS_FLAGS)) { printk(KERN_ERR "%s: Failed to modify QP to INIT\n", device->name); goto err_qp; } attr.qp_state = IB_QPS_RTR; attr.path_mtu = IB_MTU_1024; attr.dest_qp_num = ctx->qp->qp_num; attr.rq_psn = 1; attr.max_dest_rd_atomic = 4; attr.min_rnr_timer = 12; attr.ah_attr.ah_flags = 0; attr.ah_attr.dlid = get_local_lid(device, 1); attr.ah_attr.sl = 0; attr.ah_attr.src_path_bits = 0; attr.ah_attr.port_num = 1; if (ib_modify_qp(ctx->qp, &attr, IB_QP_STATE | IB_QP_AV | IB_QP_PATH_MTU | IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MAX_DEST_RD_ATOMIC | IB_QP_MIN_RNR_TIMER)) { printk(KERN_ERR "%s: Failed to modify QP to RTR\n", device->name); goto err_qp; } attr.qp_state = IB_QPS_RTS; attr.timeout = 14; attr.retry_cnt = 7; attr.rnr_retry = 7; attr.sq_psn = 1; attr.max_rd_atomic = 4; if (ib_modify_qp(ctx->qp, &attr, IB_QP_STATE | IB_QP_TIMEOUT | IB_QP_RETRY_CNT | IB_QP_RNR_RETRY | IB_QP_SQ_PSN | IB_QP_MAX_QP_RD_ATOMIC)) { printk(KERN_ERR "%s: Failed to modify QP to RTS\n", device->name); goto err_qp; } } ctx->poll_task = kthread_create(kcq_poll_thread, ctx, "kcq-%s", device->name); if (IS_ERR(ctx->poll_task)) { printk(KERN_ERR "%s: Failed to start poll thread\n", device->name); goto err_qp; } wake_up_process(ctx->poll_task); ib_set_client_data(device, &kcq_client, ctx); return; err_qp: ib_destroy_qp(ctx->qp); err_cq: ib_destroy_cq(ctx->cq); err_mr: ib_dereg_mr(ctx->mr); err_buf: dma_free_coherent(device->dma_device, PAGE_SIZE, ctx->buf, ctx->dma); err_pd: ib_dealloc_pd(ctx->pd); err: kfree(ctx); } static void kcq_remove_one(struct ib_device *device) { struct kcq_ctx *ctx; ctx = ib_get_client_data(device, &kcq_client); if (!ctx) return; cancel_rearming_delayed_work(&ctx->resize_work); kthread_stop(ctx->poll_task); ib_destroy_qp(ctx->qp); ib_destroy_cq(ctx->cq); ib_dereg_mr(ctx->mr); dma_free_coherent(device->dma_device, PAGE_SIZE, ctx->buf, ctx->dma); ib_dealloc_pd(ctx->pd); kfree(ctx); } static int __init kcq_init(void) { get_random_bytes(&seed, sizeof seed); return ib_register_client(&kcq_client); } static void __exit kcq_cleanup(void) { ib_unregister_client(&kcq_client); } module_init(kcq_init); module_exit(kcq_cleanup); From mst at mellanox.co.il Mon Jan 30 09:29:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jan 2006 19:29:16 +0200 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: <2006130917.xhWS7MGYQrADiDEW@cisco.com> References: <2006130917.xhWS7MGYQrADiDEW@cisco.com> Message-ID: <20060130172916.GD31887@mellanox.co.il> Quoting r. Roland Dreier : > +static int mthca_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata) > +{ > + struct mthca_dev *dev = to_mdev(ibcq->device); > + struct mthca_cq *cq = to_mcq(ibcq); > + struct mthca_resize_cq ucmd; > + u32 lkey; > + u8 status; > + int ret; > + > + if (entries < 1 || entries > dev->limits.max_cqes) > + return -EINVAL; > + > + entries = roundup_pow_of_two(entries + 1); > + if (entries == ibcq->cqe + 1) > + return 0; > + > + if (cq->is_kernel) { > + ret = mthca_alloc_resize_buf(dev, cq, entries); > + if (ret) > + return ret; > + lkey = cq->resize_buf->buf.mr.ibmr.lkey; > + } else { > + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) > + return -EFAULT; > + lkey = ucmd.lkey; > + } > + > + ret = mthca_RESIZE_CQ(dev, cq->cqn, lkey, long_log2(entries), &status); > + if (status) > + ret = -EINVAL; > + > + if (ret) { > + if (cq->resize_buf) { > + mthca_free_cq_buf(dev, &cq->resize_buf->buf, > + cq->resize_buf->cqe); > + kfree(cq->resize_buf); > + spin_lock_irq(&cq->lock); > + cq->resize_buf = NULL; > + spin_unlock_irq(&cq->lock); > + } > + return ret; > + } > + > + if (cq->is_kernel) { > + struct mthca_cq_buf tbuf; > + int tcqe; > + > + spin_lock_irq(&cq->lock); > + if (cq->resize_buf->state == CQ_RESIZE_READY) { > + mthca_cq_resize_copy_cqes(cq); > + tbuf = cq->buf; > + tcqe = cq->ibcq.cqe; > + cq->buf = cq->resize_buf->buf; > + cq->ibcq.cqe = cq->resize_buf->cqe; > + } else { > + tbuf = cq->resize_buf->buf; > + tcqe = cq->resize_buf->cqe; > + } > + > + kfree(cq->resize_buf); > + cq->resize_buf = NULL; > + spin_unlock_irq(&cq->lock); > + > + mthca_free_cq_buf(dev, &tbuf, tcqe); > + } else > + ibcq->cqe = entries - 1; > + > + return 0; > +} I think I see a problem with this approach: if a ULP performs CQ poll while mthca_RESIZE_CQ is in progress, it might get a false indication that the CQ is empty since CQEs are being written to the new buffer already. As a result e.g. a ULP that does: arm while (poll cq) { handle cqe } will not empty the CQ and will deadlock. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Jan 30 09:33:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 09:33:09 -0800 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: <20060130172916.GD31887@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 30 Jan 2006 19:29:16 +0200") References: <2006130917.xhWS7MGYQrADiDEW@cisco.com> <20060130172916.GD31887@mellanox.co.il> Message-ID: Michael> I think I see a problem with this approach: if a ULP Michael> performs CQ poll while mthca_RESIZE_CQ is in progress, it Michael> might get a false indication that the CQ is empty since Michael> CQEs are being written to the new buffer already. I tried to handle this in the poll CQ operation: int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) /*...*/ /* * If a CQ resize is in progress and we discovered that the * old buffer is empty, then peek in the new buffer, and if * it's not empty, switch to the new buffer and continue * polling there. */ if (unlikely(cq->resize_buf && cq->resize_buf->state == CQ_RESIZE_READY && err == -EAGAIN)) { and so on... I'm not sure I got it right but it looks OK to me. I don't really see any other way to do this, because the RESIZE_CQ command can block, so we can't lock out poll CQ operations while the firmware command is executing. - R. From mst at mellanox.co.il Mon Jan 30 09:58:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jan 2006 19:58:49 +0200 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: References: Message-ID: <20060130175849.GE31887@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ > > Michael> I think I see a problem with this approach: if a ULP > Michael> performs CQ poll while mthca_RESIZE_CQ is in progress, it > Michael> might get a false indication that the CQ is empty since > Michael> CQEs are being written to the new buffer already. > > I tried to handle this in the poll CQ operation: Right, sorry, I missed this one. > int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, > struct ib_wc *entry) > > /*...*/ > > /* > * If a CQ resize is in progress and we discovered that the > * old buffer is empty, then peek in the new buffer, and if > * it's not empty, switch to the new buffer and continue > * polling there. > */ > if (unlikely(cq->resize_buf && > cq->resize_buf->state == CQ_RESIZE_READY && > err == -EAGAIN)) { Might be a good idea to test err == -EAGAIN first: err is likely a register, and once you get a valid CQE returning it to ULP ASAP has direct latency impact. > and so on... > > I'm not sure I got it right but it looks OK to me. > > I don't really see any other way to do this, because the RESIZE_CQ > command can block, so we can't lock out poll CQ operations while the > firmware command is executing. I agree, neither do I. Did you have a chance to profile this code? Is there any performance impact on IPoIB? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From caitlinb at broadcom.com Mon Jan 30 10:02:18 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 30 Jan 2006 10:02:18 -0800 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D17C@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > > I think I see a problem with this approach: if a ULP performs > CQ poll while mthca_RESIZE_CQ is in progress, it might get a > false indication that the CQ is empty since CQEs are being > written to the new buffer already. > > As a result e.g. a ULP that does: > > arm > while (poll cq) { > handle cqe > } > > will not empty the CQ and will deadlock. Allowing resizes of an active CQ is very tricky, and has many caveats. It is also, in my opinion, something that applications should not be encouraged to do unless CQ size is truly limited by on-chip resources. When the CQ size is not constrained by on-chip resources, but only by the amount of user-space memory that can be allocated to the CQ, then the application should be encouraged to size the CQ to the maximum desired from the very beginning. I'm not saying that resize should not be supported, just that the application developer should also have a device attribute that tells it when resizing is best avoided AND not truly needed in the first place. Atomically transitioning a CQ to an alternate buffer with no potential for a false empty is simply not possible without putting extra checks in cq_poll() that have to be paid for on every cq_poll(). What exactly is the device level code *required* or expected to do? Since different devices have different methods of interacting between device, verbs and driver it is very important that the *requirements* and expectations be stated. Should the verbs be coded to enable CQ resize even if doing so adds a layer of indirection to the data structures, needs an extra check, and/or reuqires an extra lock that would not have been needed on a fixed size CQ? As I stated above, when the CQ size is not constraned by on-chip resources, I believe the correct answer is that cq_poll should be made as efficient as possible even if that means that cq_resize will not be supported. Is that what middleware should assume the devices are doing? Or do we need middleware to be aware of that there are devices where cq resizing is needed and optimized as well as those where resizing is irrelevant and not supported. What we want to discourage is devices where resize is needed but not supported, and devices where resize is not needed but cq_poll is slowed down in order to support resizing. From mshefty at ichips.intel.com Mon Jan 30 10:13:22 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 30 Jan 2006 10:13:22 -0800 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43DE1A0B.6030606@voltaire.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> Message-ID: <43DE5742.5030601@ichips.intel.com> Or Gerlitz wrote: > I see. The CMA code (cma_resolve_ib_route) just calls ib_get_path_rec > and if the latter fails returns error to its consumer. Two questions > come into my mind here: a) is it the final version or you plan to extend > the CMA to directly query the SA in that case? b) the code of > ib_get_path_rec seems to just search the path in the index and return > -ENODATA if nothing is found, does it also trigger an SA lookup for this > path? The CMA will eventually be extended to query the SA directly if the local look-up fails. The intent is to keep these failures hidden from the end-user. The SA cache does not use a failed lookup as a trigger to update its cache, but that could be added. My initial reaction is that it makes sense to do this. - Sean From mst at mellanox.co.il Mon Jan 30 10:35:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jan 2006 20:35:23 +0200 Subject: [openib-general] Re: [PATCH 0/4] SA path record caching In-Reply-To: References: Message-ID: <20060130183523.GA30719@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [PATCH 0/4] SA path record caching > > The following patch series adds caching of path records with the local system. > I divided the changes up into 4 patches to make the review easier. The patches > are arranged as follows: > > 1. Add a new API to ib_sa.h to pack/unpack SA attributes. > 2. Create a fast indexing service. > 3. Create a local SA database. > 4. Modify the CMA to use the local SA database. > > There are some additional optimizations that can be added to these changes, but > I would prefer to add them incrementally to these changes. This includes > additional optimizations to the fast indexing service to help reduce its memory > footprint, registering the local SA database to receive SA events, and failing > over from the local SA database to using SA queries. > > Signed-off-by: Sean Hefty When is the cache invalidated? In the presence of failures in the network, an attempt to create a connection with a cached path might fail. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Mon Jan 30 10:53:19 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 30 Jan 2006 10:53:19 -0800 Subject: [openib-general] Re: [PATCH 0/4] SA path record caching In-Reply-To: <20060130183523.GA30719@mellanox.co.il> References: <20060130183523.GA30719@mellanox.co.il> Message-ID: <43DE609F.7040100@ichips.intel.com> Michael S. Tsirkin wrote: > When is the cache invalidated? The cache is updated every 15 minutes, but this value is user configurable at module load time. The cache is also updated if a local event occurs, such as port up/down or SM LID change. > In the presence of failures in the network, > an attempt to create a connection with a cached path might fail. The same is also true for paths returned directly from the SA, and detection of the failure could depend on how often the SM sweeps the fabric. At some point, the cache will contain multiple paths, letting the connection be retried along another path. - Sean From jeff.walls at hp.com Mon Jan 30 11:14:07 2006 From: jeff.walls at hp.com (Walls, Jeffrey Joel) Date: Mon, 30 Jan 2006 14:14:07 -0500 Subject: [openib-general] What is proper recovery from ib_poll_cq failure (gen1) ? Message-ID: Hi There -- I'm making progress on my IB transport system, but ran into A bit of a snag and I can't seem to find an answer in the documentation, Sample apps, or google. I'm on Windows running gen1 code and after successfully sending a number Of packets my call to ib_poll_cq is failing with a status of IB_WCS_RNR_RETRY_ERR. First, am I correct that this means the receiver's resources are not ready? If so, what does *that* mean? My receiver (Linux, gen2) has posted a receive WR and is waiting for a CQ event which never comes. Second, what is the correct recovery logic for this? I've tried re-posting the Send and re-polling the CQ, but that gives me IB_WCS_WR_FLUSHED_ERR Over and over again. So it seems to me that I have a problem on my receive Side, but I don't have the foggiest idea what it could be. Any help is greatly appreciated!! Jeff -- Jeffrey J. Walls 3404 E. Harmony Road MS-74 Fort Collins, CO 80525 970.898.1619 From sean.hefty at intel.com Mon Jan 30 12:07:50 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 30 Jan 2006 12:07:50 -0800 Subject: [openib-general] [PATCH] extend ib_device node_type to include iWarp Message-ID: The following patch extends the node_type associated with an ib_device to support iWarp without breaking the ABI. Comments on this approach? Signed-off-by: Sean Hefty --- Index: ulp/ipoib/ipoib_main.c =================================================================== --- ulp/ipoib/ipoib_main.c (revision 5098) +++ ulp/ipoib/ipoib_main.c (working copy) @@ -1078,13 +1078,16 @@ static void ipoib_add_one(struct ib_devi struct ipoib_dev_priv *priv; int s, e, p; + if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) + return; + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; INIT_LIST_HEAD(dev_list); - if (device->node_type == IB_NODE_SWITCH) { + if ((device->node_type & IB_NODE_SWITCH) == IB_NODE_SWITCH) { s = 0; e = 0; } else { Index: ulp/srp/ib_srp.c =================================================================== --- ulp/srp/ib_srp.c (revision 5098) +++ ulp/srp/ib_srp.c (working copy) @@ -1581,13 +1581,16 @@ static void srp_add_one(struct ib_device struct srp_host *host; int s, e, p; + if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) + return; + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; INIT_LIST_HEAD(dev_list); - if (device->node_type == IB_NODE_SWITCH) { + if ((device->node_type & IB_NODE_SWITCH) == IB_NODE_SWITCH) { s = 0; e = 0; } else { Index: include/rdma/ib_verbs.h =================================================================== --- include/rdma/ib_verbs.h (revision 5098) +++ include/rdma/ib_verbs.h (working copy) @@ -56,7 +56,17 @@ union ib_gid { } global; }; -enum ib_node_type { +/* + * 8-bit node type - for IB this maps to NodeInfo:NodeType. + */ +enum { + /* bits 7:4 - transport type */ + IB_NODE_TRANSPORT_MASK = 0xF0, + IB_NODE_IB = (0<<4), + IB_NODE_IWARP = (1<<4), + + /* bits 3:0 - device type */ + IB_NODE_DEVICE_MASK = 0x0F, IB_NODE_CA = 1, IB_NODE_SWITCH, IB_NODE_ROUTER Index: include/rdma/ib_addr.h =================================================================== --- include/rdma/ib_addr.h (revision 5098) +++ include/rdma/ib_addr.h (working copy) @@ -42,7 +42,7 @@ struct rdma_dev_addr { unsigned char src_dev_addr[MAX_ADDR_LEN]; unsigned char dst_dev_addr[MAX_ADDR_LEN]; unsigned char broadcast[MAX_ADDR_LEN]; - enum ib_node_type dev_type; + u8 dev_type; }; /** Index: core/cm.c =================================================================== --- core/cm.c (revision 5098) +++ core/cm.c (working copy) @@ -3245,6 +3245,9 @@ static void cm_add_one(struct ib_device int ret; u8 i; + if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) + return; + cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) Index: core/addr.c =================================================================== --- core/addr.c (revision 5098) +++ core/addr.c (working copy) @@ -63,7 +63,7 @@ static int copy_addr(struct rdma_dev_add { switch (dev->type) { case ARPHRD_INFINIBAND: - dev_addr->dev_type = IB_NODE_CA; + dev_addr->dev_type = IB_NODE_IB | IB_NODE_CA; break; default: return -EADDRNOTAVAIL; Index: core/local_sa.c =================================================================== --- core/local_sa.c (revision 5194) +++ core/local_sa.c (working copy) @@ -362,6 +362,9 @@ static void sa_db_add_one(struct ib_devi struct sa_db_port *port; int i; + if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) + return; + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, GFP_KERNEL); if (!dev) Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 5194) +++ core/sa_query.c (working copy) @@ -912,7 +912,10 @@ static void ib_sa_add_one(struct ib_devi struct ib_sa_device *sa_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) + return; + + if ((device->node_type & IB_NODE_SWITCH) == IB_NODE_SWITCH) s = e = 0; else { s = 1; Index: core/user_mad.c =================================================================== --- core/user_mad.c (revision 5098) +++ core/user_mad.c (working copy) @@ -936,7 +936,10 @@ static void ib_umad_add_one(struct ib_de struct ib_umad_device *umad_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) + return; + + if ((device->node_type & IB_NODE_SWITCH) == IB_NODE_SWITCH) s = e = 0; else { s = 1; Index: core/cma.c =================================================================== --- core/cma.c (revision 5194) +++ core/cma.c (working copy) @@ -244,8 +244,9 @@ static int cma_acquire_ib_dev(struct rdm static int cma_acquire_dev(struct rdma_id_private *id_priv) { - switch (id_priv->id.route.addr.dev_addr.dev_type) { - case IB_NODE_CA: + switch (id_priv->id.route.addr.dev_addr.dev_type & + IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: return cma_acquire_ib_dev(id_priv); default: return -ENODEV; @@ -324,8 +325,8 @@ int rdma_create_qp(struct rdma_cm_id *id if (IS_ERR(qp)) return PTR_ERR(qp); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (id->device->node_type & IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: ret = cma_init_ib_qp(id_priv, qp); break; default: @@ -413,8 +414,8 @@ int rdma_init_qp_attr(struct rdma_cm_id int ret; id_priv = container_of(id, struct rdma_id_private, id); - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (id_priv->id.device->node_type & IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr, qp_attr_mask); if (qp_attr->qp_state == IB_QPS_RTR) @@ -540,8 +541,8 @@ static int cma_notify_user(struct rdma_i static void cma_cancel_addr(struct rdma_id_private *id_priv) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (id_priv->id.device->node_type & IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: rdma_addr_cancel(&id_priv->id.route.addr.dev_addr); break; default: @@ -560,8 +561,8 @@ static void cma_destroy_listen(struct rd cma_exch(id_priv, CMA_DESTROYING); if (id_priv->cma_dev) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (id_priv->id.device->node_type & IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; @@ -620,8 +621,8 @@ void rdma_destroy_id(struct rdma_cm_id * cma_cancel_operation(id_priv, state); if (id_priv->cma_dev) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (id->device->node_type & IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; @@ -782,7 +783,7 @@ static struct rdma_id_private* cma_new_i ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); - rt->addr.dev_addr.dev_type = IB_NODE_CA; + rt->addr.dev_addr.dev_type = IB_NODE_IB | IB_NODE_CA; id_priv = container_of(id, struct rdma_id_private, id); id_priv->state = CMA_CONNECT; @@ -984,8 +985,8 @@ int rdma_listen(struct rdma_cm_id *id, i return -EINVAL; if (id->device) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (id->device->node_type & IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: ret = cma_ib_listen(id_priv); break; default: @@ -1077,8 +1078,8 @@ int rdma_resolve_route(struct rdma_cm_id return -EINVAL; atomic_inc(&id_priv->refcount); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (id->device->node_type & IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; default: @@ -1372,8 +1373,8 @@ int rdma_connect(struct rdma_cm_id *id, id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (id->device->node_type & IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: ret = cma_connect_ib(id_priv, conn_param); break; default: @@ -1431,8 +1432,8 @@ int rdma_accept(struct rdma_cm_id *id, s id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (id->device->node_type & IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: if (conn_param) ret = cma_accept_ib(id_priv, conn_param); else @@ -1464,8 +1465,8 @@ int rdma_reject(struct rdma_cm_id *id, c if (!cma_comp(id_priv, CMA_CONNECT)) return -EINVAL; - switch (id->device->node_type) { - case IB_NODE_CA: + switch (id->device->node_type & IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: ret = ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, private_data, private_data_len); @@ -1491,8 +1492,8 @@ int rdma_disconnect(struct rdma_cm_id *i if (ret) goto out; - switch (id->device->node_type) { - case IB_NODE_CA: + switch (id->device->node_type & IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: /* Initiate or respond to a disconnect. */ if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); Index: core/mad.c =================================================================== --- core/mad.c (revision 5098) +++ core/mad.c (working copy) @@ -2661,7 +2661,10 @@ static void ib_mad_init_device(struct ib { int start, end, i; - if (device->node_type == IB_NODE_SWITCH) { + if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) + return; + + if ((device->node_type & IB_NODE_SWITCH) == IB_NODE_SWITCH) { start = 0; end = 0; } else { Index: core/sysfs.c =================================================================== --- core/sysfs.c (revision 5098) +++ core/sysfs.c (working copy) @@ -590,7 +590,7 @@ static ssize_t show_node_type(struct cla if (!ibdev_is_alive(dev)) return -ENODEV; - switch (dev->node_type) { + switch (dev->node_type & IB_NODE_DEVICE_MASK) { case IB_NODE_CA: return sprintf(buf, "%d: CA\n", dev->node_type); case IB_NODE_SWITCH: return sprintf(buf, "%d: switch\n", dev->node_type); case IB_NODE_ROUTER: return sprintf(buf, "%d: router\n", dev->node_type); @@ -687,7 +687,7 @@ int ib_device_register_sysfs(struct ib_d if (ret) goto err_put; - if (device->node_type == IB_NODE_SWITCH) { + if ((device->node_type & IB_NODE_SWITCH) == IB_NODE_SWITCH) { ret = add_port(device, 0); if (ret) goto err_put; Index: core/ucm.c =================================================================== --- core/ucm.c (revision 5117) +++ core/ucm.c (working copy) @@ -1255,7 +1255,8 @@ static void ib_ucm_add_one(struct ib_dev { struct ib_ucm_device *ucm_dev; - if (!device->alloc_ucontext) + if (!device->alloc_ucontext || + (device->node_type & IB_NODE_IB) != IB_NODE_IB) return; ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); Index: core/ucma.c =================================================================== --- core/ucma.c (revision 5098) +++ core/ucma.c (working copy) @@ -479,8 +479,8 @@ static ssize_t ucma_query_route(struct u sizeof(struct sockaddr_in6)); resp.node_guid = ctx->cm_id->device->node_guid; resp.port_num = ctx->cm_id->port_num; - switch (ctx->cm_id->device->node_type) { - case IB_NODE_CA: + switch (ctx->cm_id->device->node_type & IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: ucma_copy_ib_route(&resp, &ctx->cm_id->route); default: break; Index: core/ping.c =================================================================== --- core/ping.c (revision 5098) +++ core/ping.c (working copy) @@ -247,7 +247,10 @@ static void ib_ping_init_device(struct i { int num_ports, cur_port, i; - if (device->node_type == IB_NODE_SWITCH) { + if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) + return; + + if ((device->node_type & IB_NODE_SWITCH) == IB_NODE_SWITCH) { num_ports = 1; cur_port = 0; } else { From rdreier at cisco.com Mon Jan 30 13:27:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 13:27:00 -0800 Subject: [openib-general] Re: [PATCH] extend ib_device node_type to include iWarp In-Reply-To: (Sean Hefty's message of "Mon, 30 Jan 2006 12:07:50 -0800") References: Message-ID: Seems OK. The changes like this in ipoib, srp, sa_query, etc: > - if (device->node_type == IB_NODE_SWITCH) { > + if ((device->node_type & IB_NODE_SWITCH) == IB_NODE_SWITCH) { seem rather silly. We've already checked that the device is an IB device, so the change is pure obfuscation. Also, why do > - enum ib_node_type dev_type; > + u8 dev_type; it seems better to leave the field as an enum for better documentation. - R. From mst at mellanox.co.il Mon Jan 30 13:30:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jan 2006 23:30:31 +0200 Subject: [openib-general] Re: Re: [PATCH 0/4] SA path record caching In-Reply-To: <43DE609F.7040100@ichips.intel.com> References: <43DE609F.7040100@ichips.intel.com> Message-ID: <20060130213031.GA31076@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: Re: [PATCH 0/4] SA path record caching > > Michael S. Tsirkin wrote: > > When is the cache invalidated? > > The cache is updated every 15 minutes, but this value is user configurable at > module load time. The cache is also updated if a local event occurs, such as > port up/down or SM LID change. Unfortunately port down on any link in the path has the same effect but wont invalidate the cache. One way to solve this would be to invalidate the cache, and retry, if an attempt to connect to the remote node fails. > > In the presence of failures in the network, > > an attempt to create a connection with a cached path might fail. > > The same is also true for paths returned directly from the SA, and detection of > the failure could depend on how often the SM sweeps the fabric. SA gets trap notices on link failures, doesnt it? So, unlike with the local cache, we dont depend on sweeps. > At some point, > the cache will contain multiple paths, letting the connection be retried along > another path. > > - Sean > Is this possible in the current implementation? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Mon Jan 30 13:37:39 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 30 Jan 2006 13:37:39 -0800 Subject: [openib-general] Re: [PATCH] extend ib_device node_type to include iWarp In-Reply-To: References: Message-ID: <43DE8723.60002@ichips.intel.com> Roland Dreier wrote: > The changes like this in ipoib, srp, sa_query, etc: > > > - if (device->node_type == IB_NODE_SWITCH) { > > + if ((device->node_type & IB_NODE_SWITCH) == IB_NODE_SWITCH) { > > seem rather silly. We've already checked that the device is an IB > device, so the change is pure obfuscation. As long as IB_NODE_IB remains 0, we can eliminate this change. Or we could go with something like: if (device->node_type == IB_NODE_IB | IB_NODE_SWITCH) > Also, why do > > > - enum ib_node_type dev_type; > > + u8 dev_type; > > it seems better to leave the field as an enum for better > documentation. The value being returned is no longer a member of the enum, but rather a value such as: IB_NODE_IB | IB_NODE_CA or IB_NODE_IWARP | IB_NODE_CA. I have no strong preferences with either of the changes that you mentioned. Is there a precedence that we can follow? - Sean From rdreier at cisco.com Mon Jan 30 13:46:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 13:46:00 -0800 Subject: [openib-general] Re: [PATCH] extend ib_device node_type to include iWarp In-Reply-To: <43DE8723.60002@ichips.intel.com> (Sean Hefty's message of "Mon, 30 Jan 2006 13:37:39 -0800") References: <43DE8723.60002@ichips.intel.com> Message-ID: Sean> As long as IB_NODE_IB remains 0, we can eliminate this Sean> change. Or we could go with something like: Sean> if (device->node_type == IB_NODE_IB | IB_NODE_SWITCH) That's still silly. I wasn't very clear in my original email. But my point was that if a function does if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) return; then we should just write tests like if (device->node_type == IB_NODE_SWITCH) { a few lines later. We already know the node type is IB. Sean> The value being returned is no longer a member of the enum, Sean> but rather a value such as: Sean> IB_NODE_IB | IB_NODE_CA or IB_NODE_IWARP | IB_NODE_CA. The C standard guarantees that an enum type is an integer type that can hold any of the members |ed together. - R. From mst at mellanox.co.il Mon Jan 30 13:54:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jan 2006 23:54:17 +0200 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: References: Message-ID: <20060130215417.GB31076@mellanox.co.il> Quoting r. Roland Dreier : > I'm not sure I got it right but it looks OK to me. > > I don't really see any other way to do this, because the RESIZE_CQ > command can block, so we can't lock out poll CQ operations while the > firmware command is executing. I thought about this some more. Its unfortunate that there's an overhead on each poll cq operation. We could avoid this overhead by: - adding poll_cq callback and lock into the struct ib_cq, initializing them on cq creation. - modifying ib_poll_cq to static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc) { unsigned long flags; int ret; spin_lock_irqsave(cq->lock, flags); ret = cq->poll_cq(cq, num_entries, wc); spin_unlock_irqrestore(cq->lock, flags); return ret; } Now, device driver can modify the poll_cq callback in the cq to a slower version that checks two buffers for the duration of the resize operation. How does this sound? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From iod00d at hp.com Mon Jan 30 13:55:47 2006 From: iod00d at hp.com (Grant Grundler) Date: Mon, 30 Jan 2006 13:55:47 -0800 Subject: [openib-general] Re: [PATCH] extend ib_device node_type to include iWarp In-Reply-To: <43DE8723.60002@ichips.intel.com> References: <43DE8723.60002@ichips.intel.com> Message-ID: <20060130215547.GE5720@esmail.cup.hp.com> On Mon, Jan 30, 2006 at 01:37:39PM -0800, Sean Hefty wrote: > As long as IB_NODE_IB remains 0, we can eliminate this change. > Or we could go with something like: > > if (device->node_type == IB_NODE_IB | IB_NODE_SWITCH) If node_type were a bitmask mask, I'd expect if (node_type & (IB_NODE_IB | IB_NODE_SWITCH)) > >it seems better to leave the field as an enum for better > >documentation. > > The value being returned is no longer a member of the enum, but rather a > value such as: > > IB_NODE_IB | IB_NODE_CA or IB_NODE_IWARP | IB_NODE_CA. Ok. so it is a bitmask. grant From caitlinb at broadcom.com Mon Jan 30 14:01:39 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 30 Jan 2006 14:01:39 -0800 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D1ED@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Quoting r. Roland Dreier : >> I'm not sure I got it right but it looks OK to me. >> >> I don't really see any other way to do this, because the RESIZE_CQ >> command can block, so we can't lock out poll CQ operations while the >> firmware command is executing. > > I thought about this some more. Its unfortunate that there's > an overhead on each poll cq operation. > > We could avoid this overhead by: > - adding poll_cq callback and lock into the struct ib_cq, > initializing them on cq creation. > - modifying ib_poll_cq to > > static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, > struct ib_wc *wc) { > unsigned long flags; > int ret; > spin_lock_irqsave(cq->lock, flags); > ret = cq->poll_cq(cq, num_entries, wc); > spin_unlock_irqrestore(cq->lock, flags); > return ret; > } > > Now, device driver can modify the poll_cq callback in the cq > to a slower version that checks two buffers for the duration of the > resize operation. > > How does this sound? You can get the overhead even lower by just switching to a new cq ring when you hit the end of the current one. But even that sounds far too expensive for any device that does not provide a true benefit for resizing the CQ. For those devices the best support is provided when the consumer allocates the maximum, any logical resizing to a smaller size is ignored, and there is *no* extra checking (let alone a spinlock) for each cq poll. From rdreier at cisco.com Mon Jan 30 14:10:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 14:10:36 -0800 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: <20060130175849.GE31887@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 30 Jan 2006 19:58:49 +0200") References: <20060130175849.GE31887@mellanox.co.il> Message-ID: Michael> Might be a good idea to test err == -EAGAIN first: err is Michael> likely a register, and once you get a valid CQE returning Michael> it to ULP ASAP has direct latency impact. Makes sense. The x86_64 asm looks slightly better at least, so I put this change in my tree. Michael> I agree, neither do I. Did you have a chance to profile Michael> this code? Is there any performance impact on IPoIB? I didn't profile but I didn't measure any performance delta on IPoIB. - R. From rdreier at cisco.com Mon Jan 30 14:12:38 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 14:12:38 -0800 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: <20060130215417.GB31076@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 30 Jan 2006 23:54:17 +0200") References: <20060130215417.GB31076@mellanox.co.il> Message-ID: > static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, > struct ib_wc *wc) > { > unsigned long flags; > int ret; > spin_lock_irqsave(cq->lock, flags); > ret = cq->poll_cq(cq, num_entries, wc); > spin_unlock_irqrestore(cq->lock, flags); > return ret; > } Definitely an interesting idea, although it doesn penalize a hypothetical device that can do lock-free CQ polling somehow. My bias has been to leave locking to the low-level driver, but if you can show an improvement on real hardware with this idea then I would be inclined to use this approach. - R. From mshefty at ichips.intel.com Mon Jan 30 14:13:47 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 30 Jan 2006 14:13:47 -0800 Subject: [openib-general] Re: [PATCH] extend ib_device node_type to include iWarp In-Reply-To: References: <43DE8723.60002@ichips.intel.com> Message-ID: <43DE8F9B.6070008@ichips.intel.com> Roland Dreier wrote: > That's still silly. I wasn't very clear in my original email. But my > point was that if a function does > > if ((device->node_type & IB_NODE_IB) != IB_NODE_IB) > return; > > then we should just write tests like > > if (device->node_type == IB_NODE_SWITCH) { I understand. This sort of test works since IB_NODE_IB = 0, but wouldn't work for IB_NODE_IWARP, where node_type = IB_NODE_IWARP | IB_NODE_CA. - Sean From mshefty at ichips.intel.com Mon Jan 30 14:16:29 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 30 Jan 2006 14:16:29 -0800 Subject: [openib-general] Re: [PATCH] extend ib_device node_type to include iWarp In-Reply-To: <20060130215547.GE5720@esmail.cup.hp.com> References: <43DE8723.60002@ichips.intel.com> <20060130215547.GE5720@esmail.cup.hp.com> Message-ID: <43DE903D.2080106@ichips.intel.com> Grant Grundler wrote: > If node_type were a bitmask mask, I'd expect > if (node_type & (IB_NODE_IB | IB_NODE_SWITCH)) This would be true if node_type = IB_NODE_IB | IB_NODE_CA. Okay, if the check for IB_NODE_IB is there, I think that all that's needed is: if (node_type & IB_NODE_SWITCH) - Sean From rdreier at cisco.com Mon Jan 30 14:16:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 14:16:46 -0800 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F122D1ED@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Mon, 30 Jan 2006 14:01:39 -0800") References: <54AD0F12E08D1541B826BE97C98F99F122D1ED@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: Caitlin> For those devices the best support is provided when the Caitlin> consumer allocates the maximum, any logical resizing to a Caitlin> smaller size is ignored, and there is *no* extra checking Caitlin> (let alone a spinlock) for each cq poll. There is always a spinlock for a CQ poll with every driver that I've seen (mthca, ehca, ipath and amso1100). But as I said I've been reluctant to put locking in the midlayer, because it penalizes super-smart hardware that might allow lockless operation. Allocating the maximum CQ size all the time wastes a _lot_ of memory. For example, on Mellanox HCAs, the CQ entries are 32 bytes, and CQs can have up to 64K entries. So a maximum size CQ takes 2 MB of memory, which is a big waste when a consumer might only need a 4 KB CQ. - R. From mshefty at ichips.intel.com Mon Jan 30 14:33:17 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 30 Jan 2006 14:33:17 -0800 Subject: [openib-general] Re: [PATCH 0/4] SA path record caching In-Reply-To: <20060130213031.GA31076@mellanox.co.il> References: <43DE609F.7040100@ichips.intel.com> <20060130213031.GA31076@mellanox.co.il> Message-ID: <43DE942D.1090304@ichips.intel.com> Michael S. Tsirkin wrote: >>The cache is updated every 15 minutes, but this value is user configurable at >>module load time. The cache is also updated if a local event occurs, such as >>port up/down or SM LID change. > > Unfortunately port down on any link in the path has the same effect > but wont invalidate the cache. Yes, but it's was more difficult for me to detect remote events. At some point, the cache needs to register with the SA for events, but that only works as long as the SA is reachable. > One way to solve this would be to invalidate the cache, and retry, > if an attempt to connect to the remote node fails. I didn't want to invalidate the cache too quickly. If the SA goes down, or the link to it drops, then the cache can still be used to establish connections with those nodes that are reachable. Do we expect most path failures to be permanent or transient? > SA gets trap notices on link failures, doesnt it? > So, unlike with the local cache, we dont depend on sweeps. Traps are optional. >> At some point, >>the cache will contain multiple paths, letting the connection be retried along >>another path. > > Is this possible in the current implementation? Currently, only a single path record is maintained to each remote node. Supporting multiple paths would be possible with some additional work. (MPI requires multiple paths for its routing algorithms.) - Sean From rdreier at cisco.com Mon Jan 30 14:34:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 14:34:37 -0800 Subject: [openib-general] Re: [PATCH] libmthca: overflow test typo In-Reply-To: <20060130165121.GC31887@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 30 Jan 2006 18:51:21 +0200") References: <20060130165121.GC31887@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Mon Jan 30 14:36:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 14:36:59 -0800 Subject: [openib-general] Re: [PATCH] extend ib_device node_type to include iWarp In-Reply-To: <43DE8F9B.6070008@ichips.intel.com> (Sean Hefty's message of "Mon, 30 Jan 2006 14:13:47 -0800") References: <43DE8723.60002@ichips.intel.com> <43DE8F9B.6070008@ichips.intel.com> Message-ID: Sean> I understand. This sort of test works since IB_NODE_IB = 0, Sean> but wouldn't work for IB_NODE_IWARP, where node_type = Sean> IB_NODE_IWARP | IB_NODE_CA. Oh, I see. I was thinking that we would have something like IB_NODE_RNIC = IB_NODE_IWARP | whatever in the enum (where whatever could be IB_NODE_CA). - R. From mshefty at ichips.intel.com Mon Jan 30 14:41:17 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 30 Jan 2006 14:41:17 -0800 Subject: [openib-general] Re: [PATCH] extend ib_device node_type to include iWarp In-Reply-To: <43DE903D.2080106@ichips.intel.com> References: <43DE8723.60002@ichips.intel.com> <20060130215547.GE5720@esmail.cup.hp.com> <43DE903D.2080106@ichips.intel.com> Message-ID: <43DE960D.6030306@ichips.intel.com> Sean Hefty wrote: > This would be true if node_type = IB_NODE_IB | IB_NODE_CA. Okay, if the > check for IB_NODE_IB is there, I think that all that's needed is: > > if (node_type & IB_NODE_SWITCH) Never mind - this doesn't work in a generic case, since IB_NODE_CA=1 and IB_NODE_ROUTER = 3. From rdreier at cisco.com Mon Jan 30 14:43:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 14:43:19 -0800 Subject: [openib-general] Re: [PATCH] uar size != 8M In-Reply-To: <20060123163415.GC26147@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 23 Jan 2006 18:34:15 +0200") References: <20060123163415.GC26147@mellanox.co.il> Message-ID: Thanks, applied. From sean.hefty at intel.com Mon Jan 30 15:42:22 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 30 Jan 2006 15:42:22 -0800 Subject: [openib-general] [PATCH] extend ib_device node_type to include iWarp In-Reply-To: Message-ID: Here's an updated version that actually does the device checks correctly, and hopefully clarifies how node_type is formatted. I include two helper functions to extract the transport and device information from the node_type. I tried to include related changes to all files with this patch, so a few more files are affected than with the previous patch. Signed-off-by: Sean Hefty --- Index: ulp/ipoib/ipoib_main.c =================================================================== --- ulp/ipoib/ipoib_main.c (revision 5098) +++ ulp/ipoib/ipoib_main.c (working copy) @@ -1078,13 +1078,16 @@ static void ipoib_add_one(struct ib_devi struct ipoib_dev_priv *priv; int s, e, p; + if (ib_node_get_transport(device->node_type) != IB_NODE_IB) + return; + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; INIT_LIST_HEAD(dev_list); - if (device->node_type == IB_NODE_SWITCH) { + if (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) { s = 0; e = 0; } else { Index: ulp/srp/ib_srp.c =================================================================== --- ulp/srp/ib_srp.c (revision 5098) +++ ulp/srp/ib_srp.c (working copy) @@ -1581,13 +1581,16 @@ static void srp_add_one(struct ib_device struct srp_host *host; int s, e, p; + if (ib_node_get_transport(device->node_type) != IB_NODE_IB) + return; + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; INIT_LIST_HEAD(dev_list); - if (device->node_type == IB_NODE_SWITCH) { + if (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) { s = 0; e = 0; } else { Index: include/rdma/ib_verbs.h =================================================================== --- include/rdma/ib_verbs.h (revision 5098) +++ include/rdma/ib_verbs.h (working copy) @@ -56,12 +56,32 @@ union ib_gid { } global; }; +/* + * 8-bit node type - for IB this maps to NodeInfo:NodeType. + */ enum ib_node_type { + /* bits 7:4 - transport type */ + IB_NODE_TRANSPORT_MASK = 0xF0, + IB_NODE_IB = (0<<4), + IB_NODE_IWARP = (1<<4), + + /* bits 3:0 - device type */ + IB_NODE_DEVICE_MASK = 0x0F, IB_NODE_CA = 1, IB_NODE_SWITCH, IB_NODE_ROUTER }; +static inline int ib_node_get_transport(enum ib_node_type type) +{ + return type & IB_NODE_TRANSPORT_MASK; +} + +static inline int ib_node_get_device(enum ib_node_type type) +{ + return type & IB_NODE_DEVICE_MASK; +} + enum ib_device_cap_flags { IB_DEVICE_RESIZE_MAX_WR = 1, IB_DEVICE_BAD_PKEY_CNTR = (1<<1), Index: core/cm.c =================================================================== --- core/cm.c (revision 5098) +++ core/cm.c (working copy) @@ -3245,6 +3245,9 @@ static void cm_add_one(struct ib_device int ret; u8 i; + if (ib_node_get_transport(device->node_type) != IB_NODE_IB) + return; + cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) Index: core/addr.c =================================================================== --- core/addr.c (revision 5098) +++ core/addr.c (working copy) @@ -63,7 +63,7 @@ static int copy_addr(struct rdma_dev_add { switch (dev->type) { case ARPHRD_INFINIBAND: - dev_addr->dev_type = IB_NODE_CA; + dev_addr->dev_type = IB_NODE_IB | IB_NODE_CA; break; default: return -EADDRNOTAVAIL; Index: core/local_sa.c =================================================================== --- core/local_sa.c (revision 5194) +++ core/local_sa.c (working copy) @@ -362,6 +362,9 @@ static void sa_db_add_one(struct ib_devi struct sa_db_port *port; int i; + if (ib_node_get_transport(device->node_type) != IB_NODE_IB) + return; + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, GFP_KERNEL); if (!dev) Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 5194) +++ core/sa_query.c (working copy) @@ -912,7 +912,10 @@ static void ib_sa_add_one(struct ib_devi struct ib_sa_device *sa_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if (ib_node_get_transport(device->node_type) != IB_NODE_IB) + return; + + if (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) s = e = 0; else { s = 1; Index: core/user_mad.c =================================================================== --- core/user_mad.c (revision 5098) +++ core/user_mad.c (working copy) @@ -936,7 +936,10 @@ static void ib_umad_add_one(struct ib_de struct ib_umad_device *umad_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if (ib_node_get_transport(device->node_type) != IB_NODE_IB) + return; + + if (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) s = e = 0; else { s = 1; Index: core/device.c =================================================================== --- core/device.c (revision 5098) +++ core/device.c (working copy) @@ -514,7 +514,7 @@ int ib_query_port(struct ib_device *devi u8 port_num, struct ib_port_attr *port_attr) { - if (device->node_type == IB_NODE_SWITCH) { + if (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) { if (port_num) return -EINVAL; } else if (port_num < 1 || port_num > device->phys_port_cnt) @@ -589,7 +589,7 @@ int ib_modify_port(struct ib_device *dev u8 port_num, int port_modify_mask, struct ib_port_modify *port_modify) { - if (device->node_type == IB_NODE_SWITCH) { + if (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) { if (port_num) return -EINVAL; } else if (port_num < 1 || port_num > device->phys_port_cnt) Index: core/cma.c =================================================================== --- core/cma.c (revision 5194) +++ core/cma.c (working copy) @@ -244,8 +244,10 @@ static int cma_acquire_ib_dev(struct rdm static int cma_acquire_dev(struct rdma_id_private *id_priv) { - switch (id_priv->id.route.addr.dev_addr.dev_type) { - case IB_NODE_CA: + enum ib_node_type dev_type = id_priv->id.route.addr.dev_addr.dev_type; + + switch (ib_node_get_transport(dev_type) == IB_NODE_TRANSPORT_MASK) { + case IB_NODE_IB: return cma_acquire_ib_dev(id_priv); default: return -ENODEV; @@ -324,8 +326,8 @@ int rdma_create_qp(struct rdma_cm_id *id if (IS_ERR(qp)) return PTR_ERR(qp); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (ib_node_get_transport(id->device->node_type)) { + case IB_NODE_IB: ret = cma_init_ib_qp(id_priv, qp); break; default: @@ -413,8 +415,8 @@ int rdma_init_qp_attr(struct rdma_cm_id int ret; id_priv = container_of(id, struct rdma_id_private, id); - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (ib_node_get_transport(id_priv->id.device->node_type)) { + case IB_NODE_IB: ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr, qp_attr_mask); if (qp_attr->qp_state == IB_QPS_RTR) @@ -540,8 +542,8 @@ static int cma_notify_user(struct rdma_i static void cma_cancel_addr(struct rdma_id_private *id_priv) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (ib_node_get_transport(id_priv->id.device->node_type)) { + case IB_NODE_IB: rdma_addr_cancel(&id_priv->id.route.addr.dev_addr); break; default: @@ -560,8 +562,8 @@ static void cma_destroy_listen(struct rd cma_exch(id_priv, CMA_DESTROYING); if (id_priv->cma_dev) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (ib_node_get_transport(id_priv->id.device->node_type)) { + case IB_NODE_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; @@ -620,8 +622,8 @@ void rdma_destroy_id(struct rdma_cm_id * cma_cancel_operation(id_priv, state); if (id_priv->cma_dev) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (ib_node_get_transport(id->device->node_type)) { + case IB_NODE_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; @@ -782,7 +784,7 @@ static struct rdma_id_private* cma_new_i ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); - rt->addr.dev_addr.dev_type = IB_NODE_CA; + rt->addr.dev_addr.dev_type = IB_NODE_IB | IB_NODE_CA; id_priv = container_of(id, struct rdma_id_private, id); id_priv->state = CMA_CONNECT; @@ -984,8 +986,8 @@ int rdma_listen(struct rdma_cm_id *id, i return -EINVAL; if (id->device) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (ib_node_get_transport(id->device->node_type)) { + case IB_NODE_IB: ret = cma_ib_listen(id_priv); break; default: @@ -1077,8 +1079,8 @@ int rdma_resolve_route(struct rdma_cm_id return -EINVAL; atomic_inc(&id_priv->refcount); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (ib_node_get_transport(id->device->node_type)) { + case IB_NODE_IB: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; default: @@ -1372,8 +1374,8 @@ int rdma_connect(struct rdma_cm_id *id, id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (ib_node_get_transport(id->device->node_type)) { + case IB_NODE_IB: ret = cma_connect_ib(id_priv, conn_param); break; default: @@ -1431,8 +1433,8 @@ int rdma_accept(struct rdma_cm_id *id, s id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (ib_node_get_transport(id->device->node_type)) { + case IB_NODE_IB: if (conn_param) ret = cma_accept_ib(id_priv, conn_param); else @@ -1464,8 +1466,8 @@ int rdma_reject(struct rdma_cm_id *id, c if (!cma_comp(id_priv, CMA_CONNECT)) return -EINVAL; - switch (id->device->node_type) { - case IB_NODE_CA: + switch (ib_node_get_transport(id->device->node_type)) { + case IB_NODE_IB: ret = ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, private_data, private_data_len); @@ -1491,8 +1493,8 @@ int rdma_disconnect(struct rdma_cm_id *i if (ret) goto out; - switch (id->device->node_type) { - case IB_NODE_CA: + switch (ib_node_get_transport(id->device->node_type)) { + case IB_NODE_IB: /* Initiate or respond to a disconnect. */ if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); Index: core/mad.c =================================================================== --- core/mad.c (revision 5098) +++ core/mad.c (working copy) @@ -2661,7 +2661,10 @@ static void ib_mad_init_device(struct ib { int start, end, i; - if (device->node_type == IB_NODE_SWITCH) { + if (ib_node_get_transport(device->node_type) != IB_NODE_IB) + return; + + if (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) { start = 0; end = 0; } else { @@ -2708,7 +2711,7 @@ static void ib_mad_remove_device(struct { int i, num_ports, cur_port; - if (device->node_type == IB_NODE_SWITCH) { + if (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) { num_ports = 1; cur_port = 0; } else { Index: core/cache.c =================================================================== --- core/cache.c (revision 5098) +++ core/cache.c (working copy) @@ -61,12 +61,13 @@ struct ib_update_work { static inline int start_port(struct ib_device *device) { - return device->node_type == IB_NODE_SWITCH ? 0 : 1; + return (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) ? 0 : 1; } static inline int end_port(struct ib_device *device) { - return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; + return (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) ? + 0 : device->phys_port_cnt; } int ib_get_cached_gid(struct ib_device *device, Index: core/sysfs.c =================================================================== --- core/sysfs.c (revision 5098) +++ core/sysfs.c (working copy) @@ -590,7 +590,7 @@ static ssize_t show_node_type(struct cla if (!ibdev_is_alive(dev)) return -ENODEV; - switch (dev->node_type) { + switch (ib_node_get_device(dev->node_type)) { case IB_NODE_CA: return sprintf(buf, "%d: CA\n", dev->node_type); case IB_NODE_SWITCH: return sprintf(buf, "%d: switch\n", dev->node_type); case IB_NODE_ROUTER: return sprintf(buf, "%d: router\n", dev->node_type); @@ -687,7 +687,7 @@ int ib_device_register_sysfs(struct ib_d if (ret) goto err_put; - if (device->node_type == IB_NODE_SWITCH) { + if (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) { ret = add_port(device, 0); if (ret) goto err_put; Index: core/ucm.c =================================================================== --- core/ucm.c (revision 5117) +++ core/ucm.c (working copy) @@ -1255,7 +1255,8 @@ static void ib_ucm_add_one(struct ib_dev { struct ib_ucm_device *ucm_dev; - if (!device->alloc_ucontext) + if (!device->alloc_ucontext || + ib_node_get_transport(device->node_type) != IB_NODE_IB) return; ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); Index: core/ucma.c =================================================================== --- core/ucma.c (revision 5098) +++ core/ucma.c (working copy) @@ -479,8 +479,8 @@ static ssize_t ucma_query_route(struct u sizeof(struct sockaddr_in6)); resp.node_guid = ctx->cm_id->device->node_guid; resp.port_num = ctx->cm_id->port_num; - switch (ctx->cm_id->device->node_type) { - case IB_NODE_CA: + switch (ib_node_get_transport(ctx->cm_id->device->node_type)) { + case IB_NODE_IB: ucma_copy_ib_route(&resp, &ctx->cm_id->route); default: break; Index: core/smi.c =================================================================== --- core/smi.c (revision 5098) +++ core/smi.c (working copy) @@ -64,7 +64,7 @@ int smi_handle_dr_smp_send(struct ib_smp /* C14-9:2 */ if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (ib_node_get_device(node_type) != IB_NODE_SWITCH) return 0; /* smp->return_path set when received */ @@ -77,7 +77,7 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_ptr == hop_cnt) { /* smp->return_path set when received */ smp->hop_ptr++; - return (node_type == IB_NODE_SWITCH || + return (ib_node_get_device(node_type) == IB_NODE_SWITCH || smp->dr_dlid == IB_LID_PERMISSIVE); } @@ -95,7 +95,7 @@ int smi_handle_dr_smp_send(struct ib_smp /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (ib_node_get_device(node_type) != IB_NODE_SWITCH) return 0; smp->hop_ptr--; @@ -107,7 +107,7 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_ptr == 1) { smp->hop_ptr--; /* C14-13:3 -- SMPs destined for SM shouldn't be here */ - return (node_type == IB_NODE_SWITCH || + return (ib_node_get_device(node_type) == IB_NODE_SWITCH || smp->dr_slid == IB_LID_PERMISSIVE); } @@ -142,7 +142,7 @@ int smi_handle_dr_smp_recv(struct ib_smp /* C14-9:2 -- intermediate hop */ if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (ib_node_get_device(node_type) != IB_NODE_SWITCH) return 0; smp->return_path[hop_ptr] = port_num; @@ -156,7 +156,7 @@ int smi_handle_dr_smp_recv(struct ib_smp smp->return_path[hop_ptr] = port_num; /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH || + return (ib_node_get_device(node_type) == IB_NODE_SWITCH || smp->dr_dlid == IB_LID_PERMISSIVE); } @@ -175,7 +175,7 @@ int smi_handle_dr_smp_recv(struct ib_smp /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (ib_node_get_device(node_type) != IB_NODE_SWITCH) return 0; /* smp->hop_ptr updated when sending */ @@ -190,7 +190,7 @@ int smi_handle_dr_smp_recv(struct ib_smp return 1; } /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH); + return (ib_node_get_device(node_type) == IB_NODE_SWITCH); } /* C14-13:4 -- hop_ptr = 0 -> give to SM */ Index: core/ping.c =================================================================== --- core/ping.c (revision 5098) +++ core/ping.c (working copy) @@ -247,7 +247,10 @@ static void ib_ping_init_device(struct i { int num_ports, cur_port, i; - if (device->node_type == IB_NODE_SWITCH) { + if (ib_node_get_transport(device->node_type) != IB_NODE_IB) + return; + + if (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) { num_ports = 1; cur_port = 0; } else { @@ -278,7 +281,7 @@ static void ib_ping_remove_device(struct { int i, num_ports, cur_port; - if (device->node_type == IB_NODE_SWITCH) { + if (ib_node_get_device(device->node_type) == IB_NODE_SWITCH) { num_ports = 1; cur_port = 0; } else { Index: hw/mthca/mthca_provider.c =================================================================== --- hw/mthca/mthca_provider.c (revision 5098) +++ hw/mthca/mthca_provider.c (working copy) @@ -1122,7 +1122,7 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); - dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.node_type = IB_NODE_IB | IB_NODE_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; dev->ib_dev.class_dev.dev = &dev->pdev->dev; From ardavis at ichips.intel.com Mon Jan 30 15:44:53 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 30 Jan 2006 15:44:53 -0800 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal In-Reply-To: References: Message-ID: <43DEA4F5.3060706@ichips.intel.com> Kanevsky, Arkady wrote: >Arlin, >I am not convinced we need a new recv for immediate data. >But what is needed is change in normative text in many places. >Recv, RDMA Write, DTO completion events, error behavior. >Sure you can define immed data in extension but it still effects >behavior of the normative part of the spec. > > How does it effect the normative part of the spec outside of the DTO event extension? The post_recv behaves exactly the same. >This is why my preference is to put it into the main spec. > > ok, with no new recv_immed call we do get a little closer. >The xfer_size is minor thing. We just need to define it meaning >with respect to immed_data. Defining it either way is fine. > >Handling extra space on CQ can be handled by Provider. >We can add a new EVD attribute for the use for handling RDMA_write with >immed >data and Provider can automatically add extra space on CQ. >Provider is already responsible to handing user a single completion. >SO it will only be used for error handling. > > sounds good. >Error handling takes maost of the new write up anyhow. >Regardless where it is done in the spec or in extension. > >Question on do we want to support Send with immed_data have to be >decided. >Ditto remote RMR invalidation with new post(s) for immed_data. >Just because IB supports all possible correlation under one Send post >does not mean that uDAPL should follow that too. > > I would agree, strike them all except rdma_write_immed. Can you give some idea how you would write up the normative text for the transport independent receive that would accept immediate data? thanks, -arlin From caitlinb at broadcom.com Mon Jan 30 16:16:11 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 30 Jan 2006 16:16:11 -0800 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D228@NT-SJCA-0751.brcm.ad.broadcom.com> Arlin Davis wrote: > Kanevsky, Arkady wrote: > >> Arlin, >> I am not convinced we need a new recv for immediate data. >> But what is needed is change in normative text in many places. >> Recv, RDMA Write, DTO completion events, error behavior. >> Sure you can define immed data in extension but it still effects >> behavior of the normative part of the spec. >> >> > How does it effect the normative part of the spec outside of the DTO > event extension? The post_recv behaves exactly the same. > >> This is why my preference is to put it into the main spec. >> >> > ok, with no new recv_immed call we do get a little closer. > >> The xfer_size is minor thing. We just need to define it meaning with >> respect to immed_data. Defining it either way is fine. >> >> Handling extra space on CQ can be handled by Provider. >> We can add a new EVD attribute for the use for handling RDMA_write >> with immed data and Provider can automatically add extra space on CQ. >> Provider is already responsible to handing user a single completion. >> SO it will only be used for error handling. >> >> > sounds good. > >> Error handling takes maost of the new write up anyhow. >> Regardless where it is done in the spec or in extension. >> >> Question on do we want to support Send with immed_data have to be >> decided. Ditto remote RMR invalidation with new post(s) for >> immed_data. >> Just because IB supports all possible correlation under one Send post >> does not mean that uDAPL should follow that too. >> >> > I would agree, strike them all except rdma_write_immed. > > Can you give some idea how you would write up the normative > text for the transport independent receive that would accept > immediate data? > > thanks, > > -arlin The data source: posts an rdma write with immediate DTO, supplying the RDMA Write data source and an immediate value. This is translated into one work request (if the device supports write with immediate), or into a RDMA Write followed by a RDMA Send (if it does not). While successful completion of the RDMA Write will be suppressed, the Consumer must still allow for the extra space on the SendQ and the CQ. An IA attribute will document how many work requests a write_with_immediate will translate into. The data sink: post a recv (to EP or SRQ) with a four byte buffer. When it reaps the completion it needs to be ready to see the data either in an "immediate" field in the work completion, or in the buffer originally specified in the recv DTO. A Provider MAY indicate that it supports immediate receives, but on iWARP or any transport where this is not the default optimized receive processing MUST be enabled by the user. Otherwise, RFC compliance would require that a four byte untagged message matched to a zero byte buffer was an error. Essentially the user is posting a receive operation that names the four bytes in the Work completion as the buffer. From rdreier at cisco.com Mon Jan 30 16:18:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 16:18:33 -0800 Subject: [openib-general] Re: [PATCH] RFC: mthca handling of signals In-Reply-To: <20060123155436.GB26147@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 23 Jan 2006 17:54:36 +0200") References: <20060123155436.GB26147@mellanox.co.il> Message-ID: Thanks, applied along with the mthca_mcg change: --- infiniband/hw/mthca/mthca_cmd.c (revision 5216) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -199,8 +199,7 @@ static int mthca_cmd_post(struct mthca_d { int err = 0; - if (down_interruptible(&dev->cmd.hcr_sem)) - return -EINTR; + down(&dev->cmd.hcr_sem); if (event) { unsigned long end = jiffies + GO_BIT_TIMEOUT; @@ -255,8 +254,7 @@ static int mthca_cmd_poll(struct mthca_d int err = 0; unsigned long end; - if (down_interruptible(&dev->cmd.poll_sem)) - return -EINTR; + down(&dev->cmd.poll_sem); err = mthca_cmd_post(dev, in_param, out_param ? *out_param : 0, @@ -333,8 +331,7 @@ static int mthca_cmd_wait(struct mthca_d int err = 0; struct mthca_cmd_context *context; - if (down_interruptible(&dev->cmd.event_sem)) - return -EINTR; + down(&dev->cmd.event_sem); spin_lock(&dev->cmd.context_lock); BUG_ON(dev->cmd.free_head < 0); --- infiniband/hw/mthca/mthca_mcg.c (revision 5214) +++ infiniband/hw/mthca/mthca_mcg.c (working copy) @@ -152,10 +152,7 @@ int mthca_multicast_attach(struct ib_qp return PTR_ERR(mailbox); mgm = mailbox->buf; - if (down_interruptible(&dev->mcg_table.sem)) { - err = -EINTR; - goto err_sem; - } + down(&dev->mcg_table.sem); err = find_mgm(dev, gid->raw, mailbox, &hash, &prev, &index); if (err) @@ -240,7 +237,7 @@ int mthca_multicast_attach(struct ib_qp mthca_free(&dev->mcg_table.alloc, index); } up(&dev->mcg_table.sem); - err_sem: + mthca_free_mailbox(dev, mailbox); return err; } @@ -261,10 +258,7 @@ int mthca_multicast_detach(struct ib_qp return PTR_ERR(mailbox); mgm = mailbox->buf; - if (down_interruptible(&dev->mcg_table.sem)) { - err = -EINTR; - goto err_sem; - } + down_interruptible(&dev->mcg_table.sem); err = find_mgm(dev, gid->raw, mailbox, &hash, &prev, &index); if (err) @@ -370,7 +364,7 @@ int mthca_multicast_detach(struct ib_qp out: up(&dev->mcg_table.sem); - err_sem: + mthca_free_mailbox(dev, mailbox); return err; } From rdreier at cisco.com Mon Jan 30 16:28:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jan 2006 16:28:09 -0800 Subject: [openib-general] Re: [PATCH] RFC: mthca handling of signals In-Reply-To: (Roland Dreier's message of "Mon, 30 Jan 2006 16:18:33 -0800") References: <20060123155436.GB26147@mellanox.co.il> Message-ID: err, with this fixed: > + down_interruptible(&dev->mcg_table.sem); From rolandd at cisco.com Mon Jan 30 16:45:32 2006 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 31 Jan 2006 00:45:32 +0000 Subject: [openib-general] [git patch review 1/4] IB/mthca: Relax UAR size check Message-ID: <1138668332064-a06b57921710eb35@cisco.com> There are some cards around that have UAR (user access region) size different from 8 MB. Relax our sanity check to make sure that the PCI BAR is big enough to access the UAR size reported by the device firmware instead. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_main.c | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) cbd2981a97cb628431a987a8abd1731c74bcc32e diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 8b00d9a..9c849d2 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -155,6 +155,13 @@ static int __devinit mthca_dev_lim(struc return -ENODEV; } + if (dev_lim->uar_size > pci_resource_len(mdev->pdev, 2)) { + mthca_err(mdev, "HCA reported UAR size of 0x%x bigger than " + "PCI resource 2 size of 0x%lx, aborting.\n", + dev_lim->uar_size, pci_resource_len(mdev->pdev, 2)); + return -ENODEV; + } + mdev->limits.num_ports = dev_lim->num_ports; mdev->limits.vl_cap = dev_lim->max_vl; mdev->limits.mtu_cap = dev_lim->max_mtu; @@ -976,8 +983,7 @@ static int __devinit mthca_init_one(stru err = -ENODEV; goto err_disable_pdev; } - if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) || - pci_resource_len(pdev, 2) != 1 << 23) { + if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM)) { dev_err(&pdev->dev, "Missing UAR, aborting.\n"); err = -ENODEV; goto err_disable_pdev; -- 1.1.3 From rolandd at cisco.com Mon Jan 30 16:45:32 2006 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 31 Jan 2006 00:45:32 +0000 Subject: [openib-general] [git patch review 2/4] IB/srp: Semaphore to mutex conversion In-Reply-To: <1138668332064-a06b57921710eb35@cisco.com> Message-ID: <1138668332064-f9cb54cbde95165a@cisco.com> Convert srp_host->target_mutex from a semaphore to a mutex. Signed-off-by: Ingo Molnar Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/srp/ib_srp.c | 14 +++++++------- drivers/infiniband/ulp/srp/ib_srp.h | 5 ++--- 2 files changed, 9 insertions(+), 10 deletions(-) 8e9e5f4f5eb1d44ddabfd1ddea4ca4e4244a9ffb diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 31207e6..2d2d4ac 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -357,9 +357,9 @@ static void srp_remove_work(void *target target->state = SRP_TARGET_REMOVED; spin_unlock_irq(target->scsi_host->host_lock); - down(&target->srp_host->target_mutex); + mutex_lock(&target->srp_host->target_mutex); list_del(&target->list); - up(&target->srp_host->target_mutex); + mutex_unlock(&target->srp_host->target_mutex); scsi_remove_host(target->scsi_host); ib_destroy_cm_id(target->cm_id); @@ -1254,9 +1254,9 @@ static int srp_add_target(struct srp_hos if (scsi_add_host(target->scsi_host, host->dev->dma_device)) return -ENODEV; - down(&host->target_mutex); + mutex_lock(&host->target_mutex); list_add_tail(&target->list, &host->target_list); - up(&host->target_mutex); + mutex_unlock(&host->target_mutex); target->state = SRP_TARGET_LIVE; @@ -1525,7 +1525,7 @@ static struct srp_host *srp_add_port(str return NULL; INIT_LIST_HEAD(&host->target_list); - init_MUTEX(&host->target_mutex); + mutex_init(&host->target_mutex); init_completion(&host->released); host->dev = device; host->port = port; @@ -1626,7 +1626,7 @@ static void srp_remove_one(struct ib_dev * Mark all target ports as removed, so we stop queueing * commands and don't try to reconnect. */ - down(&host->target_mutex); + mutex_lock(&host->target_mutex); list_for_each_entry_safe(target, tmp_target, &host->target_list, list) { spin_lock_irqsave(target->scsi_host->host_lock, flags); @@ -1634,7 +1634,7 @@ static void srp_remove_one(struct ib_dev target->state = SRP_TARGET_REMOVED; spin_unlock_irqrestore(target->scsi_host->host_lock, flags); } - up(&host->target_mutex); + mutex_unlock(&host->target_mutex); /* * Wait for any reconnection tasks that may have diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index b564f18..4e7727d 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -37,8 +37,7 @@ #include #include - -#include +#include #include #include @@ -85,7 +84,7 @@ struct srp_host { struct ib_mr *mr; struct class_device class_dev; struct list_head target_list; - struct semaphore target_mutex; + struct mutex target_mutex; struct completion released; struct list_head list; }; -- 1.1.3 From rolandd at cisco.com Mon Jan 30 16:45:32 2006 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 31 Jan 2006 00:45:32 +0000 Subject: [openib-general] [git patch review 4/4] IB/mthca: Semaphore to mutex conversions In-Reply-To: <1138668332064-ea39a6f0ffa0f630@cisco.com> Message-ID: <1138668332064-e487a8354298962c@cisco.com> Convert semaphores to mutexes in mthca. Leave firmware command interface poll_sem and event_sem as semaphores. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_cmd.c | 6 ++-- drivers/infiniband/hw/mthca/mthca_dev.h | 8 ++++-- drivers/infiniband/hw/mthca/mthca_mcg.c | 10 ++++--- drivers/infiniband/hw/mthca/mthca_memfree.c | 36 +++++++++++++------------- drivers/infiniband/hw/mthca/mthca_memfree.h | 7 ++--- drivers/infiniband/hw/mthca/mthca_provider.c | 6 ++-- 6 files changed, 37 insertions(+), 36 deletions(-) fd9cfdd11be3b37b5c919b64b43990f14a1587bd diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 69128fe..f9b9b93 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -199,7 +199,7 @@ static int mthca_cmd_post(struct mthca_d { int err = 0; - down(&dev->cmd.hcr_sem); + mutex_lock(&dev->cmd.hcr_mutex); if (event) { unsigned long end = jiffies + GO_BIT_TIMEOUT; @@ -237,7 +237,7 @@ static int mthca_cmd_post(struct mthca_d op), dev->hcr + 6 * 4); out: - up(&dev->cmd.hcr_sem); + mutex_unlock(&dev->cmd.hcr_mutex); return err; } @@ -435,7 +435,7 @@ static int mthca_cmd_imm(struct mthca_de int mthca_cmd_init(struct mthca_dev *dev) { - sema_init(&dev->cmd.hcr_sem, 1); + mutex_init(&dev->cmd.hcr_mutex); sema_init(&dev->cmd.poll_sem, 1); dev->cmd.use_events = 0; diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index a104ab0..2a165fd 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -44,6 +44,8 @@ #include #include #include +#include + #include #include "mthca_provider.h" @@ -111,7 +113,7 @@ enum { struct mthca_cmd { struct pci_pool *pool; int use_events; - struct semaphore hcr_sem; + struct mutex hcr_mutex; struct semaphore poll_sem; struct semaphore event_sem; int max_cmds; @@ -256,7 +258,7 @@ struct mthca_av_table { }; struct mthca_mcg_table { - struct semaphore sem; + struct mutex mutex; struct mthca_alloc alloc; struct mthca_icm_table *table; }; @@ -301,7 +303,7 @@ struct mthca_dev { u64 ddr_end; MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock) - struct semaphore cap_mask_mutex; + struct mutex cap_mask_mutex; void __iomem *hcr; void __iomem *kar; diff --git a/drivers/infiniband/hw/mthca/mthca_mcg.c b/drivers/infiniband/hw/mthca/mthca_mcg.c index 55ff5e5..321f11e 100644 --- a/drivers/infiniband/hw/mthca/mthca_mcg.c +++ b/drivers/infiniband/hw/mthca/mthca_mcg.c @@ -154,7 +154,7 @@ int mthca_multicast_attach(struct ib_qp return PTR_ERR(mailbox); mgm = mailbox->buf; - down(&dev->mcg_table.sem); + mutex_lock(&dev->mcg_table.mutex); err = find_mgm(dev, gid->raw, mailbox, &hash, &prev, &index); if (err) @@ -238,7 +238,7 @@ int mthca_multicast_attach(struct ib_qp BUG_ON(index < dev->limits.num_mgms); mthca_free(&dev->mcg_table.alloc, index); } - up(&dev->mcg_table.sem); + mutex_unlock(&dev->mcg_table.mutex); mthca_free_mailbox(dev, mailbox); return err; @@ -260,7 +260,7 @@ int mthca_multicast_detach(struct ib_qp return PTR_ERR(mailbox); mgm = mailbox->buf; - down(&dev->mcg_table.sem); + mutex_lock(&dev->mcg_table.mutex); err = find_mgm(dev, gid->raw, mailbox, &hash, &prev, &index); if (err) @@ -365,7 +365,7 @@ int mthca_multicast_detach(struct ib_qp } out: - up(&dev->mcg_table.sem); + mutex_unlock(&dev->mcg_table.mutex); mthca_free_mailbox(dev, mailbox); return err; @@ -383,7 +383,7 @@ int __devinit mthca_init_mcg_table(struc if (err) return err; - init_MUTEX(&dev->mcg_table.sem); + mutex_init(&dev->mcg_table.mutex); return 0; } diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c index 9fb985a..d709cb1 100644 --- a/drivers/infiniband/hw/mthca/mthca_memfree.c +++ b/drivers/infiniband/hw/mthca/mthca_memfree.c @@ -50,7 +50,7 @@ enum { }; struct mthca_user_db_table { - struct semaphore mutex; + struct mutex mutex; struct { u64 uvirt; struct scatterlist mem; @@ -158,7 +158,7 @@ int mthca_table_get(struct mthca_dev *de int ret = 0; u8 status; - down(&table->mutex); + mutex_lock(&table->mutex); if (table->icm[i]) { ++table->icm[i]->refcount; @@ -184,7 +184,7 @@ int mthca_table_get(struct mthca_dev *de ++table->icm[i]->refcount; out: - up(&table->mutex); + mutex_unlock(&table->mutex); return ret; } @@ -198,7 +198,7 @@ void mthca_table_put(struct mthca_dev *d i = (obj & (table->num_obj - 1)) * table->obj_size / MTHCA_TABLE_CHUNK_SIZE; - down(&table->mutex); + mutex_lock(&table->mutex); if (--table->icm[i]->refcount == 0) { mthca_UNMAP_ICM(dev, table->virt + i * MTHCA_TABLE_CHUNK_SIZE, @@ -207,7 +207,7 @@ void mthca_table_put(struct mthca_dev *d table->icm[i] = NULL; } - up(&table->mutex); + mutex_unlock(&table->mutex); } void *mthca_table_find(struct mthca_icm_table *table, int obj) @@ -220,7 +220,7 @@ void *mthca_table_find(struct mthca_icm_ if (!table->lowmem) return NULL; - down(&table->mutex); + mutex_lock(&table->mutex); idx = (obj & (table->num_obj - 1)) * table->obj_size; icm = table->icm[idx / MTHCA_TABLE_CHUNK_SIZE]; @@ -240,7 +240,7 @@ void *mthca_table_find(struct mthca_icm_ } out: - up(&table->mutex); + mutex_unlock(&table->mutex); return page ? lowmem_page_address(page) + offset : NULL; } @@ -301,7 +301,7 @@ struct mthca_icm_table *mthca_alloc_icm_ table->num_obj = nobj; table->obj_size = obj_size; table->lowmem = use_lowmem; - init_MUTEX(&table->mutex); + mutex_init(&table->mutex); for (i = 0; i < num_icm; ++i) table->icm[i] = NULL; @@ -380,7 +380,7 @@ int mthca_map_user_db(struct mthca_dev * if (index < 0 || index > dev->uar_table.uarc_size / 8) return -EINVAL; - down(&db_tab->mutex); + mutex_lock(&db_tab->mutex); i = index / MTHCA_DB_REC_PER_PAGE; @@ -424,7 +424,7 @@ int mthca_map_user_db(struct mthca_dev * db_tab->page[i].refcount = 1; out: - up(&db_tab->mutex); + mutex_unlock(&db_tab->mutex); return ret; } @@ -439,11 +439,11 @@ void mthca_unmap_user_db(struct mthca_de * pages until we clean up the whole db table. */ - down(&db_tab->mutex); + mutex_lock(&db_tab->mutex); --db_tab->page[index / MTHCA_DB_REC_PER_PAGE].refcount; - up(&db_tab->mutex); + mutex_unlock(&db_tab->mutex); } struct mthca_user_db_table *mthca_init_user_db_tab(struct mthca_dev *dev) @@ -460,7 +460,7 @@ struct mthca_user_db_table *mthca_init_u if (!db_tab) return ERR_PTR(-ENOMEM); - init_MUTEX(&db_tab->mutex); + mutex_init(&db_tab->mutex); for (i = 0; i < npages; ++i) { db_tab->page[i].refcount = 0; db_tab->page[i].uvirt = 0; @@ -499,7 +499,7 @@ int mthca_alloc_db(struct mthca_dev *dev int ret = 0; u8 status; - down(&dev->db_tab->mutex); + mutex_lock(&dev->db_tab->mutex); switch (type) { case MTHCA_DB_TYPE_CQ_ARM: @@ -585,7 +585,7 @@ found: *db = (__be32 *) &page->db_rec[j]; out: - up(&dev->db_tab->mutex); + mutex_unlock(&dev->db_tab->mutex); return ret; } @@ -601,7 +601,7 @@ void mthca_free_db(struct mthca_dev *dev page = dev->db_tab->page + i; - down(&dev->db_tab->mutex); + mutex_lock(&dev->db_tab->mutex); page->db_rec[j] = 0; if (i >= dev->db_tab->min_group2) @@ -624,7 +624,7 @@ void mthca_free_db(struct mthca_dev *dev ++dev->db_tab->min_group2; } - up(&dev->db_tab->mutex); + mutex_unlock(&dev->db_tab->mutex); } int mthca_init_db_tab(struct mthca_dev *dev) @@ -638,7 +638,7 @@ int mthca_init_db_tab(struct mthca_dev * if (!dev->db_tab) return -ENOMEM; - init_MUTEX(&dev->db_tab->mutex); + mutex_init(&dev->db_tab->mutex); dev->db_tab->npages = dev->uar_table.uarc_size / 4096; dev->db_tab->max_group1 = 0; diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.h b/drivers/infiniband/hw/mthca/mthca_memfree.h index 4fdca26..36f1141 100644 --- a/drivers/infiniband/hw/mthca/mthca_memfree.h +++ b/drivers/infiniband/hw/mthca/mthca_memfree.h @@ -39,8 +39,7 @@ #include #include - -#include +#include #define MTHCA_ICM_CHUNK_LEN \ ((256 - sizeof (struct list_head) - 2 * sizeof (int)) / \ @@ -64,7 +63,7 @@ struct mthca_icm_table { int num_obj; int obj_size; int lowmem; - struct semaphore mutex; + struct mutex mutex; struct mthca_icm *icm[0]; }; @@ -147,7 +146,7 @@ struct mthca_db_table { int max_group1; int min_group2; struct mthca_db_page *page; - struct semaphore mutex; + struct mutex mutex; }; enum mthca_db_type { diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 484a7e6..e88e39a 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -185,7 +185,7 @@ static int mthca_modify_port(struct ib_d int err; u8 status; - if (down_interruptible(&to_mdev(ibdev)->cap_mask_mutex)) + if (mutex_lock_interruptible(&to_mdev(ibdev)->cap_mask_mutex)) return -ERESTARTSYS; err = mthca_query_port(ibdev, port, &attr); @@ -207,7 +207,7 @@ static int mthca_modify_port(struct ib_d } out: - up(&to_mdev(ibdev)->cap_mask_mutex); + mutex_unlock(&to_mdev(ibdev)->cap_mask_mutex); return err; } @@ -1185,7 +1185,7 @@ int mthca_register_device(struct mthca_d dev->ib_dev.post_recv = mthca_tavor_post_receive; } - init_MUTEX(&dev->cap_mask_mutex); + mutex_init(&dev->cap_mask_mutex); ret = ib_register_device(&dev->ib_dev); if (ret) -- 1.1.3 From rolandd at cisco.com Mon Jan 30 16:45:32 2006 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 31 Jan 2006 00:45:32 +0000 Subject: [openib-general] [git patch review 3/4] IB/mthca: Don't cancel commands on a signal In-Reply-To: <1138668332064-f9cb54cbde95165a@cisco.com> Message-ID: <1138668332064-ea39a6f0ffa0f630@cisco.com> We have run into the following problem: if a task receives a signal while in the process of e.g. destroying a resource (which could be because the relevant file was closed) mthca could bail out from trying to take a command interface semaphore without performing the appropriate command to tell hardware that the resource is being destroyed. As a result we see messages like ib_mthca 0000:04:00.0: HW2SW_CQ failed (-4) In this case, hardware could access the resource after the memory has been freed, possibly causing memory corruption. A simple solution is to replace down_interruptible() by down() in command interface activation. Signed-off-by: Michael S. Tsirkin [ It's also not safe to bail out on multicast table operations, since they may be invoked on the cleanup path too. So use down() for mcg_table.sem too. ] Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_cmd.c | 9 +++------ drivers/infiniband/hw/mthca/mthca_mcg.c | 14 ++++---------- 2 files changed, 7 insertions(+), 16 deletions(-) e3aa31c517cb6fd0a3d8b23e6a7e71a6aafc2393 diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index be1791b..69128fe 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -199,8 +199,7 @@ static int mthca_cmd_post(struct mthca_d { int err = 0; - if (down_interruptible(&dev->cmd.hcr_sem)) - return -EINTR; + down(&dev->cmd.hcr_sem); if (event) { unsigned long end = jiffies + GO_BIT_TIMEOUT; @@ -255,8 +254,7 @@ static int mthca_cmd_poll(struct mthca_d int err = 0; unsigned long end; - if (down_interruptible(&dev->cmd.poll_sem)) - return -EINTR; + down(&dev->cmd.poll_sem); err = mthca_cmd_post(dev, in_param, out_param ? *out_param : 0, @@ -333,8 +331,7 @@ static int mthca_cmd_wait(struct mthca_d int err = 0; struct mthca_cmd_context *context; - if (down_interruptible(&dev->cmd.event_sem)) - return -EINTR; + down(&dev->cmd.event_sem); spin_lock(&dev->cmd.context_lock); BUG_ON(dev->cmd.free_head < 0); diff --git a/drivers/infiniband/hw/mthca/mthca_mcg.c b/drivers/infiniband/hw/mthca/mthca_mcg.c index 77bc6c7..55ff5e5 100644 --- a/drivers/infiniband/hw/mthca/mthca_mcg.c +++ b/drivers/infiniband/hw/mthca/mthca_mcg.c @@ -154,10 +154,7 @@ int mthca_multicast_attach(struct ib_qp return PTR_ERR(mailbox); mgm = mailbox->buf; - if (down_interruptible(&dev->mcg_table.sem)) { - err = -EINTR; - goto err_sem; - } + down(&dev->mcg_table.sem); err = find_mgm(dev, gid->raw, mailbox, &hash, &prev, &index); if (err) @@ -242,7 +239,7 @@ int mthca_multicast_attach(struct ib_qp mthca_free(&dev->mcg_table.alloc, index); } up(&dev->mcg_table.sem); - err_sem: + mthca_free_mailbox(dev, mailbox); return err; } @@ -263,10 +260,7 @@ int mthca_multicast_detach(struct ib_qp return PTR_ERR(mailbox); mgm = mailbox->buf; - if (down_interruptible(&dev->mcg_table.sem)) { - err = -EINTR; - goto err_sem; - } + down(&dev->mcg_table.sem); err = find_mgm(dev, gid->raw, mailbox, &hash, &prev, &index); if (err) @@ -372,7 +366,7 @@ int mthca_multicast_detach(struct ib_qp out: up(&dev->mcg_table.sem); - err_sem: + mthca_free_mailbox(dev, mailbox); return err; } -- 1.1.3 From ogerlitz at voltaire.com Mon Jan 30 22:23:35 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 31 Jan 2006 08:23:35 +0200 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43DE5742.5030601@ichips.intel.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> Message-ID: <43DF0267.4080007@voltaire.com> Sean Hefty wrote: > The CMA will eventually be extended to query the SA directly if the > local look-up fails. The intent is to keep these failures hidden from > the end-user. I see, it makes sense. > The SA cache does not use a failed lookup as a trigger to update its > cache, but that could be added. My initial reaction is that it makes > sense to do this. Why not having a failed lookup as the --only-- trigger to update the cache? so the cache contains only paths that were demanded by some consumer. What is implementation you were considering, is it an SA replica having all those paths whose sgid is the local node gid? From mst at mellanox.co.il Mon Jan 30 22:38:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jan 2006 08:38:34 +0200 Subject: [openib-general] Re: [PATCH 0/4] SA path record caching In-Reply-To: <43DE942D.1090304@ichips.intel.com> References: <43DE942D.1090304@ichips.intel.com> Message-ID: <20060131063834.GA27065@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH 0/4] SA path record caching > > Michael S. Tsirkin wrote: > >>The cache is updated every 15 minutes, but this value is user configurable at > >>module load time. The cache is also updated if a local event occurs, such as > >>port up/down or SM LID change. > > > > Unfortunately port down on any link in the path has the same effect > > but wont invalidate the cache. > > Yes, but it's was more difficult for me to detect remote events. At some point, > the cache needs to register with the SA for events, but that only works as long > as the SA is reachable. I see how this would be nontrivial. Further, registration with SA events means that the SA has to initiate a storm of notification MADs on each link failure. See below for some ideas on how to improve on this. If an RC QP timeouts, there's no arp reply, or a CM attempt does not get a reply, this is a strong hint that the path has a problem. I guess we could verify with sending getportinfo or something. Once there's no reply, I think we want to look fro another path, without the cache getting in the way. > > One way to solve this would be to invalidate the cache, and retry, > > if an attempt to connect to the remote node fails. > > I didn't want to invalidate the cache too quickly. > If the SA goes down, or the > link to it drops, then the cache can still be used to establish connections with > those nodes that are reachable. Lets just invalidate the entry that has the problem then. And I guess we could start SA query and delay the invalidation until we get the response. I also wander what happens if the SA goes down at the exact 15 min interval when the cache is invalidated? > Do we expect most path failures to be permanent > or transient? No idea. Both kinds? What do you think? > > SA gets trap notices on link failures, doesnt it? > > So, unlike with the local cache, we dont depend on sweeps. > > Traps are optional. RC is also optional :) Implementations seem to have them. > >> At some point, > >>the cache will contain multiple paths, letting the connection be retried along > >>another path. > > > > Is this possible in the current implementation? > > Currently, only a single path record is maintained to each remote node. > Supporting multiple paths would be possible with some additional work. (MPI > requires multiple paths for its routing algorithms.) -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From dotanb at mellanox.co.il Mon Jan 30 22:42:51 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 31 Jan 2006 08:42:51 +0200 Subject: [openib-general] What is proper recovery from ib_poll_cq failure(gen1) ? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3013DC90F@mtlexch01.mtl.com> Hi jeff. just a little obvious thing: It doesn't matter if you are working over gen1 / gen2 /windows / linux: the behavior will be the same. > I'm on Windows running gen1 code and after successfully > sending a number > Of packets my call to ib_poll_cq is failing with a status of > IB_WCS_RNR_RETRY_ERR. > > First, am I correct that this means the receiver's resources are not > ready? yes. completion with the state RNR_RETRY means "Receiver Not Ready". The requestor sent a message which should consume a WR in the responders RQ, but the RQ (of the responder) is empty. > If so, what does *that* mean? My receiver (Linux, gen2) has posted a > receive > WR and is waiting for a CQ event which never comes. Maybe all of the WR from this RQ were used? maybe there isn't any sync between the two sides? (maybe you should increase the RNR retry count / timer?) > Second, what is the correct recovery logic for this? I've tried > re-posting the > Send and re-polling the CQ, but that gives me IB_WCS_WR_FLUSHED_ERR > Over and over again. So it seems to me that I have a problem on my > receive > Side, but I don't have the foggiest idea what it could be. You can not recover after this status (and this is the reason why all of the completion's status was flushed with error. In RC QPs if you got completion with error in the RQ/SQ the error causes the QP to go to ERR state (cannot be recovered). In UC/UD QPs if you got completion with error in the SQ the error causes the QP to go to SQE state (can be recovered, and the QP state can be changed to RTS). In UC/UD QPs if you got completion with error in the RQ the error causes the QP to go to ERR state (cannot be recovered). The only way to "recover" when at least one of the QPs is in error state is to establish the connection once again. Hope this info helped you ... Dotan From ogerlitz at voltaire.com Mon Jan 30 23:44:35 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 31 Jan 2006 09:44:35 +0200 (IST) Subject: [openib-general] [PATCH] iser: added -EAGAIN scheme to the tx posting and completion code Message-ID: commited in r5228 added -EAGAIN scheme to the tx posting and completion code. When the QP is full and we can't post to it, TX is suspended till there's room. Signed-off-by: Or Gerlitz Index: ulp/iser/iser_conn.c =================================================================== --- ulp/iser/iser_conn.c (revision 5182) +++ ulp/iser/iser_conn.c (revision 5228) @@ -204,7 +204,7 @@ int iser_conn_set_full_featured_mode(str int i, err = 0; /* no need to keep it in a var, we are after login so if this should * be negotiated, by now the result should be available here */ - int initial_post_recv_bufs_num = ISER_INITIAL_POST_RECV + 2; + int initial_post_recv_bufs_num = ISER_MAX_RX_MISC_PDUS; iser_dbg("Initially post: %d\n", initial_post_recv_bufs_num); Index: ulp/iser/iscsi_iser.h =================================================================== --- ulp/iser/iscsi_iser.h (revision 5182) +++ ulp/iser/iscsi_iser.h (revision 5228) @@ -303,6 +303,7 @@ struct iscsi_iser_conn struct list_head item; /* maintains list of conns */ unsigned long suspend_tx; + spinlock_t lock; char name[ISER_OBJECT_NAME_SIZE]; }; @@ -328,13 +329,11 @@ struct iscsi_iser_cmd_task { struct iscsi_iser_conn *conn; spinlock_t task_lock; enum iser_task_status status; + int command_sent; int datasn; /* DataSN */ uint32_t unsol_datasn; int sent; - struct scatterlist *sg; /* per-cmd SG list */ - struct scatterlist *bad_sg; /* assert statement */ - int sg_count; /* SG's to process */ int imm_count; /* imm-data (bytes) */ int unsol_count; /* unsolicited (bytes)*/ @@ -344,7 +343,6 @@ struct iscsi_iser_cmd_task { struct scsi_cmnd *sc; /* associated SCSI cmd*/ int total_length; - int data_offset; struct iscsi_iser_mgmt_task *mtask; /* tmf mtask in progr */ unsigned int post_send_count; /* posted send buffers pending completion */ @@ -592,58 +590,30 @@ void iser_task_set_status(struct iscsi_i /* --------------------------------------------------------------------- * CONSTANTS & MACROS * ------------------------------------------------------------------ */ -#define ISER_MAX_NOP_IN 2 -#define ISER_MAX_ASYNC_EVT 2 +/* iSER Initiator QP settings */ +/* Maximal bounds on asynchronous PDUs received by iSER Initiator */ -#define ISER_MAX_LOGIN_REQ 1 -#define ISER_MAX_TEXT_REQ 1 -#define ISER_MAX_NOP_OUT 2 -#define ISER_MAX_TASK_MGT_REQ 2 -#define ISER_MAX_LOGOUT_REQ 1 - -#define ISER_MAX_IMMEDIATE_CMDS 2 - -#define ISER_MIN_RECV_DSL (8*1024) /* 8K */ -#define ISER_MAX_FIRST_BURST (128*1024) /* 128K */ - -#define ISER_MAX_CTRLS_PER_CMD(first_burst,recv_dsl,imm) \ - ((first_burst / recv_dsl) + \ - (first_burst % recv_dsl > 0 ? 1 : 0) + \ - (imm ? 0 : 1)) - - /* Maximal bounds on asynchronous PDUs received by iSER Initiator */ -#define ISER_MAX_RX_MISC_PDUS (ISER_MAX_NOP_IN + \ - ISER_MAX_ASYNC_EVT) - -#define ISER_MAX_TX_MISC_PDUS (ISER_MAX_TEXT_REQ + \ - ISER_MAX_NOP_OUT + \ - ISER_MAX_TASK_MGT_REQ + \ - ISER_MAX_LOGOUT_REQ) +#define ISER_MAX_RX_MISC_PDUS 4 /* NOOP_IN(2) , ASYNC_EVENT(2) */ -#define ISER_MAX_RX_CMD_RESP ISCSI_ISER_XMIT_CMDS_MAX +#define ISER_MAX_TX_MISC_PDUS 6 /* NOOP_OUT(2), TEXT(1), * + * SCSI_TMFUNC(2), LOGOUT(1) */ +#define ISER_QP_MAX_RECV_DTOS (ISCSI_ISER_XMIT_CMDS_MAX + \ + ISER_MAX_RX_MISC_PDUS + \ + ISER_MAX_TX_MISC_PDUS) -/* iSER Initiator QP settings */ -#define ISER_AVG_TASK_RELATED_SEND(first_burst, recv_dsl,imm,max_cmds) \ - (max_cmds * (ISER_MAX_CTRLS_PER_CMD(first_burst,recv_dsl,imm) + \ - ISER_MAX_IMMEDIATE_CMDS)) - -#define ISER_INITIAL_POST_RECV ISER_MAX_RX_MISC_PDUS - -#define ISER_QP_AVG_POST_RECV (ISER_MAX_RX_CMD_RESP + \ - ISER_MAX_RX_MISC_PDUS + \ - ISER_MAX_TX_MISC_PDUS) - -#define ISER_QP_MAX_RECV_DTOS (ISER_QP_AVG_POST_RECV + 8) - -#define ISER_QP_MAX_REQ_DTOS \ - (ISER_AVG_TASK_RELATED_SEND( \ - ISER_MAX_FIRST_BURST, \ - ISER_MIN_RECV_DSL, \ - 1, \ - ISCSI_ISER_XMIT_CMDS_MAX) + \ - ISER_MAX_TX_MISC_PDUS + \ - ISER_MAX_RX_MISC_PDUS) +/* the max TX (send) WR supported by the iSER QP is defined by * + * max_send_wr = T * (1 + D) + C ; D is how many inflight dataouts we expect * + * to have at max for SCSI command. The tx posting & completion handling code * + * supports -EAGAIN scheme where tx is suspended till the QP has room for more * + * send WR. D=8 comes from 64K/8K */ + +#define ISER_INFLIGHT_DATAOUTS 8 + +#define ISER_QP_MAX_REQ_DTOS (ISCSI_ISER_XMIT_CMDS_MAX * \ + (1 + ISER_INFLIGHT_DATAOUTS) + \ + ISER_MAX_TX_MISC_PDUS + \ + ISER_MAX_RX_MISC_PDUS) /* iSER Initiator CQ settings */ #define ISCSI_ISER_MAX_CONN 8 Index: ulp/iser/iser_initiator.c =================================================================== --- ulp/iser/iser_initiator.c (revision 5182) +++ ulp/iser/iser_initiator.c (revision 5228) @@ -266,6 +266,22 @@ iser_prepare_write_cmd(struct iscsi_iser return 0; } +static int +iser_check_xmit(struct iscsi_iser_conn *conn, void *task) +{ + int rc = 0; + + spin_lock_bh(&conn->lock); + if(atomic_read(&conn->post_send_buf_count) == ISER_QP_MAX_REQ_DTOS) { + iser_dbg("%ld can't xmit task %p, suspending tx\n",jiffies,task); + set_bit(SUSPEND_BIT, &conn->suspend_tx); + rc = -EAGAIN; + } + spin_unlock_bh(&conn->lock); + return rc; +} + + /** * iser_send_command - send command PDU */ @@ -284,6 +300,8 @@ int iser_send_command(struct iscsi_iser_ iser_err("Failed to send, conn: 0x%p is not up\n", p_iser_conn->ib_conn); return -EPERM; } + if(iser_check_xmit(p_iser_conn, p_ctask)) + return -EAGAIN; edtl = ntohl(hdr->data_length); @@ -372,6 +390,9 @@ int iser_send_data_out(struct iscsi_iser return -EPERM; } + if(iser_check_xmit(p_iser_conn, p_ctask)) + return -EAGAIN; + itt = ntohl(hdr->itt); data_seg_len = ntoh24(hdr->dlength); buf_offset = ntohl(hdr->offset); @@ -455,6 +476,9 @@ int iser_send_control(struct iscsi_iser_ return -EPERM; } + if(iser_check_xmit(p_iser_conn,p_mtask)) + return -EAGAIN; + /* build the tx desc regd header and add it to the tx desc dto */ p_mtask->desc.type = ISCSI_TX_CONTROL; p_send_dto = &p_mtask->desc.dto; @@ -627,6 +651,14 @@ void iser_snd_completion(struct iser_des atomic_dec(&p_iser_conn->post_send_buf_count); + spin_lock(&p_iser_conn->lock); + if(p_iser_conn->suspend_tx) { + iser_dbg("%ld resuming tx\n",jiffies); + clear_bit(SUSPEND_BIT, &p_iser_conn->suspend_tx); + schedule_work(&p_iser_conn->xmitwork); + } + spin_unlock(&p_iser_conn->lock); + /* if the last sent PDU of the task, task can be freed */ if (p_dto->p_task != NULL && iser_task_post_send_count_dec_and_test(p_dto->p_task)) Index: ulp/iser/iscsi_iser.c =================================================================== --- ulp/iser/iscsi_iser.c (revision 5182) +++ ulp/iser/iscsi_iser.c (revision 5228) @@ -104,8 +104,7 @@ static void iscsi_iser_cmd_init(struct i MAX_COMMAND_SIZE - sc->cmd_len); ctask->mtask = NULL; - ctask->sent = 0; - ctask->sg_count = 0; + ctask->command_sent = 0; ctask->total_length = sc->request_bufflen; @@ -168,9 +167,6 @@ static void iscsi_iser_cmd_init(struct i * call it again later, or recover. '0' return code means successful * xmit. * - * Management xmit state machine consists of two states: - * IN_PROGRESS_IMM_HEAD - PDU Header xmit in progress - * IN_PROGRESS_IMM_DATA - PDU Data xmit in progress **/ static int iscsi_iser_mtask_xmit(struct iscsi_iser_conn *conn, struct iscsi_iser_mgmt_task *mtask) @@ -179,12 +175,8 @@ static int iscsi_iser_mtask_xmit(struct debug_scsi("mtask deq [cid %d itt 0x%x]\n", conn->id, mtask->itt); - /* Send the control */ error = iser_send_control(conn, mtask); - if (error) - printk(KERN_ERR "send_control failed\n"); - return error; } @@ -253,10 +245,9 @@ static int iscsi_iser_ctask_xmit_unsol_d /* Send the command */ error = iser_send_data_out(conn, ctask, &hdr); if (error) { - printk(KERN_ERR "send_data_out failed\n"); + ctask->unsol_datasn--; goto iscsi_iser_ctask_xmit_unsol_data_exit; } - ctask->unsol_count -= ctask->data_count; debug_scsi("Need to send %d more as data-out PDUs\n", ctask->unsol_count); @@ -281,22 +272,19 @@ static int iscsi_iser_ctask_xmit(struct return error; /* Send the cmd PDU */ - error = iser_send_command(conn, ctask); - if (error) { - printk(KERN_ERR "Couldn't send a cmd PDU\n"); - goto iscsi_iser_ctask_xmit_exit; + if(!ctask->command_sent) { + error = iser_send_command(conn, ctask); + if (error) + goto iscsi_iser_ctask_xmit_exit; + ctask->command_sent = 1; } /* Send unsolicited data-out PDU(s) if necessary */ - if (ctask->unsol_count) { + if (ctask->unsol_count) error = iscsi_iser_ctask_xmit_unsol_data(conn, ctask); - if (error) - printk(KERN_ERR "Couldn't send unsolicited " - "data-out PDU(s)\n"); - } iscsi_iser_ctask_xmit_exit: - if(error) + if(error && error != -EAGAIN) iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); return error; } @@ -1099,6 +1087,8 @@ static iscsi_connh_t iscsi_iser_conn_cre atomic_set(&conn->post_send_buf_count, 0); init_waitqueue_head(&conn->disconnect_wait_q); + spin_lock_init(&conn->lock); + return iscsi_handle(conn); login_mtask_alloc_fail: From mst at mellanox.co.il Mon Jan 30 23:47:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jan 2006 09:47:53 +0200 Subject: [openib-general] Re: [PATCH] fix minor typo in SDP In-Reply-To: <1137534902.4520.345.camel@brick.internal.keyresearch.com> References: <1137534902.4520.345.camel@brick.internal.keyresearch.com> Message-ID: <20060131074753.GF31887@mellanox.co.il> Quoting r. Ralph Campbell : > Subject: [PATCH] fix minor typo in SDP > > This patch fixes a minor misspelling in SDP. Thanks! Applied. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Jan 31 00:02:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jan 2006 10:02:07 +0200 Subject: [openib-general] Re: [PATCH] SDP doesn't flush path cache if deviceremoved In-Reply-To: <1138392648.23076.13.camel@brick.internal.keyresearch.com> References: <1138392648.23076.13.camel@brick.internal.keyresearch.com> Message-ID: <20060131080207.GG31887@mellanox.co.il> Quoting r. Ralph Campbell : > Subject: [PATCH] SDP doesn't flush path cache if deviceremoved > > SDP doesn't remove cached SA path records if the device is removed. > Minor nit: flush_workqueue() doesn't need to be called before > destroy_workqueue(). > > Signed-off-by: Ralph Campbell Applied, thanks! > Index: sdp_proto.h > =================================================================== > --- sdp_proto.h (revision 5193) > +++ sdp_proto.h (working copy) Could you please build patches so that they apply cleanly with -p1 under the linux tree? They should look like Index: linux-2.6.15/drivers/infiniband/ulp/sdp/sdp_proto.h =================================================================== etc. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From RAISCH at de.ibm.com Tue Jan 31 01:07:02 2006 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Tue, 31 Jan 2006 10:07:02 +0100 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: Message-ID: I also prefer the idea that queue poll locking should be part of the device driver. Especially on SMP systems there are interesting concepts around what else to do than just spinlock. From mst at mellanox.co.il Tue Jan 31 01:13:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jan 2006 11:13:43 +0200 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: References: Message-ID: <20060131091343.GK31887@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ > > > static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, > > struct ib_wc *wc) > > { > > unsigned long flags; > > int ret; > > spin_lock_irqsave(cq->lock, flags); > > ret = cq->poll_cq(cq, num_entries, wc); > > spin_unlock_irqrestore(cq->lock, flags); > > return ret; > > } > > Definitely an interesting idea, although it doesn penalize a > hypothetical device that can do lock-free CQ polling somehow. True, although documentation/SubmittingPatches does say "Don't try to anticipate nebulous future cases which may or may not be useful" :) And it seems you'll always need a some kind of lock if you want to detect queue overruns on post_send/post_recv. > My bias has been to leave locking to the low-level driver, but if you > can show an improvement on real hardware with this idea then I would > be inclined to use this approach. Sounds sane. I'm a bit busy now, but I guess this optimization can wait. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Jan 31 02:58:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jan 2006 12:58:22 +0200 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: References: Message-ID: <20060131105822.GN31887@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ > > > static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, > > struct ib_wc *wc) > > { > > unsigned long flags; > > int ret; > > spin_lock_irqsave(cq->lock, flags); > > ret = cq->poll_cq(cq, num_entries, wc); > > spin_unlock_irqrestore(cq->lock, flags); > > return ret; > > } > > Definitely an interesting idea, although it doesn penalize a > hypothetical device that can do lock-free CQ polling somehow. I guess we'll always be able to move back to current code if/when such a device appears. > My bias has been to leave locking to the low-level driver, but if you > can show an improvement on real hardware with this idea then I would > be inclined to use this approach. BTW, just moving poll_cq and req_notify_cq to struct ib_cq will already remove one indirection level from the datapath. And this can be done without touching low-level drivers. Same for QP and various ib_post callbacks. Should I submit such a patch? Sean? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Jan 31 03:03:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jan 2006 13:03:05 +0200 Subject: [openib-general] Re: Re: [PATCH 2/6] [RFC] mthca kernel changes forresizeCQ In-Reply-To: References: Message-ID: <20060131110305.GO31887@mellanox.co.il> Quoting Christoph Raisch : > Subject: Re: Re: [PATCH 2/6] [RFC] mthca kernel changes forresizeCQ > > > I also prefer the idea that queue poll locking should be part of the device > driver. Seems to just be duplicated code at the moment. > Especially on SMP systems there are interesting concepts around what else > to do than just spinlock. I'm not aware of any interesting concepts. What are they? As Roland pointed out all existing drivers do polling under the lock. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ogerlitz at voltaire.com Tue Jan 31 04:16:34 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 31 Jan 2006 14:16:34 +0200 (IST) Subject: [openib-general] [PATCH] iser: cleanup and re org Message-ID: commited to r5235 removed iser_task.c & iscsi_iser_cmd_task->task_lock, post_send_count and sent fields. some re org of the dma-map/mem-reg and mem-unreg/dma-unmap flow for ctasks. removed iser_regd_buf->data_cache and some more cleanups. Signed-off-by: Or Gerlitz Makefile | 1 iscsi_iser.c | 4 iscsi_iser.h | 30 ++----- iser_initiator.c | 223 ++++++++++++++++++++++++++----------------------------- iser_memory.c | 1 iser_task.c | 130 -------------------------------- 6 files changed, 120 insertions(+), 269 deletions(-) Index: ulp/iser/iscsi_iser.h =================================================================== --- ulp/iser/iscsi_iser.h (revision 5228) +++ ulp/iser/iscsi_iser.h (revision 5235) @@ -184,7 +184,6 @@ struct iser_mem_reg { struct iser_regd_buf { struct iser_mem_reg reg; /* memory registration info */ - kmem_cache_t *data_cache; /* data allocated from here, when set */ void *virt_addr; struct iser_adaptor *p_adaptor; /* p_adaptor->device for dma_unmap */ @@ -198,14 +197,6 @@ struct iser_regd_buf { #define MAX_REGD_BUF_VECTOR_LEN 2 -enum iser_dto_type { - ISER_DTO_RCV = 0, /* Receive buffer */ - ISER_DTO_SEND, /* Send buffer */ - ISER_DTO_PASSIVE, /* Passive side of a remote RDMA op */ - - ISER_DTO_TYPES_NUM -}; - struct iser_dto { struct iscsi_iser_cmd_task *p_task; struct iscsi_iser_conn *p_conn; @@ -327,13 +318,12 @@ struct iscsi_iser_cmd_task { struct iscsi_cmd *hdr; /* iSCSI PDU header points to desc->iscsi_hdr */ int itt; /* this ITT */ struct iscsi_iser_conn *conn; - spinlock_t task_lock; + enum iser_task_status status; int command_sent; int datasn; /* DataSN */ uint32_t unsol_datasn; - int sent; int imm_count; /* imm-data (bytes) */ int unsol_count; /* unsolicited (bytes)*/ @@ -345,8 +335,6 @@ struct iscsi_iser_cmd_task { int total_length; struct iscsi_iser_mgmt_task *mtask; /* tmf mtask in progr */ - unsigned int post_send_count; /* posted send buffers pending completion */ - int dir[ISER_DIRS_NUM]; /* set if direction used */ struct iser_regd_buf rdma_regd[ISER_DIRS_NUM]; /* regd rdma buffer */ unsigned long data_len[ISER_DIRS_NUM]; /* total data length */ @@ -525,12 +513,21 @@ void iser_dto_send_create(struct iscsi_i /* iser_initiator.h */ +int iser_dma_map_task_data(struct iscsi_iser_cmd_task *p_iser_task, + struct iser_data_buf *p_data, + enum iser_data_dir iser_dir, + enum dma_data_direction dma_dir); + +void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *p_iser_task); void iser_rcv_completion(struct iser_desc *p_desc, unsigned long dto_xfer_len); void iser_snd_completion(struct iser_desc *p_desc); +void iser_ctask_rdma_init(struct iscsi_iser_cmd_task *p_iser_task); +void iser_ctask_rdma_finalize(struct iscsi_iser_cmd_task *p_iser_task); + /* iser_memory.h */ /* regd_buf */ @@ -576,15 +573,8 @@ int iser_page_vec_build(struct iser_data int count); -/* iser_task.h */ -void iser_task_init_lowpart(struct iscsi_iser_cmd_task *p_iser_task); -void iser_task_finalize_lowpart(struct iscsi_iser_cmd_task *iser_task); -void iser_task_post_send_count_inc(struct iscsi_iser_cmd_task *p_iser_task); -int iser_task_post_send_count_dec_and_test(struct iscsi_iser_cmd_task *p_iser_task); -void iser_task_set_status(struct iscsi_iser_cmd_task *p_iser_task, - enum iser_task_status status); /* iser_verbs.h */ /* --------------------------------------------------------------------- Index: ulp/iser/iser_initiator.c =================================================================== --- ulp/iser/iser_initiator.c (revision 5228) +++ ulp/iser/iser_initiator.c (revision 5235) @@ -40,7 +40,63 @@ #include "iscsi_iser.h" -static void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *p_iser_task); +int iser_dma_map_task_data(struct iscsi_iser_cmd_task *p_iser_task, + struct iser_data_buf *p_data, + enum iser_data_dir iser_dir, + enum dma_data_direction dma_dir) +{ + struct device *dma_device; + dma_addr_t dma_addr; + int dma_nents; + + p_iser_task->dir[iser_dir] = 1; + dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; + + if (p_data->type == ISER_BUF_TYPE_SINGLE) { + p_iser_task->data_len[iser_dir] = p_data->size; + dma_addr = dma_map_single(dma_device,p_data->p_buf, p_data->size, + dma_dir); + if (dma_mapping_error(dma_addr)) { + iser_err("dma_map_single failed at %p\n", p_data->p_buf); + return -EINVAL; + } + p_data->dma_addr = dma_addr; + } else { + dma_nents = dma_map_sg(dma_device, p_data->p_buf, p_data->size, + dma_dir); + if (dma_nents == 0) { + iser_err("dma_map_sg failed!!!\n"); + return -EINVAL; + } + p_data->dma_nents = dma_nents; + p_iser_task->data_len[iser_dir] = iser_sg_size(p_data); + } + return 0; +} + +void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *p_iser_task) +{ + struct device *dma_device; + struct iser_data_buf *p_data; + + dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; + + p_data = &p_iser_task->data[ISER_DIR_IN]; + if (p_data->p_buf != NULL && p_data->type == ISER_BUF_TYPE_SCATTERLIST) + dma_unmap_sg(dma_device, p_data->p_buf, p_data->size, + DMA_FROM_DEVICE); + else if (p_data->p_buf != NULL) /* p_data->type == ISER_BUF_TYPE_SINGLE */ + dma_unmap_single(dma_device, p_data->dma_addr, p_data->size, + DMA_FROM_DEVICE); + + p_data = &p_iser_task->data[ISER_DIR_OUT]; + if (p_data->p_buf != NULL && p_data->type == ISER_BUF_TYPE_SCATTERLIST) + dma_unmap_sg(dma_device, p_data->p_buf, p_data->size, + DMA_TO_DEVICE); + else if (p_data->p_buf != NULL) /* p_data->type == ISER_BUF_TYPE_SINGLE */ + dma_unmap_single(dma_device, p_data->dma_addr, p_data->size, + DMA_TO_DEVICE); +} /** * iser_reg_rdma_mem - Registers memory @@ -74,7 +130,6 @@ static int iser_reg_rdma_mem(struct iscs int aligned_len; iser_dbg("converting sg to page_vec\n"); - /* DMA_MAP: use task->data[IN/OUT] to check alignment */ aligned_len = iser_data_buf_aligned_len(p_mem,0); if (aligned_len == p_mem->size) cnt_to_reg = aligned_len; @@ -121,35 +176,15 @@ static int iser_prepare_read_cmd(struct { struct iser_regd_buf *p_regd_buf; int err; - dma_addr_t dma_addr; - int dma_nents; - struct device *dma_device; struct iser_hdr *hdr = &p_iser_task->desc.iser_header; - p_iser_task->dir[ISER_DIR_IN] = 1; - dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; + err = iser_dma_map_task_data(p_iser_task, + buf_in, + ISER_DIR_IN, + DMA_FROM_DEVICE); + if(err) + return err; - if (buf_in->type == ISER_BUF_TYPE_SINGLE) { - p_iser_task->data_len[ISER_DIR_IN] = buf_in->size; - /* DMA_MAP: map single task->data[ISER_DIR_IN], store dma_addr */ - dma_addr = dma_map_single(dma_device,buf_in->p_buf, buf_in->size, - DMA_FROM_DEVICE); - if (dma_mapping_error(dma_addr)) { - iser_err("dma_map_single failed at %p\n", buf_in->p_buf); - return -EINVAL; - } - buf_in->dma_addr = dma_addr; - } else { - /* DMA_MAP: map sg task->data[ISER_DIR_IN], store in .dma_nents */ - dma_nents = dma_map_sg(dma_device, buf_in->p_buf, buf_in->size, - DMA_FROM_DEVICE); - if (dma_nents == 0) { - iser_err("dma_map_sg failed!!!\n"); - return -EINVAL; - } - buf_in->dma_nents = dma_nents; - p_iser_task->data_len[ISER_DIR_IN] = iser_sg_size(buf_in); - } if (edtl > p_iser_task->data_len[ISER_DIR_IN]) { iser_err("Total data length: %ld, less than EDTL: " "%d, in READ cmd BHS itt: %d, p_conn: 0x%p\n", @@ -192,38 +227,16 @@ iser_prepare_write_cmd(struct iscsi_iser { struct iser_regd_buf *p_regd_buf; int err; - dma_addr_t dma_addr; - int dma_nents; - struct device *dma_device; struct iser_dto *p_send_dto = &p_iser_task->desc.dto; struct iser_hdr *hdr = &p_iser_task->desc.iser_header; - p_iser_task->dir[ISER_DIR_OUT] = 1; - dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; - - iser_dbg("buf_out %p buf_out->type is %d\n", buf_out, buf_out->type); + err = iser_dma_map_task_data(p_iser_task, + buf_out, + ISER_DIR_OUT, + DMA_TO_DEVICE); + if(err) + return err; - if (buf_out->type == ISER_BUF_TYPE_SINGLE) { - p_iser_task->data_len[ISER_DIR_OUT] = buf_out->size; - /* DMA_MAP: map single task->data[ISER_DIR_OUT], store dma_addr */ - dma_addr = dma_map_single(dma_device, buf_out->p_buf, buf_out->size, - DMA_TO_DEVICE); - if (dma_mapping_error(dma_addr)) { - iser_err("dma_map_single failed at %p\n", buf_out->p_buf); - return -EINVAL; - } - buf_out->dma_addr = dma_addr; - } else { - /* DMA_MAP: map sg task->data[ISER_DIR_OUT], store dma_nents */ - dma_nents = dma_map_sg(dma_device, buf_out->p_buf, buf_out->size, - DMA_TO_DEVICE); - if (dma_nents == 0) { - iser_err("dma_map_sg failed!!!\n"); - return -EINVAL; - } - buf_out->dma_nents = dma_nents; - p_iser_task->data_len[ISER_DIR_OUT] = iser_sg_size(buf_out); - } if (edtl > p_iser_task->data_len[ISER_DIR_OUT]) { iser_err("Total data length: %ld, less than EDTL: %d, " "in WRITE cmd BHS itt: %d, p_conn: 0x%p\n", @@ -306,7 +319,7 @@ int iser_send_command(struct iscsi_iser_ edtl = ntohl(hdr->data_length); /* MERGE_CHANGE - temporal move it up */ - iser_task_init_lowpart(p_ctask); + /* build the tx desc regd header and add it to the tx desc dto */ p_ctask->desc.type = ISCSI_TX_SCSI_COMMAND; @@ -337,7 +350,6 @@ int iser_send_command(struct iscsi_iser_ if (err) goto send_command_error; } - /* DMA_MAP: safe to dma_map now - map and flush the cache */ iser_reg_single(p_iser_conn->ib_conn->p_adaptor, p_send_dto->regd[0], DMA_TO_DEVICE); @@ -347,16 +359,11 @@ int iser_send_command(struct iscsi_iser_ goto send_command_error; } - iser_task_set_status(p_ctask,ISER_TASK_STATUS_STARTED); - iser_task_post_send_count_inc(p_ctask); + p_ctask->status = ISER_TASK_STATUS_STARTED; err = iser_start_send(&p_ctask->desc); - if (err) { - iser_task_post_send_count_dec_and_test(p_ctask); - goto send_command_error; - } - - return 0; + if (!err) + return 0; send_command_error: if (p_send_dto != NULL) { @@ -415,7 +422,6 @@ int iser_send_data_out(struct iscsi_iser p_send_dto->p_task = p_ctask; iser_dto_send_create(p_iser_conn, tx_desc); - /* DMA_MAP: safe to dma_map now - map and flush the cache */ iser_reg_single(p_iser_conn->ib_conn->p_adaptor, p_send_dto->regd[0], DMA_TO_DEVICE); @@ -436,15 +442,10 @@ int iser_send_data_out(struct iscsi_iser iser_dbg("data-out itt: %d, offset: %ld, sz: %ld\n", itt, buf_offset, data_seg_len); - iser_task_post_send_count_inc(p_ctask); err = iser_start_send(tx_desc); - if (err) { - iser_task_post_send_count_dec_and_test(p_ctask); - goto send_data_out_error; - } - - return 0; + if (!err) + return 0; send_data_out_error: if (p_send_dto != NULL) @@ -487,7 +488,6 @@ int iser_send_control(struct iscsi_iser_ p_iser_adaptor = p_iser_conn->ib_conn->p_adaptor; - /* DMA_MAP: safe to dma_map now - map and flush the cache */ iser_reg_single(p_iser_adaptor, p_send_dto->regd[0], DMA_TO_DEVICE); itt = ntohl(p_mtask->hdr->itt); @@ -561,7 +561,6 @@ void iser_rcv_completion(struct iser_des int rc, rx_data_size = 0; unsigned int itt; unsigned char opcode; - int no_more_task_sends = 0; p_hdr = &p_rx_desc->iscsi_header; @@ -602,11 +601,8 @@ void iser_rcv_completion(struct iser_des * sglist, anyway dma_unmap and free the copy */ iser_finalize_rdma_unaligned_sg(p_iser_task); - /* DMA_MAP: unmap according to task->data[dir].type/etc */ - iser_dma_unmap_task_data(p_iser_task); - p_dto->p_task = p_iser_task; - iser_task_set_status(p_iser_task, - ISER_TASK_STATUS_COMPLETED); + p_iser_task->status = ISER_TASK_STATUS_COMPLETED; + iser_ctask_rdma_finalize(p_iser_task); } } @@ -614,18 +610,6 @@ void iser_rcv_completion(struct iser_des if(rc) iscsi_iser_conn_failure(p_iser_conn, rc); - if(p_iser_task != NULL) { - spin_lock(&p_iser_task->task_lock); - if(p_iser_task->post_send_count == 0) - no_more_task_sends = 1; - spin_unlock(&p_iser_task->task_lock); - if(no_more_task_sends) - iser_task_finalize_lowpart(p_iser_task); - else - iser_err("can't free iSER task:0x%p more %d sends\n", - p_iser_task, p_iser_task->post_send_count); - } - iser_dto_free(p_dto); kfree(p_rx_desc->data); kmem_cache_free(ig.desc_cache, p_rx_desc); @@ -658,33 +642,42 @@ void iser_snd_completion(struct iser_des schedule_work(&p_iser_conn->xmitwork); } spin_unlock(&p_iser_conn->lock); +} + +void iser_ctask_rdma_init(struct iscsi_iser_cmd_task *p_iser_task) + +{ + p_iser_task->status = ISER_TASK_STATUS_INIT; - /* if the last sent PDU of the task, task can be freed */ - if (p_dto->p_task != NULL && - iser_task_post_send_count_dec_and_test(p_dto->p_task)) - iser_task_finalize_lowpart(p_dto->p_task); + p_iser_task->dir[ISER_DIR_IN] = 0; + p_iser_task->dir[ISER_DIR_OUT] = 0; + + p_iser_task->data_len[ISER_DIR_IN] = 0; + p_iser_task->data_len[ISER_DIR_OUT] = 0; + + memset(&p_iser_task->rdma_regd[ISER_DIR_IN], 0, + sizeof(struct iser_regd_buf)); + memset(&p_iser_task->rdma_regd[ISER_DIR_OUT], 0, + sizeof(struct iser_regd_buf)); } -static void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *p_iser_task) +void iser_ctask_rdma_finalize(struct iscsi_iser_cmd_task *p_iser_task) { - struct device *dma_device; - struct iser_data_buf *p_data; + int deferred; - dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; + if (p_iser_task->dir[ISER_DIR_IN]) { + deferred = iser_regd_buff_release + (&p_iser_task->rdma_regd[ISER_DIR_IN]); + if (deferred) + iser_bug("References remain for BUF-IN rdma reg\n"); + } - p_data = &p_iser_task->data[ISER_DIR_IN]; - if (p_data->p_buf != NULL && p_data->type == ISER_BUF_TYPE_SCATTERLIST) - dma_unmap_sg(dma_device, p_data->p_buf, p_data->size, - DMA_FROM_DEVICE); - else if (p_data->p_buf != NULL) /* p_data->type == ISER_BUF_TYPE_SINGLE */ - dma_unmap_single(dma_device, p_data->dma_addr, p_data->size, - DMA_FROM_DEVICE); + if (p_iser_task->dir[ISER_DIR_OUT]) { + deferred = iser_regd_buff_release + (&p_iser_task->rdma_regd[ISER_DIR_OUT]); + if (deferred) + iser_bug("References remain for BUF-OUT rdma reg\n"); + } - p_data = &p_iser_task->data[ISER_DIR_OUT]; - if (p_data->p_buf != NULL && p_data->type == ISER_BUF_TYPE_SCATTERLIST) - dma_unmap_sg(dma_device, p_data->p_buf, p_data->size, - DMA_TO_DEVICE); - else if (p_data->p_buf != NULL) /* p_data->type == ISER_BUF_TYPE_SINGLE */ - dma_unmap_single(dma_device, p_data->dma_addr, p_data->size, - DMA_TO_DEVICE); + iser_dma_unmap_task_data(p_iser_task); } Index: ulp/iser/iser_memory.c =================================================================== --- ulp/iser/iser_memory.c (revision 5228) +++ ulp/iser/iser_memory.c (revision 5235) @@ -103,7 +103,6 @@ int iser_regd_buff_release(struct iser_r if(p_regd_buf->reg.rkey != 0) iser_unreg_mem(&p_regd_buf->reg); - /* DMA_MAP: call dma_unmap_single */ if (p_regd_buf->dma_addr) dma_unmap_single( p_regd_buf->p_adaptor->device->dma_device, Index: ulp/iser/iscsi_iser.c =================================================================== --- ulp/iser/iscsi_iser.c (revision 5228) +++ ulp/iser/iscsi_iser.c (revision 5235) @@ -135,8 +135,6 @@ static void iscsi_iser_cmd_init(struct i if (!ctask->unsol_count) /* No unsolicit Data-Out's */ ctask->hdr->flags |= ISCSI_FLAG_CMD_FINAL; - /*else - ctask->xmstate |= XMSTATE_UNS_HDR | XMSTATE_UNS_INIT;*/ /* bytes to be sent via RDMA operations */ ctask->rdma_data_count = ctask->total_length - @@ -155,6 +153,8 @@ static void iscsi_iser_cmd_init(struct i zero_data(ctask->hdr->dlength); ctask->rdma_data_count = ctask->total_length; } + + iser_ctask_rdma_init(ctask); } /** Index: ulp/iser/Makefile =================================================================== --- ulp/iser/Makefile (revision 5228) +++ ulp/iser/Makefile (revision 5235) @@ -6,7 +6,6 @@ ib_iser-y := iser_mod.o \ iser_verbs.o \ iser_initiator.o \ iser_memory.o \ - iser_task.o \ iser_dto.o \ iser_socket.o \ iscsi_iser.o \ Index: ulp/iser/iser_task.c =================================================================== --- ulp/iser/iser_task.c (revision 5228) +++ ulp/iser/iser_task.c (revision 5235) @@ -1,130 +0,0 @@ -/* - * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ -#include "iscsi_iser.h" - -/** - * iser_task_init_lowpart - Allocates and initializes a conn descriptor - */ -void iser_task_init_lowpart(struct iscsi_iser_cmd_task *p_iser_task) - -{ - spin_lock_init(&p_iser_task->task_lock); - p_iser_task->status = ISER_TASK_STATUS_INIT; - p_iser_task->post_send_count = 0; - - p_iser_task->dir[ISER_DIR_IN] = 0; - p_iser_task->dir[ISER_DIR_OUT] = 0; - - p_iser_task->data_len[ISER_DIR_IN] = 0; - p_iser_task->data_len[ISER_DIR_OUT] = 0; - - memset(&p_iser_task->rdma_regd[ISER_DIR_IN], 0, - sizeof(struct iser_regd_buf)); - memset(&p_iser_task->rdma_regd[ISER_DIR_OUT], 0, - sizeof(struct iser_regd_buf)); -} - - -/** - * iser_task_post_send_count_inc - Increments counter of - * post-send buffers pending send completion - */ -void iser_task_post_send_count_inc(struct iscsi_iser_cmd_task *p_iser_task) -{ - spin_lock_bh(&p_iser_task->task_lock); - p_iser_task->post_send_count++; - spin_unlock_bh(&p_iser_task->task_lock); -} - -/** - * iser_task_post_send_count_dec_and_test - Decrements counter - * of post-send buffers pending - * send completion and tests the task's eligibility for release. - */ -int iser_task_post_send_count_dec_and_test(struct iscsi_iser_cmd_task *p_iser_task) -{ - int ret_val = 0; - - spin_lock_bh(&p_iser_task->task_lock); - if (p_iser_task->post_send_count == 0) { - spin_unlock_bh(&p_iser_task->task_lock); - iser_bug("task: 0x%p, decrementing zero post_send_cnt\n", - p_iser_task); - } - p_iser_task->post_send_count--; - if (p_iser_task->status == ISER_TASK_STATUS_COMPLETED && - p_iser_task->post_send_count == 0) - ret_val = 1; - spin_unlock_bh(&p_iser_task->task_lock); - - return ret_val; -} - -/** - * iser_task_set_status - Sets atask status - */ -void -iser_task_set_status(struct iscsi_iser_cmd_task *p_iser_task, - enum iser_task_status status) -{ - spin_lock_bh(&p_iser_task->task_lock); - p_iser_task->status = status; - spin_unlock_bh(&p_iser_task->task_lock); -} - -/** - * iser_task_free - Frees all task res - */ -void iser_task_finalize_lowpart(struct iscsi_iser_cmd_task *p_iser_task) -{ - int deferred; - - if (p_iser_task == NULL) - iser_bug("NULL task descriptor\n"); - - spin_lock_bh(&p_iser_task->task_lock); - if (p_iser_task->dir[ISER_DIR_IN]) { - deferred = iser_regd_buff_release - (&p_iser_task->rdma_regd[ISER_DIR_IN]); - if (deferred) - iser_bug("References remain for BUF-IN rdma reg\n"); - } - if (p_iser_task->dir[ISER_DIR_OUT]) { - deferred = iser_regd_buff_release - (&p_iser_task->rdma_regd[ISER_DIR_OUT]); - if (deferred) - iser_bug("References remain for BUF-OUT rdma reg\n"); - } - spin_unlock_bh(&p_iser_task->task_lock); -} From grave at ipno.in2p3.fr Tue Jan 31 06:53:22 2006 From: grave at ipno.in2p3.fr (Xavier Grave) Date: Tue, 31 Jan 2006 15:53:22 +0100 Subject: [openib-general] OpenIb and Ada95 binding Message-ID: <1138719202.30748.12.camel@ipnnarval> Hi everybody, I'm using Ada95 and I would like to implement a thin binding in order to use infiniband. I'm using 2.6.15 linux kernel on ppc computers. I had a look at the wiki and faq sites and after a look at the code examples (that run quite well) I don't know from where to start. Can somebody point me to a very basic tutorial or documentation in order that I try to make a very simple code both in C and Ada. Thanks in advance, xavier From mst at mellanox.co.il Tue Jan 31 06:58:24 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jan 2006 16:58:24 +0200 Subject: [openib-general] Re: OpenIb and Ada95 binding In-Reply-To: <1138719202.30748.12.camel@ipnnarval> References: <1138719202.30748.12.camel@ipnnarval> Message-ID: <20060131145824.GU31887@mellanox.co.il> Quoting r. Xavier Grave : > Subject: OpenIb and Ada95 binding > > Hi everybody, > > I'm using Ada95 and I would like to implement a thin binding in order to > use infiniband. > I'm using 2.6.15 linux kernel on ppc computers. > I had a look at the wiki and faq sites and after a look at the code > examples (that run quite well) I don't know from where to start. > Can somebody point me to a very basic tutorial or documentation in order > that I try to make a very simple code both in C and Ada. > > Thanks in advance, xavier > Look at examples under the libibverbs directory. Try porting them to Ada. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From retaipghaom at kobej.zzn.com Tue Jan 31 09:03:32 2006 From: retaipghaom at kobej.zzn.com (retaipghaom at kobej.zzn.com) Date: Tue, 31 Jan 2006 09:03:32 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCNVUxZz11OD02Yko/NlEbKEI=?= =?iso-2022-jp?b?GyRCQWo+bCMxIzIjMCEkIzAjMCMwMV8bKEIg?= Message-ID: 20060201011014.40872mail@mail.lovelove-queensex552158754_lookserver772_womansystem01_woman-queen-love.tv ☆━━━━━━━━━━━━━━━━━━━━━━━━━━━━☆    2006年1月現在の逆援助謝礼¥平均相場      ・デートのみ : 30,000円                        ・セックス込み:120,000円 ☆━━━━━━━━━━━━━━━━━━━━━━━━━━━━☆  これは、女性が男性へ支払う謝礼の相場です。  女性によっては相場の3倍や4倍を男性へお渡しする方も  いらっしゃるようです。  詳しくは、下記ホームページをご覧下さい。  http://lovlyqueen.cx/h/  ※紹介料、会員登録料は無料となっております。 □■□■□■□■□■□■□■□■□■□■□■□■□■□■□■  ☆ ご登録手順です ☆   1.登録フォームを入力   ↓ 2.数名の女性会員様のお写真がお客様のメールに届きます    ↓   3.指名する    ↓ 4.本人と待ち合わせ    ↓ 5.やる事をやる 実行する方はこちら  ↓↓↓  http://lovlyqueen.cx/h/                   □■□■□■□■□■□■□■□■□■□■□■□■□■□■□■ ※重要※  謝礼の後払いはトラブルの元になりますので、必ず前払いで  受け取ってください。 From caitlinb at broadcom.com Tue Jan 31 09:20:18 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 31 Jan 2006 09:20:18 -0800 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D2A4@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Quoting r. Roland Dreier : >> Subject: Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ >> >> > static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, >> > struct ib_wc *wc) >> > { >> > unsigned long flags; >> > int ret; >> > spin_lock_irqsave(cq->lock, flags); >> > ret = cq->poll_cq(cq, num_entries, wc); >> > spin_unlock_irqrestore(cq->lock, flags); >> > return ret; >> > } >> >> Definitely an interesting idea, although it doesn penalize a >> hypothetical device that can do lock-free CQ polling somehow. > > True, although documentation/SubmittingPatches does say > "Don't try to anticipate nebulous future cases which may or may not > be useful" :) > This is not an issue of a nebulous future change. This is a matter of one device trying to force future devices to have the same limitation. Devices that do not require a host-wide spinlock to poll the CQ are not hypothetical, they exist even if drivers under OpenIB for them do not yet exist. Not planning for the needs of such drivers will not help the process of making those drivers exist. This is not a question of extra work, just assigning the responsibility to the correct layer -- that is to the device specific code. This is a proper recognition that the logic is in fact device dependent. From mshefty at ichips.intel.com Tue Jan 31 09:44:21 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 31 Jan 2006 09:44:21 -0800 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43DF0267.4080007@voltaire.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> Message-ID: <43DFA1F5.1050907@ichips.intel.com> Or Gerlitz wrote: > Why not having a failed lookup as the --only-- trigger to update the > cache? so the cache contains only paths that were demanded by some > consumer. What is implementation you were considering, is it an SA > replica having all those paths whose sgid is the local node gid? I view MPI as one of the primary reasons for having a cache. The cache is updated using an SA GET_TABLE request, which is more efficient than sending separate SA GET requests for each path record. Waiting for a failed lookup to create the initial cache would delay the startup time for apps wanting all-to-all connection establishment. In this case, we also get the side effect that the SA receives GET_TABLE requests from every node at roughly the same time. Your assumption is correct. The implementation will contain copies of all path records whose SGID is a local node GID. (Currently it contains only a single path record per SGID/DGID, but that will be expanded.) - Sean From mshefty at ichips.intel.com Tue Jan 31 09:45:45 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 31 Jan 2006 09:45:45 -0800 Subject: [openib-general] Re: [PATCH 0/4] SA path record caching In-Reply-To: <20060131063834.GA27065@mellanox.co.il> References: <43DE942D.1090304@ichips.intel.com> <20060131063834.GA27065@mellanox.co.il> Message-ID: <43DFA249.4050101@ichips.intel.com> Michael S. Tsirkin wrote: > If an RC QP timeouts, there's no arp reply, or a CM attempt > does not get a reply, this is a strong hint that the path > has a problem. I guess we could verify with sending getportinfo > or something. Once there's no reply, I think we want to look > fro another path, without the cache getting in the way. If the cache contains multiple/all paths from a local SGID to some DGID, then it can be used to obtain another path. If only an error has occurred, then the SA shouldn't have any paths that aren't known to the local cache. > Lets just invalidate the entry that has the problem then. > And I guess we could start SA query and delay the invalidation > until we get the response. Note that the cache submits a single query for all path records, rather than updating only a single entry. The assumption here is that a failure on one path may result in failures on other paths as well. > I also wander what happens if the SA goes down at the exact 15 min > interval when the cache is invalidated? The SA query will timeout, and the old data will be used. A new update attempt will be re-scheduled. >>Do we expect most path failures to be permanent >>or transient? > > No idea. Both kinds? What do you think? I have no idea either. - Sean From rdreier at cisco.com Tue Jan 31 09:47:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 31 Jan 2006 09:47:28 -0800 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F122D2A4@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Tue, 31 Jan 2006 09:20:18 -0800") References: <54AD0F12E08D1541B826BE97C98F99F122D2A4@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: > static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, > struct ib_wc *wc) > { > unsigned long flags; > int ret; > spin_lock_irqsave(cq->lock, flags); > ret = cq->poll_cq(cq, num_entries, wc); > spin_unlock_irqrestore(cq->lock, flags); > return ret; Caitlin> This is not an issue of a nebulous future change. This is Caitlin> a matter of one device trying to force future devices to Caitlin> have the same limitation. Well, all four devices for which we have drivers (mthca, ipath, ehca, amso1100) all take a lock while polling CQs. And that's all the devices for which we have driver code, as far as I know. So yes, as far as I can tell planning for lockless CQ polling is purely hypothetical. Caitlin> Devices that do not require a host-wide spinlock to poll Caitlin> the CQ are not hypothetical, they exist even if drivers Caitlin> under OpenIB for them do not yet exist. Not planning for Caitlin> the needs of such drivers will not help the process of Caitlin> making those drivers exist. The code above is not taking a host-wide lock -- it is taking the per-CQ lock "cq->lock". Do you know of a device that permits lockless CQ polling? If so a pointer to the driver code or at least a description of how consistency is maintained without locks would be interesting. - R. From rdreier at cisco.com Tue Jan 31 09:48:51 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 31 Jan 2006 09:48:51 -0800 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F122D2A4@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Tue, 31 Jan 2006 09:20:18 -0800") References: <54AD0F12E08D1541B826BE97C98F99F122D2A4@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: Caitlin> This is not a question of extra work, just assigning the Caitlin> responsibility to the correct layer -- that is to the Caitlin> device specific code. This is a proper recognition that Caitlin> the logic is in fact device dependent. BTW in general I agree with this. But if we could get a measurable performance win by moving the locking up a layer, and not penalize any real hardware, then as I said I would be inclined not to worry about hypothetical future hardware. As Michael said, we can always change it back if we want to. - R. From halr at voltaire.com Tue Jan 31 09:42:42 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Jan 2006 12:42:42 -0500 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43DFA1F5.1050907@ichips.intel.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> <43DFA1F5.1050907@ichips.intel.com> Message-ID: <1138729361.4453.17459.camel@hal.voltaire.com> On Tue, 2006-01-31 at 12:44, Sean Hefty wrote: > Or Gerlitz wrote: > > Why not having a failed lookup as the --only-- trigger to update the > > cache? so the cache contains only paths that were demanded by some > > consumer. What is implementation you were considering, is it an SA > > replica having all those paths whose sgid is the local node gid? > > I view MPI as one of the primary reasons for having a cache. The cache is > updated using an SA GET_TABLE request, which is more efficient than sending > separate SA GET requests for each path record. Waiting for a failed lookup to > create the initial cache would delay the startup time for apps wanting > all-to-all connection establishment. In this case, we also get the side effect > that the SA receives GET_TABLE requests from every node at roughly the same time. > > Your assumption is correct. The implementation will contain copies of all path > records whose SGID is a local node GID. (Currently it contains only a single > path record per SGID/DGID, but that will be expanded.) Ultimately, this should likely be using MultiPathRecord as it is able to do some things PathRecords can't. -- Hal From mshefty at ichips.intel.com Tue Jan 31 10:00:44 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 31 Jan 2006 10:00:44 -0800 Subject: [openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: <20060131105822.GN31887@mellanox.co.il> References: <20060131105822.GN31887@mellanox.co.il> Message-ID: <43DFA5CC.2030107@ichips.intel.com> Michael S. Tsirkin wrote: > BTW, just moving poll_cq and req_notify_cq to struct ib_cq will already remove > one indirection level from the datapath. And this can be done without touching > low-level drivers. > > Same for QP and various ib_post callbacks. > > Should I submit such a patch? Sean? Are you referring to copying the function pointers from ib_device to ib_cq and ib_qp? I don't object to such a patch, but it would be nice to get some data on the performance impact. - Sean From mshefty at ichips.intel.com Tue Jan 31 10:10:18 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 31 Jan 2006 10:10:18 -0800 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <1138729361.4453.17459.camel@hal.voltaire.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> <43DFA1F5.1050907@ichips.intel.com> <1138729361.4453.17459.camel@hal.voltaire.com> Message-ID: <43DFA80A.9020207@ichips.intel.com> Hal Rosenstock wrote: > Ultimately, this should likely be using MultiPathRecord as it is able to > do some things PathRecords can't. Can you elaborate on this? - Sean From halr at voltaire.com Tue Jan 31 10:12:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Jan 2006 13:12:43 -0500 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43DFA80A.9020207@ichips.intel.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> <43DFA1F5.1050907@ichips.intel.com> <1138729361.4453.17459.camel@hal.voltaire.com> <43DFA80A.9020207@ichips.intel.com> Message-ID: <1138731162.4453.17689.camel@hal.voltaire.com> On Tue, 2006-01-31 at 13:10, Sean Hefty wrote: > Hal Rosenstock wrote: > > Ultimately, this should likely be using MultiPathRecord as it is able to > > do some things PathRecords can't. > > Can you elaborate on this? It's able to choose SGID and/or DGID as explicit GIDs, GIDs from same node or GIDs from same system, request multiple SGID/DGID pairs in one request, and request varying path independence (as fault tolerant as possible or not). -- Hal > - Sean From mshefty at ichips.intel.com Tue Jan 31 10:24:23 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 31 Jan 2006 10:24:23 -0800 Subject: [openib-general] [PATCH] extend ib_device node_type to include iWarp In-Reply-To: References: Message-ID: <43DFAB57.4030502@ichips.intel.com> Sean Hefty wrote: > Here's an updated version that actually does the device checks correctly, and > hopefully clarifies how node_type is formatted. I include two helper functions > to extract the transport and device information from the node_type. > > I tried to include related changes to all files with this patch, so a few > more files are affected than with the previous patch. Any issues committing this? This does change ipoib, mthca, along with a slew of core files. I'll submit separate patches for the ehca and ipath drivers, but they should continue to work without changes. - Sean From mst at mellanox.co.il Tue Jan 31 10:30:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jan 2006 20:30:21 +0200 Subject: [openib-general] Re: Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ In-Reply-To: <43DFA5CC.2030107@ichips.intel.com> References: <43DFA5CC.2030107@ichips.intel.com> Message-ID: <20060131183021.GA29757@mellanox.co.il> Quoting r. Sean Hefty : > Are you referring to copying the function pointers from ib_device to ib_cq and > ib_qp? Yes. > I don't object to such a patch, but it would be nice to get some data > on the performance impact. Sure. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Tue Jan 31 10:36:49 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 31 Jan 2006 10:36:49 -0800 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <1138731162.4453.17689.camel@hal.voltaire.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> <43DFA1F5.1050907@ichips.intel.com> <1138729361.4453.17459.camel@hal.voltaire.com> <43DFA80A.9020207@ichips.intel.com> <1138731162.4453.17689.camel@hal.voltaire.com> Message-ID: <43DFAE41.5050401@ichips.intel.com> Hal Rosenstock wrote: > It's able to choose SGID and/or DGID as explicit GIDs, GIDs from same > node or GIDs from same system, request multiple SGID/DGID pairs in one > request, and request varying path independence (as fault tolerant as > possible or not). I see where a MultiPathRecord contains a path independence field. But since it only provides a single value for all other PathRecord fields for all SGID/DGID pairs, it seems less flexible than PathRecords. It's also missing LID information, and limits the number of DGIDs to 255. What I'm not understanding is how I can replace PathRecords with MultiPathRecords or how to make full use of MultiPathRecords. - Sean From halr at voltaire.com Tue Jan 31 10:39:45 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Jan 2006 13:39:45 -0500 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43DFAE41.5050401@ichips.intel.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> <43DFA1F5.1050907@ichips.intel.com> <1138729361.4453.17459.camel@hal.voltaire.com> <43DFA80A.9020207@ichips.intel.com> <1138731162.4453.17689.camel@hal.voltaire.com> <43DFAE41.5050401@ichips.intel.com> Message-ID: <1138732784.4453.17876.camel@hal.voltaire.com> On Tue, 2006-01-31 at 13:36, Sean Hefty wrote: > Hal Rosenstock wrote: > > It's able to choose SGID and/or DGID as explicit GIDs, GIDs from same > > node or GIDs from same system, request multiple SGID/DGID pairs in one > > request, and request varying path independence (as fault tolerant as > > possible or not). > > I see where a MultiPathRecord contains a path independence field. But since it > only provides a single value for all other PathRecord fields for all SGID/DGID > pairs, it seems less flexible than PathRecords. It depends on whether those components are wildcarded or not. > It's also missing LID information, Yes but is LID needed in the query ? > and limits the number of DGIDs to 255. Actually the number if lower as I think it is limited by the remaining bits in the component mask. > What I'm not understanding is how I can replace PathRecords with > MultiPathRecords or how to make full use of MultiPathRecords. I'm not quite sure how to answer this as I don't think I have enough information on what the question(s) is/are. -- Hal > - Sean From mshefty at ichips.intel.com Tue Jan 31 11:01:57 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 31 Jan 2006 11:01:57 -0800 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <1138732784.4453.17876.camel@hal.voltaire.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> <43DFA1F5.1050907@ichips.intel.com> <1138729361.4453.17459.camel@hal.voltaire.com> <43DFA80A.9020207@ichips.intel.com> <1138731162.4453.17689.camel@hal.voltaire.com> <43DFAE41.5050401@ichips.intel.com> <1138732784.4453.17876.camel@hal.voltaire.com> Message-ID: <43DFB425.4000908@ichips.intel.com> Hal Rosenstock wrote: >>I see where a MultiPathRecord contains a path independence field. But since it >>only provides a single value for all other PathRecord fields for all SGID/DGID >>pairs, it seems less flexible than PathRecords. > > It depends on whether those components are wildcarded or not. I was referring to the query response. >>It's also missing LID information, > > Yes but is LID needed in the query ? It's needed for connection establishment and QP modification. >>and limits the number of DGIDs to 255. > > Actually the number if lower as I think it is limited by the remaining > bits in the component mask. For a fabric with 1000 or more nodes, the MultiPathRecord only returns paths to 255 of the nodes, and then only if their path information matches. (This is part of why I'm confused about how to actually use a MultiPathRecord.) From what I can tell, a MultiPathRecord does not eliminate the need to obtain PathRecords, and only provides a 2-bit path independence field that's always set to 1. It seems like I'm missing something here. - Sean From halr at voltaire.com Tue Jan 31 11:02:27 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Jan 2006 14:02:27 -0500 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43DFB425.4000908@ichips.intel.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> <43DFA1F5.1050907@ichips.intel.com> <1138729361.4453.17459.camel@hal.voltaire.com> <43DFA80A.9020207@ichips.intel.com> <1138731162.4453.17689.camel@hal.voltaire.com> <43DFAE41.5050401@ichips.intel.com> <1138732784.4453.17876.camel@hal.voltaire.com> <43DFB425.4000908@ichips.intel.com> Message-ID: <1138734146.4453.18021.camel@hal.voltaire.com> On Tue, 2006-01-31 at 14:01, Sean Hefty wrote: > Hal Rosenstock wrote: > >>I see where a MultiPathRecord contains a path independence field. But since it > >>only provides a single value for all other PathRecord fields for all SGID/DGID > >>pairs, it seems less flexible than PathRecords. > > > > It depends on whether those components are wildcarded or not. > > I was referring to the query response. The query response is PathRecords not MultiPathRecords. > > >>It's also missing LID information, > > > > Yes but is LID needed in the query ? > > It's needed for connection establishment and QP modification. but in the response, right ? > >>and limits the number of DGIDs to 255. > > > > Actually the number if lower as I think it is limited by the remaining > > bits in the component mask. > > For a fabric with 1000 or more nodes, the MultiPathRecord only returns paths to > 255 of the nodes, and then only if their path information matches. (This is > part of why I'm confused about how to actually use a MultiPathRecord.) You can wildcard the DGID (and DGIDCount). > From what I can tell, a MultiPathRecord does not eliminate the need to obtain > PathRecords, It's a better way to get PathRecords. > and only provides a 2-bit path independence field that's always set > to 1. It seems like I'm missing something here. Aside from the independence field, there's 2 new fields" SGID/DGIDScope to be used for explicit/node/system GID indication. -- Hal From mshefty at ichips.intel.com Tue Jan 31 11:21:26 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 31 Jan 2006 11:21:26 -0800 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <1138734146.4453.18021.camel@hal.voltaire.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> <43DFA1F5.1050907@ichips.intel.com> <1138729361.4453.17459.camel@hal.voltaire.com> <43DFA80A.9020207@ichips.intel.com> <1138731162.4453.17689.camel@hal.voltaire.com> <43DFAE41.5050401@ichips.intel.com> <1138732784.4453.17876.camel@hal.voltaire.com> <43DFB425.4000908@ichips.intel.com> <1138734146.4453.18021.camel@hal.voltaire.com> Message-ID: <43DFB8B6.4040105@ichips.intel.com> Hal Rosenstock wrote: > The query response is PathRecords not MultiPathRecords. Okay - this is what I was missing. It makes a whole lot more sense now. Thanks. - Sean From halr at voltaire.com Tue Jan 31 11:32:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Jan 2006 14:32:28 -0500 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43DFB8B6.4040105@ichips.intel.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> <43DFA1F5.1050907@ichips.intel.com> <1138729361.4453.17459.camel@hal.voltaire.com> <43DFA80A.9020207@ichips.intel.com> <1138731162.4453.17689.camel@hal.voltaire.com> <43DFAE41.5050401@ichips.intel.com> <1138732784.4453.17876.camel@hal.voltaire.com> <43DFB425.4000908@ichips.intel.com> <1138734146.4453.18021.camel@hal.voltaire.com> <43DFB8B6.4040105@ichips.intel.com> Message-ID: <1138735396.4453.18171.camel@hal.voltaire.com> On Tue, 2006-01-31 at 14:21, Sean Hefty wrote: > Hal Rosenstock wrote: > > The query response is PathRecords not MultiPathRecords. > > Okay - this is what I was missing. It makes a whole lot more sense now. Thanks. Indeed, that's the confusing part... It's the only nonsymettric SA query/response where the query is one thing and the response another. -- Hal From iod00d at hp.com Tue Jan 31 13:55:21 2006 From: iod00d at hp.com (Grant Grundler) Date: Tue, 31 Jan 2006 13:55:21 -0800 Subject: [openib-general] Re: SDP perf drop with 2.6.15 In-Reply-To: <20060112195004.GK3106@esmail.cup.hp.com> References: <20060112071431.GE29168@esmail.cup.hp.com> <20060112071803.GC5850@mellanox.co.il> <20060112195004.GK3106@esmail.cup.hp.com> Message-ID: <20060131215521.GA11678@esmail.cup.hp.com> On Thu, Jan 12, 2006 at 11:50:04AM -0800, Grant Grundler wrote: ... > I can't explain why q-syscollect *improves* perf by ~11 to 17%. Kudos to Stephane Eranian for sorting this out. Execute summary: Rebooting with "nohalt" kernel option gets me the full performance. Gory details: By default, ia64-linux goes to a "low power state" (no jokes about this please) in the idle loop. This is implemented with form of "halt" instruction. Perfmon subsystem disables use of "halt" since older (and possibly current) PAL support for "halt" was broken and it would break the Performance Monitoring HW state. Please ask Stephane offline if you need more details. Stephane also commented that this might be an issue with other architectures as well if they have a low power state. The transition from low power to "normal" will have a cost on every architecture. And every interrupt is likely to incur that cost. This is a real problem for benchmarking where "latency" (ie we idle for very short periods of time) is hurt as I saw with netperf TCP_RR. Q to ia64-linux: since perfmon can enable/disable this on the fly, can I add a /sys hook to do the same from userspace? Where under /sys could this live? cheers, grant > Details: > Since I prefer q-syscollect: > grundler at gsyprf3:~/openib-perf-2006/rx2600-r4929$ LD_PRELOAD=/usr/local/lib/libsdp.so q-syscollect /usr/local/bin/netperf -p 12866 -l 60 -H 10.0.0.30 -t TCP_RR -T 1,1 -c -C -- -r 1,1 -s 0 -S 0 > libsdp.so: $LIBSDP_CONFIG_FILE not set. Using /usr/local/etc/libsdp.conf > libsdp.so: $LIBSDP_CONFIG_FILE not set. Using /usr/local/etc/libsdp.conf > bind_to_specific_processor: enter > TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.30 (10.0.0.30) port 0 AF_INET > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 16300.11 10.53 8.26 12.925 10.137 > > Wierd. Performance jumps from 13900 to 16300 (+2400 or +%17). > Hrm...something got me to look at /proc/interrupts and I see that > mthca is interrupting on CPU0 now: > 70: 644084899 0 PCI-MSI-X ib_mthca (comp) > 71: 8 0 PCI-MSI-X ib_mthca (async) > 72: 27247 0 PCI-MSI-X ib_mthca (cmd) > > Retest with -T 0,1 : > > 16384 87380 1 1 60.00 17557.94 6.06 10.88 6.909 12.390 > > And again -T 0,1 but without q-syscollect: > 16384 87380 1 1 60.00 15891.41 6.13 7.61 7.713 9.571 > > Now with -T 0,0: > 16384 87380 1 1 60.00 20719.03 5.93 5.26 5.724 5.076 > > with -T 0,0 and without q-syscollect: > 16384 87380 1 1 60.00 18553.61 5.73 5.36 6.181 5.782 > > That's +11% on the last set. > I'm stumped why q-syscollect would *improve* performance. ... From lindahl at pathscale.com Tue Jan 31 14:27:25 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Tue, 31 Jan 2006 14:27:25 -0800 Subject: [openib-general] OpenIb and Ada95 binding In-Reply-To: <1138719202.30748.12.camel@ipnnarval> References: <1138719202.30748.12.camel@ipnnarval> Message-ID: <20060131222725.GD1917@greglaptop.internal.keyresearch.com> On Tue, Jan 31, 2006 at 03:53:22PM +0100, Xavier Grave wrote: > I'm using Ada95 and I would like to implement a thin binding in order to > use infiniband. If MPI meets your needs -- do you need recovery from failures? -- then you can use one of the existing MPI/ADA bindings. -- greg From mshefty at ichips.intel.com Tue Jan 31 14:44:19 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 31 Jan 2006 14:44:19 -0800 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <1138735396.4453.18171.camel@hal.voltaire.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> <43DFA1F5.1050907@ichips.intel.com> <1138729361.4453.17459.camel@hal.voltaire.com> <43DFA80A.9020207@ichips.intel.com> <1138731162.4453.17689.camel@hal.voltaire.com> <43DFAE41.5050401@ichips.intel.com> <1138732784.4453.17876.camel@hal.voltaire.com> <43DFB425.4000908@ichips.intel.com> <1138734146.4453.18021.camel@hal.voltaire.com> <43DFB8B6.4040105@ichips.intel.com> <1138735396.4453.18171.camel@hal.voltaire.com> Message-ID: <43DFE843.9090203@ichips.intel.com> Hal Rosenstock wrote: >>>The query response is PathRecords not MultiPathRecords. >> >>Okay - this is what I was missing. It makes a whole lot more sense now. Thanks. > > Indeed, that's the confusing part... It's the only nonsymettric SA > query/response where the query is one thing and the response another. To make things a little worse, it looks like MultiPathRecord support is optional. Do all SAs (outside of OpenSM) support this attribute (along with double-sided RMPP)? From rdreier at cisco.com Tue Jan 31 15:18:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 31 Jan 2006 15:18:23 -0800 Subject: [openib-general] [git pull] InfiniBand fixes for 2.6.16 In-Reply-To: (Roland Dreier's message of "Mon, 23 Jan 2006 08:03:03 -0800") References: Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus The pull will get the following changes: Ingo Molnar: IB/srp: Semaphore to mutex conversion Michael S. Tsirkin: IPoIB: Make sure path is fully initialized before using it IB/uverbs: Flush scheduled work before unloading module IB/sa_query: Flush scheduled work before unloading module IPoIB: Lock accesses to multicast packet queues IB/mthca: Use correct GID in MADs sent on port 2 IB/mthca: Relax UAR size check IB/mthca: Don't cancel commands on a signal Roland Dreier: IB/mthca: Semaphore to mutex conversions drivers/infiniband/core/sa_query.c | 2 + drivers/infiniband/core/uverbs_main.c | 1 + drivers/infiniband/hw/mthca/mthca_av.c | 2 + drivers/infiniband/hw/mthca/mthca_cmd.c | 13 +++------ drivers/infiniband/hw/mthca/mthca_dev.h | 8 +++-- drivers/infiniband/hw/mthca/mthca_main.c | 10 +++++-- drivers/infiniband/hw/mthca/mthca_mcg.c | 20 +++++-------- drivers/infiniband/hw/mthca/mthca_memfree.c | 36 ++++++++++++------------ drivers/infiniband/hw/mthca/mthca_memfree.h | 7 ++--- drivers/infiniband/hw/mthca/mthca_provider.c | 6 ++-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 +-- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 25 +++++++++++++++-- drivers/infiniband/ulp/srp/ib_srp.c | 14 +++++---- drivers/infiniband/ulp/srp/ib_srp.h | 5 +-- 14 files changed, 86 insertions(+), 67 deletions(-) From rdreier at cisco.com Tue Jan 31 15:39:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 31 Jan 2006 15:39:07 -0800 Subject: [openib-general] [PATCH] extend ib_device node_type to include iWarp In-Reply-To: <43DFAB57.4030502@ichips.intel.com> (Sean Hefty's message of "Tue, 31 Jan 2006 10:24:23 -0800") References: <43DFAB57.4030502@ichips.intel.com> Message-ID: Sean> Any issues committing this? This does change ipoib, mthca, Sean> along with a slew of core files. I'll submit separate Sean> patches for the ehca and ipath drivers, but they should Sean> continue to work without changes. Reading this patch over, I'm not convinced that it makes sense to separate the node type and transport into bitfields. The notion of CA/switch/router is IB-specific; iWARP has only RNICs, right? Since there are only four choices of node type, why not just do something direct like: enum rdma_node_type { RDMA_NODE_IB_CA = 1, RDMA_NODE_IB_SWITCH, RDMA_NODE_IB_ROUTER, RDMA_NODE_RNIC }; enum rdma_transport_type { RDMA_TRANSPORT_IB, RDMA_TRANSPORT_IWARP }; static inline enum rdma_transport_type rdma_node_get_transport(enum rdma_node_type node) { switch (node) { case RDMA_NODE_IB_CA: case RDMA_NODE_IB_SWITCH: case RDMA_NODE_IB_ROUTER: return RDMA_TRANSPORT_IB; case RDMA_TRANSPORT_IWARP: return RDMA_TRANSPORT_IWARP; default: BUG(); } (and I'm quite consciously switching over to use "rdma_" naming instead of "ib_" naming, since otherwise it's a little odd to worry about non-IB transports). Perhaps I'm being too picky here but it just seems ugly to me for iWARP drivers to do something like device->node_type = IB_NODE_CA | IB_NODE_IWARP; or even device->node_type = RDMA_NODE_CA | RDMA_NODE_IWARP; and I don't see why we want to allow for impossible things like an iWARP switch. - R. From sean.hefty at intel.com Tue Jan 31 16:39:45 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 31 Jan 2006 16:39:45 -0800 Subject: [openib-general] [PATCH v3] extend ib_device node_type to include iWarp In-Reply-To: Message-ID: Here's an updated version of the patch based on Roland's comments. Patch includes updates to mthca, ipath, and ehca drivers. I did not update any userspace verbs support code. Signed-off-by: Sean Hefty --- Index: ulp/ipoib/ipoib_main.c =================================================================== --- ulp/ipoib/ipoib_main.c (revision 5240) +++ ulp/ipoib/ipoib_main.c (working copy) @@ -1078,13 +1078,16 @@ static void ipoib_add_one(struct ib_devi struct ipoib_dev_priv *priv; int s, e, p; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; INIT_LIST_HEAD(dev_list); - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { s = 0; e = 0; } else { Index: ulp/srp/ib_srp.c =================================================================== --- ulp/srp/ib_srp.c (revision 5240) +++ ulp/srp/ib_srp.c (working copy) @@ -1581,13 +1581,16 @@ static void srp_add_one(struct ib_device struct srp_host *host; int s, e, p; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; INIT_LIST_HEAD(dev_list); - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { s = 0; e = 0; } else { Index: ulp/kdapl/ib/dapl_provider.c =================================================================== --- ulp/kdapl/ib/dapl_provider.c (revision 5240) +++ ulp/kdapl/ib/dapl_provider.c (working copy) @@ -331,7 +331,7 @@ static void dapl_add_dev(struct ib_devic dapl_dbg_log(DAPL_DBG_TYPE_UTIL, "dapl_add_dev called for %s\n", device->name); - if (IB_NODE_CA == device->node_type) + if (RDMA_NODE_IB_CA == device->node_type) for (i = 1; i <= device->phys_port_cnt; i++) dapl_add_port(device, i); } Index: include/rdma/ib_verbs.h =================================================================== --- include/rdma/ib_verbs.h (revision 5240) +++ include/rdma/ib_verbs.h (working copy) @@ -56,12 +56,34 @@ union ib_gid { } global; }; -enum ib_node_type { - IB_NODE_CA = 1, - IB_NODE_SWITCH, - IB_NODE_ROUTER +enum rdma_node_type { + /* IB values map to NodeInfo:NodeType. */ + RDMA_NODE_IB_CA = 1, + RDMA_NODE_IB_SWITCH, + RDMA_NODE_IB_ROUTER, + RDMA_NODE_RNIC }; +enum rdma_transport_type { + RDMA_TRANSPORT_IB, + RDMA_TRANSPORT_IWARP +}; + +static inline enum rdma_transport_type +rdma_node_get_transport(enum rdma_node_type node_type) +{ + switch (node_type) { + case RDMA_NODE_IB_CA: + case RDMA_NODE_IB_SWITCH: + case RDMA_NODE_IB_ROUTER: + return RDMA_TRANSPORT_IB; + case RDMA_NODE_RNIC: + return RDMA_TRANSPORT_IWARP; + default: + BUG(); + } +} + enum ib_device_cap_flags { IB_DEVICE_RESIZE_MAX_WR = 1, IB_DEVICE_BAD_PKEY_CNTR = (1<<1), Index: include/rdma/ib_addr.h =================================================================== --- include/rdma/ib_addr.h (revision 5240) +++ include/rdma/ib_addr.h (working copy) @@ -42,7 +42,7 @@ struct rdma_dev_addr { unsigned char src_dev_addr[MAX_ADDR_LEN]; unsigned char dst_dev_addr[MAX_ADDR_LEN]; unsigned char broadcast[MAX_ADDR_LEN]; - enum ib_node_type dev_type; + enum rdma_node_type dev_type; }; /** Index: core/cm.c =================================================================== --- core/cm.c (revision 5240) +++ core/cm.c (working copy) @@ -3245,6 +3245,9 @@ static void cm_add_one(struct ib_device int ret; u8 i; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) Index: core/addr.c =================================================================== --- core/addr.c (revision 5240) +++ core/addr.c (working copy) @@ -63,7 +63,7 @@ static int copy_addr(struct rdma_dev_add { switch (dev->type) { case ARPHRD_INFINIBAND: - dev_addr->dev_type = IB_NODE_CA; + dev_addr->dev_type = RDMA_NODE_IB_CA; break; default: return -EADDRNOTAVAIL; Index: core/local_sa.c =================================================================== --- core/local_sa.c (revision 5240) +++ core/local_sa.c (working copy) @@ -362,6 +362,9 @@ static void sa_db_add_one(struct ib_devi struct sa_db_port *port; int i; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, GFP_KERNEL); if (!dev) Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 5240) +++ core/sa_query.c (working copy) @@ -912,7 +912,10 @@ static void ib_sa_add_one(struct ib_devi struct ib_sa_device *sa_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; else { s = 1; Index: core/device.c =================================================================== --- core/device.c (revision 5240) +++ core/device.c (working copy) @@ -514,7 +514,7 @@ int ib_query_port(struct ib_device *devi u8 port_num, struct ib_port_attr *port_attr) { - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { if (port_num) return -EINVAL; } else if (port_num < 1 || port_num > device->phys_port_cnt) @@ -589,7 +589,7 @@ int ib_modify_port(struct ib_device *dev u8 port_num, int port_modify_mask, struct ib_port_modify *port_modify) { - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { if (port_num) return -EINVAL; } else if (port_num < 1 || port_num > device->phys_port_cnt) Index: core/user_mad.c =================================================================== --- core/user_mad.c (revision 5240) +++ core/user_mad.c (working copy) @@ -936,7 +936,10 @@ static void ib_umad_add_one(struct ib_de struct ib_umad_device *umad_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; else { s = 1; Index: core/cma.c =================================================================== --- core/cma.c (revision 5240) +++ core/cma.c (working copy) @@ -244,8 +244,10 @@ static int cma_acquire_ib_dev(struct rdm static int cma_acquire_dev(struct rdma_id_private *id_priv) { - switch (id_priv->id.route.addr.dev_addr.dev_type) { - case IB_NODE_CA: + enum rdma_node_type dev_type = id_priv->id.route.addr.dev_addr.dev_type; + + switch (rdma_node_get_transport(dev_type)) { + case RDMA_TRANSPORT_IB: return cma_acquire_ib_dev(id_priv); default: return -ENODEV; @@ -324,8 +326,8 @@ int rdma_create_qp(struct rdma_cm_id *id if (IS_ERR(qp)) return PTR_ERR(qp); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_init_ib_qp(id_priv, qp); break; default: @@ -413,8 +415,8 @@ int rdma_init_qp_attr(struct rdma_cm_id int ret; id_priv = container_of(id, struct rdma_id_private, id); - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr, qp_attr_mask); if (qp_attr->qp_state == IB_QPS_RTR) @@ -540,8 +542,8 @@ static int cma_notify_user(struct rdma_i static void cma_cancel_addr(struct rdma_id_private *id_priv) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: rdma_addr_cancel(&id_priv->id.route.addr.dev_addr); break; default: @@ -560,8 +562,8 @@ static void cma_destroy_listen(struct rd cma_exch(id_priv, CMA_DESTROYING); if (id_priv->cma_dev) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; @@ -620,8 +622,8 @@ void rdma_destroy_id(struct rdma_cm_id * cma_cancel_operation(id_priv, state); if (id_priv->cma_dev) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; @@ -782,7 +784,7 @@ static struct rdma_id_private* cma_new_i ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); - rt->addr.dev_addr.dev_type = IB_NODE_CA; + rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA; id_priv = container_of(id, struct rdma_id_private, id); id_priv->state = CMA_CONNECT; @@ -984,8 +986,8 @@ int rdma_listen(struct rdma_cm_id *id, i return -EINVAL; if (id->device) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_ib_listen(id_priv); break; default: @@ -1077,8 +1079,8 @@ int rdma_resolve_route(struct rdma_cm_id return -EINVAL; atomic_inc(&id_priv->refcount); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; default: @@ -1372,8 +1374,8 @@ int rdma_connect(struct rdma_cm_id *id, id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_connect_ib(id_priv, conn_param); break; default: @@ -1431,8 +1433,8 @@ int rdma_accept(struct rdma_cm_id *id, s id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: if (conn_param) ret = cma_accept_ib(id_priv, conn_param); else @@ -1464,8 +1466,8 @@ int rdma_reject(struct rdma_cm_id *id, c if (!cma_comp(id_priv, CMA_CONNECT)) return -EINVAL; - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, private_data, private_data_len); @@ -1491,8 +1493,8 @@ int rdma_disconnect(struct rdma_cm_id *i if (ret) goto out; - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: /* Initiate or respond to a disconnect. */ if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); Index: core/mad.c =================================================================== --- core/mad.c (revision 5240) +++ core/mad.c (working copy) @@ -2661,7 +2661,10 @@ static void ib_mad_init_device(struct ib { int start, end, i; - if (device->node_type == IB_NODE_SWITCH) { + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) { start = 0; end = 0; } else { @@ -2708,7 +2711,7 @@ static void ib_mad_remove_device(struct { int i, num_ports, cur_port; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { num_ports = 1; cur_port = 0; } else { Index: core/cache.c =================================================================== --- core/cache.c (revision 5240) +++ core/cache.c (working copy) @@ -61,12 +61,13 @@ struct ib_update_work { static inline int start_port(struct ib_device *device) { - return device->node_type == IB_NODE_SWITCH ? 0 : 1; + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; } static inline int end_port(struct ib_device *device) { - return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; } int ib_get_cached_gid(struct ib_device *device, Index: core/sysfs.c =================================================================== --- core/sysfs.c (revision 5240) +++ core/sysfs.c (working copy) @@ -591,10 +591,14 @@ static ssize_t show_node_type(struct cla return -ENODEV; switch (dev->node_type) { - case IB_NODE_CA: return sprintf(buf, "%d: CA\n", dev->node_type); - case IB_NODE_SWITCH: return sprintf(buf, "%d: switch\n", dev->node_type); - case IB_NODE_ROUTER: return sprintf(buf, "%d: router\n", dev->node_type); - default: return sprintf(buf, "%d: \n", dev->node_type); + case RDMA_NODE_IB_CA: + return sprintf(buf, "%d: CA\n", dev->node_type); + case RDMA_NODE_IB_SWITCH: + return sprintf(buf, "%d: switch\n", dev->node_type); + case RDMA_NODE_IB_ROUTER: + return sprintf(buf, "%d: router\n", dev->node_type); + default: + return sprintf(buf, "%d: \n", dev->node_type); } } @@ -687,7 +691,7 @@ int ib_device_register_sysfs(struct ib_d if (ret) goto err_put; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { ret = add_port(device, 0); if (ret) goto err_put; Index: core/ucm.c =================================================================== --- core/ucm.c (revision 5240) +++ core/ucm.c (working copy) @@ -1255,7 +1255,8 @@ static void ib_ucm_add_one(struct ib_dev { struct ib_ucm_device *ucm_dev; - if (!device->alloc_ucontext) + if (!device->alloc_ucontext || + rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) return; ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); Index: core/ucma.c =================================================================== --- core/ucma.c (revision 5240) +++ core/ucma.c (working copy) @@ -479,8 +479,8 @@ static ssize_t ucma_query_route(struct u sizeof(struct sockaddr_in6)); resp.node_guid = ctx->cm_id->device->node_guid; resp.port_num = ctx->cm_id->port_num; - switch (ctx->cm_id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(ctx->cm_id->device->node_type)) { + case RDMA_TRANSPORT_IB: ucma_copy_ib_route(&resp, &ctx->cm_id->route); default: break; Index: core/smi.c =================================================================== --- core/smi.c (revision 5240) +++ core/smi.c (working copy) @@ -64,7 +64,7 @@ int smi_handle_dr_smp_send(struct ib_smp /* C14-9:2 */ if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; /* smp->return_path set when received */ @@ -77,7 +77,7 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_ptr == hop_cnt) { /* smp->return_path set when received */ smp->hop_ptr++; - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_dlid == IB_LID_PERMISSIVE); } @@ -95,7 +95,7 @@ int smi_handle_dr_smp_send(struct ib_smp /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; smp->hop_ptr--; @@ -107,7 +107,7 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_ptr == 1) { smp->hop_ptr--; /* C14-13:3 -- SMPs destined for SM shouldn't be here */ - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_slid == IB_LID_PERMISSIVE); } @@ -142,7 +142,7 @@ int smi_handle_dr_smp_recv(struct ib_smp /* C14-9:2 -- intermediate hop */ if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; smp->return_path[hop_ptr] = port_num; @@ -156,7 +156,7 @@ int smi_handle_dr_smp_recv(struct ib_smp smp->return_path[hop_ptr] = port_num; /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_dlid == IB_LID_PERMISSIVE); } @@ -175,7 +175,7 @@ int smi_handle_dr_smp_recv(struct ib_smp /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; /* smp->hop_ptr updated when sending */ @@ -190,7 +190,7 @@ int smi_handle_dr_smp_recv(struct ib_smp return 1; } /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH); + return (node_type == RDMA_NODE_IB_SWITCH); } /* C14-13:4 -- hop_ptr = 0 -> give to SM */ Index: core/ping.c =================================================================== --- core/ping.c (revision 5240) +++ core/ping.c (working copy) @@ -247,7 +247,10 @@ static void ib_ping_init_device(struct i { int num_ports, cur_port, i; - if (device->node_type == IB_NODE_SWITCH) { + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) { num_ports = 1; cur_port = 0; } else { @@ -278,7 +281,7 @@ static void ib_ping_remove_device(struct { int i, num_ports, cur_port; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { num_ports = 1; cur_port = 0; } else { Index: hw/ehca/ehca_main.c =================================================================== --- hw/ehca/ehca_main.c (revision 5240) +++ hw/ehca/ehca_main.c (working copy) @@ -382,7 +382,7 @@ int ehca_register_device(struct ehca_shc (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) | (1ull << IB_USER_VERBS_CMD_DETACH_MCAST); - shca->ib_device.node_type = IB_NODE_CA; + shca->ib_device.node_type = RDMA_NODE_IB_CA; shca->ib_device.phys_port_cnt = shca->num_ports; shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; shca->ib_device.query_device = ehca_query_device; Index: hw/ipath/ipath_verbs.c =================================================================== --- hw/ipath/ipath_verbs.c (revision 5240) +++ hw/ipath/ipath_verbs.c (working copy) @@ -6034,7 +6034,7 @@ static int ipath_register_ib_device(cons (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); - dev->node_type = IB_NODE_CA; + dev->node_type = RDMA_NODE_IB_CA; dev->phys_port_cnt = 1; dev->dma_device = ipath_layer_get_pcidev(t); dev->class_dev.dev = dev->dma_device; Index: hw/mthca/mthca_provider.c =================================================================== --- hw/mthca/mthca_provider.c (revision 5240) +++ hw/mthca/mthca_provider.c (working copy) @@ -1240,7 +1240,7 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); - dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.node_type = RDMA_NODE_IB_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; dev->ib_dev.class_dev.dev = &dev->pdev->dev; From halr at voltaire.com Tue Jan 31 17:12:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Jan 2006 20:12:59 -0500 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43DFE843.9090203@ichips.intel.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> <43DFA1F5.1050907@ichips.intel.com> <1138729361.4453.17459.camel@hal.voltaire.com> <43DFA80A.9020207@ichips.intel.com> <1138731162.4453.17689.camel@hal.voltaire.com> <43DFAE41.5050401@ichips.intel.com> <1138732784.4453.17876.camel@hal.voltaire.com> <43DFB425.4000908@ichips.intel.com> <1138734146.4453.18021.camel@hal.voltaire.com> <43DFB8B6.4040105@ichips.intel.com> <1138735396.4453.18171.camel@hal.voltaire.com> <43DFE843.9090203@ichips.intel.com> Message-ID: <1138755950.4453.21746.camel@hal.voltaire.com> On Tue, 2006-01-31 at 17:44, Sean Hefty wrote: > Hal Rosenstock wrote: > >>>The query response is PathRecords not MultiPathRecords. > >> > >>Okay - this is what I was missing. It makes a whole lot more sense now. Thanks. > > > > Indeed, that's the confusing part... It's the only nonsymettric SA > > query/response where the query is one thing and the response another. > > To make things a little worse, it looks like MultiPathRecord support is > optional. Yes, it is an optional feature. There is an SA ClassPortInfo capability mask bit for this (and another one for the enhanced feature). > Do all SAs (outside of OpenSM) support this attribute (along with > double-sided RMPP)? Not currently AFAIK but if this is truly useful, the customers will demand it and the vendors who currently don't support it will need to implement it. -- Hal From rdreier at cisco.com Tue Jan 31 17:44:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 31 Jan 2006 17:44:27 -0800 Subject: [openib-general] [PATCH v3] extend ib_device node_type to include iWarp In-Reply-To: (Sean Hefty's message of "Tue, 31 Jan 2006 16:39:45 -0800") References: Message-ID: This looks good to me. Thanks for bearing with all my comments. - R. From mfizupal at kobej.zzn.com Tue Jan 31 21:40:33 2006 From: mfizupal at kobej.zzn.com (mfizupal at kobej.zzn.com) Date: Tue, 31 Jan 2006 21:40:33 -0800 (PST) Subject: [openib-general] =?utf-8?b?wpBTwoLDhsKRw4zCgsOwwo7CncKCw4TCl10=?= =?utf-8?b?woLCt8KOw6LCgsK1woLCosKQbMKNw4jCgsOGwoLDjMKPb8KJw6/CgsKi?= Message-ID: 20060201123605.20524mail@mail.koqspoo28759-superderisystem_server71-gohitodumadeai.cc �@�@�������������������������������������������������������� �@�@���S�Ƒ̂��ė]���l�ȂƂ̎h���I�ȏo� �@�@�@�@���S�����ŗV�ׂ邲�ߏ��l�ȒT���I �@�@�������������������������������������������������������� �@�@�`���Ȃ��͉����߂܂����H�` �@�@ �@�@�j���̒��Ől�ȍD���̕��͌��\��������Ⴂ�܂����A �@�@���ۂɐl�ȂƂ����Ă�l�X�ȃ^�C�v�̕������܂��B �@�@���̒��ł�����Ƃ��Ă̖��͂�ۂ����������ƍl���Ă��� �@�@�����͑�ϖ��͓I�ȕ��������ł��B �@�@ �@�@�l�Ȃ̕����T�C�g�𗘗p�����ړI�͗l�X�ł��B �@�@�E�v���Z�����A�l�Ɖ�@����Ȃ����� �@�@�E���}�Ȑ����Ɏh�����~�����B �@�@�E�q�������������āA���������V��ł݂��� �@�@�ł��Ԃ̗��R�́A�l�Ȃł���O�Ɂu�����v�Ƃ��� �@�@������ė~�����Ƃ�����]�ł͂Ȃ��ł��傤���B �@�@�q���̐��b�◎���������v�w�����ł͖������؂�Ȃ������̏����� �@�@�F�B��V�ё�����߂ăT�C�g�𗘗p����Ă��܂��B �@�@����Ƃ�l�Ȃ̗~�]�ɉ����A�u�����v�Ƃ��đ��肵�ĉ������B �@�@http://hiru-furin.mine.nu/?&m328-2 �@�@ �@������������������B *:���l�Ȃ̗U�f���F* �B�􁙄������������� �@�@36�ɂ�Ȃ�ƐF��ȗ�����o�����Ă��܂����B���낻��A�^���� �@���t���������������Ďv���Ă܂��􂽂��A�d�����������Y���� �@����Ȃ��������܂��񂪁A��肭����Ă������M�͂���܂��B �@��l�̗����֌W���Ď��ŁA�����Ž�����‚ł��݂��̎d���� �@���d�������Ȃ���A��l�̐S��ʂ킹�Ă��������ł��� �@������l�̏o��Ɋ��t�������ł��ˁ� �@���X�؁@�������� �@�₳�����F�R�@���[���A�F�R�@���b�N�X�F�T�@�t�@�b�V�����F�S�@�m���F�R �@���l��W �@�i�v���C�o�V�[�ی�̂��߁A�ꕔ�������ɂČf�ڂ��Ă��܂��j �@���A���͂����炩��ǂ��� �@http://hiru-furin.mine.nu/?&m328-2 ������������������B *:���Ɛg�̗U�f���F* �B�􁙄������������� �@�@���\�A�T�d�h�ŏ���(��)�H�ȃ^�C�v�Ȃ̂ň�x�t����������A �@��l�̐l�ƒ��������^�C�v�ł����ł�A����Ȃ̂ōD���ɂȂ��Ă� �@��������s���o���Ȃ��āc�ł�A26�΂ɂȂ��Ĕގ���������c(�E�E;) �@���낻��A�ǂ��l���‚��ė������������Ȃ��`�B�����āA����� �@���낻�댋���b�Ƃ��Ŏ������Ȃ񂩎₵���c �@�f�G�ȗ�������̎�Œ͂܂Ȃ���I���Ďv���āc������ƐϋɓI�� �@�s�����悤���Ȃ��Ďv���Ă܂��� �@�L�~�J���� �@�₳�����F�S�@���[���A�F�R�@���b�N�X�F�S�@�t�@�b�V�����F�S�@�m���F�R �@���l��W �@ �@���A���͂����炩��ǂ��� �@http://hiru-furin.mine.nu/?&m328-2 ������������������B *:���l�Ȃ̗U�f���F* �B�􁙄������������� �@�@�N����D���ɂȂ��ăh�L�h�L����̂́A�Ⴂ�q�̓����ł��傤���H �@���̔N�ɂȂ��Ă�A�܂��g�L���L�������Ďv���Ă��܂��̂͂������� �@���Ȃ�ł��傤���H �@����A����Ȏ������悤�ȔN�ł͂Ȃ��������܂��񂪁A�����x �@���̎��̋C�����𖡂킢������ł��B�Ⴂ���ɂ͖߂�܂��񂪁A �@�C������h�点�邱�ƂȂ�o���܂���ˁH�ꏏ�ɖ������悤�� �@�z�b�Əo����ꏊ����܂��񂩁H �@��������@�Í����� �@�₳�����F�T�@���[���A�F�T�@���b�N�X�F�R�@�t�@�b�V�����F�R�@�m���F�R �@���l��W �@ �@���A���͂����炩��ǂ��� �@http://hiru-furin.mine.nu/?&m328-2 �@�������ł����p�o���܂��B �@�����͂����ɁA�S���̉�����炨��������”\�ł��B �@������o�^�͊ȒP�ł��B �@��������3�X�e�b�v�ŁA�o�^�������܂��B �@���܂��߂ȗ�����p�ł��B �@�܂��߂ɗ������߂�j�� ���W�܂�A �@�R�~���j�P�[�V������ʂ��āA�������Ă܂��B �@�j���Ƃ�����ł�����ł��A�S���̉�����炨��������”\�ł��B �@�o���T�|�[�g����l�X�ȋ@�\���񋟂��Ă���܂��B �@�E�ʃ��[���Ή� �@�E���[���A�h���X�E�d�b�ԍ����@�\ �@�E�n��E�v���t�B�[������ �@�E�g�ѓd�b�A�p�\�R�����Ή� �@�����������p�������B �@http://hiru-furin.mine.nu/?&m328-2